"OpenAI's gpt-oss Java port for CPU inference"

🚀 Launching GPT-OSS Java: Pure Java LLM Inference in ~1000 Lines Excited to share my latest open-source project - a complete Java port of OpenAI's gpt-oss inference engine running on CPU, now available on https://lnkd.in/gzCXk-pH! 🎯 Key features: • 📚 Educational - Clean, readable code for understanding LLM Transformer internals • 🏗️ Complete gpt-oss architecture - Full implementation of MoE transformer with GQA, sliding window attention, RoPE, and SwiGLU • 💻 CPU inference - No GPU required, designed for consumer-grade commodity hardware on local machines or cloud compute instances • 🧠 Memory efficient - Runs gpt-oss-20b models on CPU with just 16GB RAM • ⚡ Performance optimized - Support KVCache and exploit modern JDK GC/JIT, parallel processing, SIMD Vector API, and fused operations • 🔢 MXFP4 dequantization - Handles original MXFP4 quantized MoE weights 📊 Performance highlights: • ~11 tokens/sec on Apple M3 Pro (12 CPUs, 36GB) • ~10 tokens/sec on AWS EC2 m5.4xlarge (8 physical cores, 16 vCPUs, 64GB) Inspired by llama.cpp and llama2.c, this project demonstrates that Java can achieve impressive performance for LLM inference when properly optimized. Check it out: https://lnkd.in/gzCXk-pH

Nice! Just curious, why Java?

Like
Reply

To view or add a comment, sign in

Explore content categories