Microsoft Releases BitNet.cpp for Running Large AI Models on CPUs

View organization page for vizemotion.com

13 followers

6mo

Microsoft Releases bitnet.cpp — Run Large AI Models on CPUs This is huge. For years, running large language models meant relying on expensive GPUs or renting cloud compute. Now, Microsoft is making it possible to run serious AI workloads directly on your CPU — no GPU required. What is BitNet.cpp? BitNet.cpp is Microsoft’s official open-source inference framework for their 1-bit large language models, including the BitNet b1.58 family. Unlike traditional 16-bit or even 8-bit models, BitNet uses ternary weights: –1, 0, +1 Combined with 8-bit activations, this design drastically reduces memory requirements and power consumption while maintaining competitive performance. In practical terms, this means: -Up to 6× faster inference -Up to 82% lower energy consumption -Ability to run 100B-parameter models on a single x86 CPU -Roughly 5–7 tokens per second — about human reading speed In short: Microsoft is making large-scale AI accessible to everyone. First Open-Source 1.58-bit Model Alongside the framework, Microsoft released BitNet b1.58 2B4T — the first fully functional open-source model that uses only 1.58 bits for weights. Despite its ultra-low precision, it performs surprisingly well on reasoning and benchmark tasks. That’s why this release is drawing attention: it’s small, fast, and effective — not a toy model. Performance benchmarks show: -On ARM CPUs — 1.37× to 5.07× faster, with 55–70% less energy use -On x86 CPUs — 2.37× to 6.17× faster, with 71–82% less energy use Imagine running a 100B-parameter model on your laptop — without burning through your power supply. That’s the level of efficiency BitNet.cpp is aiming for. Getting Started If you want to try it yourself, setup is straightforward. Clone the repo -git clone --recursive https://lnkd.in/dkJhz7YE -cd BitNet -Create a Conda environment -conda create -n bitnet-cpp python=3.9 -conda activate bitnet-cpp Install dependencies -pip install -r requirements.txt Download the model -huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T ... Why This Matters For years, the AI industry has been limited by GPU bottlenecks. Want to train or run a large model? Buy NVIDIA cards or rent cloud instances. Now we’re looking at a future where: Local-first AI is realistic again Developers without GPUs can still work with advanced models Energy efficiency becomes a design principle, not an afterthought Democratized access replaces “GPU gatekeeping” This isn’t just another AI release. It’s a paradigm shift — proof that large-scale AI doesn’t have to depend on high-end GPUs. Microsoft GitHub: BitNet.cpp repository 👉 https://lnkd.in/dDry8f-V #MicrosoftAI #BitNet #AIResearch #OpenSourceAI #EdgeAI #CPUInference #MachineLearning #DeepLearning #AIInnovation #EfficientAI

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs github.com

To view or add a comment, sign in

More Relevant Posts

Chris Columbkille Biddle
6mo
Report this post
Building your own secure, local AI web co-browser in Linux Mint: Hello! LLM’s are a fun tool these days, and full of all sorts of interesting (and enterprising) uses. Today I’m going to show you how to build your own fully secure, localised AI pet — that’s a Large Language Model in a contained, sandboxed environment. I’m going to instruct you in how to install it on a mounted external SSD, using Podman, and then in setting up its very own local Google Chrome browser to use. It’s going to look like this:The ‘chat’ window is pictured below. When we are done, your local LLM setup will be fully capable at browsing the web with you. It will have no access to your host PC, but can still quite happily use the web. The focus of this project is: * Functionality * Privacy and Security * Minimal Hardware Requirements * Fun So if you have any need for the above things in an AI-driven web co-browsing service, let’s get started. Here’s what we will need to begin: * 1 x External SSD. 500gb-1tb Sandisk SSDs are available on Amazon. * Linux Mint or any similar Ubuntu-based system. 64gb of RAM/4gb VRAM is minimum hardware requirements. * Podman. Podman is a rootless containerisation service that will function as the ‘box’ your AI lives in. This project will likely conflict with Docker if you have it installed. * pebkac. This is the web co-browsing service that goes inside the Podman container. Something to know is that this guide is heavily oriented towards AMD GPUs using Vulkan engine support. If you have a NVIDIA or ROCm-based GPU setup, or don’t know what they are, you’ve got a little bit of extra work to do. First we need to set up the new SSD. If installing on your regular machine, skip this part. Using an external SSD if you have budget hardware gives a small speed boost, and it also makes the whole pebkac setup (nearly!) portable. Let’s get started. This is where some bash knowledge comes in handy. I’ll make this as easy as possible. * Open a terminal window. Paste the code below into it. # Create mount point with user ownership sudo mkdir -p /mnt/ssd sudo chown $USER:$USER /mnt/ssd 2. Navigate (or cd nano) to /etc/fstab. Replace /dev/sda1 with your device./dev/sda1 /mnt/ssd ext4 defaults,user,exec 0 2 3. Mount SSD to the user by entering this into the terminal window and verifying ownership:mount /mnt/ssd ls -la /mnt/ssd Now let’s build the Podman environment. * Install Podman & Podman Compose. sudo apt update && sudo apt install -y podman podman-compose 2. Paste this full below text into your terminal window next. It creates a config file that directs Podman to use the mounted SSD as its storage location. Again, if not using a mounted SSD, skip this step.# Configure Podman storage location mkdir -p ~/.config/containers mkdir -p /mnt/ssd/podman cat > ~/.config/containers/storage.conf #genai #shared #ai

Building your own secure, local AI web co-browser in Linux Mint generativeai.pub
Like Comment
To view or add a comment, sign in
Ravichandran Paramasivam
5mo
Report this post
Inside GPUNetIO: When Your GPU Talks Directly to the NIC ✅ GPUDirect RDMA alone: lets the NIC read/write GPU memory directly (zero-copy), but the CPU still drives the NIC (polls, posts WQEs, wakes kernel). Good for throughput, but your hot path still burns host cores and adds wakeup jitter. ✅ GPUNetIO: your CUDA kernel controls RX/TX queues and interacts with the NIC from device code (leveraging GPUDirect RDMA + GPUDirect Async). The GPU polls the NIC and pushes packets through your CUDA pipelines inline, so the CPU can step out of the packet loop. As a result, lower control-path latency jitter, fewer CPU cores consumed, and tighter coupling between packet arrival and GPU compute. 🟢 How GPUNetIO actually works? You run a persistent CUDA kernel that owns the NIC queues via DOCA GPUNetIO's device-side APIs. The CPU only does initialization (create queues, map memory, start kernel) and then steps out. 🔹 Receive path - NIC DMA → GPU memory. - The CUDA kernel running on the GPU polls device-side RX/CQ state that lives in GPU-accessible memory (exposed by GPUNetIO). - When a descriptor/CQE is ready, the kernel reads the packet directly from VRAM/HBM and continues processing, no CPU notification needed. 🔹 Transmit path - The CUDA kernel running on the GPU writes TX descriptors in GPU memory and rings the NIC doorbell from the device (GPUDirect Async). The NIC fetches payloads from GPU memory and transmits. The kernel can also observe TX completions device-side. 🔹 Memory visibility / ordering - In the CPU-driven model, verbs/DPDK ensure host-side memory ordering. your CUDA work waits on host→device signals. - In the GPU-driven model, the GPUNetIO device API wraps the needed fences/ordering so that when the API tells you an RX entry is valid, the NIC's DMA writes are visible to your kernel. You don't hand-roll PCIe coherency, use the API's dequeue/peek calls and they do the right thing. - For persistent-kernel polling, you typically use acquire-style reads/GPUNetIO helpers rather than raw relaxed loads, so you don't see stale descriptors. Key point: There is no CPU wake up. The kernel is already running and it notices the DMA completion by observing the RX/CQ structures on the GPU. side. 🟢 CPU-driven (GPUDirect RDMA) - NIC → (DMA) → GPU VRAM - NIC → (CQE) → Host CQ - CPU poller → sees CQE → launches or signals CUDA 🟢 GPU-driven (GPUNetIO) - NIC → (DMA) → GPU VRAM - CUDA persistent kernel → polls device-side CQ/RX → processes packet → optional TX from device. 🟢 Which should you use? - Choose CPU-driven if you already have a mature DPDK/verbs pipeline and CPU cores are cheap, latency/jitter is fine, and you prefer simple, batched kernel launches. - Choose GPU-driven when CPU cores are precious, you care about tight p99/999, or your pipeline needs packet-arrival-triggered compute (security filters, RAN, telemetry, sensor streams). It eliminates the CPU from the hot path and avoids launch/wakeup jitter by using a persistent kernel.
9 Comments
Like Comment
To view or add a comment, sign in
Ben Jarvis
6mo
Report this post
Build terabit scale distributed systems without a PhD in DPDK The traditional I/O frameworks can't keep up with 400 GbE NICs or PCIe5 NVMe drives. The industry has solved this with various acceleration SDKs such as DPDK, RDMA, XLIO, io_uring, VFIO, etc. All great, all incompatible. Hardware support is fragmented. Development requires understanding many different complex, low-level frameworks. The cost of swapping out hardware vendors is high. I've created an open source event-driven I/O framework named libevpl to solve this problem. Build your distributed system against libevpl once, then leverage any supported network or storage backend SDK without code modifications. libevpl provides a familiar sockets-like API that will be familiar to most developers without sacrificing the performance that the underlying acceleration SDKs provide. libevpl reduces the cost to adopt novel acceleration strategies. That benefits both developers and vendors. Benchmark results from the lab: * ~400 Gbps from a single core with RDMA * >200 Gbps from a single core over a single TCP connection * ~400 Gbps using four cores with one TCP socket each * 2.2µs avg round trip latency for 64B RPCs at QD=1 * 70 million 64B RPC ops/sec on one server while maintaining <7 µs avg RTT Currently supported backends: * RDMA via libibverbs (Infiniband and RoCE) * NVIDIA XLIO Ultra API for TCP sockets on NVIDIA NICs * io_uring for block I/O and TCP via linux kernel (soon with zero-copy RX) * VFIO-NVMe for ultra-low latency NVMe driven directly from user-space * Traditional BSD sockets as a widely supported fallback option Develop on your laptop and deploy to an HPC cluster.

19 Comments
Like Comment
To view or add a comment, sign in
Steven Dargaville
5mo Edited
Report this post
If you work with asymmetric linear systems, you might be interested in PFLARE - an open-source library now available through the PETSc configure. PFLARE features polynomial methods and reduction multigrids specifically designed for solving asymmetric linear systems at scale, on both CPUs and GPUs (NVIDIA, AMD and Intel), with interfaces in C/Fortran/Python. Recently I used these methods across the majority of a pre-exascale GPU machine (Lumi-G - currently #9 on the June 2025 Top 500 HPC list) and showed scalable solutions to time-independent advection equations (https://lnkd.in/eWnkFJ9t). These approaches are quite general: they apply to a range of asymmetric systems without requiring Gauss-Seidel iterations, and can be used for both time-dependent and time-independent problems, on structured or unstructured grids, with or without lower-triangular structure. Check it out at: https://lnkd.in/e9xYQ-Uc

GitHub - PFLAREProject/PFLARE: Parallel iterative methods for asymmetric linear systems github.com

1 Comment
Like Comment
To view or add a comment, sign in
Philip A.
5mo
Report this post
If you've been wrangling Kubernetes clusters for a while, you've likely noticed how CPU and memory optimization has matured into a well-stocked toolkit—open-source staples and commercial heavy-hitters alike, all making it easier to squeeze every last cycle out of your infra. But here's the twist that's keeping many of us up at night: the AI boom is flipping the script on GPUs. Suddenly, every team wants their own chatbot, fine-tuned model, or AI-infused workflow—internal tools humming with inference, external pipelines crunching predictions. Demand is everywhere, and supply? It's a patchwork of shortages across regions and clouds. The fallout? Costs trending sharply upward, with both dev sandboxes and prod environments feeling the pinch. GPUs aren't just pricey; they're a scarcity multiplier, turning what used to be predictable scaling into a budgeting nightmare. What if we borrowed a page from the CPU playbook? We've long shared those resources across workloads to hit utilization targets—why not do the same for GPUs? Tools like time slicing and MIG are proving game-changers, letting you partition a single card into isolated slices for concurrent jobs. Early adopters are reporting utilization jumps that slash bills without skimping on performance, all while keeping latency in check. Curious how pairing this with spot instances could reshape your GPU economics? I broke it down in a quick blog post—give it 3 minutes and see if it sparks any tweaks to your setup. What's one GPU optimization hack that's paid off for you lately? https://lnkd.in/e4d4A8Ce

GPU Cost Optimization: How to Reduce Costs with GPU Sharing and Automation cast.ai

3 Comments
Like Comment
To view or add a comment, sign in
José-María Súnico
6mo
Report this post
sn-news: #ai #ml #llm #hmc BitNet - Inference framework for 1-bit LLMs offering a suite of optimized kernels supporting fast and lossless inference of 1.58-bit models on CPU and GPU (NPU coming next) https://lnkd.in/d5MEaSvk

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs github.com
Like Comment
To view or add a comment, sign in
Matt Lavelle
5mo Edited
Report this post
LLM’s and Decade Old Hardware. As an avid home lab enthusiast and tinkerer, I have a dedicated server its basically a 2010 era powerhouse: Intel i7 980X, 24GB DDR3 RAM, a GTX 760, and a GTX 750 Ti. Not exactly cutting edge, but still capable. I started by diving into research about quantization and Docker. My goal was to use docker to isolate the environment and use CUDA cores to speed up compute time. After looking through some of the available quantized models I decided on TinyLlama-1.1B-chat-v1.0 because of its 4bit quantization and 1.1B parameters which, should fit nicely on my whopping 2GB VRAM. I then started by cloning the llama.cpp repo, I compiled the binaries and launched the model using the llama-cli binary, which lets you jump straight into conversation in the cli. At first, I ran everything on the CPU. It was slow, exceptionally slow. I tried offloading to the GTX 760, but nothing worked, I kept getting “Unsupported gpu architecture”. After some digging, I realized the 760’s Kepler architecture isn’t compatible with the modern CUDA toolkit. That was a rough moment in my journey, as I thought I’d hit a hard limit and wouldn’t be able to utilize the GPUs at all. Then many hours later came an accidental breakthrough. I tried offloading again, but didn’t specify a GPU and because the 750 Ti was in the first PCIe slot on my motherboard, CUDA_VISIBLE_DEVICES defaulted to 0 and suddenly, it worked. I started offloading layers one by one and eventually realized I could offload all 22 layers to the 750 Ti. That moment was huge. The 750 Ti, despite being arguably lower end than the 760, uses Maxwell 1.0 architecture which is compatible with the modern CUDA toolkit, The performance boost was wild: it went from 4.1 tokens/sec to 25.5 tokens/sec. That’s a 6.2× increase. Tokens started streaming almost instantly. It was a nice reward for the time committed. From there, I built out the whole ecosystem which I called Net-760: - Multiple Docker containers connected via a virtual network - A custom REST API made with Express.js - A simple frontend that sends prompts and streams tokens back - Nginx as a reverse proxy - Docker Compose YML for consistency and easy start up I mounted my repo to each container for an efficient and easy development workflow and committed the images so I could reuse the environments seamlessly. Now I can launch the whole network in seconds and run a fully functional LLM on decade old hardware. This project taught me a ton about CUDA, Docker containers, quantization, and GPU architecture but more than anything, it taught me the value of perseverance and stubbornness in the face of daunting technical pursuits. If you’re into home labs, LLMs, or just love tinkering don’t underestimate what’s possible with old hardware, not only is old hardware capable but it's cheap and even sometimes power efficient. Here is a link to the temporary live demo site. - https://lnkd.in/g7u-7FKy

10 Comments
Like Comment
To view or add a comment, sign in
Jiarong Xing
6mo
Report this post
🚀 Save the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your GPU cluster's low utilization? We built kvcached (KV cache daemon), an open-source library to save your GPU cluster utilization when serving LLMs. 🧩 What it does: kvcached enables elastic GPU sharing for LLM inference by virtualizing the KV cache. With kvcached, each LLM uses only the GPU memory it actually needs, instead of aggressively reserving a large static allocation in advance. ⚙️ Why it matters: – 🚫 Eliminates static GPU memory reservation, improving resource utilization – 🧠 Enables multiple workloads to flexibly run on shared GPUs – ⚡ Allows finer-grained and more rapid autoscaling in Serverless LLM – 🚀 Achieves 1.2×–28× faster time-to-first-token in multi-LLM serving 🌐 kvcached is compatible with mainstream LLM inference engines including sgl-project and vLLM. Try it with one command now: https://lnkd.in/dArTKvnr Read more in our deep-dive blog post: 📄 https://lnkd.in/gxK8N5QT kvcached represents our first step toward a GPU operating system—a vision where compute and memory are dynamically shared across models, workloads, and even users. This project is a joint effort led by Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University, and the sgl-project and vLLM communities. 👏 Incredibly grateful to our amazing team: Jiarong Xing, Yifan Qiao, Shan Yu, Xingqi Cui, Mingyuan MA, Yangmin Li, Xinyuan Tong, Yang Wang We especially thank our advisors, Joseph Gonzalez and Ion Stoica, for their guidance and insightful feedback. We thank everyone who shared feedback, ideas, and support throughout the project’s development. We're warmly inviting collaborators from both academia and industry to join us in building the foundations of elastic GPU infrastructure. Let’s make GPUs as flexible, efficient, and shared as CPUs. 💪 #LLMServing #KVCache #GPUOS #GPUSharing #GPUVirtualization #SystemsResearch #DeepLearningInfrastructure #OpenSource #Berkeley #SkyComputing #vLLM #SGLang
24 Comments
Like Comment
To view or add a comment, sign in
Yifan Qiao
6mo
Report this post
So excited to release kvcached 🔥 GPUs should not sit idle when workloads subside. We built kvcached to make GPUs elastic, efficient, and shareable across LLMs. No more wasting HBM on static and idle KV cache. This is only the beginning of our journey toward a GPU operating system for dynamic and efficient AI infrastructure. Please stay tuned for more updates. I am grateful to work with such an amazing team from Berkeley’s Sky Computing Lab (University of California, Berkeley), Rice University, and UCLA. Special thanks to our advisors Joseph Gonzalez and Ion Stoica for their guidance and support.
Jiarong Xing

Assistant Professor@Rice; Postdoc@UCBerkeley; AI System Researcher
6mo

🚀 Save the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your GPU cluster's low utilization? We built kvcached (KV cache daemon), an open-source library to save your GPU cluster utilization when serving LLMs. 🧩 What it does: kvcached enables elastic GPU sharing for LLM inference by virtualizing the KV cache. With kvcached, each LLM uses only the GPU memory it actually needs, instead of aggressively reserving a large static allocation in advance. ⚙️ Why it matters: – 🚫 Eliminates static GPU memory reservation, improving resource utilization – 🧠 Enables multiple workloads to flexibly run on shared GPUs – ⚡ Allows finer-grained and more rapid autoscaling in Serverless LLM – 🚀 Achieves 1.2×–28× faster time-to-first-token in multi-LLM serving 🌐 kvcached is compatible with mainstream LLM inference engines including sgl-project and vLLM. Try it with one command now: https://lnkd.in/dArTKvnr Read more in our deep-dive blog post: 📄 https://lnkd.in/gxK8N5QT kvcached represents our first step toward a GPU operating system—a vision where compute and memory are dynamically shared across models, workloads, and even users. This project is a joint effort led by Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University, and the sgl-project and vLLM communities. 👏 Incredibly grateful to our amazing team: Jiarong Xing, Yifan Qiao, Shan Yu, Xingqi Cui, Mingyuan MA, Yangmin Li, Xinyuan Tong, Yang Wang We especially thank our advisors, Joseph Gonzalez and Ion Stoica, for their guidance and insightful feedback. We thank everyone who shared feedback, ideas, and support throughout the project’s development. We're warmly inviting collaborators from both academia and industry to join us in building the foundations of elastic GPU infrastructure. Let’s make GPUs as flexible, efficient, and shared as CPUs. 💪 #LLMServing #KVCache #GPUOS #GPUSharing #GPUVirtualization #SystemsResearch #DeepLearningInfrastructure #OpenSource #Berkeley #SkyComputing #vLLM #SGLang
1 Comment
Like Comment
To view or add a comment, sign in
Yu Teng (滕昱)
6mo
Report this post
GPUs hold inherent edges over conventional CPUs in areas like throughput, power efficiency, and expandability. They're transitioning from auxiliary processors supporting CPUs to core computational powerhouses, eroding CPUs' traditional stronghold in data centers. Yet, the Memory Wall remains a key hurdle: GPUs pack thousands to tens of thousands of cores for massive parallel processing, but their HBM memory tops out at mere tens of GBs, clashing with working datasets that span TBs. Computations frequently depend on CPU-mediated I/O exchanges with external storage (e.g., SSDs), creating performance chokepoints. To cut down on CPU-GPU coordination costs and streamline GPU-to-storage pathways, specialized file systems tailored for GPUs are being developed. These preserve POSIX-like file semantics and interfaces for GPU tasks, enabling seamless, high-performance direct I/O to storage. We could consider two potential strategies: #1 Develop a streamlined GPU file system acting as a "companion" alongside the host file system. For example, GeminiFS: A Companion File System for GPUs (FAST'25) https://lnkd.in/giPr7V-6 #2 Build a GPU-driven, versatile, and highly scalable file system that sidesteps CPU oversight and relaying. It delegates essential file system functions—like metadata handling and I/O management—directly to the GPU, allowing GPU apps to achieve unmediated, high-parallelism interactions with NVMe storage through standard file APIs. For example, Managing Scalable Direct Storage Accesses for GPUs with GoFS (SOSP'25) https://lnkd.in/gxThTJin Which one would you pick? 😊

Managing Scalable Direct Storage Accesses for GPUs with GoFS | Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles dl.acm.org

1 Comment
Like Comment
To view or add a comment, sign in

13 followers

View Profile Connect

Microsoft Releases BitNet.cpp for Running Large AI Models on CPUs

More Relevant Posts

Explore related topics

Explore content categories