Top LinkedIn Content on GPU Programming Insights

Director @ NVIDIA | AI System Software & Next-Gen Programming Platforms

8,459 followers 4mo

Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1. cuTile - a Python DSL that dramatically simplifies writing high-performance CUDA kernels. 2. Tile IR - a language-agnostic virtual instruction set that (third party) compilers or DSLs can target. This is just the beginning. You'll see more features, broader platform support, and continued performance improvements in every CUDA release going forward. I'd love to hear from developers, researchers, and compiler folks who plan to explore cuTile or Tile IR. In particular, I'm excited to learn about novel algorithms or kernels built on cuTile, and new programming-language or compiler techniques unlocked by Tile IR. If you're interested, here are some great starting points: cuTile reference: https://lnkd.in/grmv8C3b Tile IR specification: https://lnkd.in/grjE7hBG Blog post 1: https://lnkd.in/gWUZ4sP2 Blog post 2: https://lnkd.in/gaf82Ybb

29 Comments

Michael Biercuk

8,510 followers 8mo

Sometimes when you set out to solve something small you end up delivering something huge. The team at Q-CTRL just did that with our partners NVIDIA and Oxford Quantum Circuits (OQC), achieving a totally new #GPU-optimized algorithm for the subgraph-isomorphism problem. One of the toughest challenges when it comes to practical scaling of #quantumcomputing is how to parse the problem of interest onto the device at hand. Which qubits are best? Which connectivity is most efficient? How can you use mathematical tricks to reduce the number of operations (and hence reduce opportunities for error)? Which parts of the process can be sped up with classical techniques? These questions are all part of a task called compilation, and even though it's less sexy than other areas, it's an actual performance bottleneck for most users in the real-world. We set out to investigate how to speed up certain subroutines with #GPUs, and in the process achieved something even more profound. The underlying problem is called the subgraph-isomorphism problem which is key to a range of #AI/ #machinelearning tasks. There are tons of algorithms allowing this problem to be solved, but most are stubbornly resistant to parallelization, rendering GPUs much less useful than in other areas. Until now. Working with NVIDIA and OQC, we developed a novel solution to this problem that combines insights from the graph database and analytics community, data science techniques, and leverages well-established open source software. Our new approach, named Δ-Motif, replaces traditional backtracking strategies with a data-centric approach that decomposes the graphs into fundamental motifs (small, reusable building blocks like paths and cycles), representing them in tabular formats and models graph processing with relational database operations like merges and filters. This shift transforms an inherently sequential problem into one that can be executed in parallel at scale, unlocking new levels of efficiency in graph processing. In an implementation on NVIDIA GPUs we achieved up to 600X speedups in wall clock time using test graphs, quantum-algorithm benchmarks (QASMBench) and classical ML benchmarks (SparseSuite Matrix Collection). This is an amazing example of how pushing the frontiers of PRACTICAL #quantumcomputing can deliver huge outcomes of much broader appeal. Our team is proud of this development and excited to continue expanding our partnerships with NVIDIA and OQC as we help deliver true #hybridcompute to the #datacenter. Jin-Sung Kim Oded Green Gerald Mullally Jamie Friel Jensen Huang Atsushi Sugiura Read more at our blog post: https://lnkd.in/gn-Bsurp Technical manuscript: https://lnkd.in/gxaJsumG

12 Comments

Emilio Andere

Co-Founder and CEO at Wafer - Hardware Acceleration for AI

14,509 followers 2mo

nvidia now releases its most optimized inference kernels through a PhD student's open-source project. here's a breakdown of FlashInfer: FlashInfer is a GPU kernel library built specifically for LLM serving. it won Best Paper at MLSys 2025, powers both SGLang and vLLM, and NVIDIA now actively ships TensorRT-LLM kernels through it. the creator, Zihao Ye, built it during his PhD at UW and now works at NVIDIA full-time. LLM serving has a combinatorial explosion of attention kernels. every combination of KV-cache layout (paged, radix tree, tree masks), attention variant (GQA, MLA, RoPE-fused, sliding window), and batch mode (prefill, decode, append, shared prefix) needs a different kernel. FlashInfer's insight was: all KV-cache layouts are special cases of block-sparse matrices. paged attention is just block-sparse with page_size as block width. radix tree? block-sparse. tree attention for speculative decoding? block-sparse. one abstraction can replace what used to be separate kernel implementations. then you get JIT compilation to handle the variant explosion, in the form of CUDA/CUTLASS templates that get specialized at runtime there's two other major innovations built on top of FlashInfer: 1. cascade attention when multiple requests share a prefix (document QA, system prompts), FlashInfer decomposes attention into two stages: a multi-query kernel for the shared prefix (loaded once into SMEM, reused across all queries) and a batch decode kernel for unique suffixes. results merge using an associative operator on partial attention states. 31x speedup over vLLM's PagedAttention for 32K-token shared prefixes at batch size 256. 2. plan/run scheduling for CUDAGraph LLM serving has dynamic sequence lengths. CUDAGraphs need static configurations. FlashInfer solves this with a two-phase pattern: plan() inspects request shapes and computes balanced scheduling metadata, run() launches kernels. you plan once per decode step, then replay across all transformer layers. FlashInfer is an amazing project that i deeply respect, so also want to share some links for anyone that wants to go deeper: - paper (MLSys 2025 Best Paper): https://lnkd.in/gc_CTbnf - github: https://lnkd.in/gwfQ8B72 - NVIDIA blog: https://lnkd.in/gzs_uquk - cascade attention deep dive: https://lnkd.in/gHGqdNTV - docs: https://docs.flashinfer.ai

5 Comments

Pascal Biese

AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

85,059 followers 1y

AI just delivered a computation breakthrough: Translating PyTorch to CUDA isn’t just a human problem anymore. Modern AI relies on GPU-optimized CUDA kernels, but handcrafting these requires rare expertise spanning algorithms, hardware, and memory hierarchies. This bottleneck now has a scalable solution: The AI CUDA Engineer. Sakana AI’s new framework uses Large Language Models (LLMs) to convert PyTorch operations into correct CUDA kernels and evolutionary optimization to iteratively maximize runtime efficiency. Key innovations: 1. Automatic translation (91% success rate) via error feedback loops 2. LLM-guided evolution combining model-generated variants with profiling data 2. Innovation Archive—a repository of 17K optimized kernels that seed future optimizations via RAG The results? A median 1.52x speedup over native PyTorch, with extreme gains like 54x faster diagonal matrix multiplications. Their system even translated and optimized full ResNet architectures into CUDA, achieving 1.44x speedups via fused shared-memory kernels. Why this matters: LLMs are moving beyond code generation to optimization—mastering hardware-specific constraints without human priors. With models writing code for 72% of PyTorch operations faster than torch.compile, democratizing GPU programming is no longer hypothetical. It's open for everyone: you can explore their open-sourced kernels or probe limitations 𝘳𝘪𝘨𝘩𝘵 𝘯𝘰𝘸. For industries like agriculture seeking location-specific AI—or anyone battling CUDA complexity—automating kernel engineering might just be the compute multiplier you need. Fore more on the AI CUDA Engineer and other AI highlights, check out this week's LLM Watch: https://lnkd.in/dfPZhpt6

10 Comments

James Hongyi Zeng

Senior Engineering Manager, AI Networking at Meta

1,698 followers 6mo

Last week in PyTorch Conference 2025, we announced we are open sourcing torchcomms and NCCLX/CTran. Today, we share more details about NCCLX/CTran design and how we used them in production GenAI training and inference. Check out our white paper on this topic - https://lnkd.in/gySuXi6Y Some features we covered in this paper - * Host-driven collectives * Zero-copy data transfer * CTran/Network co-design (DQPLB) * Zero-copy and SM-free Send/Receive for PP * RMA Put for TP * Fault tolerant AllReduce * GPU-resident collectives for EP * Low-latency optimization * Scalable Initialization in Training * GPU Memory Management for comms * Fault localization and Performance Observability * CPU emulation This paper covers years of innovation and production experience from GPU communication teams at Meta, in supporting generations of LLAMA models training and inference. Hope you enjoy reading it!

Collective Communication for 100k+ GPUs arxiv.org

8 Comments

Yiwei Yang

Founder of zett.ai. Using the CXL technology to disrupt Nvidia.

3,512 followers 8mo

GPU Memory Offloading with CXL/NVMe + PIM The convergence of CXL-to-NVMe memory tiering with Processing-In-Memory (PIM) technology represents a paradigm shift in GPU computing. Instead of being constrained by limited HBM capacity, GPUs could seamlessly access a multi-tier memory hierarchy spanning from on-package HBM through CXL-attached memory down to NVMe storage, with each tier potentially embedding PIM capabilities. The game-changer is that PIM units at each level—HBM-PIM for near-GPU compute, CXL-PIM for memory-side preprocessing, and computational storage in NVMe—would filter and process data locally, dramatically reducing data movement. For massive AI workloads, this means attention mechanisms could execute in CXL-PIM while embeddings are searched directly in storage, with the GPU orchestrating rather than computing everything. The technical challenges are substantial, requiring new coherence protocols, tier-aware programming models, and intelligent runtime systems for data placement. However, early prototypes from Samsung’s CXL-SSD and SK Hynix’s PIM-enabled memory suggest this vision could materialize by 2027, fundamentally solving the memory wall problem that currently limits large-scale AI and HPC applications.

6 Comments

Paolo Perrone

No BS AI/ML Content | ML Engineer with a Plot Twist 🥷100M+ Views 📝

128,881 followers 2mo

Modal reverse-engineered Flash Attention 4. Here's what you need to know. FA4 is ~20% faster than Nvidia's own closed-source kernels. No official technical report yet. Modal read the source code and broke it down. The biggest change isn't the math. It's the architecture. FA4 runs an async pipeline with 5 specialized warp types: → Load warp: moves Q, K, V tiles from GPU RAM into shared memory → MMA warp: runs the actual matmuls on Tensor Cores → 8 Softmax warps: normalize attention scores → 4 Correction warps: rescale outputs as normalization shifts → 1-2 Epilogue warps: write final outputs back to GPU RAM Each warp type handles one stage. They run concurrently. Producer-consumer model with barrier synchronization. Modal's words: "it vaguely resembles a microservices diagram." The two math tricks: 1️⃣ Fast approximate exponentials Replace hardware SFU exponentiation with a cubic polynomial. Same bf16 precision. Avoids SFU queueing bottleneck. Based on a 1999 paper by Schraudolph. 2️⃣ Smarter online softmax Old: rescale every time a new max appears. New: only rescale when numerical stability is actually threatened. 10x fewer rescaling operations. Why this matters beyond FA4: Triton's team gave up writing Blackwell attention and built Gluon at a lower level instead. GPU programming is shifting from "write a kernel" to "architect an async pipeline across specialized hardware." 💾 Save this. The full breakdown is worth the read → https://lnkd.in/eRKBFk5c

27 Comments

Christopher Royse

Founder at Leapable

2,060 followers 1mo

NVIDIA just dropped CUDA 13.2, and it's worth paying attention to. A few things that stood out to me: → CUDA Tile — the new tile-based programming model — now works on Ampere and Ada GPUs (not just Blackwell). That's a massive installed base suddenly gaining access to a fundamentally better way to write GPU kernels. → The cuTile Python DSL lets you write GPU kernels in ~15 lines of Python that rival 200 lines of hand-tuned CUDA C++. Flash Attention in Python, within 10% of peak GPU performance. That's not a gimmick. → NVFP4 precision on Blackwell Ultra is delivering 36× inference speedups for DeepSeek-R1 and 6.3× for image generation. The cost implications are real — 97% fewer GPU-hours for the same workload. → Grouped GEMM with MXFP8 gives 4× speedups for Mixture-of-Experts models. MoE is everywhere right now, so this matters. I'm still working through all of it — this is a dense release. But if you're building anything on GPUs, it's worth reading the release notes properly, not just the headlines. Still a lot to learn. That's what makes this space interesting. #CUDA #GPU #AI #MachineLearning #NVIDIA #HPC

3 Comments

Arsh Shah Dilbagi

AI observability & evalutions @ adaline.ai

8,289 followers 1y

I've been closely studying DeepSeek groundbreaking approach to GPU optimization. Their GPU programming innovations provide key insights for scaling AI systems. 𝗧𝗵𝗲 𝗰𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲 𝗮𝗻𝗱 𝘁𝗵𝗲 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 DeepSeek faced strict limitations on GPU access in China. Instead of accepting these constraints, they chose innovation. They went beyond standard NVIDIA NCCL limitations. Their solution? Custom communication scheduling at the streaming multiprocessor (SM) level. This wasn't a minor tweak. It was a complete reimagining of GPU resource utilization. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗱𝗲𝗲𝗽 𝗱𝗶𝘃𝗲 DeepSeek arranged accurate scheduling for particular SMs. They treated GPU cores as a dynamic resource pool. Each core could switch roles fluently: - Running the model - Handling allReduce operations - Managing allGather communications They accomplished this by utilizing PTX, a low-level instruction 𝘴𝘦𝘵. While most teams stay within Python and PyTorch , DeepSeek proved the value of hardware-level optimization. 𝗠𝗼𝗘 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 Their implementation of a MoE stands out. Traditional systems use 8-16 experts, activating two at a time. DeepSeek went bigger: 𝟮𝟱𝟲 𝗲𝘅𝗽𝗲𝗿𝘁𝘀, with eight active simultaneously. This 32:1 ratio dwarfed the typical 4:1 approach. This required solving complex challenges: - Load balancing across GPU nodes - Communication scheduling - Resource allocation - Expert utilization optimization 𝗞𝗲𝘆 𝗜𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻𝘀 1. They replaced traditional auxiliary loss with custom routing mechanisms. 2. Expert utilization became dynamically balanced. 3. Load distribution across GPUs reached new levels of sophistication. 4. SM scheduling achieved unprecedented optimization. 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 DeepSeek's journey reinforces crucial principles: 1. Innovation through constraint: Limited access to GPU resources necessitated creative problem-solving techniques. The restrictions imposed served as catalysts for innovative approaches. 2. Speed versus sustainability: The intricate nature of their codebase yields significant outcomes. Nevertheless, this complexity engenders critical inquiries regarding the sustainability of such systems. 3. Challenge convention: The team critically evaluated the conventional use of NCCL and the implementation of MoE. In AI, it is imperative that we consistently question established doctrines. How do you balance optimization versus maintainability? What unconventional approaches have you discovered?

5 Comments

Jason Super

Senior Talent Acquisition Partner

7,975 followers 4mo

CUDA 13.1 is here. It is the biggest expansion of CUDA since it launched in 2006. We are introducing CUDA Tile, a new way to program GPUs that makes powerful AI and accelerated computing easier for more developers to use. CUDA is the foundation for developers, researchers, and organizations to solve challenges and drive economic growth. Their work is transforming healthcare, drug discovery, scientific research, and manufacturing. Over 6M developers and 4,000 apps run on CUDA. CUDA Tile in CUDA 13.1 gives developers a simpler way to tell the GPU what to do. They can think in tiles of data instead of handling every low level detail, while the compiler works to keep performance high on modern GPUs. Why CUDA matters for non-developers: • Doctors can run more accurate scans and diagnostics. • Researchers can test more drug candidates, faster. • Manufacturers can run smarter, safer factories. • Companies can build AI services that feel instant instead of slow. Stephen Jones, NVIDIA CUDA Architect, put it best. What excites him most about CUDA Tile is not just what it does on paper, but what developers will build with it in ways the team never imagined. As CUDA grows, so does what developers can build. CUDA 13.1 and CUDA Tile expand the foundation of NVIDIA's computing platform and support the next generation of AI growth, from labs and hospitals to factories and data centers. Go deep in today’s technical blog:

Focus on Your Algorithm-NVIDIA CUDA Tile Handles the Hardware | NVIDIA Technical Blog developer.nvidia.com

GPU Programming Insights

More in GPU Programming Insights

More Artificial Intelligence topics

Explore categories