GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB
How to Optimize Performance Using Cuda
Explore top LinkedIn content from expert professionals.
Summary
CUDA is a parallel computing platform from NVIDIA that helps software run much faster by using the graphics card’s power for intense calculations. Learning how to optimize performance using CUDA means understanding both the hardware and how your code interacts with it, so you can get the most out of your GPU for tasks like machine learning or scientific computing.
- Streamline memory movement: Minimize slow data transfers between the CPU and GPU by keeping data on the GPU for as long as possible, and use features like shared memory and pinned memory to speed up processing.
- Batch and parallelize tasks: Process data in larger batches and use parallel data loading whenever possible, so the GPU cores stay busy and no time is wasted waiting for individual tasks.
- Fuse operations and tune kernels: Combine several steps into one computation (known as kernel fusion) and adjust how your code accesses memory to reduce redundancy, making the most of the GPU’s design and improving speed.
-
-
I was staring at a loading spinner on an H100, and it made no sense. We pay a premium for the world’s fastest hardware. We spin up instances on GCP or hit endpoints on AWS Bedrock, expecting instant intelligence. Yet, there I was waiting! The latency wasn’t just "slow" - it felt wrong. It’s the classic abstraction trap. We’ve piled so many convenient layers on top of the GPU frameworks, managed runtimes, safety checks that we’ve suffocated the hardware. The raw compute power of the H100 is there, screaming to go fast, but it’s stuck in traffic, waiting for the driver to tell it where the memory is. So, I stopped waiting and started digging. I decided to bypass the standard orchestration layer. No more generic kernels launching one by one. I wanted to see what happens when you treat the GPU less like a black box API and more like a raw memory device. I started prototyping a custom driver layer - using C++ and CUDA to handle the memory hierarchy directly. The realization hit hard: The bottleneck isn't the math. It’s the movement. Standard tools are designed for safety and ease of use, which means they constantly shuttle data back and forth between High Bandwidth Memory (HBM) and the compute cores. It’s inefficient. By implementing unique approach of CUDA internals (Kernel Fusion), you stop that "ping pong." You keep the data alive in the cache (SRAM) and let the cores feast on it without waiting for the slow trip back to global memory. We don't need faster chips to solve the latency problem. We need smarter drivers. [https://tensornative.io] (TensorNative - Eliminate the RAG bottlenecks)
-
I'm happy to share a new coding tutorial for doing fast matrix multiplication on #NVIDIA Hopper GPUs! This covers the warpgroup matrix-multiply-accumulate (WGMMA) instruction that specifically targets the Tensor Cores on Hopper GPUs. Using tools from the CUTLASS library, we go into detail on all aspects of correctly invoking WGMMA as a primitive for matmul when writing a CUDA kernel -- how tensor data should be laid out in memory for WGMMA, how to use CUTLASS to define these layouts of your data, and how to synchronize WGMMA as an async instruction to guard against race conditions and ensure correct behavior of your kernel. If you've read the blog post on FlashAttention-3, you'll know how heavily FA-3 exploits WGMMA -- both in terms of its higher throughput and asynchronous capabilities -- to achieve its impressive performance gains. Our hope is that this tutorial can help similarly unlock the potential of the Hopper architecture when coding up your own projects and research ideas! We're also planning at least two followups to this tutorial - one covering the overall structure of an efficient GEMM kernel with a focus on copy-compute overlapping techniques such as warp specialization, and another on persistent kernels and the Stream-K algorithm for GEMM. Work done in collaboration with my colleagues at Colfax and Hieu Pham. https://lnkd.in/g-tsnnha
-
This is how I 𝗿𝗲𝗱𝘂𝗰𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of my 𝗣𝘆𝗧𝗼𝗿𝗰𝗵 𝗰𝗼𝗱𝗲 by 𝟴𝟮% 𝘂𝘀𝗶𝗻𝗴 𝗼𝗻𝗹𝘆 𝗣𝘆𝘁𝗵𝗼𝗻 & 𝗣𝘆𝗧𝗼𝗿𝗰𝗵. 𝗡𝗢 𝗳𝗮𝗻𝗰𝘆 𝘁𝗼𝗼𝗹𝘀 𝗶𝗻𝘃𝗼𝗹𝘃𝗲𝗱! 𝙏𝙝𝙚 𝙥𝙧𝙤𝙗𝙡𝙚𝙢? During inference, I am chaining 5 DL models for processing ~25k image loads. The script takes around ~4 hours to run. The problem is that this isn't a batch job that runs over the night. Various people across the company required it to run in "real-time" multiple times a day. 𝙏𝙝𝙚 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣? The first thing that might come to your mind is to start using some fancy optimizer (e.g., TensorRT). Even though that should be done at some point... First, you should 𝗮𝘀𝗸 𝘆𝗼𝘂𝗿𝘀𝗲𝗹𝗳: - I/O bottlenecks: reading & writing images - preprocessing & postprocessing - can it be parallelized? - are the CUDA cores used at their maximum potential? - is the bandwidth between the CPU & GPU throttled? - can we move more computation to the GPU? That being said. 𝗛𝗲𝗿𝗲 is what I did I 𝗱𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of the script by 𝟴𝟮% 𝟭. 𝗕𝗮𝘁𝗰𝗵𝗲𝗱 𝘁𝗵𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗮𝗺𝗽𝗹𝗲𝘀 Batching is valuable for training but also powerful in speeding up your inference time. Otherwise, you waste your GPU CUDA cores. Instead of passing through the models one sample at a time, I now process 64. 𝟮. 𝗟𝗲𝘃𝗲𝗿𝗮𝗴𝗲𝗱 𝗣𝘆𝗧𝗼𝗿𝗰𝗵'𝘀 𝗗𝗮𝘁𝗮𝗟𝗼𝗮𝗱𝗲𝗿 This has 2 main advantages: - parallel data loading & preprocessing on multiple processes (NOT threads) - copying your input images directly into the pinned memory (avoid a CPU -> CPU copy operation) 𝟯. 𝗠𝗼𝘃𝗲𝗱 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗼𝗳 𝘁𝗵𝗲 𝗽𝗼𝘀𝘁𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗼𝗻 𝘁𝗵𝗲 𝗚𝗣𝗨 I saw that the tensor was moved too early on the CPU and mapped to a NumPy array. I refactored the code to keep it on the GPU as much as possible, which had 2 main advantages: - tensors are processed faster on the GPU - at the end of the logic, I had smaller tensors, resulting in smaller transfers between the CPU & GPU 𝟰. 𝗠𝘂𝗹𝘁𝗶𝘁𝗵𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗳𝗼𝗿 𝗮𝗹𝗹 𝗺𝘆 𝗜/𝗢 𝘄𝗿𝗶𝘁𝗲 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 For I/O bottlenecks, using Python threads is extremely powerful. I moved all my writes under a 𝘛𝘩𝘳𝘦𝘢𝘥𝘗𝘰𝘰𝘭𝘌𝘹𝘦𝘤𝘶𝘵𝘰𝘳, batching my write operations. . Note that I used only good old Python & PyTorch code. → When the code is poorly written, no tool can save you Only now is the time to add fancy tooling, such as TensorRT. . So remember To optimize the PyTorch code by 82%: 1. Batch the inference samples 2. Leverage PyTorch's DataLoader 3. Move as much of the postprocessing on the GPU 4. Multithreading for all I/O write operations #machinelearning #mlops #datascience
-
Cursor rewrote their entire MoE layer from scratch in pure CUDA and PTX. they got a 3.5x MoE layer speedup and 1.5x end-to-end training speedup on Blackwell. let's break down what they did: initially, they tried just quantizing to naive FP8 but this gave them no speedup. on Blackwell, quantizing matrices before feeding them to an FP8 matmul consumes roughly 40% of the matmul time. when you include transpose-quantization for backward passes, it jumps to 76%. you get 2x faster matmul but spend nearly the same time just preparing the inputs. MXFP8 training can actually be slower than BF16 if you don't fuse the quantization. it gets worse on Blackwell specifically. on Hopper, tensor core results accumulate in registers, so you can pipeline dequantization with CUDA cores while the next matmul runs. on Blackwell, results go into a new on-chip memory called TMEM. to do any arithmetic on the accumulator, you transfer from TMEM to registers, process with CUDA cores, write back, and wait. Cursor measured dequantization taking 1.76x the matmul time on Blackwell (vs 1.03x on Hopper). they couldn't even beat Hopper's realistic FP8 throughput with any variation of this approach. the fix is to not dequantize at all. Blackwell's tcgen05.mma block_scale PTX instruction handles MXFP8 block scaling entirely in hardware, inside the tensor cores. no TMEM-to-register transfers, no CUDA core arithmetic. the scaling factors load into TMEM and get consumed during the matrix multiply itself. but you still need to quantize the inputs. existing kernels from TransformerEngine and TorchAO run at ~4.5 TB/s and produce scale factors in the wrong memory layout, requiring a separate reshape kernel. Cursor built a quantization kernel sustaining 6.2+ TB/s that writes scales directly in the hardware-expected packed layout. they also fused quantization into SwiGLU's epilogue, so activations get quantized as they flow through the activation function. no BF16 round-trip through HBM. for grouped GEMM (the actual MoE operation), they beat DeepSeek's DeepGEMM at 0.43ms vs 0.67ms for forward/dgrad. that benchmark excludes DeepGEMM's quantization time, since DeepGEMM doesn't ship optimized quantization kernels. the real-world gap is larger. Cursor uses MXFP8 with 32-element block scaling (FP8E4M3 elements, E8M0 scale factors). DeepSeek V3 used 128-element blocks for the A matrix. finer blocks = better accuracy but more scale factors to manage. Cursor verified 32-block MXFP8 converges nearly identically to BF16. MoE forward went from 25.96ms (Blackwell BF16) to 9.45ms. backward from 59.17ms to 17.04ms. end-to-end: 24k tokens/GPU vs 16k on Blackwell BF16. the kernel was written by Stuart Sul (ML at Cursor), and the full link is in the comments.
-
I cut my kernel runtime from 𝟭𝟮𝗺𝘀 𝘁𝗼 𝟭.𝟰𝗺𝘀. I didn't change the math. I changed where the data lived. Every thread in my kernel was reading the same values — straight from global memory. 500 cycles of latency. Every. Single. Time. The GPU was computing fast. But it was spending most of its time waiting for data. The fix: __𝘴𝘩𝘢𝘳𝘦𝘥__ memory. Load data once per block, then every thread reuses it on-chip at ~5 cycle latency. Same math. 100× faster memory access. One thing that caught me: __syncthreads() is non-negotiable. Skip it and some threads will read shared memory before others finish writing. Silent bugs that are nearly impossible to debug. The rule I follow now: if multiple threads in a block touch the same data, it belongs in shared memory. What optimization unlocked the biggest speedup for you? #CUDA #GPU #SharedMemory #PerformanceEngineering #NVIDIA #ParallelProgramming #GPUProgramming
-
An uncomfortable truth: our CUDA based AI stacks were built by people who optimize equations, not hardware. Data scientists know the math; systems engineers know the memory. When I instrument real workloads, about ~55% of wall-clock isn’t “doing AI” at all—it’s vanishing into avoidable data motion and sync stalls. AI isn’t slow because models are big; it’s slow because memory was treated like an afterthought. Sporkle is my fourth attempt at solving the problem: treat the whole thing as a device-management problem first, and a compute problem second. Orchestrate memory, lifetimes, and transfers cleanly across devices with one API; let the math ride on top of a sane pipeline. When you center memory, the stack stops fighting you and performance shows up everywhere. I wrote it in a new, pretty Fortran dialect because array semantics and explicit shapes let the compiler generate ruthless code while keeping the memory model obvious. Think do concurrent, pure procedures, explicit-shape arrays, predictable strides, and ISO C interop when needed. It reads like math, behaves like a system, and deletes the pointer soup that makes “fast” code fragile and opaque. Receipts—before NVIDIA support even lands: production CPU paths hit ~90–160 GFLOPS; a straightforward OpenGL compute backend clears 400+ GFLOPS; the triple-buffered async pipeline (fence-gated) aggregates ~3,630 GFLOPS. Big kernels see ~6.5× latency cuts inside the pipeline; small, memory-bound kernels still gain ~2.3× throughput by keeping data hot and launches sane. Under load, the thread-safe program cache even shows “negative overhead,” hitting ~124.5% parallel efficiency thanks to lock-free reads and shared artifacts. 🧩 Today it runs on AMD, issuing real work without vendor SDK fairy dust; NVIDIA and Intel are next. The thesis is device-agnostic: fix data motion and the wins show up on anything with RAM and lanes. 🖥️ And because this matters: every number above was produced on my three-year-old Ryzen 7, all-AMD gaming rig. Not a datacenter box. No exotic hardware. Just disciplined systems engineering. ⚡ Why pay attention to this? Lower minimum footprint for useful AI. Speed without bigger boxes. Fewer watts per token/frame/sample. Reproducible harnesses, real traces, research-grade code. Fix the device layer and the math flies. Currently in Alpha2 👉https://lnkd.in/gCj4JduQ
-
CUDA 13.0 quietly introduced one of the most interesting GPU performance changes in years: PTXAS can now spill registers into shared memory before falling back to local memory. That sounds like a compiler detail. It is actually a memory-hierarchy story. In CUDA, "local memory" is local in visibility, not placement. It is thread-private, but backed by device memory and serviced through the global/L2 path. So when a kernel runs out of registers, the penalty is not just "a few extra loads." Hot state falls out of the SM's fastest storage tier. CUDA 13.0 adds a new intermediate landing zone: asm volatile(".pragma \"enable_smem_spilling\";"); Semantics: register spill -> shared memory first overflow beyond that -> local memory second NVIDIA did not remove register pressure. They changed where overflow lands. That is the deep idea here. Many kernels are register-bound but not shared-memory-bound, so previously unused on-chip SRAM can now absorb some spills that would otherwise go off-chip. The release notes describe this path as roughly 10x lower latency than spilling to L2. The published numbers are interesting because they are credible, not magical. In NVIDIA's example, enabling shared-memory spilling took a kernel from 176B spill stores and 176B spill loads to 0/0, replaced that with about 46KB of shared memory, and improved runtime by 7.76%. In QUDA kernels, NVIDIA reports typical gains in the 5-10% range. The caveat is just as important as the speedup: this is not free. Shared-memory spilling consumes per-CTA shared memory, which can reduce occupancy. NVIDIA explicitly recommends using __launch_bounds__; otherwise PTXAS may assume a larger CTA than you actually launch, over-allocate spill space, and hurt performance. There are also hard constraints: sm_75+, function scope only, valid in whole-program mode (-rdc=false), and not allowed with dynamically allocated shared memory. That is why this is such a good systems lesson. Performance cliffs are often not about arithmetic; they are about hierarchy boundaries and overflow paths. Registers -> shared memory -> local memory L1 -> L2 -> DRAM A lot of optimization is really about one question: When hot state stops fitting, which tier does it fall into next? https://lnkd.in/g9SjtHJW
-
🐢🚀 Making GPUs Go Brrr: The Art of Deep Learning Optimization TL;DR 🧠 Deep learning performance depends on three bottlenecks: compute, memory bandwidth, and overhead. Optimizing requires identifying which regime you're in. 🏭 Compute-bound: Maximize Tensor Core usage (e.g., matmuls) to achieve up to 312 TFLOPS. 🚚 Memory-bound: Use operator fusion to reduce costly memory transfers (e.g., x.cos().cos() is 2x faster when fused). 🐢 Overhead-bound: Framework and Python dispatch costs dominate small ops. Use tracing (jit.trace) or TorchDynamo to reduce overhead. Problems and Solutions 🐢 Overhead-bound: Use TorchDynamo or CUDA Graphs to reduce Python and framework dispatch costs. 🚚 Memory-bound: Fuse operations (e.g., NVFuser) to avoid repeated memory reads/writes. 🏭 Compute-bound: Focus on Tensor Core utilization for matrix multiplications, as non-matmul operations are 15x slower. Experiments & Setup ⏱️ PyTorch profiler: Reveals GPU idle gaps caused by CPU overhead (pink CPU vs. green GPU traces). 📦 Batch size test: Doubling batch size with only a 10% runtime increase indicates overhead-bound operations. 🧮 FLOP counting: Non-matmul ops (e.g., layer norm) consume 0.2% of FLOPs but achieve 250x less efficiency. Novel Insights 🧩 Operator fusion: Fused gelu costs are similar to relu due to reduced memory transfers. 🔄 Rematerialization: Recomputation can reduce both memory and runtime, as seen in AOTAutograd's min-cut optimization. 📉 Hardware disparity: GPU compute grows faster than memory bandwidth, making memory optimizations increasingly critical. Improvements Over Prior Work 🧪 TorchDynamo: A JIT compiler that dynamically reduces Python overhead without sacrificing flexibility. 🚀 CUDA Graphs: Eliminates kernel launch overhead but requires static execution. [Source: Chunk 10] 🔧 NVFuser: Automates operator fusion for pointwise/reduction ops, achieving 2x speedups in some cases. Key Architecture Details 🧠 Tensor Cores: Specialized for matmuls, achieving 312 TFLOPS, compared to 19.5 TFLOPS for general CUDA cores. 📦 Memory hierarchy: DRAM (global) → SRAM (shared) → registers. Operator fusion minimizes DRAM usage. 🔄 Asynchronous execution: CPU queues GPU kernels to hide overhead, but small ops leave GPUs idle. Future Work 🤖 JIT compilers: Combine flexibility and low overhead with VM-level introspection (e.g., TorchDynamo). 🧩 Hardware-software co-design: Optimize for non-matmul ops, especially on TPUs. 📉 Memory-aware training: Automate rematerialization using min-cut algorithms. Key Visualizations 🏭 Factory analogy: Compute = factory, memory = warehouse, bandwidth = shipping. Optimizing compute means reducing shipping delays. 🔥 Flamegraph: Shows that 90% of PyTorch a + b time is overhead, not actual computation. 📈 Microbenchmark plot: Increasing compute intensity (e.g., repeat=64) shifts operations from memory-bound (0.2 TFLOPS) to compute-bound (9.75 TFLOPS). 👇
-
GPU MODE Compute Profiling: For those that aren’t aware, GPU MODE (pardon the caps, it’s their name) is an open community with tons of resources for a variety of GPU programming topics. One of the more challenging, yet beneficial aspects of GPU programming is using profilers effectively. NVIDIA offers two great tools for profiling CUDA code - Nsight Systems and Nsight Compute. Nsight Systems offers a higher level picture of overall pipeline activity. It’s great for looking into things like kernel execution overlapping with memory copies, synchronization overheads, stream executions, and more. When I was first getting started with the cuFFT library, I realized that my values were not what I was expecting - data was showing up late. I looked into the behavior with Nsight Systems and found that the kernels in cuFFT were executing on the default stream, which indicated that I was initializing my plan incorrectly. It helped me pinpoint the bug and correct the mistake. Nsight Compute is another profiling tool that takes a deeper look into kernel execution. It highlights things like occupancy, cache utilization, and more. This tool is vital for kernel tuning. It’s pivotal to use these tools together, as sometimes optimizing for a kernel can penalize the system, which is why it’s important to design around our bottlenecks. This video breaks down the magic of Nsight Compute, and is worth the 2 hours+ digging into the information: https://lnkd.in/eXinbqD5 If you like my content, feel free to follow or connect! #softwareengineering #gpus
Lecture 44: NVIDIA Profiling
https://www.youtube.com/
Explore categories
- Hospitality & Tourism
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development