CuTe-DSL: GPU Speed Without C++ Headache

Stop writing C++ to make your GPU go fast! I wrote a custom CUDA kernel entirely in Python today. My GPU didn't even cry, and honestly, neither did I. If you've ever looked at CUTLASS or CuTe and felt instantly overwhelmed by the wall of C++ templates, you aren't alone. We want the speed, but we don't want the headache. CuTe-DSL: It brings the raw power of CuTe's advanced memory layouts and vectorization straight into a familiar, Pythonic interface. It looks and feels just like everyday PyTorch or Numba, but under the hood, it compiles directly to native GPU code. I just published the ultimate gentle guide to getting started with it. Here is what we cover: Zero to GPU: Writing your first kernel with a simple @cute.kernel decorator. Demystifying Layouts: We use simple ASCII diagrams (like the one below!) to finally make sense of how multi-dimensional math maps to flat memory. Logical vs. Zipped Divides: The secret sauce to cleanly partitioning your data without breaking your brain. Free Speed: How to graduate to vectorized execution and fetch multiple floats at once with literally one line of code. You no longer need to wrestle with raw pointer math and manual bounds checking to saturate your memory bandwidth. Check out the full guide and fire up your GPUs! https://lnkd.in/dsbQVnvc #Python #CUDA #GPUComputing #MachineLearning #DeepLearning #DataScience #AI

2 Comments

Jai Vig 2w

interesting. any benchmarking you did vs the c++ setup?

To view or add a comment, sign in

More Relevant Posts

Aditya Kumar Tiwari
2w
Report this post
From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
38 Comments
Like Comment
To view or add a comment, sign in
Kirti Bihade
2w
Report this post
Proud of what the team built under the hood at TenderGenie. https://lnkd.in/dd2TwD7p Processing thousands of pages of EPC specs, oil & gas RFPs, and construction annexures in real time means retrieval latency is on the critical path of every bid decision. A slow embedding layer = slower compliance checks, slower scope extraction, slower bid turnaround. Self-hosted TEI was the right call. Sub-2ms latency, 10,000 texts/s, zero Python in the hot path. But beyond performance, enterprise clients in oil & gas deal with commercially sensitive bid data. Self-hosted means documents are vectorized entirely within their own infrastructure. No third party API call, no data exposure. And every tender processed, every compliance decision, every scope interpretation compounds into institutional memory. Domain-specific embeddings that understand what a LSTK deviation means or how a P&ID spec maps to a commercial obligation. Generic OpenAI embeddings dont. In high-stakes, document-heavy industries, infrastructure choices are product decisions. Thanks Microsoft for the Founder Hub's sponsorship of $ 150,000. Many more experimentations are lined up. TenderGenie DataSmith AI
Aditya Kumar Tiwari

AI Engineer @Datasmith AI
2w

From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
Like Comment
To view or add a comment, sign in
TenderGenie

85 followers
1w
Report this post
A milestone for us. A faster, sharper experience for you. We've moved our entire embedding infrastructure to a self-hosted Rust-native stack. The numbers: sub-2ms retrieval latency, 1,000 req/s on a single GPU, zero downtime, zero data leaving your environment. What this means for bid & proposals teams: → Faster intelligence Compliance gaps, scope risks, and commercial obligations surface quicker. Every second saved compounds across hundreds of bid decisions. → Your data stays yours Your tender documents, RFPs, and commercial strategies are processed entirely within your own environment. No third party ever sees your data. → Sharper relevance Our models understand LSTK contracts, NEC clauses, P&ID specs, and BOQ structures — not just generic text. The more tenders you process, the sharper the intelligence gets. We're building the intelligence layer that serious bid & proposals teams deserve. This is one step further in that direction. #Procurement #BidManagement #TenderManagement #EPC #DigitalTransformation
Aditya Kumar Tiwari

AI Engineer @Datasmith AI
2w

From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
Like Comment
To view or add a comment, sign in
Aditi Sikarwar
3w
Report this post
Most Python GPU code never touches the GPU directly. @vectorize. cuPy. PyTorch ops. You get your speedup, you move on. But at some point, the abstraction starts costing you. Your pipeline has too many unnecessary transfers. Your intermediate results keep bouncing between CPU and GPU when they never needed to leave the device. The bottleneck isn't the kernel , it's the bus. That's when you need to understand what's actually happening underneath. I wrote a deep-dive into GPU-accelerated Python with Numba that goes past the ufunc basics - into the mental model you actually need: → CUDA device functions and why the compiler inlines them for zero overhead → Why your pipeline is slow (spoiler: it's PCIe transfers, not compute) → Device arrays that keep data on-GPU across multiple operations → Warp divergence — what it is and how a single branchless multiply fixes it → Synchronization gotchas that make fast kernels look slow in benchmarks → The memory hierarchy from registers to global memory to host RAM - and the 100-1000x latency gaps between them #CUDA #GPU #HPC #HighPerformanceComputing #DeepLearning #MLEngineering #ParallelComputing

GPU-Accelerated Python: Writing CUDA Kernels with Numba medium.com
Like Comment
To view or add a comment, sign in
Xron AI

1,226 followers
1w
Report this post
𝟱 𝗣𝘆𝘁𝗵𝗼𝗻 𝗟𝗟𝗠 𝗹𝗶𝗯𝗿𝗮𝗿𝗶𝗲𝘀 𝗲𝘃𝗲𝗿𝘆 𝗔𝗜 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝘀𝗵𝗼𝘂𝗹𝗱 𝗸𝗻𝗼𝘄 🧠 Here are 5 open-source Python libraries that separate the builders from the users: 𝟬𝟭 — 𝗩𝗟𝗟𝗠 𝘀𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗿𝗼𝘂𝘁𝗲𝗿 Stop sending every prompt to the same model. Route intelligently based on intent and cut costs dramatically. 𝟬𝟮 — 𝗛𝗮𝘆𝘀𝘁𝗮𝗰𝗸 Production-ready AI orchestration for RAG pipelines and agent workflows. 𝟬𝟯 — 𝗹𝗹𝗺𝗳𝗶𝘁 One command to find which LLMs actually run on your hardware. Scores models on quality, speed, fit, and context. 𝟬𝟰 — 𝗟𝗮𝗻𝗴𝗚𝗿𝗮𝗽𝗵 When LangChain isn't enough. Build stateful, long-running agents that track context and recover from failures. 𝟬𝟱 — 𝗕𝗶𝘁𝘀𝗮𝗻𝗱𝗯𝘆𝘁𝗲𝘀 Run 70B parameter models on consumer GPUs. k-bit quantization that halves memory with barely any performance hit.
Like Comment
To view or add a comment, sign in
Aditi Sikarwar
3w
Report this post
A Python loop. 662 nanoseconds per iteration. Add two characters. Same loop. Same algorithm. 50–200× faster. That's @jit, and understanding why it works is a systems-level education. I break it down here , it covers: ▸ Why Python is structurally slow (not just "interpreted" it's the boxing, type dispatch, and GC pressure on every single loop iteration) ▸ What Numba actually is under the hood -a 5-stage compilation pipeline: Python bytecode → type inference → Numba IR → LLVM IR → machine code or CUDA PTX. The same backend Clang uses. ▸ A real benchmark breakdown -pure Python (662 ns) vs Numba (193 ns) vs built-in C (128 ns) and why Numba doesn't always win, and when it does win massively ▸ The HPC memory hierarchy explained - registers, L1/L2 cache, DRAM, PCIe, GPU HBM and why the most common GPU bottleneck isn't compute, it's the data transfer ▸ CUDA C++ vs pyCUDA vs Numba - a side-by-side comparison of when to use which, with no fluff ▸ The Monte Carlo Pi exercise - how adding @jit to a 1M-iteration loop gives 50–200× speedup, and why this is the sweet spot Numba was built for ▸ The core architectural insight: Python is a control plane, not a compute plane the same pattern behind PyTorch, TensorFlow, and JAX #Python #Numba #GPU #CUDA #HPC #DataScience #MachineLearning #ScientificComputing #PerformanceEngineering #NumPy #SoftwareEngineering

The Speed of Light in Python: How Numba Brings GPU Acceleration to Scientific Computing medium.com
Like Comment
To view or add a comment, sign in
Viral B. Shah
4d
Report this post
CuTile.jl already allows you to leverage NVIDIA's CuTile compiler in #julialang. This blog from NVIDIA shows how GPU kernels can be ported from #python to #julialang using agentic tools like Claude Code. Just point Claude at the SKILL.md file and get started!

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl | NVIDIA Technical Blog developer.nvidia.com

2 Comments
Like Comment
To view or add a comment, sign in
Hariprashad Ravikumar
6d
Report this post
Huge thanks to Vinishka Kalra from Packt for sharing a reviewer's copy of "GPU-Accelerated Computing with Python 3 and CUDA" by Niels Cautaerts and Hossein Ghorbanfekr! This book does an excellent job of bridging the gap between low-level GPU programming and high-level Python tools. It is structured into a logical progression: 🔴 Fundamentals: It introduces GPGPU concepts and teaches how to write, profile, and debug CUDA kernels using Numba-CUDA. 🔴 Optimization & Scaling: It explores performance optimization, enabling concurrency with CUDA streams, and scaling computations to multiple GPUs. 🔴 High-Level Ecosystems: It covers accelerated array programming and data science using CuPy and the RAPIDS ecosystem (cuDF and cuML). It also introduces JAX for solving optimization problems. 🔴 Real-World Applications: It applies these concepts to practical problems, including solving the heat equation, processing images, simulating atomic interactions, and implementing a transformer-based language model from scratch. For those seeking a comprehensive, practical guide to mastering CUDA, CuPy, and Numba natively within Python, this book is highly recommended. #CUDA #Python #Numba #CuPy #JAX #RAPIDS #GPUComputing #MachineLearning #SoftwareEngineering #DataScience #HighPerformanceComputing #HPC #AI
Like Comment
To view or add a comment, sign in
Jonathan Dekhtiar
2w Edited
Report this post
🎙️ Just dropped on Talk Python To Me (Ep. 544) ! WheelNext is on Air! 🎧 Listen here → https://lnkd.in/ehWRzNcG `pip install <package>` is incredible for pure Python, unfortunately often insufficient for scientific python where compiled code is all the rage. Large wheels, no GPU detection, poor CPU optimizations. We're fixing this! A coalition from NVIDIA, Astral, Quansight, Meta, AMD, Intel, Red Hat, and many others has been building WheelNext, a community focused on re-inventing the Wheel (pun intended)! The goal: Automatic right CUDA version, right CPU optimizations. Smaller wheels. Better performance unlocked for scientific computing. PEP 825 is live: https://lnkd.in/exJMa4Tk On the episode, Ralf Gommers, Charlie Marsh, Michael Kennedy, and myself dig into why this matters, how it works, and when it's coming to your workflow. Credit to many of the fantastic people who helped us getting so far: Michal Gorny, Konstantin, Andrey Talman, Dr. Andy R. Terrel (he/him), Michael Sarahan, Barry Warsaw, Donald Stufft, Emma Smith, Eli Uriegas, Chris Gottbrath If you maintain a package with native code, now is the time to get involved! #Python #OpenSource #CUDA #PythonPackaging #DeepLearning #pytorch #NVIDIA
1 Comment
Like Comment
To view or add a comment, sign in
Arundhati Banerjee
3w
Report this post
🎉 CUDA 13.2 just dropped, and GPU programming just got simpler. This release expands CUDA Tile support to Ampere and Ada GPUs while delivering a stronger CUDA Python stack for cluster-scale workloads. What's new: ✅ Install cuTile Python directly from PyPI: pip install cuda-tile ✅ Enhanced CUDA Python profiling and debugging across Numba-CUDA flows and Nsight tools ✅ Modern CUDA C++ and refreshed math libraries optimized for AI and HPC kernels Ready to accelerate your workflows? 📝 Read the technical deep dive:

CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features | NVIDIA Technical Blog developer.nvidia.com
Like Comment
To view or add a comment, sign in

1,070 followers

60 Posts

View Profile Connect

CuTe-DSL: GPU Speed Without C++ Headache

More Relevant Posts

Explore content categories