Jonathan Dekhtiar’s Post

2w Edited

🎙️ Just dropped on Talk Python To Me (Ep. 544) ! WheelNext is on Air! 🎧 Listen here → https://lnkd.in/ehWRzNcG `pip install <package>` is incredible for pure Python, unfortunately often insufficient for scientific python where compiled code is all the rage. Large wheels, no GPU detection, poor CPU optimizations. We're fixing this! A coalition from NVIDIA, Astral, Quansight, Meta, AMD, Intel, Red Hat, and many others has been building WheelNext, a community focused on re-inventing the Wheel (pun intended)! The goal: Automatic right CUDA version, right CPU optimizations. Smaller wheels. Better performance unlocked for scientific computing. PEP 825 is live: https://lnkd.in/exJMa4Tk On the episode, Ralf Gommers, Charlie Marsh, Michael Kennedy, and myself dig into why this matters, how it works, and when it's coming to your workflow. Credit to many of the fantastic people who helped us getting so far: Michal Gorny, Konstantin, Andrey Talman, Dr. Andy R. Terrel (he/him), Michael Sarahan, Barry Warsaw, Donald Stufft, Emma Smith, Eli Uriegas, Chris Gottbrath If you maintain a package with native code, now is the time to get involved! #Python #OpenSource #CUDA #PythonPackaging #DeepLearning #pytorch #NVIDIA

1 Comment

Yaroslav Nikitenko 2w

Thanks for your work! I have a slightly different problem. My Python package provides a man page for Linux. Will it be ever supported? If pip install could help with man pages, that would be great!

To view or add a comment, sign in

More Relevant Posts

Darshan Baslani
2w
Report this post
Stop writing C++ to make your GPU go fast! I wrote a custom CUDA kernel entirely in Python today. My GPU didn't even cry, and honestly, neither did I. If you've ever looked at CUTLASS or CuTe and felt instantly overwhelmed by the wall of C++ templates, you aren't alone. We want the speed, but we don't want the headache. CuTe-DSL: It brings the raw power of CuTe's advanced memory layouts and vectorization straight into a familiar, Pythonic interface. It looks and feels just like everyday PyTorch or Numba, but under the hood, it compiles directly to native GPU code. I just published the ultimate gentle guide to getting started with it. Here is what we cover: Zero to GPU: Writing your first kernel with a simple @cute.kernel decorator. Demystifying Layouts: We use simple ASCII diagrams (like the one below!) to finally make sense of how multi-dimensional math maps to flat memory. Logical vs. Zipped Divides: The secret sauce to cleanly partitioning your data without breaking your brain. Free Speed: How to graduate to vectorized execution and fetch multiple floats at once with literally one line of code. You no longer need to wrestle with raw pointer math and manual bounds checking to saturate your memory bandwidth. Check out the full guide and fire up your GPUs! https://lnkd.in/dsbQVnvc #Python #CUDA #GPUComputing #MachineLearning #DeepLearning #DataScience #AI

2 Comments
Like Comment
To view or add a comment, sign in
Quansight

5,943 followers
1w
Report this post
The strength of Python lies in its community, and at Quansight, we are excited to contribute to a cross-industry initiative that makes the Python packaging ecosystem more capable and flexible. Our co-CEO Ralf Gommers, recently joined host Michael Kennedy on the Talk Python Training podcast together with Jonathan Dekhtiar from NVIDIA and Charlie Marsh from Astral, to discuss the collaborative work on WheelNext and wheel variants. Working alongside many other companies and open source projects, we are developing a new standard for hardware-aware packaging. This effort ensures that whether you are using a specialized GPU or a modern CPU, the tools you rely on, like NumPy and PyTorch, will automatically perform at their best. Listen to the full conversation here: https://lnkd.in/gnxvNdM6

Wheel Next + Packaging PEPs talkpython.fm

2 Comments
Like Comment
To view or add a comment, sign in
CloudRift

1,440 followers
2d
Report this post
Dmitry built a complete LLM compiler from scratch to document how a modern ML compiler stack works end to end. 5,000 lines, two weeks, no library use, just pure Python and raw CUDA. The pipeline takes a PyTorch graph through six intermediate representations: Torch IR, Tensor IR, Loop IR, Tile IR, Kernel IR, CUDA. Each lowering moves closer to the hardware: decompose Torch ops, convert to loops and fuse, schedule kernels, render CUDA. GELU at seq=32 runs 31 µs in eager PyTorch and 6 µs in our stack, a 4.87x speedup. Softmax sits at parity with eager. Matmul lands at 50% to slightly above NVIDIA cuBLAS depending on shape. Vendor kernels are still hard to beat at full prefill on the FFN-width matmuls, which is why every production stack falls back to cuBLAS, cuDNN, and CUTLASS on the heavy hitters. https://lnkd.in/g6qbVdFv #CUDA #MLCompilers #PyTorch

A Principled ML Compiler Stack in 5,000 Lines of Python cloudrift.ai
Like Comment
To view or add a comment, sign in
Viral B. Shah
3d
Report this post
CuTile.jl already allows you to leverage NVIDIA's CuTile compiler in #julialang. This blog from NVIDIA shows how GPU kernels can be ported from #python to #julialang using agentic tools like Claude Code. Just point Claude at the SKILL.md file and get started!

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl | NVIDIA Technical Blog developer.nvidia.com

2 Comments
Like Comment
To view or add a comment, sign in
Thrinadha Rao
1w
Report this post
Running an LLM locally is not like downloading software and double-clicking an executable. What you actually download is a set of model artifacts, basically a recipe. The inference engine is the chef, and each engine has its own opinions about how to load, quantize, schedule, and serve that model. That’s also why “this engine is written in C++” or “that one is in Python” is not enough to explain real-world inference performance. A lot of the magic happens in places people rarely talk about: Memory mapping. SSD to RAM to GPU movement. Quantization strategy. Scheduler design. KV cache behavior. Hardware-specific optimizations. The deeper I go into local inference, the more I see that this is really a systems problem, not just a model problem. And once you understand that, you stop looking for the “best” engine in absolute terms and start looking for the engine that makes the right tradeoffs for your hardware, latency target, and serving pattern. #LLM #LocalLLM #AIInfrastructure #AIEngineering #MachineLearning #DeepLearning
Like Comment
To view or add a comment, sign in
NextTech

474 followers
3w
Report this post
An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution In this tutorial, we implement an advanced, practical implementation of the NVIDIA Transformer Engine in Python, focusing on how mixed-precision acceleration can be explored in a realistic deep learning workflow. We set up the environment, verify GPU and CUDA readiness, attempt to install the required Transformer Engine components, and handle compatibility issues gracefully so that the notebook remains runnable even when the full extension cannot be built....

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution https://nexttech-news.com
Like Comment
To view or add a comment, sign in
Aditya Kumar Tiwari
2w
Report this post
From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
38 Comments
Like Comment
To view or add a comment, sign in
Kirti Bihade
1w
Report this post
Proud of what the team built under the hood at TenderGenie. https://lnkd.in/dd2TwD7p Processing thousands of pages of EPC specs, oil & gas RFPs, and construction annexures in real time means retrieval latency is on the critical path of every bid decision. A slow embedding layer = slower compliance checks, slower scope extraction, slower bid turnaround. Self-hosted TEI was the right call. Sub-2ms latency, 10,000 texts/s, zero Python in the hot path. But beyond performance, enterprise clients in oil & gas deal with commercially sensitive bid data. Self-hosted means documents are vectorized entirely within their own infrastructure. No third party API call, no data exposure. And every tender processed, every compliance decision, every scope interpretation compounds into institutional memory. Domain-specific embeddings that understand what a LSTK deviation means or how a P&ID spec maps to a commercial obligation. Generic OpenAI embeddings dont. In high-stakes, document-heavy industries, infrastructure choices are product decisions. Thanks Microsoft for the Founder Hub's sponsorship of $ 150,000. Many more experimentations are lined up. TenderGenie DataSmith AI
Aditya Kumar Tiwari

AI Engineer @Datasmith AI
2w

From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
Like Comment
To view or add a comment, sign in
TenderGenie

85 followers
1w
Report this post
A milestone for us. A faster, sharper experience for you. We've moved our entire embedding infrastructure to a self-hosted Rust-native stack. The numbers: sub-2ms retrieval latency, 1,000 req/s on a single GPU, zero downtime, zero data leaving your environment. What this means for bid & proposals teams: → Faster intelligence Compliance gaps, scope risks, and commercial obligations surface quicker. Every second saved compounds across hundreds of bid decisions. → Your data stays yours Your tender documents, RFPs, and commercial strategies are processed entirely within your own environment. No third party ever sees your data. → Sharper relevance Our models understand LSTK contracts, NEC clauses, P&ID specs, and BOQ structures — not just generic text. The more tenders you process, the sharper the intelligence gets. We're building the intelligence layer that serious bid & proposals teams deserve. This is one step further in that direction. #Procurement #BidManagement #TenderManagement #EPC #DigitalTransformation
Aditya Kumar Tiwari

AI Engineer @Datasmith AI
2w

From Python to Rust for embedding inference — here's what we gained. We moved our RAG embedding pipeline to Text Embeddings Inference (TEI) — Hugging Face's Rust-native inference server. Stack: TEI (https://lnkd.in/grvTMJDk) · Rust + Candle ML + Flash Attention v2 + cuBLASLt · NVIDIA A100 80GB Results (500 reqs × batch=10 @ concurrency=10): → Single-req latency: 2.0 ms → Throughput: 1,000 req/s | 10,000 texts/s batched → VRAM footprint: ~2 GB (2.6% of 80GB) → Cold start: 6 seconds → Failures: 0 / 500 ——— Why Python costs you throughput: → GIL serializes tokenization, I/O, and response serialization → No built-in dynamic batching → Every request pays GC pauses + dynamic dispatch overhead → Triton/vLLM solve KV-cache problems embeddings don't have What TEI does differently (87% Rust, 9% CUDA): → No Python in the hot path — HTTP, tokenization, batching, dispatch: all compiled Rust → Flash Attention via Candle: Q·K^T·V fused into one kernel, HBM reads drop O(N²)→O(N) → cuBLASLt for QKV/FFN projections — FP16 tensor cores, zero GEMM dispatch overhead → Token-based dynamic batching — SM occupancy stays high regardless of input length variance → Zero-copy safetensors mmap — no torch.load(), no graph tracing, boot in seconds The model needs ~2 GB VRAM. The bottleneck was never the hardware — it was the runtime between your code and the GPU. Use vLLM for LLMs. Use TEI for embeddings. https://lnkd.in/gHNxzBJy #MachineLearning #Rust #CUDA #Embeddings #InferenceOptimization #RAG #MLOps #FlashAttention #HuggingFace
Like Comment
To view or add a comment, sign in
Kürşat Kılıç,PhD

Research Engineer/AI in Geotech
4w Edited
Report this post
Disk-LLM: When I was a PhD student, I dealt with a challenging acceleration sensor data and I and one of my collab friends come up with an idea to use NumPy memmap to handle large size data with a small RAM capability. From this perspective I was think and asking what if we apply NumPy memmap for LLMs and handle large models on disk and relieved the RAM and avoid to use too much RAM. This is approach is not new because there are already similar approaches but NumPy memmap is native numpy and python. It doesn’t require C++, CUDA kernels and massive frameworks. Therefore, I started an open source project on my GitHub, once you have time go and check it and dig into it , let me know your thoughts. https://lnkd.in/gj8cgPhg https://lnkd.in/gkch7CDe #research #opensource #LLMs #GENAI #NumPy

GitHub - kilickursat/disk-llm github.com

1 Comment
Like Comment
To view or add a comment, sign in

4,344 followers

70 Posts

View Profile Connect

Jonathan Dekhtiar’s Post

More Relevant Posts

Explore content categories