GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB
How to Maximize GPU Utilization
Explore top LinkedIn content from expert professionals.
Summary
Maximizing GPU utilization means making sure your graphics processing unit is always busy and handling as much work as possible, instead of waiting on slow data transfers or inefficient coordination between tasks. This is crucial for speeding up tasks like training machine learning models, reducing costs, and getting the most out of expensive hardware.
- Monitor bottlenecks: Track how much time your GPU spends computing versus waiting for data or instructions to help identify areas where improvements can be made.
- Streamline workload: Use methods like operator fusion and batch processing so your GPU handles bigger chunks of data at once, reducing idle time and memory swapping.
- Improve scheduling: Configure job schedulers and cache strategies to keep data ready and prevent delays caused by fragmented workloads or excessive memory swapping.
-
-
Teams often focus on buying powerful GPUs. But very few actually use them efficiently. High GPU cost does not guarantee high performance. Utilization is what truly matters. Many training pipelines look busy. But GPUs often sit idle between operations. Understanding the right metrics changes everything. Streaming Multiprocessor (SM) Utilization shows how effectively GPU cores stay active. Low values usually mean stalled pipelines or poor workload distribution. Memory Bandwidth Utilization reveals data movement limits. Sometimes compute is fast. Data transfer becomes the real bottleneck. GPU Memory Occupancy determines how well memory is allocated. Underutilized memory reduces batch sizes and slows training efficiency. Compute-to-Communication Ratio matters in distributed training. Too much synchronization wastes compute time across nodes. Kernel Execution Efficiency measures how well kernels translate theory into real performance. Poor fusion and small kernels introduce hidden delays. GPU Idle Time Percentage exposes scheduling gaps. Data loading, networking, or batching issues often starve GPUs. End-to-End Throughput per GPU connects engineering metrics to business impact. Higher throughput means faster experiments and lower training cost. Optimization is rarely about bigger hardware. It’s about removing invisible inefficiencies. The best ML teams don’t just train models. They engineer utilization. Because in modern AI systems: Performance = Utilization × Efficiency × Throughput.
-
🐢🚀 Making GPUs Go Brrr: The Art of Deep Learning Optimization TL;DR 🧠 Deep learning performance depends on three bottlenecks: compute, memory bandwidth, and overhead. Optimizing requires identifying which regime you're in. 🏭 Compute-bound: Maximize Tensor Core usage (e.g., matmuls) to achieve up to 312 TFLOPS. 🚚 Memory-bound: Use operator fusion to reduce costly memory transfers (e.g., x.cos().cos() is 2x faster when fused). 🐢 Overhead-bound: Framework and Python dispatch costs dominate small ops. Use tracing (jit.trace) or TorchDynamo to reduce overhead. Problems and Solutions 🐢 Overhead-bound: Use TorchDynamo or CUDA Graphs to reduce Python and framework dispatch costs. 🚚 Memory-bound: Fuse operations (e.g., NVFuser) to avoid repeated memory reads/writes. 🏭 Compute-bound: Focus on Tensor Core utilization for matrix multiplications, as non-matmul operations are 15x slower. Experiments & Setup ⏱️ PyTorch profiler: Reveals GPU idle gaps caused by CPU overhead (pink CPU vs. green GPU traces). 📦 Batch size test: Doubling batch size with only a 10% runtime increase indicates overhead-bound operations. 🧮 FLOP counting: Non-matmul ops (e.g., layer norm) consume 0.2% of FLOPs but achieve 250x less efficiency. Novel Insights 🧩 Operator fusion: Fused gelu costs are similar to relu due to reduced memory transfers. 🔄 Rematerialization: Recomputation can reduce both memory and runtime, as seen in AOTAutograd's min-cut optimization. 📉 Hardware disparity: GPU compute grows faster than memory bandwidth, making memory optimizations increasingly critical. Improvements Over Prior Work 🧪 TorchDynamo: A JIT compiler that dynamically reduces Python overhead without sacrificing flexibility. 🚀 CUDA Graphs: Eliminates kernel launch overhead but requires static execution. [Source: Chunk 10] 🔧 NVFuser: Automates operator fusion for pointwise/reduction ops, achieving 2x speedups in some cases. Key Architecture Details 🧠 Tensor Cores: Specialized for matmuls, achieving 312 TFLOPS, compared to 19.5 TFLOPS for general CUDA cores. 📦 Memory hierarchy: DRAM (global) → SRAM (shared) → registers. Operator fusion minimizes DRAM usage. 🔄 Asynchronous execution: CPU queues GPU kernels to hide overhead, but small ops leave GPUs idle. Future Work 🤖 JIT compilers: Combine flexibility and low overhead with VM-level introspection (e.g., TorchDynamo). 🧩 Hardware-software co-design: Optimize for non-matmul ops, especially on TPUs. 📉 Memory-aware training: Automate rematerialization using min-cut algorithms. Key Visualizations 🏭 Factory analogy: Compute = factory, memory = warehouse, bandwidth = shipping. Optimizing compute means reducing shipping delays. 🔥 Flamegraph: Shows that 90% of PyTorch a + b time is overhead, not actual computation. 📈 Microbenchmark plot: Increasing compute intensity (e.g., repeat=64) shifts operations from memory-bound (0.2 TFLOPS) to compute-bound (9.75 TFLOPS). 👇
-
Supercharge Your Model Training: Essential Techniques and Tricks 🚀 Are you tired of long model training times and inefficient training process? I have always struggled to understand which techniques can be chained together towards cumulative improvement and the order of magnitude improvement from each. Here is an array of powerful techniques to accelerate training with their effect size. The key in most cases is to know the memory architecture for the GPU 💾 and utilize it optimally by reducing data movement between on chip registers, cache, and off chip high-bandwidth memory. Frameworks like PyTorch make this pretty simple allowing you to do this in a few lines of code at most. - Switch to Mixed Precision: 🔢 Implementing bfloat16 can lead to a potential 3x speedup by reducing the amount of data transferred, thus enabling larger batch sizes. Although GPUs may promise up to an 8x improvement, actual gains could be lower due to memory constraints. Benchmarking is essential! - PyTorch Compile: 🖥️ Experience about a 2.5x speed increase by minimizing unnecessary memory bus traffic. This approach prepares your computations for more efficient execution. - Flash Attention: ⚡ Utilize a fused kernel specifically optimized for attention-heavy models, which can boost performance by up to 40% by enhancing memory hierarchy utilization. - Optimized Data Formats: 📊 Aligning your vocab size to a power of 2 can provide a straightforward 10% speed boost by improving memory access efficiency. - Hyperparameter Tuning: 🛠️ Gain an additional 5-10% speed by tweaking hyperparameters and employing fused kernels for optimizers like AdamW. Bespoke Fused Kernels: 🧩 Push the boundaries with custom kernels designed specifically for your model’s architecture to achieve optimal performance. Leverage Additional Optimizations: ➕ Employ vector operations (e.g., AVX-512) on CPUs or use sparse kernels for pruned models to further enhance memory efficiency. Scale Responsibly: 📈 Before moving to a multi-GPU setup, ensure you've maximized the potential of single-GPU optimizations to avoid inefficiencies. Once your setup is optimized, scaling across multiple GPUs can dramatically reduce training times by parallelizing the workload and minimizing data transfers. You can do this almost trivially by using things like Hugging Face Accelerate. Remember, the effectiveness of these techniques can vary based on your specific model, hardware setup, and other variables. Extensive benchmarking is crucial to find the perfect balance between speed and accuracy. Optimization is a continuous journey. Stay proactive in exploring new methods to reduce training times and remain competitive in the fast-evolving field of machine learning. For more insights, check out Karpathy’s latest video where he replicates GPT-2 on 8x A100s, astonishingly beating GPT-3 on Hellaswag. It’s incredible to see such advancements, allowing what once took months to be accomplished virtually overnight. 🌙✨
-
My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling. And finally, we also load openly available pretrained weights into our scratch-built model architecture. Along with this pretraining tutorial, I also have bonus material on speeding up the LLM training. These apply not just to LLMs but also to other transformer-based models like vision transformers: 1. Instead of saving the causal mask, this creates the causal mask on the fly to reduce memory usage (here it has minimal effect, but it can add up in long-context size models like Llama 3.2 with 131k-input-tokens support) 2. Use tensor cores (only works for Ampere GPUs like A100 and newer) 3. Use the fused CUDA kernels for `AdamW` by setting 4. Pre-allocate and re-use GPU memory via the pinned memory setting in the data loader 5. Switch from 32-bit float to 16-bit brain float (bfloat16) precision 6. Replace from-scratch implementations of attention mechanisms, layer normalizations, and activation functions with PyTorch counterparts that have optimized CUDA kernels 7. Use FlashAttention for more efficient memory read and write operations 8. Compile the model 9. Optimize the vocabulary size 10. After saving memory with the steps above, increase the batch size Video tutorial: https://lnkd.in/gDRycWea PyTorch speed-ups: https://lnkd.in/gChvGCJH
-
𝗘𝘅𝗽𝗹𝗮𝗶𝗻 𝗧𝗵𝗶𝘀: 𝗟𝗹𝗮𝗺𝗮 𝟯 𝗡𝗲𝗲𝗱𝘀 𝟮.𝟰𝗧𝗕. 𝗬𝗼𝘂𝗿 𝗚𝗣𝗨 𝗛𝗮𝘀 𝟴𝟬𝗚𝗕. 𝗜𝘁 𝗦𝘁𝗶𝗹𝗹 𝗧𝗿𝗮𝗶𝗻𝘀. Training Llama-3 405B needs ~2.4TB with BF16 + 8-bit Adam: • Weights: 810GB • Gradients: 810GB • Optimizer: 810GB (vs 3.24TB with standard Adam!) • Total: ~2.4TB (Illustrative budget—config-dependent; FP32 masters, ZeRO stage, and offload change totals) Your H100? 80GB. You'd need 30+ GPUs just to hold everything. 𝗧𝗵𝗿𝗲𝗲 𝗧𝗿𝗶𝗰𝗸𝘀 𝗧𝗵𝗮𝘁 𝗠𝗮𝗸𝗲 𝗜𝘁 𝗪𝗼𝗿𝗸 𝟭. 𝗗𝗮𝘁𝗮 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split batch. Problem: Each GPU needs 2.4TB. Fix: ZeRO splits it across N GPUs. 𝟮. 𝗠𝗼𝗱𝗲𝗹 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split layers. Problem: Sequential bottleneck. Fix: Pipeline batches. 𝟯. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split tokens. This is the game changer. 8K tokens → 8 GPUs → 1K each. But attention needs every token to see all others. 𝗧𝗵𝗲 𝗠𝗮𝗴𝗶𝗰 𝗠𝗼𝗺𝗲𝗻𝘁: Instead of moving the 2.4TB model, GPUs only exchange attention keys/values (K,V). Each GPU: • Computes K,V for its 1K tokens (32MB) • Sends to others via all-to-all • Receives 7×32MB = 224MB total • Computes attention, deletes copies 𝟮𝟮𝟰𝗠𝗕 𝗺𝗼𝘃𝗲𝗱 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝟮.𝟰𝗧𝗕. That's 10,000x less. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁: Combine all three (ZeRO + tensor + pipeline + sequence parallel). Each GPU holds ~75GB instead of 2.4TB. This exact choreography powers ChatGPT, Claude, and every frontier model. Without it? 10K token limits. With it? Entire books in one context. Not magic. Just brilliant engineering making the impossible routine.
-
Most engineers think model cost is about API tokens or inference time. In reality, it’s about how your requests compete for GPU scheduling and how effectively your data stays hot in cache. Here’s the untold truth 👇 1. 𝐄𝐯𝐞𝐫𝐲 𝐦𝐢𝐥𝐥𝐢𝐬𝐞𝐜𝐨𝐧𝐝 𝐨𝐧 𝐚 𝐆𝐏𝐔 𝐢𝐬 𝐚 𝐰𝐚𝐫 𝐟𝐨𝐫 𝐩𝐫𝐢𝐨𝐫𝐢𝐭𝐲. . Your model doesn’t just “run.” It waits its turn. Schedulers (like Kubernetes device plugins, Triton schedulers, or CUDA MPS) decide who gets compute time — and how often. If your jobs are fragmented or unbatched, you’re paying for idle silicon. That’s like renting a Ferrari to sit in traffic. 2. 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐥𝐚𝐲𝐞𝐫𝐬 𝐪𝐮𝐢𝐞𝐭𝐥𝐲 𝐝𝐞𝐜𝐢𝐝𝐞 𝐲𝐨𝐮𝐫 𝐛𝐮𝐫𝐧 𝐫𝐚𝐭𝐞. Intermediate activations, embeddings, and KV caches live in high-bandwidth memory. If your model keeps reloading them between requests — you’re paying full price every time. That’s why serving infra (like vLLM, DeepSpeed, or FasterTransformer) focuses more on cache reuse than raw FLOPS. The real optimization isn’t in “faster models.” It’s in smarter scheduling and cache locality. Your cost per token can drop 50% with zero model changes — just better orchestration. 3. 𝐓𝐡𝐞 𝐡𝐢𝐝𝐝𝐞𝐧 𝐭𝐚𝐱: 𝐟𝐫𝐚𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐞𝐯𝐢𝐜𝐭𝐢𝐨𝐧. When too many models share the same GPU cluster, the scheduler starts slicing compute and evicting caches. This leads to context thrashing — where memory swaps cost more than inference. At scale, this kills both performance and margins. So if you’re wondering why your inference bill doubled while latency stayed the same — don’t blame the model. Blame the infrastructure design. The real bottleneck isn’t model size — it’s architectural awareness. Understanding schedulers, memory hierarchies, and caching strategies is what separates AI engineers from AI architects. And that’s exactly what we go deep into inside the Advanced System Design Cohort — a 3-month, high-intensity program for Senior, Staff, and Principal Engineers who want to master the systems that power modern AI infra. You’ll learn to think beyond API calls — about how compute, caching, and scheduling interact to define scale and cost. If you’re ready to learn the architectures behind real AI systems — there’s a form in the comments. Apply, and we’ll check if you’re a great fit. We’re selective, because this is where future technical leaders are being built.
-
“Just rent a GPU for training” Until you need: - Multi-node training for 70B+ models - $5/hour per GPU (not $30/hour) - 90%+ GPU utilization Then you build your own ml infra. Here’s the reality: Most ML engineers think training infrastructure = - Rent some A100s - Install PyTorch - Run training script - Scale with more GPUs The pain starts around 8 GPUs. Remember: You’re not training ONE model on ONE GPU. You’re orchestrating DOZENS of experiments across hundreds of GPUs with checkpointing, fault tolerance, and resource sharing. That’s a scheduling problem, not a training problem. What you actually need: > Job scheduler that understands GPU topology > Distributed checkpoint manager that doesn’t waste bandwidth > Network fabric optimized for all-reduce > Elastic training that handles node failures This is the actual platform. Your training cost breakdown at scale: > Compute: $10/GPU-hour (you pay $30 on cloud) > Data transfer: $2/TB (kills you with large datasets) > Storage: $0.02/GB-month (checkpoints add up fast) > Network: Included (but becomes bottleneck) The hidden cost? Idle GPU time while debugging. The first principle of distributed training: Bandwidth >> Compute for models over 10B params Ring all-reduce needs 2(N-1)/N bandwidth efficiency. With 64 GPUs on 3.2 Tbps InfiniBand, you max out at 200GB/sec actual throughput. This is why “just add more GPUs” plateaus. Training Llama 70B: - 140GB model weights - Optimizer states: 280GB - Checkpoints every 1K steps - 30 checkpoints = 12.6TB One training run = $250 in storage. You run 50 experiments/month. “We need to train 10 models simultaneously with different hyperparameters” Now your platform needs: > Gang scheduling for multi-GPU jobs > Spot instance preemption handling > Shared dataset caching across jobs > Priority queues with fairness 90% of DIY platforms can’t do this. > Use cloud when you’re training <5 models/month, using standard frameworks, can tolerate random failures, and engineering time costs more than GPU markup. > Build your own when you train 20+ models/month, need 70B+ params, want <$10/GPU-hour, or are spending $50K+/month. The actual math: AWS p5.48xlarge (8× H100): $98/hour 100 training runs × 48 hours = $470,400/year Your bare-metal with 64× H100s at $2.5M upfront: Depreciation + power = $150K/year at 60% utilization = $312,500 Plus $200K engineer, $50K maintenance. Break-even: 18 months. Production training platforms have four layers: - Orchestration (job queue, gang scheduler, resource manager). - Execution (distributed runtime, checkpoint manager, fault handler). - Storage (dataset cache, checkpoint store, artifact registry). - Telemetry (GPU util, training metrics, cost per epoch). Most build layer 2, skip the rest. That’s it. Building training infrastructure is a 9-month project with upfront hardware costs. But at 100+ training runs/month? ROI in 12 months. #ml #gpu #llm #infra #cloud #nvidia #inference #aws #cloud #ai
-
People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?
-
This is how I 𝗿𝗲𝗱𝘂𝗰𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of my 𝗣𝘆𝗧𝗼𝗿𝗰𝗵 𝗰𝗼𝗱𝗲 by 𝟴𝟮% 𝘂𝘀𝗶𝗻𝗴 𝗼𝗻𝗹𝘆 𝗣𝘆𝘁𝗵𝗼𝗻 & 𝗣𝘆𝗧𝗼𝗿𝗰𝗵. 𝗡𝗢 𝗳𝗮𝗻𝗰𝘆 𝘁𝗼𝗼𝗹𝘀 𝗶𝗻𝘃𝗼𝗹𝘃𝗲𝗱! 𝙏𝙝𝙚 𝙥𝙧𝙤𝙗𝙡𝙚𝙢? During inference, I am chaining 5 DL models for processing ~25k image loads. The script takes around ~4 hours to run. The problem is that this isn't a batch job that runs over the night. Various people across the company required it to run in "real-time" multiple times a day. 𝙏𝙝𝙚 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣? The first thing that might come to your mind is to start using some fancy optimizer (e.g., TensorRT). Even though that should be done at some point... First, you should 𝗮𝘀𝗸 𝘆𝗼𝘂𝗿𝘀𝗲𝗹𝗳: - I/O bottlenecks: reading & writing images - preprocessing & postprocessing - can it be parallelized? - are the CUDA cores used at their maximum potential? - is the bandwidth between the CPU & GPU throttled? - can we move more computation to the GPU? That being said. 𝗛𝗲𝗿𝗲 is what I did I 𝗱𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of the script by 𝟴𝟮% 𝟭. 𝗕𝗮𝘁𝗰𝗵𝗲𝗱 𝘁𝗵𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗮𝗺𝗽𝗹𝗲𝘀 Batching is valuable for training but also powerful in speeding up your inference time. Otherwise, you waste your GPU CUDA cores. Instead of passing through the models one sample at a time, I now process 64. 𝟮. 𝗟𝗲𝘃𝗲𝗿𝗮𝗴𝗲𝗱 𝗣𝘆𝗧𝗼𝗿𝗰𝗵'𝘀 𝗗𝗮𝘁𝗮𝗟𝗼𝗮𝗱𝗲𝗿 This has 2 main advantages: - parallel data loading & preprocessing on multiple processes (NOT threads) - copying your input images directly into the pinned memory (avoid a CPU -> CPU copy operation) 𝟯. 𝗠𝗼𝘃𝗲𝗱 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗼𝗳 𝘁𝗵𝗲 𝗽𝗼𝘀𝘁𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗼𝗻 𝘁𝗵𝗲 𝗚𝗣𝗨 I saw that the tensor was moved too early on the CPU and mapped to a NumPy array. I refactored the code to keep it on the GPU as much as possible, which had 2 main advantages: - tensors are processed faster on the GPU - at the end of the logic, I had smaller tensors, resulting in smaller transfers between the CPU & GPU 𝟰. 𝗠𝘂𝗹𝘁𝗶𝘁𝗵𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗳𝗼𝗿 𝗮𝗹𝗹 𝗺𝘆 𝗜/𝗢 𝘄𝗿𝗶𝘁𝗲 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 For I/O bottlenecks, using Python threads is extremely powerful. I moved all my writes under a 𝘛𝘩𝘳𝘦𝘢𝘥𝘗𝘰𝘰𝘭𝘌𝘹𝘦𝘤𝘶𝘵𝘰𝘳, batching my write operations. . Note that I used only good old Python & PyTorch code. → When the code is poorly written, no tool can save you Only now is the time to add fancy tooling, such as TensorRT. . So remember To optimize the PyTorch code by 82%: 1. Batch the inference samples 2. Leverage PyTorch's DataLoader 3. Move as much of the postprocessing on the GPU 4. Multithreading for all I/O write operations #machinelearning #mlops #datascience
Explore categories
- Hospitality & Tourism
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development