Optimizing GPU Memory: 3 Techniques for Efficient Local Training using Pytorch
Training and fine-tuning deep learning models locally can be tough—especially when working with limited GPU resources. My setup includes an RTX 3050 Ti with just 4GB of VRAM, and during a recent project fine-tuning EfficientNetB3 for a classification task, I ran headfirst into the dreaded OutOfMemory (OOM) error.
This article walks through several practical memory optimization techniques that helped me work around OOM issues and complete training successfully—without needing access to cloud GPUs or expensive hardware.
Whether you're working on a laptop GPU or optimizing for edge environments, these strategies can help you train smarter, not harder.
1. Go-To Memory-Saving Techniques
When training locally, your first step should be applying quick, low-effort optimizations. These are often overlooked but can significantly reduce GPU memory usage and improve training stability.
Garbage Collection
In PyTorch for example, memory isn't always released immediately after each batch or epoch. You may still be holding references to unused tensors, especially when experimenting in notebooks or long-running scripts.
Manually clearing memory ensures unused GPU allocations are freed before they pile up:
import torch
torch.cuda.empty_cache()
Resetting kernels may also prove helpful.
Optimized DataLoaders
Efficient data loading ensures your GPU is constantly fed with data and not waiting on the CPU—this avoids performance bottlenecks and reduces idle memory consumption.
also use the right batch size
here is a snippet
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
pin_memory=True # Allocates page-locked memory for faster GPU transfer
)
2. Advanced Memory Optimization Techniques
Gradient Accumulation
Normally, gradients are computed and optimizer steps are taken after every batch. But if your batch size is too small due to memory limits, your model might suffer in terms of convergence.
Gradient accumulation simulates a larger batch size by accumulating gradients over multiple mini-batches before updating weights:
accumulation_steps = 4
for i, batch in enumerate(dataloader):
outputs = model(batch)
loss = criterion(outputs, targets)
#loss scaled across batches
loss = loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Mixed Precision Training
Mixed precision uses 16-bit floating point (FP16) operations where possible instead of the default 32-bit (FP32). This reduces memory consumption and often speeds up training thanks to better hardware utilization (especially on NVIDIA GPUs with Tensor Cores).
In PyTorch, this can be enabled using the torch.amp module:
Recommended by LinkedIn
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda')
for inputs, targets in dataloader:
with autocast('cuda'):
outputs = model(inputs)
loss = loss_fn(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Gradient Checkpointing
When training large models, intermediate activations from each layer are stored in memory to compute gradients during backpropagation. This can quickly consume GPU memory, especially with deep architectures.
Gradient checkpointing helps by not storing all activations—instead, it recomputes them on the fly during the backward pass. The tradeoff? You use less memory but pay a bit more in compute time.
Let’s say you have a custom model made up of several blocks:
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.block1 = nn.Sequential(...)
self.block2 = nn.Sequential(...)
self.block3 = nn.Sequential(...)
self.classifier = nn.Linear(...)
def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
return self.classifier(x)
If block2 is memory-heavy, you can apply checkpointing just to that part:
from torch.utils.checkpoint import checkpoint
def forward(self, x):
x = self.block1(x)
x = checkpoint(self.block2, x) # Apply checkpointing here
x = self.block3(x)
return self.classifier(x)
A Project!
This article is based on a real-world experiment I conducted while fine-tuning EfficientNetB3 on a multi-class image classification problem — using a local GPU with only 4GB of VRAM (RTX 3050 Ti).
The challenge? Avoiding OutOfMemory errors while keeping training performance reasonable.
Throughout the project, I tested and applied multiple memory optimization techniques — from basic tricks to advanced strategies — which I’ve documented here.
🔗 You can find the full project code and setup in the GitHub repository
Wrap-Up: Making the Most of Limited VRAM
Training and fine-tuning deep learning models on a local, low-memory GPU isn’t easy — but with the right techniques, it’s absolutely doable.
Other techniques worth exploring include:
The real power comes from combining these methods. In my case, mixing smaller batches, smart loading and mixed precision helped me fine-tune EfficientNetB3 on a 4GB GPU without OOM errors — and still achieve good results.
That said… there's no magic trick.
Memory optimization helps push the limits, but it can’t replace the benefits of better hardware. If you’re working on larger datasets or models, a more powerful GPU will always bring more reliability, faster experimentation, and less frustration.
Thanks for reading!
Happy training!