Vision Transformers with PyTorch Achieve 94% Accuracy on MNIST

1w Edited

Vision Transformers using Pytorch trained a small Vision Transformer architecture from scratch using PyTorch to better understand how they work. The 28 x 28 input image is converted into a 4 x 4 grid of 7 x 7 patches. This is done using a convolutional operation, resulting in a sequence shape of (16, 500) (16 patches with an embedding dimension of 500). Next, we apply positional embeddings via element-wise addition, so the shape remains the same. We then calculate the Queries (Q), Keys (K), and Values (V) by matrix-multiplying the (16, 500) sequence with the learnable 500 x 500 weight matrices Wq , WK, and Wv. From here, we calculate the attention scores, apply a Softmax, and multiply by V to get our (16, 500) tensor. Finally, we apply layer normalization, pass the features through a feed-forward network to expand and contract the dimensions, and route the output to a linear classifier. To test the pipeline, I trained it on the MNIST dataset and achieved ~94% accuracy in 5 epochs #ComputerVision #DeepLearning #VisionTransformers

To view or add a comment, sign in

More Relevant Posts

林子哲
1mo
Report this post
I benchmarked TurboQuant (ICLR 2026) on a DGX Spark (GB10 / SM121). Here’s what the paper doesn’t tell you. The algorithm works — really well On Qwen2.5-3B (dense attention): 3-bit KV cache compression 5× memory reduction (289MB → 57.6MB @ 8K) Cosine similarity: 0.9943 Perfect Needle-in-Haystack retrieval QJL bias correction is real. MSE-only quantization introduces systematic bias — QJL removes it. Then I tried Qwen3.5-35B Qwen3.5-35B-A3B is a hybrid attention model: 40 layers total 10 layers → full attention (with KV cache) 30 layers → GatedDeltaNet (linear attention, fixed state, no KV cache) 👉 Only 25% of layers have KV cache to compress So what happens in practice? Paper claim: ~6× KV cache reduction Real benchmark (200K context): ~1.6GB → ~400MB Useful — but nowhere near 6× Why? Because: TurboQuant can only compress what exists If your model doesn’t use KV cache, there’s nothing to compress. Key insight TurboQuant performance is bounded by attention architecture Dense attention → near paper-level gains Hybrid attention → structurally capped gains Speed? Not on GB10 (yet) PyTorch implementation: ~500× slower than baseline Paper’s 8× speedup depends on fused CUDA kernels (H100) Not available for SM121 The real lesson Before estimating TurboQuant’s benefit on any model: Check what fraction of layers actually use KV cache TL;DR The algorithm works The paper is not wrong But real-world gains depend on model architecture Final thought TurboQuant doesn’t fail on Qwen3.5. Qwen3.5 just doesn’t give it enough KV cache to matter. 🔗 Full benchmark: https://lnkd.in/gnnk--ub #LLM #MachineLearning #Inference #KVCache #TurboQuant #AIInfrastructure #DGXSpark
Like Comment
To view or add a comment, sign in
Intelliigence

6 followers
1mo
Report this post
Part 3: Slicing the FFN Monster and the GLU Challenge 📉 bottleneck feed-forward network (FFN) is the parameter monster of the modern LLM architecture. While Multi-Head Attention gets all the spotlight, the massive FFN layers usually make up two-thirds of the total model parameter count. They are the dense knowledge reservoirs, storing associations learned during pre-training. If you want to shrink a model efficiently for edge hardware or extreme latency budgets, you cannot ignore the FFN width. This final pillar of Width Pruning focuses on FFN Sparsification: The Core Metric: Neuron Pruning Similar to pruning attention heads, we analyze individual neurons within the dense FFN layers during inference on our target domain data. We apply a metric, often weight magnitude or the L2 norm of the incoming weights, to score a neuron’s contribution. We identify and "slice out" (set the weights to zero) neurons that exhibit minimal activity. The Advanced Challenge: Gated Linear Units (GLU) Modern architectures (like Llama-3, Gemma, etc.) don't use simple ReLU FFNs. They use Gated Linear Units (GLU variants like SwiGLU). Pruning these is significantly more complex because they involve multiplicative interactions. If you prune a neuron in the "gate" projection, it destroys the semantic input of the parallel "up" projection. Structural width pruning on GLU architectures requires meticulous mapping to ensure the gating logic doesn't fail. The Final Scalable Architecture Width pruning, when combined with depth pruning (referencing Part 1's setup), allows you to engineer a custom, blazingly fast Small Language Model. We are no longer just consuming APIs. We are rearchitecting the networks themselves to match our production demands exactly. I’d love to hear your thoughts on this series! Which structural technique are you most excited to experiment with in your own stack? 👇 #AI #DeepLearning #GLU #ModelRearchitecting #SoftwareEngineering #TechLeadership #LLMs
Like Comment
To view or add a comment, sign in
TechGeo Mapping

4,489 followers
1w
Report this post
Geospatial Technologies essential keywords, daily Tips 🌎 : Keyword : GeoTorch Category :Programming **GeoTorch** is an open‑source, GPU‑accelerated geospatial deep‑learning framework built on top of PyTorch. It streamlines the ingestion, preprocessing, and augmentation of raster, vector, and multi‑spectral imagery by leveraging GDAL, Rasterio, and GeoPandas under the hood, while exposing a familiar PyTorch `Dataset`/`DataLoader` API. This integration allows GIS practitioners to scale complex models—such as CNNs for land‑cover classification, U‑Nets for semantic segmentation, or transformer‑based architectures for point‑cloud processing—directly onto large geospatial datasets without manual data conversion or extensive pre‑processing pipelines. 🚀 GeoTorch’s design embraces #TechGeoMapping #EssentialKeywords
Like Comment
To view or add a comment, sign in
Akshat Mishra
1w
Report this post
Day 30/30, ML Challenge; Llama based Transformer from scratch. For the final build, I stepped away from standard predictive modeling and moved into market microstructure. I Made a custom Llama-based transformer architecture designed to ingest Level 3 Limit Order Book (LOB) tick data and predict impending liquidation cascades. The core objective is to identify the mathematical signature of a liquidity vacuum before the floor falls out. The engine processes a 128-tick rolling sequence window, computing forward passes to output real-time threat probabilities. It is built entirely in PyTorch, RoPE & SwiGLU activations are used to capture the non-linear, high-speed dynamics of institutional order flow. Building 30 projects in 30 days was a very fun but strict exercise in discipline and execution. The focus now shifts from rapid prototyping to Deep Optimization and architectural rigor. Entire Explanation, math, code in repository. Repo : https://lnkd.in/gi7EGEHe #MachineLearning #QuantitativeFinance #HighFrequencyTrading #PyTorch #Llama3 #AlgorithmDesign #QuantitativeResearch #Engineering
Like Comment
To view or add a comment, sign in
Adam Edsall
1w Edited
Report this post
The world exists in more dimensions than two. That means our AI models should be able to predict in them also. In my state-based architecture based on pure Neuroscience and Object Oriented Programming that lives in RAM where you stream activation states while input is coming in, you can grab multi-dimensional representation of real world objects after detecting their presence through sensor streams. This is still an immature example compared to where we are going and working on right at this moment, but I thought it was pretty and wanted to show you. https://lnkd.in/eqbk_QzN
Like Comment
To view or add a comment, sign in
Giuseppe Futia, PhD
2w Edited
Report this post
Graph transformers promise to fix key limitations of message-passing GNNs, but they introduce a scalability bottleneck of their own. How can we get the best of both worlds? ➜ Oversquashing: Information between distant nodes gets compressed at each hop, like a game of telephone where each person summarizes multiple messages into one sentence. ➜ Limited expressivity: Standard GNNs are bounded by the 1-WL isomorphism test, meaning they can't distinguish certain structurally different graphs — e.g., two disjoint triangles vs. a six-node cycle look identical to them. (For a deeper dive into GNN expressivity, invariance, and equivariance, see my article: https://lnkd.in/dBmeCx9z) ➜ Poor long-range modelling: A consequence of the above: distant signals get lost, long-range tasks suffer. Graph transformers fix this by letting every node attend to every other — just like LLMs let every token attend to every token. But this means N² operations, and in graphs with hundreds of thousands of nodes, that's a hard wall. A new paper by Jonas De Schouwer, Haitz Sáez de (Ocáriz) Borde, and Xiaowen Dong (GRaM ICLR 2026) tackles this with k-MIP attention: each query attends only to its top-k keys by inner product. Using symbolic matrices (KeOps), the full N² matrix is never stored — values are computed lazily in GPU registers. Result: linear memory, 10x speedup, 500k+ node graphs on a single A100. ➜ Theoretically, k-MIP transformers can approximate any full-attention transformer to arbitrary precision, therefore no expressivity lost. ➜ The expressivity of graph transformers comes from positional/structural encodings, not the Transformer itself. Without them, GraphGPS is no more powerful than the 1-WL test. ➜ k-MIP is NOT an approximation of full attention — it's a fundamentally different way of doing attention that's both efficient and expressive when composed across layers. 📄 Paper: https://lnkd.in/dk5iRAbZ #GraphTransformers #GeometricDeepLearning #GraphNeuralNetworks #ICLR2026

2 Comments
Like Comment
To view or add a comment, sign in
Vito Lin
1mo
Report this post
🚀 CUDA : Memory Model When I started learning CUDA, I thought performance was about parallelism. I was wrong, because memory is the real bottleneck. 🧠 Here is what I learned: 1. Memory hierarchy defines performance → Register , Shared Memory , Global Memory → Optimizing is more about minimizing slow memory access, maximizing data reuse. 2. Registers: fastest but limited → Private to each thread → On-chip → Allocated per thread 3. Shared Memory: the optimization playground → Shared within a block → On-chip → Data reuse and tiling is critical for matrix operation → Shared is where most performance gains come from ! 4. Global Memory: large but costly → Accessible by all threads per grid. → Off-chip → Most CUDA performance issues originate here ! 5. Constant Memory → Read-only → Cached → Efficient when all threads read the same value 6. Texture Memory → cached → Optimized for spatial locality → Useful for irregular access patterns 💡 Key insight: CUDA performance is not compute-bound, but memory-bound. Optimization is more about controlling how data moves. ⚡ Next: Vector Multiplication (optimize with Shared memory and tile)
Like Comment
To view or add a comment, sign in
C KAUSTUBH
2w Edited
Report this post
most ML code runs slow by default. not because fast is impossible - because fast was locked behind infrastructure pain. building FlashAttention from source needed ~96 GB of RAM and hours of compilation. so most people just didn't. Hugging Face just shipped Kernel Hub. pre-compiled, optimized CUDA kernels - FlashAttention, RMSNorm, quantization ops - loaded with one function call: get_kernel("kernels-community/flash-attn") detects your PyTorch and CUDA version automatically. seconds, not hours. if you're a student or undergrad researcher running experiments on a lab GPU or Colab - your baseline performance just improved without you doing anything. 2.47x speedup on RMSNorm over PyTorch defaults on H100, in three lines of code. the gap between "I know what fast ML looks like" and "I can actually run it" just got a lot smaller. check it out : https://lnkd.in/gSYzqRyq
1 Comment
Like Comment
To view or add a comment, sign in
Darius Jérome Patzner
3w Edited
Report this post
Building a Local Research Agent: GemmaCore I saw gemma4 and thought oh that's nice, so I wanted to see how far I could push a fully local research agent on my current setup. The result? GemmaCore, an agent-based research automation framework that lives entirely on my machine. Prompt a topic and hit go to approve the research steps, click through and have an article in .md file at the end of the loop. I’m running Gemma 3 4B on an AMD 5700 XT (8GB VRAM). It’s a reminder that you don't need a server farm to build something powerful; you just need the right architecture. 🛠️ How it works: The Brain: An OperatorAgent logic core that handles iterative research loops. The Memory: Integrated ChromaDB for long-term context retention and retrieval. The Skills: Extensible modules for browser interaction and filesystem management. The Safety: A built-in Human-in-the-loop approval system via Tkinter. 🏗️ Why this matters: Most research agents rely on heavy API costs and third-party data privacy. By building this locally, I’ve got: Zero latency between the LLM and the local filesystem. Privacy by default (my research stays on my hardware). Extensibility to add custom skills as needed. See it in the Video below. Link to the repo in the comments! #AI #MachineLearning #Gemma #LocalLLM #SoftwareEngineering #AMD #Python

6 Comments
Like Comment
To view or add a comment, sign in
Sanskruti Pasalkar
3w
Report this post
Stop chasing model size. Start chasing efficiency. I just finished an intensive deep dive into the LLM Fine-Tuning masterclass by Krish Naik, and the "aha!" moment wasn't about the code, it was about the Math of Efficiency. If you’re still trying to full-parameter fine-tune models on consumer GPUs, you’re fighting a losing battle. Here is the technical breakdown of how the industry is actually doing it in 2026: 1. The Quantization Logic (FP32 → INT8/4) It’s not just about "compressing" files. I learned the intuition behind Symmetric vs. Asymmetric quantization. By using calibration to find the zero-point offset, we can squeeze a Llama-3 model into 4-bit precision with almost zero loss in reasoning. 2. LoRA & QLoRA: The "Adapter" Revolution Why update 70 billion parameters when you can update 1 million? LoRA: Uses matrix decomposition to train tiny "adapters" while the base model stays frozen. QLoRA: The real game-changer. It quantizes the model to 4-bit first, then adds the adapters. This is how I’m now running fine-tuning jobs on a basic Colab instance that used to require an A100. 3. The Rise of 1-bit LLMs (BitNet b1.58) The research is mind-blowing. We are moving toward ternary weights (-1, 0, 1). Replacing floating-point multiplication with simple integer addition is going to make AI 10x faster and significantly greener. 4. No-Code Pipelines (Vext & Gradient.ai) For production, speed is king. I explored building RAG pipelines in minutes using Vext—removing the boilerplate of vector DB management and API keys so we can focus on the actual user experience. The Verdict: Building AI is getting easier, but understanding the underlying architecture is what makes you an Engineer, not just a user. Huge thanks to Krish Naik for breaking down the "Why" behind the "How." To my fellow builders: Are you still relying on RAG alone, or have you started experimenting with QLoRA adapters for specialized tasks? Let’s swap notes! #GenAI #LLMs #FineTuning #LoRA #QLoRA #AIEngineering #MachineLearning #Python #DeepLearning
Like Comment
To view or add a comment, sign in

5,178 followers

144 Posts

View Profile Connect

Vision Transformers with PyTorch Achieve 94% Accuracy on MNIST

More Relevant Posts

Explore content categories