Vision Transformers with PyTorch Achieve 94% Accuracy on MNIST

Vision Transformers using Pytorch trained a small Vision Transformer architecture from scratch using PyTorch to better understand how they work. The 28 x 28 input image is converted into a 4 x 4 grid of 7 x 7 patches. This is done using a convolutional operation, resulting in a sequence shape of (16, 500) (16 patches with an embedding dimension of 500). Next, we apply positional embeddings via element-wise addition, so the shape remains the same. We then calculate the Queries (Q), Keys (K), and Values (V) by matrix-multiplying the (16, 500) sequence with the learnable 500 x 500 weight matrices Wq , WK, and Wv. From here, we calculate the attention scores, apply a Softmax, and multiply by V to get our (16, 500) tensor. Finally, we apply layer normalization, pass the features through a feed-forward network to expand and contract the dimensions, and route the output to a linear classifier. To test the pipeline, I trained it on the MNIST dataset and achieved ~94% accuracy in 5 epochs #ComputerVision #DeepLearning #VisionTransformers

  • No alternative text description for this image

To view or add a comment, sign in

Explore content categories