An Image is Worth 16x16 Words, Transformers for Image Recognition at Scale Paper Explained (ViT paper) PART 2

This is the coding part of the ViT “series”. The coding walkthrough is based on the DINO GitHub repo, it won’t be exactly identical, the point of this walkthrough is to briefly showcase how the important modules (patch embedding, multi-head attention and the mlp modules) are implemented, therefore I’ll exclude dropouts and fancy weight initializations. We will be using PyTorch for this, and obviously, knowing how transformers work is essential for this “tutorial”.

An Image is Worth 16x16 Words, Transformers for Image Recognition at Scale Paper Explained (ViT paper) PART 1

By now anyone who is interested in deep learning should be at least familiar with the word “Transformers”, these transformers were born out of the influential paper Attention is all you need, normally transformers were used for NLP (natural language processing) tasks, and are now the standard for these tasks (instead of RNN based architectures), what this paper suggests is that transformers can be also be used for vision tasks, challenging the long-dominant convolution neural networks.

Self-supervised learning is our best shot

The most popular type of machine learning is Supervised machine learning. Supervised machine learning works in this manner, you gather lots of human-labeled data, these labels that the humans annotate work as “points of interests” or “guides” for the machine learning system to use such that it can learn something useful about the data, for example annotating a picture of a cat with label “cat”, tells the model that this picture contains features that are distinct to cats so try and figure out those features, annotating objects within a picture with bounding boxes guides the model that within these boundaries there is something of interest so try and learn how to figure these boundaries out.

Deep Residual Learning for Image Recognition Paper Explained (Resnet paper)

A neural network can be viewed as a number of layers stacked on top of (or next to, look at it however you want) each other, where the output of some layer, is input to the layer next to it (at least simple neural networks can be viewed this way)