An Image is Worth 16x16 Words, Transformers for Image Recognition at Scale Paper Explained (ViT paper) PART 2

This is the coding part of the ViT “series”. The coding walkthrough is based on the DINO GitHub repo, it won’t be exactly identical, the point of this walkthrough is to briefly showcase how the important modules (patch embedding, multi-head attention and the mlp modules) are implemented, therefore I’ll exclude dropouts and fancy weight initializations. We will be using PyTorch for this, and obviously, knowing how transformers work is essential for this “tutorial”.

Read More

An Image is Worth 16x16 Words, Transformers for Image Recognition at Scale Paper Explained (ViT paper) PART 1

By now anyone who is interested in deep learning should be at least familiar with the word “Transformers”, these transformers were born out of the influential paper Attention is all you need, normally transformers were used for NLP (natural language processing) tasks, and are now the standard for these tasks (instead of RNN based architectures), what this paper suggests is that transformers can be also be used for vision tasks, challenging the long-dominant convolution neural networks.

Read More

Self-supervised learning is our best shot

The most popular type of machine learning is Supervised machine learning. Supervised machine learning works in this manner, you gather lots of human-labeled data, these labels that the humans annotate work as “points of interests” or “guides” for the machine learning system to use such that it can learn something useful about the data, for example annotating a picture of a cat with label “cat”, tells the model that this picture contains features that are distinct to cats so try and figure out those features, annotating objects within a picture with bounding boxes guides the model that within these boundaries there is something of interest so try and learn how to figure these boundaries out.

Read More