Thumbnail for Vision Transformer by AGI Lambda

Vision Transformer

AGI Lambda

5m 8s901 words~5 min read
Auto-Generated

[0:00]Let's understand Vision Transformers. We first divide the image into sub-images known as patches. This image patch is nothing but the pixel values of that area of the image. You can see pixel values of single patch. However, since this is an RGB image, instead of a single 2D array, we have three channels for each patch. The problem with these patches is that the pixel values range from 0 to 255. We simply normalize these values, and now the input image is ready to be fed into the Vision Transformer. Remember, here we are using a patch size of 8x8 with a total of 64 patches for clear visualization. However, in the actual Vision Transformer paper, their patch size is 16x16. To convert it into a one-dimensional array, we need to flatten all three normalized channels of the patch and obtain a vector. We do this for all the patches in the image and obtain a linear vector for each patch. Let's rearrange these vectors for better visualization. Instead of using these normalized pixel values directly after flattening them, we transform them into embedding vectors for each patch. This process involves taking each input and passing it through a neural network to obtain an embedding vector one by one for each patch. Now, these embedding vectors can be treated as word embeddings. For the attention mechanism, we take the embedding vector. We make three copies of it to feed into the query, key, and value matrices. We get the output query, key, and value vectors for the attention layer. This is a simple process, here each embedding vector can be processed in parallel. However, this parallelism creates a problem. I mean, how will the attention mechanism know which patch it is processing? The first patch, or the 10th patch, or the last patch. Just like in sentences, where the position of a word is important to understand the meaning, changing the position can alter the meaning of the sentence even with the same words. The position of each patch is also important to understand the whole image. But how do we feed position information to the attention part of the Transformer? For this, we add positional encoding to the embedding vector to incorporate position information into the vector. After adding position encoding, we get the final vector to feed into the attention block. Now the data seems ready for the attention block to develop relationships between patches. We apply multiple attention blocks and then obtain the output. After these attention layers, final output is ready. At the end, we apply a neural network with Softmax for classification. One way is to take the last patch embedding and feed it to the classification network, just like we do in text for next token generation. But actually, we add an extra learnable embedding vector. The goal of this extra embedding vector is to gather all the important information from other patches using the attention mechanism for classification. Then, at the end, we feed this vector to the classification layer to get the softmax distribution for the class label. Now, what is the purpose of the attention block here? If we take the first patch of the image. We calculate the attention of the first patch with all other patches. This way, we can find out how the first patch relates to other patches in the image. The same is the case with other patches. This helps the model understand how different parts of the image are related to each other, enabling it to develop an overall understanding of the information in the image. Now, if we take the first and last patch of the image, we know that the attention between the first and last patch is calculated in the attention layer. And also for all the other patches. So, we can say that this attention mechanism in Vision Transformers gives it a global receptive field. In convolutional neural networks, we have local receptive fields, so they have a built-in inductive bias. In this image, you can see the texture of the cat because texture is a local feature. Convolutional neural networks may predict the class of the image using the texture of the object. As explained earlier, Vision Transformers with a global receptive field focus more on global features, like the shape of the object, when classifying it. Remember, these images are not extracted from an actual Vision Transformer or CNN. I just used them to illustrate the difference between CNNs and Vision Transformers. This was the image we started with. We have a fixed patch size of 16x16 because increasing it to a higher value would prevent Vision Transformers from gaining a good understanding of the patch. So, the image size should be 128x128 for 64 patches. There will be a total of 4096 attention values calculated at each layer. If the image size is now 256x256, you can see attention increased by huge margin. Now size increased to 512. For an image size of 2048, you can see that the total number of attentions calculated by each layer increases, making this approach less suitable for high-resolution images. But what do you think is an alternative to this Vision Transformer? Vision Transformers cannot compete with convolutional neural networks when there is a small amount of data. Vision Transformers require large datasets to perform well and compete with convolutional neural networks.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript