Thumbnail for The Only Embedding Model You Need for RAG by Prompt Engineering

The Only Embedding Model You Need for RAG

Prompt Engineering

13m 54s2,221 words~12 min read
Auto-Generated

[0:00]I usually cover LLMs releases, but today we're going to be looking at a new embedding model. Because this embedding model is too good to be true, it's multimodal in nature, it's multilingual, and you can use the same model for text and code retrieval tasks. This is the new Jina Embedding V4 universal embedding for multimodal and multilingual retrieval. And the best part, the weights are available on Hugging Face, so you can download them and start using them. If you have seen some of my videos, I cover a lot of stuff related to RAG and search. And you probably have seen this picture before. Embeddings play a very critical role when it comes to retrieval tasks. But in real life, documents contains images, text and tables. And sometimes they are in really complex orientations as well. Now, traditionally, people used to convert those text into, or the images into text description and then use the same text embedding model to create embedding representation. As you can imagine, you lose a lot of information during this translation. There were other approaches like ColPali, which takes a page from a PDF file, converts it into an image, and then use an image encoder to project its representation in a same space where we can have the textual representation. But the idea is that you're using the same model now, both for text and image queries, and this works really well. Now, ColPali type of approaches are inspired by ColBERT style multi-vector representations. So, if you look at the normal dense embeddings that people use for text representation, you have your input text, you compute embeddings for each token, then you create a single vector, so irrespective of the input text size, you're going to get exactly same size vector at the output. Now, the multi-vector representation is a little different where you create embeddings for each of the token. So, instead of a whole chunk level embeddings, you're now getting token level embeddings, which are going to be a lot more accurate. But the problem with these is that the storage size that is going to be needed for your embeddings is going to explode. Now, ColPali does a very similar thing for vision representation, or the vision encoder. So, it divides each of the page into a number of different patches and computes embeddings for those patches. So, again, it's going to result in a much larger vector space compared to a simple dense representation. But then, recently, Cohere released Embed Four, which is their multimodal embedding vector, or embedding approach, which is state-of-the-art on a number of different benchmarks. But the beauty is that you get a fixed size vector, and that means that your storage cost is not going to be as outrageous compared to some of the other representations like ColPali. I have covered all of these approaches in a number of my videos. Links are going to be in the video description if you're interested. Now, this new Jina V4 embedding model tries to combine everything into a single model. And in fact, they actually have LoRA adapters on top of the embeddings, which makes it even more interesting. Now, before looking at that, uh, today, NVIDIA released their multimodal RAG NeMo Retriever, which is based on top of a Llama 3.2 model. Now, this is again very similar to the ColPali approach that we saw before. So, for the normal text embedding, you have your image to text parser, and then, as I said, you use the same representation for both text and images. Now, in this case, they have a vision encoder, which encode images, but they're using the same embedding space, both for text and images. And this is trained on top of 1 billion parameter model. So, it's a relatively smaller model, although it's a little bit bigger compared to some of the traditional dense embedding models, but it's a lot smaller compared to the Jina V4 embedding model. Now, the NeMo Retriever is already available on NVIDIA's website, so you can play around with this model if you want. A link is going to be in the video description. Okay, so coming back to the embedding V4 from Jina AI. Now, here's the architectural representation of Embeddings V4, and this is probably one of the most fascinating retrieval embedding architecture that I have seen. So, let me walk you to this, and it combines a number of different ideas that we discuss in this video and some other that are available in literature. So, as I said, it can process both text and images. Now, images are fed into a vision decoder. This is based on Qwen 2.5 vision language model.

[5:06]I was using the 3.8 billion parameter model, so the backbone is almost 4 billion parameters. Once you encode images, then there's a same language model decoder, both for text and images. But they also have LoRAs. So, I think they have four different types of LoRAs. One is for retrieval for text retrieval, the other one is for code search. I think there is one for classification as well. So, depending on your task, not only you provide your input images, text, but you also tell it what type of task that you're doing, and based on that, it's going to add the LoRA adapters specific to that task. But that's not it. Okay, so it generates your embedding vector. However, you can also define whether you want to have a dense embedding, or a multi-vector representation. So, going back to this, you can generate either a fixed size dense embedding vector that is going to be basically pull all the token level embeddings into a single embedding vector, or you can use exactly the same model to have multi-vector representation. Which is kind of crazy. Okay, but that's not it. There's one more trick that they're doing, and that is using the Matryoshka embedding models. That means you can use the same model to generate output embedding of different sizes. So, if you look at the new embeddings models from OpenAI, you can use the same model and just truncate the output model or the output embedding size depending on your needs. So, for example, if you're using the text embedding large model, you can either extract the first 256 dimensions, or at most, I think it goes up to 3072 dimensions. But this is basically the same embedding vector and you are just truncating it to a lower size. Now, lower size will save you on cost and speed. Again, the same embedding model will be able to give you different vector sizes or different dimensions based on your needs. And the Jina Embedding V4 actually supports that. So, you can go as low as 128 dimensions, or you can go all the up to 2048 dimensions. So, there is so much going on in a single embedding vector, which is kind of crazy. But that's not it, it supports up to 32,000 tokens. So, your chunk size can be up to 32,000 tokens in theory. But I'll highly recommend to keep your chunk sizes lower. Although, this huge size or context size makes it a very good candidate for late chunking as well. Now, if you are not familiar with late chunking, I'm going to give you a quick overview. There is a detailed video on my channel, which I'll highly recommend. So, normally, if you divide your document into a smaller chunk, you lose context, right? So, one idea is that you take your whole document, use an embedding model that supports long context, generate embedding for the whole document. And since it's going to be multi-vector representation, so you're creating token level embeddings for the whole document, then you chunk your document after generating the embeddings, and then pull token level embeddings for that specific chunk. So, at the end, you are going to get chunk level embedding, but you start with document level embedding, and this preserves global information. If you're interested in the topic, I have created a video on that, a link is going to be in the video description. Okay, so some quick more details before we look at an example notebook, in which I'm going to show you how to use this system. So, this is a relatively large model for an embedding model. So, keep that in mind, it supports text and images, and you can process up to 20 megapixels, visually rich, images, which is pretty awesome. Especially for high-resolution images of pages, it supports 29 different languages. And as I said, single vector, multi-vector representation, which is pretty incredible. So, here's a quick notebook from the Jina AI team. Now, keep in mind, since it's a 4 billion parameter model, some of the examples are not going to run on a single T4 GPU, so keep that in mind if you try to run this notebook. So, here, the model weights are available in Hugging Face. We download those model weights, and you can actually see the architecture here. Now, in this example, they have different languages: English, Spanish, Japanese, Portuguese, German, and Arabic. And I, I believe all of them are a translation of "May the Force be with you," right? And then, there are three different images. And actually, do let me know if you want me to create a more detailed video on doing multimodal RAG with this embedding model. So, here's the first image, here's the second one, and here's the third one, right? So, they are inspired from different themes, but all of the queries are related to Star Wars. So, if we do some sort of similarity measurement or distance computation, they all should pick this image as the closest one. Now, here's the interesting part. So, when you're embedding your query, which is basically your text description, you will need to tell it what task you're doing, right? So, we want to do retrieval, so that it will select the specific LoRA adapters for a retrieval task. And we also need to tell it whether it's a query or the documents that are present. So, for for those text embeddings, we tell it that these are the queries and we want to do retrieval task. Then, we embed the images in the same space. Now, this is relatively slow on a T4 GPU, but after that we are doing cosine similarity on each of the query against those three images. So, here's a helper function for doing the cosine similarity. We provide text embeddings and our image embeddings, and then they are basically showing the results. For example, if this is the query, it picks the Star Wars scene or the lightsaber fight at the top, right? And it does consistently in all of those queries, which is pretty neat. In a multimodal RAG pipeline or RAG system, you will compute embeddings for your documents by first converting them into images, and then computing these embeddings, and then you do query on top of those. I have covered similar systems in some of my previous videos, so I'm going to put some links to other videos in the description. Now, I think there's this one more example, which is embeddings for text matching. Now, in this case, for the T4 GPU, I ran out of VRAM, so you just need to be aware of that, especially if you're trying to do a lot of embeddings of image data, you can easily run into this issue. Now, one of the other task could be text matching. So, this is basically topic clustering, and I think it can figure out what are the similar topics as well. You can use the same embedding model for code retrieval, which is pretty neat, right? So, you don't need a separate embedding model that is specifically trained for code retrieval tasks. So, if you are creating any coding agent, you can potentially use this model. You can use the same model for multi-vector representation. Again, since there were I think more images involved, and since the multi-vector representation is going to be a lot larger, we ran into the VRAM issues. Now, if you're looking for a multimodal RAG system out of the box, I highly recommend to check out my localGPT-Vision repo, project, that uses a combination of multimodal retrieval system and vision language model to create an end-to-end multimodal RAG system. And it supports a number of different models, not only open source, but also some proprietary models as well. So, I'll highly recommend to check it out, and if you like the project, make sure you start it on GitHub. Anyways, this is a very interesting approach, and it aligns very well with my own interest in RAG and search systems. And if you're doing anything related to RAG or search and are looking for advice or consulting, please do reach out to me. I am helping a number of different companies, so would love to help you out as well. Details are going to be in the video description. Anyways, I hope you found this video useful. Thanks for watching, and as always, see you in the next one.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript