[0:00]If you're going to fine tune large language model, what you need is some high quality data. This means starting off probably with a set of documents and converting it into some kind of questions and answers. Now, there are various steps in this process, each of which I'll go through in this video, starting off with document ingestion, converting documents into text, then chunking, then generation of questions and answers, then visualization of that data set to ensure that it's comprehensive, and last of all, I'll cover some techniques for creating an evaluation data set. Now, throughout this video, I will be using scripts from the advanced fine tuning repository, and I'll be working in the data prep folder. However, you should be able to follow along, I'll describe everything in detail if you want to build it from scratch yourself. I'm going to start off with a little bit of theory, I want to talk about the goals for generating data, what it means to generate high quality synthetic data. I'll walk through the pipeline, just a graphic showing you the various stages, and then I'll talk about each of those stages. So document ingestion, I want to show you the performance of different approaches to conversion of PDFs to markdown. I'll talk about chunking approaches and trade-offs, question-answer pair generation approaches. I'll then spend quite a bit of time on visualization. This is probably under appreciated, and I haven't covered it enough on the channel before, but being able to see the distribution of your questions is going to be critical to ensure you can comprehensively cover your subject matter. I'm going to compare the performance of different models, the cutting edge models on QA generation, and last of all, I'll talk a bit about evaluation data set creation, which is a specific but pretty important topic if you want to effectively do fine tuning. Now, I will go further and actually do the fine tuning, but I'm going to reserve that for a follow on video. So what are the goals and what does it mean to have a high quality synthetic data set? There are a few things we're trying to achieve. The first is coverage, if I have a document, I want to make sure that the questions I generate for fine tuning or for evaluation cover every aspect of that document. So coverage is going to be important and it needs to reflect the topics, formats and difficulties, different difficulties of questions that might be posed. The second goal is contextualization. It's useless if you just ask a question like, how long is the field? Because there's no way to know what fields that is referring to, is it a rugby field, a soccer field? This is what I refer to as contextualization, the questions must have correct context within them so that they are correctly posed and there's a meaningful answer that can come back to it. The third thing is to have a representative evaluation data set. If you have a big training data set, you're going to use the evaluation data set during fine tuning to check the loss that you're not overfitting, and you're also going to check the answers on the evaluation data set. And you need this to be representative of the training set. If it just has a subset of topics or categories, it's not going to be representative. So I'll talk about some of the different design choices in setting up your evaluation set. And last of all, this is very related is consistent grading. When we generate questions and answers, we need to have a way then to grade correct answers. And we want to make sure that our judge is going to be consistently marking the answers correct based on some kind of rubric. If it's inconsistent and the judge is maybe only marking certain verbatim amounts correct, then you're going to find that actual answers that are correct are being marked incorrect, and this is going to give inconsistencies in your pipeline. So having consistent grading is going to be pretty important. Okay, here is the pipeline, I'll just lay it down, if you don't get something, we're going through each step, so don't worry too much. We're going to start off with some documents, extract the text from it. We'll use a few approaches, one of which will literally be to send it into a visual supporting LLM. Then we'll have a body of text, which we'll split into chunks. And we're also going to take the text from each document and generate a summary. So we'll also have a set of summaries, there'll be one summary for each document. And given these chunks and document summaries, we're going to generate pairs of questions and answers for each chunk. And when we generate, when we prompt a language model for questions and answers, we're going to pass in a chunk but also the document summary, which will help give context to the LLM. This is going to give us a training set of questions and answers, which we're then going to visualize. And we'll visualize them in two ways, one will be by tagging each of the questions, we'll get an LLM to assign some keyword tags, and then we will show a plot of the different questions organized by tags. And the second approach is embeddings. So here we'll calculate the embedding of each of the questions, and then we will again do a plot that will cluster everything so we can see the spread of data. And we'll be able to visualize that and see how different language models compare in generating question-answer data sets. Then we're going to move to create an evaluation set, and the way I'll do this is by uniform sampling. So I'm not going to take a random subset of the training set.
[5:19]I'm going to make sure that it reflects the categories that are in that training set. And I'm then going to rephrase those questions that are taken from the training set to avoid overfitting, and that's the way I'm going to create an eval set. Now, there are some other ways you could do that too, which I'll talk about, but this is the main one I'm going to use. So let's get started, we'll walk through now each of these different sections, the first of which is document ingestion. So this is where you convert the documents into text, we're not even converting to chunks yet, just trying to get a document into text. And there are three methods that I'll cover. The first is a library called marker-pdf by Vik P. It is probably the most accurate, as you'll see, and I'm going to show a demo. The second is MarkitDown by Microsoft, very fast and cheap, and you can just run it on your CPU. And the third option is to send in the PDFs page by page and have Gemini Flash create markdown. And this is surprisingly accurate, versatile, and also pretty cost-effective. So I'll show you each of those. In fact, I'll go over now to Windsurf where I've cloned advanced fine tuning, and I'm in the data prep folder. So actually, I've just CD into data prep, and I have a test script here. It's in the test scripts folder, and it's called, I know my fonts are small, let me just increase those. It's called PDF to markdown, and this is a simple demo script that's going to take a PDF file and it's going to use these three approaches I mentioned to convert it, and then it'll calculate the time it takes for conversion. So let's see the time it takes to convert, and let's see the quality of the results we get from this. So I'll CD into test scripts, and I'm going to UV run this script, which is PDF to markdown. And I need to pass in the path of this script here. So I'm just going to copy this, and I need to wrap it in inverted commas because it has spaces in it. .pdf, and it's not able to find the PDF to markdown because I need to put in .py. So, it's now installing all of these scripts, because I've set them as dependencies at the top, and it's going to take the PDF and pass it through marker, pass it through markdown and then pass it through Gemini Flash. And we'll see what we're going to get back. Now, to run this, you do need to add in an API key because we're going to have to call Gemini Flash. You could call it directly, but I'm calling it through Open Router, and so I've put in an Open Router API key. Actually, I think I'm reading it from a .env file, which is located within the data prep folder. You can check out the sample .env, have a good sample .env, it's basically just Open Router API key is equal to that key. And you can see I actually have the syntax wrong here. So I need to paste this, and I need to put PDF path, and then I'll copy in the PDF path. This here, and finally, I've put in the right command here, and I need to put in the path. It's an underscore, and we should be running now. Okay, great, so we're running with marker. Marker does use OCR, it uses Surya, and it uses a variety of models to identify layouts and even to identify equations. So it's very targeted in how it converts a PDF into text, but it does mean that there are quite a few models that need to be downloaded and then run. And the first time you run it, if you're running on a Mac, which I am, or on a CPU, it's going to take a bit longer because it has to download the models. So I'd expect to see that the time for processing here is going to be a bit longer for marker just because it's got to do those downloads. Okay, so we've got marker is complete, markdown was almost instantaneous, and then Gemini Flash was also pretty fast. Now, in my, in my code here, when I call Gemini Flash, I am using an asynchronous method to send the pages in parallel. So, actually, I have a default of 32 as a batch size, you could actually increase that, you can have a higher concurrency with with Gemini. And this means that each page is being processed in parallel, which will speed things up by quite a bit. Now, I've got complete parallelization because the PDF I'm converting is just 24 pages. So this is the fastest that you would be able to get, if your document is longer than the concurrency, then it's going to add time because you're going to have to put in series some of those conversion steps. But what you can see here is the time for each of these methods, marker takes about 19 seconds, MarkitDown takes .08, so extremely fast, and then Gemini Flash is about 20 seconds. And this is running with marker-pdf locally. If you decided to run marker via their API endpoint, then you would be able to get, I think, a few seconds in terms of response time, so it won't be as fast as MarkitDown, but it would probably be about 5 to 10X faster than using Gemini Flash. Now, let's take a look at the quality that we're getting. So first, we have MarkitDown, and what you can do is open it as a preview, the markdown. And you can see some of the problems already, there's there are spaces that are missing between words, and that's going to be annoying when it comes to chunking, because it's going to reduce the quality. MarkitDown will split things out by page, so you can see it's getting the text from each page. But here again, for example, it's putting in this added character before the field of play. So it's kind of getting the raw text. It is getting the raw text, but it's certainly making some errors in terms of the correct breakdown of paragraphs, new lines, and even just adding things together without having a space or a new line between. By contrast, if you look at Gemini, and if you take a look in the preview mode, it's making, it's making a little better of a job. You can see that here, for example, it's got some new lines that are more correctly being displayed. Let's scroll down a little bit more. In this section on page six, it's correctly bolding and then giving a description. So we don't have the same issue that we saw with MarkitDown where there are characters adjacent to each other. And overall, I would say the Gemini approach is pretty, is pretty accurate, probably will be a good approach for many applications. Also, it will sometimes convert images. So here in the appendix, there was some text on the image. So if you look here at the very last page, sorry, second last page, you can see it's the field, and it's got some text on it, and that text has been converted. So it's available here. Now, it's maybe not in a very useful format, but it's doing better than what say the MarkitDown library is doing. Now, we'll go to the gold standard, which is looking at marker, and look how tidy this is organized. If I open up the preview now, you can see everything is very neat. Like look at this table here, it's beautifully parsed.
[12:18]We've a beautifully parsed table here, another beautifully parsed table. So generally, yeah, look at this, not very nice indentation. So you can see the difference, MarkitDown is just of much higher quality. Now, it's not, it's not going to convert the image. That's just because it's it's not designed to convert the image, so that's something to take into account. It will optionally, if you set a flag, it will optionally allow you to inject images in here. So if you built a more advanced pipeline, you could actually forward the markdown text and you could forward the images as part of chunks when you're generating questions. For now, I'm just dealing with text. But anyway, this should give you a clear picture of how much better it is to use marker versus the other ones in terms of quality. So with that, I'm going to cover the last topic, which is related to data set evaluation.
[27:34]Okay, so we're in a position now to generate some of this QA, and I'll just go through some of the parameters. So we've defined a model, a temperature, max tokens, top P, top K, min P. This makes sure that when you've got a temperature of non zero, you're not able to pick wild tokens that could throw the answer off. We've got our context, and I've also set this up so that we have batching so that we'll run in parallel with many requests to speed things up. You can choose to skip tables, not generate questions for tables, but that seems a bad idea, so we're not going to skip the tables. All right, so we'll move now past the chunking on to QA generation. So I'm going to go down here, and find a script. So here we are, and there are a few options for generating questions. First is we can pass a custom configuration file, we can force regeneration, we can run in test mode. We can only process a given document, and we can also iterate. This is what I talked about, but iterate means we will generate questions, pass them back, and see if the language model wants to generate more. So let's run in test mode, and let's run with iteration like this. And it's not iteration, it's iterate. And everything has been generated already. So I'm going to add force. So what's happening here is, it's found a document to process, and it's now generating for iteration one, and it's generated five new pairs.
[29:04]And now it's going to move to the next iteration, iteration two, and it didn't generate any pairs, so it's going to stop iterating. Now, it's going to move to chunk two. So it's moving to chunk two, it's doing one iteration, generated 11 pairs. It's now going to move to iteration two, generated seven, iteration three, generated two pairs. And yeah, the number of iterations that we have is currently set to three at max. But what I don't like is we didn't have enough iterations. And the reason for that is, because if I go up to the config file here, and if I check within QA, we can actually add in a flag that will set the max iterations. So we can add in this flag here, max iterations of 10. Let's just paste here, save, and if we run again, and we run with iterate and force, it should do something similar. Now, there is some stochasticity here, so we may find that it runs out of questions earlier. But for the first chunk, it's generating six questions. It has no new questions on the second iteration. For the second chunk, it's generated 13 questions in the first iteration. On the second iteration, it's generated three. On the third iteration, it's generated three. So it's clearly finding a lot more room for questions to be asked. And yeah, it's even getting as far as 10 iterations.
[30:57]Now, generally, I find that for a given chunk with a max length of 5000, within 10 iterations, you're likely to get to the end and it will have zero on a given iteration. So that's why I recommend a config value of about 10. But we're going to go ahead and take a look at the data, take a look at the QA here, and you can see the generations all combined for the first question here. So here are our six questions, you can see them combined, then for the second chunk here, of which there are many more. And notice how each one has got questions, eval criteria, answer, difficulty, and then a category for the question as well. Okay, so now we have a series of questions. We would want to run that then for the full data set. So I've run it just on test, which will only run two chunks, but we would need to run that on the full data set in order to get questions and answers that are covering our full document. So now we have the question and answer pairs, and what we want to do is maybe visualize what they look like, and compare if we generate question-answer pairs with different documents. Because, uh, not with different documents, but different models because we want to see, well, should you use all three, should you use Gemini Flash? What's the best model to use in order for question generation? And there are two ways that we can visualize the questions. The first is an embedding based approach where we calculate the embedding for each question and we plot them, and the second is a tag based approach. So I'm going to go now to the embed-viz folder, which is right here. So CD into embed-viz. And within this folder, there's another ReadMe, let me just close down all the other ones, and embed-viz allows me to do a few different types of visualization. It's going to let me create embeddings, visualize them within a single data set, visualize train versus eval splits, compare different splits, different data sets rather. And then also take a tag based approach where we generate tags with Gemini Flash, and then compare the data sets based on tag distributions and visualizations. Now, you might be wondering, why didn't we generate tags when we generated the questions and answers? And the reason is because you want the same model to generate tags across all the data sets so that the tagging is consistent. Because different models will assign tags differently, and if you have a model tag itself, it's probably not going to categorize them in the same way as the other models would. So you're going to have tag distributions that are not consistent. So that's why I recommend using a single model for embeddings across all the data sets and a single model for tagging across all of the data sets. Now, there is a script here that you can run if you want to create some embeddings. You can run the generate embedding script. It's just going to use the Nomic Modern Bert model. It will take each question, embed it locally actually here, and it will save that embedding, and then it will be able to use the embeddings to generate a plot of the comparison. Now, if you want to compare embeddings, the comparison script will actually run the generation of the embeddings in the background.
[34:36]So the script I have for that is this compare embeddings here, and I'm able to just run this script. And actually, what I can do is point it to a hugging face data set. So I'm going to go over to hugging face. I'm just going to search here for Trellis Touch Rugby, and I'll sort by recently created. And what I've done is create a number of data sets using different models. So I've created them with Flash, with Pro, Gemini Pro, with or one, with O4 mini. So here, for example, if I look at O4 mini, I can copy that data set. And I can paste it here, and why not compare one more data set as well. So I'll go back. Let's take a look at the Gemini Pro, and by the way, you can see the two chunks at the end. This just means that I'm only generating questions for two chunks, which is a restricted data set. It just makes it easier to visualize by taking a smaller data set. So let's copy-paste this one, and let's try visualize. Now, this actually will work, but there's an extra flag I want to add in, which is interactive. And by adding the interactive flag, I'm going to be able to run or open an HTML file.
[36:02]And an HTML file is going to allow us to interactively inspect the results. So here's the embedding comparison results, I'm going to copy the path to the HTML file, and I'll open that up. Okay, so this is the comparison based on embeddings. You can see we've got the O4 mini model, and we've got the Touch Rugby Pro model. And this is a comparison done using TSNE. It's it's a student T test that is used, it's basically collapsing the multi-dimensional data down into two dimensions. So that we can visualize it. And what you can see broadly is two things here. There are more green dots than purple dots. Um, so the pro model is generating more questions than the O4 model in this case. But you can see that roughly speaking, the dots are kind of covering the same kind of space, at least in these two dimensions here. And what this tells us is that basically the models are getting a reasonable amount of coverage across the data sets across the documents rather, which is a good sign. If you had a particularly weak or a biased model, what you'd see is just the dots appear within a certain part of the graph, and that would be a concern because it might indicate that your coverage is not great. So ideally, what you would want to see is when you run a large number of models, you want to see that, uh, you want to pick a model that's covering basically what any model is able to cover. If you wanted to go a step further, you could show here Flash, you could show here some of the other models like or one. In fact, maybe I should just go ahead and do that. Let's take a look at or one. Let's take a look at Flash. Let's take a look at Sonet. So what I'll do is I'll rerun my script, and I'm going to put on here or one.
[37:57]I'll put on here Flash.
[38:01]I'll put on here Sonet, and I'll make it interactive. And by adding the interactive flag, I'm going to be able to run or open an HTML file. And HTML file is going to allow us to interactively inspect the results. So here's the embedding comparison results, I'm going to copy the path to the HTML file, and I'll open that up. Okay, so this is the comparison based on embeddings. You can see we've got the O4 mini model, and we've got the Touch Rugby Pro model. And this is a comparison done using TSNE. It's it's a student T test that is used, it's basically collapsing the multi-dimensional data down into two dimensions. So that we can visualize it. And what you can see broadly is two things here. There are more green dots than purple dots. Um, so the pro model is generating more questions than the O4 model in this case. But you can see that roughly speaking, the dots are kind of covering the same kind of space, at least in these two dimensions here. And what this tells us is that basically the models are getting a reasonable amount of coverage across the data sets across the documents rather, which is a good sign. If you had a particularly weak or a biased model, what you'd see is just the dots appear within a certain part of the graph, and that would be a concern because it might indicate that your coverage is not great. So ideally, what you would want to see is when you run a large number of models, you want to see that, uh, you want to pick a model that's covering basically what any model is able to cover. If you wanted to go a step further, you could show here Flash, you could show here some of the other models like or one. In fact, maybe I should just go ahead and do that. Let's take a look at or one. Let's take a look at Flash. Let's take a look at Sonet. So what I'll do is I'll rerun my script, and I'm going to put on here or one. I'll put on here Flash.
[40:01]I'll put on here Sonet, and I'll make it interactive. And that should show five sets of dots. I think you can support, I think I've got the script supported at least five, or at most five right now. So if I open up that file, you can see here the five different models. So this is answering the question of which model should you choose to generate questions and answers? And I would say, looking on this, basically all of the models are fairly performance. The Flash model definitely generates less questions, there are fewer yellow dots. So potentially giving you less tangent coverage. You can see that the Gemini Pro model has got very good coverage, the O4 mini model has also got a pretty good coverage. Maybe down here, there aren't any O4 mini points, so you could say that's maybe a little bit of a drawback. Also, Gemini Pro doesn't get down here. You can check, so these questions who holds the copyright for Touch Rugby rules. What copyright restrictions apply to the Touch Rugby Australia playing rules according to this document? So you can see those questions are related, well, they're close to each other. Also, you can see that within a given data set, there aren't too many overlapping questions or overlapping points, which indicates that you aren't duplicating information. So here if you've what range of expertise and perspectives were represented in the group that developed Touch Football Australia's eighth edition playing rules. And here's a question on process and timeline. So even though they're close, they're definitely not a replica. So basically, I would say any of these questions here are probably going to be, any of these models rather, are probably going to be fairly strong. If anything, probably includes the pro model, or probably make use of maybe the O4 model. They may give the best performance in terms of question generation. Now, that was the demo in terms of embeddings. You can also run a comparison with tags instead of embeddings. So let's take a quick look at how that works. So for looking at tags, it's going to be a pretty similar procedure. We have a script for generating tags, and once we have run the tags, we're then able to, um, run a comparison script. Which, I think also will call the script for making the tags. So what I'm going to do is, let's just copy-paste this here, and then I'm going to copy from my previous command the names of all of these models.
[42:40]And, you can control as well the minimum or maximum number of tags that you want to include.
[42:48]Um, I will just put the default for max tags is five. Let's put that a bit bigger. And this is the max tags that are being shown in the plot. So, let's run it like this, and just while this is running here, a note that you can set which model is generating the tags. If you go to the config file and if you scroll down to the tagging section, you can set the temperature, which I've set low, and I'm using GPT-40 mini. Um, you can use, you can use any model, you could use Gemini as well if you wish. So it's going to run the generate tags on each of the data sets. It'll pass in each of the questions, we can take a look briefly at what that script looks like. Generate tags, and we should just quickly check what the prompt involves. So here's the system prompt, you're a helpful assistant that generates concise descriptive tags for questions. Generate exactly max tags, so that's where the number of max tags comes into play. That capture the key topics, concepts and skills tested in the question. Each tag should be one to three words, lowercase with hyphens between words. So we've now generated the tags, and you can see that the comparison will generate two plots. It'll generate one, which is going to be very much like the plot we checked earlier. It's a reduced dimensional plot. Down to two dimensions, and you can see the five different models here. So again, we're seeing a pretty good spread in the tags. There is this one area, maybe where Gemini is creating some tags that other models aren't. And if you want to visualize it a different way, it's a little bit more intuitive. What you can do is check, you can sort the tags according to the, uh, you can basically do a histogram of the tags. So if I copy the path here, and open this, you see now I have a histogram across the data sets. And this is an easy way to check whether you have uniformity in tags, and it might indicate you want to generate more questions or tweak your question generation prompt to cover more on certain, on certain topics.
[45:52]So here, for example, football rules, sports equipment, and you can see in green, we've got pretty good coverage of all the tags that's in, in Pro 2.5. Now, we don't necessarily want this to be a flat distribution because we want it to reflect whatever the reality of the document is. So I'm not looking for it to be flat. What I'm comparing is the ability of the different models to cover all of these tags.
[46:27]For example, if you look at FIT rules, um, that particular tag is not appearing so often for either the Flash model, it's also not appearing so often for the Sonet model. So, just a few examples where those two models are not creating that tag. But all of the models are including tags on Touch Rugby, sports regulations, game rules, Touch Football.
[54:41]So that is an overview of data preparation. I am going to follow up with a fine tuning video, where I will show you how using a comprehensive data set like this outperforms using a naive data set with simpler chunking, and simpler question and answer generation. I will go through the full training with on sloth, on, uh, one of the more recent models, Jamma. I'll also cover a bit on the Mistral model, and the script will work too with with Lama models. Um, it should even work with with Lama 4 if you have enough GPU memory. I'll put all of the links below in the description. You can find the script here, access via the repo that's called Advanced-fine-tuning, Advanced-fine-tuning. It's on Trellis.com, and in the meantime, if you have any questions, let me know down below in the comments.



