[0:02]We're putting the AI in AIVTuber with three different ways to create your perfect personality. The success of this project hinges on this moment, but despite how important it is, pretty much no one has put in the time into showing how it's done. So let's get started and hope I don't mess everything up.
[0:22]Hey there, Limit here, and welcome to the second part in our five-part prototype, where we'll be checking out our options for an AI personality. Like I said, this is arguably the most important part of this whole system, as it's quite literally the brain of our VTuber. Okay, before I get started coding though, I did about five hours of research on my options, and it wasn't anything interesting. I basically just wrote out this script with stuff I already knew from my past experience, and I searched stuff that I had heard about but have never done myself. I think it's important I should clarify some things here before we actually get into what I did to create my custom AI. Especially if you aren't too familiar with the technologies of AI as we know it today. Nero is an AIV Tuber. I don't think that's specific enough. Nero is an AI powered VTuber. That's more like it. You see, Neuro Sama isn't a singular AI entity, unless Vito is holding onto a billion dollar invention. To have the capabilities to do all she can from communication to playing games, she would have to be what we call in artificial general intelligence, or AGI for short. The general in artificial general intelligence means that the AI is capable of doing a wide variety of novel tasks at a pretty high level. It's kind of like making a person into an AI so it can control a body, kind of like Chappie. Who even remembers that movie? As much as Open AI claim that they can make an AGI right now, fact remains that we aren't there yet. So then what is Nero Sama if not in artificial general intelligence? Filtered. She is instead a system of AI entities. Why am I even saying entities? They're called models. She is a bunch of AI models working together to operate a whole person. Each model is less powerful and less capable at being a whole person, but instead they focus at getting good at their one job. Here's an example of how we could talk to our AI. We have a speech to text model to turn our spoken words into text. Then we take our output and ask the text-to-text model like GPT models for a response. Then we feed that response into a text-to-speech AI to say the response out loud. Filtered. That's just an example, but it's worth noting that we also have more holistic models, such as speech-to-speech models, which can replace this entire example pipeline. As well, we also have multimodal models, which can take in a combination of text, audio, and visuals and output any combination of text, audio, and visuals depending on what it was trained to do. I'm not going to bore you with all the technicalities or give you a full lesson on AI in this video. If you want to know more, however, I'll be making a video dedicated to that sometime soon and have it linked in the description below. I'm going to continue with the assumption that you have some understanding on the workings of these models and the basics of machine learning in general. Okay, what was the point of all that? I just wanted to drive the point that what we're working on here isn't the whole AI system, but just as text-to-text component. And I'm going with a system instead of a whole speech-to-speech model because of three reasons. Firstly, a system performs better. The more we try to get a single model to do, the worse it starts to perform in everything it tries to do. Think of it as a guy that's constantly switching hobbies. They're mediocre at everything and not really good at anything. So if we want our multimodal model to be as smart as a basic text-to-text model, we need a bigger and more resource-intensive model. Frankly speaking, I don't have the resources nor money to even do that. Even if I did, the second part of customizability did me over. As you'll see, we can quickly swap text-to-text models in and out, and this will be great for when we want to change the brain out behind the voice without worrying about affecting the voice in any way, and vice versa. The last thing to know is data. To train our model, we need a lot of the right data. We don't have any collection of high-quality data for our purposes, so we have to make our own, and this is the simplest way to ease into that. With that out of the way, let's finally start coding. I wanted to easily swap between different implementation from the command line, so I made an abstract implementation and use that in our bot. Then I made a dictionary with specific implementations of that abstract class. Every one of these specific implementations are used exactly the same way as the abstract, so we can use the exact same code without having an if block for every single different implementation. At runtime, we're passing in our heat to this dictionary, and that allows us to pull the right implementation, create an instance, and call it like so, without any regard for the implementation, as long as we're getting what we expected. For the first implementation, we have just prompting Chat GPT. I know, I know. This is basically the equivalent to making a meal out of dollar store ingredients. That is horrific. But it gets better, I promise. This quite literally only consists of telling Chat GPT how it should behave. You describe the character and its personality, give it some context, perhaps give it some memory, and let it try to act the scene. With our abstract class, all we really have to do is implement this function, which basically takes in the script and the prompt, and all we have to do is implement how we send that information to our model. In this case, since we're working with Open AI or Chat GPT, we can reuse the exact same code that we had before. The only difference here compared to last video is that we're making use of a proper prompt and giving it to Chat GPT in a system message before we send anything else to it. We're also adding all that extra stuff, like a script, and using some prompt engineering techniques to try and get better responses. Anyways, our implementation is done, and all that's left is to actually make the prompt. J. A. I. Son is meant to be a model that models me, so I need an accurate description of my own personality. The only problem is that I'm kind of an NPC, more so than I am in these videos. So I tried to describe myself, but I also got my friends to describe me as well. And I realized that's kind of disingenuous to have them tell me straight up, especially inside of a group server, so I made an anonymous survey and waited 24 hours to get the responses. I wasn't expecting to get that many responses, especially since they're probably going to forget while playing TF2. But when I came back, I had seven. You actually got like seven responses, that's way more than I thought we would get. Okay, okay. Let's see what they say. Hi, I'm Jason, and I suck all 69 of Briar's grippers for pleasure. He needs to let go of Ari and Briar, too obsessed, but we all know he's into Zoe and Annie. Has a miserably pitiful fake laugh, and can't seem to smile and laugh. Well, I tried to scrap together what I could from those descriptions and tried to keep it as true, but relevant as I could. Here's the final description I used in this implementation. What we got was something starkly different from the usual customer service worker being held at gunpoint tone, but it still sounds very weird and sometimes breaks character. But that's just the thing of working with this dollar store solution. You get what you pay for. Models are terrible actors. If your model is trained to speak in one way, it'll have a hard time trying to speak in any other. Not to mention it's also really hard to describe a person's personality in the first place. You may feel like you know your own personality, but even that feeling is ambiguous at times and isn't specific enough to get the results you want. But I'll give credit where credit is due though. It did a lot better at staying in character than I expected it to, and it will probably get better the more these models improve. But it's just like storytelling, you get the best sense of a situation by showing, not telling. Now I can already hear some of you asking, what about a more curated system like Character AI? Well, at a simple level, you're fundamentally still doing the same thing. You're giving a prompt to an already trained model and telling it to act a certain way. And they don't even have an API, so even if Character AI was significantly better, we would need to do some sort of flimsy hacky solution that would probably break the next time they update their site. In any case, it's not going to be definitely better unless we try our second implementation. Fine-tuning. Fine-tuning is additional training we do to a model to get it to act in accordance to our preferences. In this case, we're going to give our model examples of how it should respond to various inputs, and it will try its best to learn how to respond in that way. It's showing, not telling. But we need something to show in the first place, so we're going to be making a data set using some of my Discord conversations. And yes, I did ask for permission. I spent way too long going through this conversation, and I was getting PTSD from some of the shit that I've done in the past couple years. Like, how the fuck did I soft-lock my VS code with a furry picture? And why did I get a shawarma platter with only hot sauce? After five hours, I cooked up a data set worthy of fine dining. That is to say the portions are tiny, only 100 conversations. But some are decently long. Plus, you don't need nearly as much data to fine-tune a model as you would need to train from nothing. Also a note here, your data set should be formatted similarly to how it will be used in training, so we need a similar prompt and to format the conversations into a script. So it will behave the same way when we give it the same situation outside of the fine-tuning process. I also made a script for this exact purpose. You can give it a prompt and a multi-turn conversation and it will make multiple conversation examples where the AI is responding at different points in time within the same conversation. And for this implementation, I'm fine-tuning a GPT model, specifically 40 Mini, which is now available for fine-tuning. And this bad boy is cracked. Like 1/10th of the cost for three times the performance, but in order to fine-tune their model, I need a data set to be in a specific format. And they have a whole document on fine-tuning their models, and I used the format they specified there in order to create my own data set. In fact, all of the data sets and scripts that I make in this video will be following that format. Now, I won't go much into detail on this format, but it's basically a bunch of JSON objects that contain a single array of messages. For actually fine-tuning, you can follow the docs and do it programmatically, and I also have a script for that inside the repo. But I would highly recommend doing it through their user interface. Just upload your files, select the model you want to train, and then train. One thing I absolutely have to say though, is that for your training and testing data sets, do not put examples you trained on inside of your tests. This is a cardinal sin to AI programmers, and Zeus himself will smite you if you get caught doing this. That and you'll overfit your model. Once it's done fine-tuning though, you'll have this model checkpoint name you can copy and use just like any other GPT model in your code. Instead of calling say GPT 3.5 Turbo, I can call on my J Ison model and use that however I like. Now, unfortunately, there's no actual way I can share this model with you, not unless I've I give you my account. I guess that. But you can pretty simply make your own fine-tuned model as well. The first model I fine-tuned cost me about two cents Canadian to train, and already it's responding way better than I would have with the first version. It doesn't break character, and it sounds like something that I would write. And I continued to improve on the wording and formatting of the prompt, and each time I got better and better results. So the lesson here is to fine-tune your models for your use case when you can. You'll always get better results, and you can use it in the exact same code as you were for just prompting. And it will also run faster and cheaper, since now you don't have to repeatedly describe the personality inside the prompt. But we're not done yet though. Up until now, I've been exclusively showing solutions with Open AI servers, and while the results are great, there are some caveats. There's the obvious of needing to pay Open AI in order to train and use their models, but I think they're pretty fair in the cost, especially with our small scale setup. But there's also censorship. If you're trying to make an AI that is way too dank, then it will likely be held back by Open AI's filtering system. There's not much you can really do about that, since Open AI is more like closed AI. So we can't go in and get rid of the filter ourselves. But that brings us to our final implementation, running it locally.
[12:19]Now, we have the power to do anything we want, theoretically. Despite having all that endless power, we can't actually wield it because now that we're not running on Open AI server, we have to run it on our own hardware. And this is one of the cases where hardware actually matters. There are many openly available models that you can use. Among the most powerful is actually a model coming from Mr. Zuckerberg himself. In spite of being a massive tech giant, Meta is coming in with a W, releasing some of the most popular and powerful openly available AI models that they've spent millions of dollars pre-training for us. The latest in their series of text-to-text models is Lama 3.1. You'll notice numbers like 405 B or 8 B after them, and these are the sizes of the models. Generally, the larger the model is, the more capable it is as well, and usually by significant margins. So we should use the 405 billion parameter model, right? Wrong. The bigger it is, the more resources you need too. To run a 405 billion model as is, you'd probably need a server with 10 RTX 4090s, and that probably still isn't enough to run it efficiently. In fact, I just searched it up, and it takes 800 GB to run an uncompressed version, or 240 GB for a four-bit compressed version. I don't know about you, but I don't have several H200s lying around. I do have a single 3070 with 24 GB of VRAM, so I can run an 8 billion model that is four-bit quantized compressed along with some further optimizations. Downloading the model from their website will give you the base 8 billion version. But don't worry, since we normally don't just download the models this way anyways. There are several, I guess you would call them AI hubs, that provide resources for AI such as models, data sets, papers, and tutorials. Some popular ones are Hugging Face and Kaggle, but there are also libraries that help with making, getting, and training models like PyTorch and Unsloth. Unsloth is the one that we're interested in, since they give us what we need in order to run these behemoth models on our consumer grade PCs. They have a bunch of tutorials on how to use their libraries to fine-tune these models as well. Now, this is the part of the video where I have to warn you. If you aren't developing in a Linux-based environment, like in Linux OS or WSL 2, do so now. If you are using WSL, check your version and make sure you are using version 2. And if you don't have an RTX card, go buy one, or you can watch your PC scream and cry if it even gets that far. Unsloth is a bit finicky and that it has some dependencies that are only available on Linux. People have managed to install and use Unsloth on Windows, but their solutions wouldn't work on my hardware, and it may not work for yours either. So I switched over to using WSL 2. These dependencies also make use of CUDA, which is only available on NVIDIA GPUs. Even after switching to WSL 2, I had to deal with my Linux completely breaking. This shit is driving me in fucking sane. For some reason, I wasn't able to connect or install anything anymore. I was just stuck in this infinite loop of searching for available downloads, and none of the usual fixes worked either. So I ended up deleting my entire Linux distro and downloaded a new one. I mean, I was already meaning to clean up my WSL anyways. Well, after spending a millennia in the trenches setting back up my environment, I spent a while trying to train following the wrong tutorial before I realized that there was actually one for my exact purpose. But once I realized and updated my script, training was actually surprisingly easy. Some notes here. The name you give your model will be the directory that your model is saved. It's also the name that you can give to the bot, and it will load up this exact model. The script that I included for fine-tuning inside of my project is only going to save the LORA adapters. These adapters are basically just the adjustments that we made to the base model, and they are not the models themselves. However, it's still usable as long as you're using the same system that you trained on, and if you're using Unsloth. Otherwise, you'll have to go through the process of patching and downloading the model through Unsloth as well. But after that, you can use your saved adapters. I also can't take credit for any of this code, I basically stole it all from their example notebook. But hey, if it ain't broke, don't fix it. Now, after fine-tuning with my tiny data set, we get some pretty decent results. The vibe of it sounds pretty similar to the GPT version. I wasn't sure if it would be enough, so I had also downloaded an archive of Discord messages off of Kaggle back when I was trying to fix my Unsloth dependencies. This archive, in its unfiltered state, contains 50 million Discord messages from hundreds of thousands of conversations from many, many Discord servers. But in its state, it's just a wall of text that we can't fine-tune on. So I also made a rough script to go through all of these conversations and make a rough conversation data set. One instance per conversation, giving us about 300,000 samples. All I did was randomly select a number and choose that many messages from the start of every conversation. Then I separate each message in a conversation, which I did by noticing that there was a special tab white space that was separating each message from the name of the next user. Then with each separated message, I split the name and the content of the message by the first colon that appeared. We now have the name and the content of each message in the conversation, so I took the code to construct the script and the prompt, and use them to create my next entry in my data set. Using the last message as the name and the message contents of my AI for that instance. From this larger data set, I made a smaller data set of 500 messages for testing and another 500 for training, and I used that to train both local and Open AI models. And the results were a personality. But this really speaks to the importance of data set quality. I got my best results from manually making the data set, and although the AI fed on public Discord servers did give us some funnier results. It was more prone to not behaving well in conversations, specifically ignoring what the other person is saying. This was especially evident when I trained my local model with 300,000 samples. This is because I didn't bother to make sure that the conversations made sense in that Discord data set, and when I went to go take a look, it's not uncommon to see multiple conversations happening at the same time, or even just people randomly coming in to say hi. This was also evident in my AI's behavior.
[18:28]So that's it. Three ways to make a custom AI personality, but really I would only recommend fine-tuning a GPT model or a local model. They both have their pros and cons. Locally, you aren't being censored, don't have to give any potentially sensitive information, and don't have to pay Open AI. But you also need the hardware, as well as the money to pay your electricity bill. I personally might use Open AI moving forward, because I need the resources to run other models at the same time, and I think their costs are pretty fair, especially at the scale I'm running at. We'll see how this changes things. And if I do end up with a good model, I can use that same data set to train a model locally and also get good results. When that happens, I'll probably share that version with you, but unfortunately for now, I won't be sharing any of these models because I either can't in the case of Open AI, or they're too large in the case of local models. However, all the scripts I used to make them and documentation to use them will be in the repo linked in the description below. This again took way longer than I expected it to, and the timer might have only increased by 20 hours, but I probably spent closer to 40 across three days with all the downloads and environment configuration. Anyways, as always, like and subscribe to continue following my journey, and comment to let me know what I can improve on. You've heard me yap for long enough, so I'll get up and out of your way. Thanks for watching.



