TubeScript Get a Transcript

Thumbnail for Running a 35B AI Model on 6GB VRAM, FAST (llama.cpp Guide) by Codacus

Running a 35B AI Model on 6GB VRAM, FAST (llama.cpp Guide)

Codacus

15m 6s2,244 words~12 min read

Auto-Generated

Watch on YouTube

Share

[0:00]Can you run a 35 billion parameter AI on an 8-year-old GPU with only 6 gigs of VRAM at acceptable speed? Yes, you can. We are going to explore every way to not just run this model, but run it fast enough that it doesn't feel like you're talking through a satellite phone. And we are going to run it on the full context and making it usable, not just possible. Five tricks, one docker command. Let's get into it. Before we touch any flags, let me introduce the three characters in this story. First, Llama.cpp, the engine. I'm using llama.cpp because it gives you crazy flexibility. Every knob exposed, every memory decision yours to make. If you want to micromanage where each piece of the model lives, and we will, this is the tool. Second character, Qwen 3.6 35B-A3B, the model. Total size 35 billion parameters. But it's a mixture of experts, so only 3 billion are active for any given word. 256 tiny specialists, eight of them wake up per token. Hold that thought, we'll come back to it. Third character, the worst case rig I had on hand, built almost 8 years ago. Specifically, 8-year-old GTX 1060 with 6 gigs of VRAM on PCIe Gen 3. 8-year-old i3 8100, 4 cores, no hyper-threading. 24 gigs of DDR4. Nothing on this list would impress anyone in 2026. If your setup is better, and it probably is, your numbers will come out better than mine. The point is that this rig is a floor, not a ceiling. Cast introduced. Let's start with the dumb thing first, the way most people would actually try it. The obvious move. Split the model in half. Top half on the GPU, bottom half on the CPU and RAM. Do as much as fits, push the rest off-board. That's what -ngl does. Number of GPU layers. We tell it 20, meaning the first 20 layers go on the card, the rest stay on the CPU. It loads. Just barely. And the model starts answering at about 3 tokens a second. Which means watching a sentence appear over 20-30 seconds. We are in satellite phone territory. Why is it crawling? Because every layer carries its experts with it. When a layer is on the CPU, the entire layer, including its expert blocks, lives on the CPU. And per token, the data has to make the trip across PCIe. The bus chokes. Three tokens a second, useless for anything real-time. So this is our floor, the dumb baseline. Now the question becomes, what do we know about this model that the dumb baseline doesn't? And this is where the type of model we are running starts to matter. Most older AI models are what they called dense. Every neuron in every layer fires for every word. If a dense model has 35 billion parameters, all 35 billion run. Mixture of experts is different. The bulk of the weights live in those expert blocks. Per token, the model only wakes up a handful of them. Which means most of the model is sleeping most of the time. Dead weight if you are sitting on the GPU, but cheap rent if you are sitting in RAM. So the smart split isn't half the layers each. It's keep the small fast firing parts on the GPU. Push the giant sleeping experts onto the CPU. Per token, the GPU does its job, then asks for whichever eight experts are needed, then does its job again. llama.cpp has the exact flag for this. --n-cpu-moe 41. Take every layer's experts and pin them to the CPU.

[4:06]Everything else, send to the GPU. Reload. Same model, same hardware, one different flag. Speed jumps from 3 tokens a second to 10. 230% faster, no hardware change. That would be the end of most YouTube videos. Cool trick, runs at reading speed, ship it. But there's four more flags. Each one makes it faster. All right, on to the next one. Flag two, no mmap. By default, llama.cpp does this clever thing. It pretends the whole model file is in RAM, but really, it's still on the disk.

[4:47]The OS pages chunks in only when they are needed. Sounds smart, it's actually slow for what we are doing. Every few tokens, the model asks for an expert that hasn't been loaded yet. Disk read, wait, token comes out late. With no mmap, llama.cpp reads the entire model into RAM upfront, the whole 20 gigs. Once it's there, every expert is already loaded. No more disk reads during inference, no more page faults mid token. Every lookup is predictable. 10 tokens a second jumps to 13 and a half, about a 35% bump from one flag. No code, no retraining, no quantization tricks, just telling the OS, hey, stop being clever about my RAM. Two flags down, three to go. Now check the GPU. At 13 tokens a second, it still isn't full. Two whole gigabytes of VRAM just sitting there, free real estate we can spend. So we change one number from 41 down to 35. That pulls six layers worth of experts from the CPU back onto the GPU. More work happens on the fast chip, less crosses the PCIe bus. VRAM goes from 4 gigs to 5 and a half, speed jumps from 13 and a half to 17. There's a trade off, by the way. The bigger the GPU's footprint, the less room for the context window. We dropped from 100,000 tokens of context down to about 64,000. Fine for most chats, not fine if you're feeding it a whole code base. Hold that thought. 17 tokens a second, faster than I read out loud on a card from when Bohemian Rhapsody was in theaters. Three flags, 17 tokens a second. We could honestly stop here, but I want my context back. Real quick, if you like seeing what old hardware can actually do. I do this every couple weeks. New model, cheap setup, no cloud bill. Subscribe and you'll catch the next one. Right, back to it. Remember that context window we had to shrink? Time to get it back. We are at 17 tokens a second, but context is chopped down to 64,000 tokens. To get more, you'd usually need more VRAM, which we don't have. Here's why context costs VRAM. Every token you've ever shown the model, the model remembers. Specifically, it stores two numbers per token per layer, keys and values, the KV cache, and it grows linearly. Twice the context, twice the memory. Heads up, I was already using Q8 quantization on the cache. Q8 is basically lossless, negligible quality drop, but push past Q8, Q4, Q3, and the answers start to fall apart. Earlier this year, Google DeepMind published TurboQuant. Random rotation, then aggressive quantization. Four bits for keys, three for values, and somehow still almost lossless.

[7:40]Q3 and Q4 territory with quality you can't tell apart from Q8. The math is in the paper if you're into that. Two flags, turbo 4 for the keys, turbo 3 for the values. The asymmetry isn't a typo. This model uses grouped query attention with an 8 to 1 ratio, so the keys can take heavier compression than the values. Don't worry about it, just type the flags. Okay, flags on. First, let's not get greedy. Bump context from 64,000 back to 128,000. Reload. It loads. 5.3 gigs of VRAM, 128k context, no OOM. That's already a win. Half a flag and we double the context. But only 0.7 gigs free. Can we be greedier? Push it to 256,000. Reload. Out of memory. Wait. What if we pull one more layer of experts back to the CPU? NCPU MOE goes from 35 to 36. Try again. It fits, just barely, 5.9 of 6 gigs. 64,000 stretches to 256,000. Four times the entire training context of the model on the same 6 gig GPU. Speed, still 17 tokens a second, same as before.

[8:57]Compression doesn't slow it down. The cache is small enough that the lookup is essentially free. What that gets you in practical terms, you can paste a small book in and ask questions about it. You can drop an entire code base as context without the model forgetting page one by the time it reads page 50. Four flags, same speed, four times the context. That's the turbo quant trick. One paper, two flags, free real estate. One more flag and we're done. This one isn't about speed. It's the difference between a setup that works in a demo and a setup that survives Tuesday. There's a problem you don't see until you leave the server running for a day. Check this. Mlocked 12 kilobytes. 12, as in basically nothing. All those experts we put in RAM, the kernel thinks they're regular files. Hours later when memory gets tight or the system idles, it starts paging some of them out to disk. Next inference, page fault, stutter, random slow tokens. The fix is annoying because it lives in three places. The LXC container needs permission to lock memory. Docker needs the IPC lock capability. llama.cpp needs the mlock flag. Skip any one of them and it silently falls back to default. No error, just a slow leak. With all three in place, you literally tell the kernel, do not touch this RAM. Do not page it. Do not move it. It's mine. Recheck Meminfo. Mlocked 16 gigabytes. Every expert is glued in place. The day three slowdown, gone. Same 17 tokens a second, by the way. The speed didn't change because in a fresh boot, the experts were already cached. What changed is that this thing now runs for a week without degrading. Production behavior. Five flags, that's the whole list. 35 billion parameters, 6 gigs of VRAM, 256,000 tokens of context, 17 tokens a second. Stable. Okay, so we're done with the wins, five flags, one Docker command, 17 tokens a second. Now I want to talk about the thing I tried that didn't work, because most videos skip that part. Speculative decoding. The idea is beautiful. You run a tiny model alongside the big one. The tiny one guesses the next eight tokens. The big one verifies them in a single batch instead of eight serial passes. When it works, you get two to four times the speed. I added Qwen 3.5, the 800 million parameter version as the drafter. Same tokenizer as the target, 248,000 tokens. Should plug right in. The drafter held its own. Acceptance came in around 65%. Two out of every three guesses landed. Solid for an 800 million parameter draft. And the speed dropped from 17 to 11. Slower, even with decent predictions. That broke my brain for a day. Here's why it doesn't work. Two reasons, both are architectural. First, mixture of experts. Each token in the batch picks its own eight experts out of 256. Eight tokens batched together can pull from 64 different experts per layer. Each one fetched fresh from CPU RAM over the PCIe bus. The verification stops being a batch and turns into a memory thrash. Second, this model uses state space layers. 30 out of 40 layers are SSM. SSM layers compute one position at a time. Each step depends on the state of the step before it. You can't parallelize them across a draft window. So the verify eight tokens in one pass trick doesn't apply. The math actually goes the wrong way. Per token verification time stays the same when you batch because expert loading dominates. Plus, you paid the cost of running the draft model. Net negative. Someone benchmarked this on a 3090 across 19 different configurations. Same result. Speculative decoding works for transformers. It doesn't work for what we're running. There's a follow-up paper called DFlash, block diffusion drafter, generates eight tokens in one shot instead of one at a time. There's a working drafter for Qwen 3.6's 27 billion dense version. Different model, same trick I want, finally working, worth coming back to. All right, let's wrap. Here's the whole thing. One Docker command, five flags that matter. 35 billion parameters, 6 gigs of VRAM, 256,000 tokens of context, 17 tokens a second on a card older than every AI startup you've heard of. The hardware isn't the bottleneck anymore. The defaults are. And remember, these numbers came off about the worst case rig you can dig up in 2026. Eight year old GPU, eight year old CPU, plain DDR4. If you've got anything from this decade, newer card, faster RAM, PCIe Gen 4, your numbers will go up. This is the floor. What you can actually run on the card already in your machine is bigger than what the AI press tells you. Most of the work is done. You just have to know which flags to set. There's a follow-up worth trying. Dense 27 billion model, the DFlash drafter. See if we can crack 25 tokens a second on the same 1060. If it works, that's a different kind of crazy. If it doesn't, you'll get the same honest write up. Honest question for the comments. What's a real workload you'd want to run on this setup? Code base Q&A, long doc summarization, some weird local agent thing. Drop it below. I'm curious what people would actually use this for. That's it. See you in a couple of weeks.

MORE TRANSCRIPTS

Thumbnail for DLT วิดิทัศน์ประกอบโครงการ 5 นาที by everdale

DLT วิดิทัศน์ประกอบโครงการ 5 นาที

everdale

Thumbnail for Burj khalifa ke 4 khatarnak facts #shorts by Ks world facts

Burj khalifa ke 4 khatarnak facts #shorts

Ks world facts

Thumbnail for Сингапур - путь из третьего мира к городу будущего / Экономическое чудо Ли Куан Ю by Послезавтра

Сингапур - путь из третьего мира к городу будущего / Экономическое чудо Ли Куан Ю

Послезавтра

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript