[0:00]In the past week or so, so many people have been complaining about hitting their cloud code limit insanely fast. Claims like one prompt that is about 1% of the limit is now around 10%. You could go through X and find tons and tons of threads about this topic.
[0:12]Even on a $200 per month plan, people are reaching the session limit way too fast. And then we got this post from an Anthropic employee that basically said that they are working on a little change with peak hours and off-peak hours.
[0:25]But even after that, some people were saying they were still hitting it really quick, even during off-peak hours. So anyways, I've been playing around a ton, trying different things, doing research, and I have 18 token management hacks for you guys that I've organized from tier one all the way up to tier three, so they get more advanced as we go.
[0:39]I'm very confident that by the end of this video, you will feel like your cloud code usage has doubled, tripled, maybe even 5x. So, let's not waste any time and just get straight into the video.
[0:48]So, as I've been optimizing my own token management, I think that what's really important to realize first is how tokens actually work. Because once you realize how Claude uses tokens, it makes it very clear how you should actually reverse engineer the way that you work in order to use less tokens.
[1:04]So, a token is the smallest unit of text that an AI model reads and charges you for, roughly one token per word, but that's not explicitly true, kind of just a good baseline.
[1:13]So, every time that you send a message, Claude re-reads the entire conversation from the beginning, and all of those are tokens that it's charging you for. So message one, it will read it, then it will read its reply, and then message two, and then the reply all the way up to your latest prompt, every single time.
[1:28]And I think that alone is a huge light bulb moment for a lot of people. This means cost compounds, not adds. Message one might cost 500 tokens. Message 30 costs 15,000 plus, because it's re-reading everything before it.
[1:45]One developer tracked a 100 plus message chat, 98.5% of all tokens were spent re-reading old history, only 1.5% went toward new output. Like, that's a huge waste.
[1:57]Now, yes, the argument has to be made that, well, it needs the context, and it needs to understand what we're doing, but still, 98.5% is crazy. So, take a quick look at this graphic here.
[2:04]Along the x-axis, we have message number, and as it increases, you can see that we have our per message cost and our cumulative tokens increasing, but it's not linear. It's basically each message is re-reading all of the past ones, and it has to count that in.
[2:18]So message one could be 500, message 30 could be 15,500, which is 31 times more. And then after 30 messages, you might already be at almost a quarter million cumulative tokens.
[2:27]Now, on top of all of your own messages, Claude will also re-load your claude.md file, all connected MCP server tool definitions, system prompts, and referenced files on every single turn.
[2:40]This overhead is invisible but constant. And a really important thing to realize is that bloated context doesn't just cost more, it produces worse output. Research on the lost in the middle phenomenon shows models pay less attention to content buried in long contexts.
[2:50]Keeping context tight isn't just a cost move, it's a quality move. There's this phenomenon called lost in the middle, which basically says that models are paying the most attention in the beginning of your session and kind of at the end.
[2:59]So all that stuff in the middle of your session, kind of in this dip, is getting ignored. All right. Now that we kind of understand a little bit more about how Claude code works and how tokens work, let's move into the hacks. We're going to start here with tier one hacks.
[3:07]These are the ones that are going to be super easy to implement, and everyone should be able to understand. So, we've got nine of these. Number one is to start fresh conversations. Use /clear between unrelated tasks.
[3:19]Don't carry context about topic A into a conversation about topic B. Every message in a long chat is exponentially more expensive than the same message in a fresh chat. So this one habit is the number one thing that extends session life, and it's pretty obvious based on what we just talked about, so that's why this was number one.
[3:36]Okay, number two is to disconnect MCP servers. Every connected MCP server loads all its tool definitions into your context on every message, even if you never touch it.
[3:45]This is another source of completely invisible tokens that might just be eating up and eating away. So, one server alone might be something like 18,000 tokens per message.
[3:55]So, run /mcp at the start of each session. Disconnect everything you won't use. And better yet, if you're able to find CLIs for something, so, for example, rather than having the Google Workspace or Google Calendar MCP server, which eats a lot of tokens, just use the Google Workspace CLI.
[4:10]It's faster, it's cheaper, and I think the future is moving towards having our agents use CLIs rather than MCPs. All right, number three, batch prompts into one message.
[4:20]Three separate messages cost 3x what one combined message costs because of the way that tokens work, right? Instead of summarize this as one message, and then now extract the issues, now suggest a fix, send it all in one prompt.
[4:33]If Claude gets something slightly wrong, edit your original message and regenerate instead of sending a follow-up correction. Follow-ups stack onto history permanently. Edits replace the bad exchange entirely.
[4:44]Now, I will say there is an argument to be made here that potentially doing it this way, where you're doing task one, task two, then task three, might actually be better output quality. I think it depends on the actual use case.
[4:56]Basically, the idea would be if you can give AI one specific task at a time, it's going to do better because it's more specialized and it's more focused. But this is definitely something that you should be aware of.
[5:04]Okay, number four is to use plan mode before any real task. This lets Claude map out the approach, ask you the right questions, and it prevents the single biggest source of token waste, which is just having Claude go down the wrong path, writing code, and then basically everything that it just did, you have to basically like scrap and redo. It's just a huge waste of time and tokens.
[5:23]So, you can add something like this to your claude.md: Do not make any changes until you have 95% confidence in what you need to build. Ask me follow-up questions until you reach that confidence. This is something that I'm putting into all of my claude.mds when I am having it help me build things.
[5:37]Number five, we have run /context and /cost. /context shows you exactly what's eating your tokens right now. So your conversation history, your MCP overhead, loaded files, stuff like that. And /cost shows your actual token usage and estimated spend for that current session.
[5:52]Most people have no idea where their tokens are going. These two commands make the invisible visible, because if you don't actually know that you're bleeding because of MCPs, then how would you be able to fix that?
[6:02]So, when you run /context, this is what it will look like. It'll basically give you a screenshot of how many tokens you're at, what is the cap, and it will estimate based on the different categories.
[6:12]And what I did here is this was ran in a completely fresh session, no chats. So, what that tells me is, okay, before I even talk to Claude, I'm already down 51,000 tokens because of things like the system prompt, the system tools, my custom agents, my skills, memory files, and here, I've actually cleared out all of the MCPs, so there wasn't anything in there.
[6:35]But those can, like I said, completely blow up your tokens right from the get-go. Okay, number six is to set up a status line. This kind of goes hand in hand with having more visibility. You only actually see this in your terminal, though, so you will have to do it there.
[6:46]Um, and it basically lets you see what's going on. So right here, you can see in my terminal, I've got this set up so that I can see the model I'm using. I can see a visual kind of progress bar of my usage, and then I can see uh 5% of my whole 1 million context window, and I can see 52,000 tokens out of 1,000,000, which is a million.
[7:04]And just to clarify, this isn't my session, like my five-hour session, this is basically just indicating that I'm 5% of the way or 52K out of 1,000K. So all you have to do is in Claude code in the terminal, do /statusline, and explain that you want to replicate this setup.
[7:18]Number seven is just super simple, but keep your dashboard open. Same thing with visibility. You might run into issues with your limit and just get hit out of nowhere, but if you have it pulled up next to you or you have it ready so that you can switch into that tab and you know, check every 20, 40 minutes, then you're going to be able to pace yourself a little bit better.
[7:35]You could even set up a automation to basically check in on it every 30 minutes and send you like a text or a Slack message and say, hey, by the way, you're getting near your usage.
[7:45]All right, so number eight, we have be smart with pasting. Before you drop a document or a file or something large, just ask yourself, does Claude need to read this whole thing?
[7:57]Sometimes it does, sometimes it needs that full context, but sometimes it just needs one section or one page. So, if the bug is in one function, then paste just that function or if it just needs the context of one little paragraph, just paste that. Claude needs to be precise about what it reads, but you also need to be precise about what you feed it.
[8:10]And number nine, our last Tier one hack, is to actually watch Claude code work. Don't just fire off a prompt and walk away or switch tabs. Watch what Claude is doing, especially on longer tasks.
[8:20]Sometimes it gets stuck in its own loops, re-reads the same files, retrying the same approach, or exploring paths that clearly aren't going anywhere. And this is because if you actually sit and watch it, sometimes you'll realize it's going down the wrong path.
[8:29]Sometimes it gets stuck in its own loops, re-reads the same files, things like that. So, if it's doing that, you might as well just stop it right there. Kind of the same idea as plan mode.
[8:37]Why would you let it go down the wrong path, waste all your tokens, and then just have to scrap it all? In a bad loop, 80%+ of the tokens being used are producing zero value. So if you're able to just watch your session run until you know it's going down the right path, it could save you thousands of tokens.
[8:49]All right, let's kick it up a little bit. Let's move into our tier two hacks, and for these ones, we have five of them. So, number one is to keep your claude.md file lean. Place it in your project root.
[9:02]Claude auto-reads it at the start of every chat as system context. So keep it under 200 lines. Include things like your tech stack, your coding conventions, your build commands, the 95% confidence rule, only the most important things.
[9:15]And you need to treat this like an index, route to where more data lives. This is a complete mindset shift. This file basically just tells Claude code, where is everything that it needs and what to do every single time.
[9:25]So, it can point to files that are huge, but that way it just says, okay, I don't need this right now, but if I do need this, I know exactly where to look. And because it knows exactly where to look, it's not going to waste time or tokens searching through and reading other files. It's just able to grab it right there by the file name.
[9:39]And the reason I say this is a mindset shift, because you should be doing this with other things, not just your claude.md, with your skills or with your, um, you know, master reference guide sheets.
[9:47]I saw someone talking about how they created an index that's super, super lean, and it shows Claude code exactly where to go in the Claude code documentation. So, if it needs help with something related to Claude code, it doesn't have to search through the whole database, it can just say, okay, here's my index file, I know exactly which URL to look up at.
[10:03]Super simple. You want to keep this lean and trim it all the time. It's always a work in progress, because every single chat, not just like your session, every single message, claude.md gets read.
[10:14]So, if your claude.md file is 1,000 lines, every single time you shoot off a message, even if you just say hi, the whole thing's going to get read. Okay, number two here is to be surgical with file references.
[10:22]Don't just say something like here's my whole repo, go find the bug. Say something more like, check the verify user function inside the off.js file. Or you could also use @filename to point at specific files instead of once again letting Claude explore freely.
[10:38]The whole idea of being specific and routing. All right, so number three, I'm saying to compact at around 60% capacity. Auto-compact triggers at like 95%, by which point your context is already pretty degraded.
[10:49]So, run /context to check your capacity percentage, or you should have the status line set up. And at about 60%, just run the /compact with specific instructions on what it should actually be preserving.
[11:00]After three to four compacts in a row, the quality does start to degrade. So, at that point, once you've done three or four, just get a session summary, slash clear, give the session summary back, and then keep going.
[11:10]All right, so number four, short breaks are actually costing you. Claude code uses prompt training to avoid re-processing unchanged context, but the cache has a five-minute timeout.
[11:20]So, if you step away, and you come back and it's been longer than five minutes, your next message re-processes everything from scratch at full cost. And that is why some people feel like their usage just randomly spikes if they might have, you know, stepped away and came back.
[11:32]So, if you're going to do that, just consider doing a /compact or a /clear before you step away. All right, number five, command output bloat.
[11:40]When Claude runs shell commands, the full output enters your context window. So, if you have a command, then it comes back with 200 commits, or, you know, just tons and tons of data, then all of that is tokens that get sent to your model.
[11:54]So really, the takeaway here is to be intentional about what you let Claude run. If you know in a certain project that it doesn't need to use certain commands, then you can go ahead and in that project, deny those permissions.
[12:05]And this is another one that seems like it's invisible, because when it runs like a bash or, you know, certain commands, it basically just has like one line, and you don't actually see all the tokens that it has, you know, sent there.
[12:16]All right, so sitting here editing this video, and there's just one more thing that I wanted to get off my chest, and it's basically about hitting your limit. And you know, the goal of this video and your goal should be to optimize so that you don't hit your limit.
[12:30]But I don't think that you should associate hitting your limit with like, it shouldn't be a negative connotation, because ultimately, if you're doing a lot of these hacks and you are not just like being wasteful with tokens, then hitting your limit is actually a good thing if you think about it.
[12:43]Because it means that you are using this tool so much, and I think that's what you want to be. I think you want to be a power user of this tool to the point where it's like, got to wait again. And you know, waiting sucks, but people that are using it so much are going to be so much more productive and so much farther ahead than people who are never hitting their limits, you know, not making not getting their their money's worth and not truly getting the leverage that you are now getting.
[13:10]So anyways, quick little raw rant there, but I think it's an important mindset shift to have, you know? Just something to think about. All right, so we're moving on to Tier three now. I hope you guys feel like you already have a lot of things that you want to implement, and these ones are getting a little crazier as well.
[13:25]So we've got four of these here, and I've got a few bonus ones also. But number one is to pick the right model. So Sonnet for your default most coding work, Haiku for subagents, formatting, simple tasks (3x cheaper than Sonnet). Opus for deep architectural planning only, and only when Sonnet wasn't enough.
[13:38]Try to keep this under 20% of usage, or unless you just really, really need it for that project. Now, a little tip here is when you have a huge code base and you want to do certain things like maybe a review, then try bringing in Codex.
[13:49]There isn't an official plugin now, and I made a video about this. I'll tag that right up here. But you could basically have, you know, Opus and Sonnet working together to build you, you know, a project or a code base.
[13:58]And then you could just bring in Codex to actually review everything, and that way you're saving yourself on the Claude tokens. The next one, number two here is the cost of subagents.
[14:05]Agent workflows use roughly 7-10x more tokens than a standard single-agent session. Now, why is that? Because they wake up with their own full context, and it's a separate instance.
[14:18]So, they basically have to reload everything when you start up the new session. All of those files, all of the system tools, like everything like that. Now, what you can do, though, which is helpful, is to delegate to subagents for one-off tasks, especially if you want that one-off task to use Haiku.
[14:32]So, maybe you need to process a lot of information, or maybe you need to do a ton of research and get just like a summary back. Now, yes, tokens are still tokens, no matter what, at the end of the day, but if you can make 80% of your tokens a cheaper model, rather than 80% of your tokens an expensive model, then you're going to be saving money.
[14:48]And then, of course, agent teams are cool. Um, sometimes I really do actually like them, and it helps me get more higher quality outputs, but they're very, very expensive, so try to use them very sparingly.
[14:59]All right, so number three is to understand peak hours. So, we just talked about it at the beginning how they've adjusted how fast your five-hour session window drains based on demand during the peak hours, which are 8:00 a.m. to 2:00 p.m. Eastern Time on weekdays.
[15:13]But off-peak, this is when your usage is kind of either normal or it lasts a little longer. And these are afternoons, evenings, weekends. So, if you actually think about this strategically, maybe you want to make sure that you're running big refactors or multi-agent sessions or big projects during off-peak hours only. Otherwise, you're going to, you know, drain right through that peak session.
[15:33]And on top of this, we'll call this a little hack 3.5, which is the one I kind of alluded to earlier when I said, hey, just keep open your Claude account, so you can see your usage at all times. If you're near a reset and you have room left in your allocation, then go heavy.
[15:46]Try to hit that usage limit before it resets. Get your money's worth. Let your agents go loose at that point. And on the other side, if you're getting near your limit, but you still have lots of time, then step away.
[15:56]This is your time to take a break, take a walk, make some lunch. Come back with a full budget instead of burning the last 5% on something small and getting stuck mid-task and having to just kind of, you know, lose that flow state that you might have been in.
[16:10]Okay, number four, your system's constitution, which is claude.md. This should contain stable decisions, architecture rules, and progress summaries. Think of it like the source of truth that makes every prompt shorter and shorter.
[16:22]Save decisions, not conversations. Every architectural call you store there is a paragraph that you never have to type again. So, this builds on top of the way that you were thinking about it back in Tier one.
[16:31]You can add rules in there that basically tell it, hey, I want you to help me make sure I'm being smart about tokens. Use subagents for any exploration or research. If a task needs 3+ files or multi-file analysis, spawn a subagent and only return summarized insights. Spawn that subagent in Haiku.
[16:47]And here's a little prompt that I have at the bottom of my claude.md. And I will say before I read this out, you have to be careful, because when you make a file like this, um, kind of self-learning or self-evolving, you have to check on it frequently, because you don't want it to accidentally get too bloated.
[17:02]But here I said, applied learning, when something fails repeatedly, when Nate has to re-explain, or when a workaround is found for a platform tool or limitation, add a one-line bullet here. Keep each bullet under 15 words, no explanations.
[17:13]Only add things that will save time in future sessions. And then it's got some bullets. Now, I'm not saying this is the most optimal prompt, but I think this sort of system of having your claude.md actually learn and continuously think about how it can save you time and tokens is a good idea to play with.
[17:28]All right, so I know that we just went through a ton of stuff. This whole slide deck will be available for download for free in my free school community. The link for that will be down in the description. But right now, what you should go do are these things.
[17:40]Go run /context, see what it looks like. Go to some of your active sessions, run /cost. Status line, make sure it's showing your model, your context percentage, and your token count. Make sure you pull up your Claude usage dashboard so you can see your remaining allocation and what time it resets.
[17:54]Disconnect unused MCP servers. Start complex tasks in Plan mode. Use /clear when switching to an unrelated task. Manually compact at 60% context. Batch your multi-step instructions into single messages and schedule heavy sessions for off-peak hours and really just be mindful about the actual timing.
[18:12]So, I wanted to kind of leave you guys with one, maybe two messages. The first thing is just the idea that there is a balance between quality and cost. And so, that's kind of a game that you have to play a little bit, and sometimes you do have to go for the higher quality, which ultimately is going to cost you more money, and that's just the way it works.
[18:28]But the other thing is just to keep it simple, and think about what we talked about at the beginning of this video, how tokens actually work, how Claude code actually charges you. Most people don't need a bigger plan. They need to stop re-sending their entire conversation history 30 times when they could send it five times.
[18:44]It's not a limits problem. It's a context hygiene problem. But anyways, that is going to do it for this one. If you guys enjoyed or you learned something new, please give it a like, it helps me out a ton. And as always, I appreciate you guys making it to the end of the video. I'll see you all in the next one. Thanks, everyone.



