TubeScript Get a Transcript

Thumbnail for How to Build a Scalable RAG System for AI Apps (Full Architecture) by ByteMonk

How to Build a Scalable RAG System for AI Apps (Full Architecture)

ByteMonk

15m 50s2,702 words~14 min read

YouTube auto captions

Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes

[0:00]When a user asks a question, you first retrieve relevant pieces of information from your own documents.

[0:00]You just need a way to search your documents and a way to pass that context to the LLM.

[0:00]Google research recently published a study showing that when your retrieval is off, your LLM doesn't just give slightly wrong answers.

[0:00]So what separates a rack system that works in a Jupiter notebook from the one actually holds up in production?

Use this transcript

Summarize a YouTube transcript Make study notes Find timestamped highlights Export to Markdown Download transcript files Browse related transcript hubs

Related transcript hubs

Transcript archive Auto Captions hub English transcripts AI transcripts Tutorials transcripts

Watch on YouTube

Share

[0:00]Large language models don't know anything about your private data. They were trained on the public internet, not your internal wiki. RAG or retrieval augmented generation solves this. The idea is simple. When a user asks a question, you first retrieve relevant pieces of information from your own documents. Then you argument the user's question with that retrieve context. And finally, you let the LLM generate an answer based on what you gave it. Retrieval, augmentation, generation. That's where the name comes from. And the beauty of this approach is, you don't need to retrain the model. You don't need millions of dollars in compute. You just need a way to search your documents and a way to pass that context to the LLM. It's simple, powerful, and it works. At least in demos. Now, here is what I found interesting. Google research recently published a study showing that when your retrieval is off, your LLM doesn't just give slightly wrong answers. It actually hallucinates more than if you would given it no context at all. Bad retrieval is worse than no retrieval. So what separates a rack system that works in a Jupiter notebook from the one actually holds up in production? And that's what we are exploring today. We'll break down the full architecture of a production ready rack system, piece by piece. From how you process your documents, to how you store them, to how you internally retrieve and validate answers before they reach your users. Whether you are just getting started or you have already built a few rack systems, I think there is something here for you. Let's get into it.

[1:37]Let me quickly walk you through how basic Rag works. A user sends a query. That query gets converted into an embedding, which is just a numerical representation of the text. You then search your vector database for chunks that are similar to this embedding. That's step one, retrieve. The chunks you get back become your context. In step two, augment. You take that context and combine it with the original query into a prompt. Finally, in step three, generate. You pass that prompt to the LLM and it generates a response based on the context you gave. Retrieve, augment, generate. Now, I have made a detailed video on rack basics last year. I'll link it in the description if you want to go deeper on fundamentals. But for now, let's see where this simple flow breaks down. Say a user asked, what's our parental leave policy for employees in California? Your system searches through the documents and finds some chunks that mention parental leave and some that mention California. Looks relevant. But here is what can go wrong. Maybe that parental leave chunk is from 2019 policy that's been updated twice since then. Your system doesn't know the difference. It just knows the words matched. Or maybe the document was split in a way that cut off the eligibility criteria mid sentence. So, your LLM gets half the information it needs. Or maybe there's a table in the original document that listed all the state specific policies. But when you extracted the text, that table become a jumbled mess of words that doesn't make sense anymore. Your search worked. It found text that looked relevant. But the context your LLM received was incomplete, outdated or just wrong. And here is the part that surprised me. You would think the LLM would say, I don't know, when the context isn't good enough. But that's not what happens. When you give an LLM context, it becomes more confident, even if the context is bad. It fills in the gaps with information that sounds right, but is completely made up. So now you have a system that's confidently wrong. Which is far worse than a system that just admits it doesn't know. And this is the gap between demo rack and a production rack. In a demo, the documents are clean. The questions are predictable, and everything works. In production, your data is messy. Tables, images, headers, footers, multiple versions of the same document, users asking vague or ambiguous questions. And somehow, you need to handle all of that. So how do we fix this? Let me show you what a production rack architecture actually looks like. On the left, you have your data sources, documents, code, images, spreadsheets, all the unstructured and structured data that lives inside an organization. Now, in basic rag, you would just chunk this and embed it. But look at what happens here. First, the data goes through a restructuring layer. This is where you pass raw documents and understand their structure. What's a heading, what's a paragraph, what's a table? Then you have a structure aware chunking. Instead of blindly splitting every 500 tokens, you chunk in a way that respects the documents natural boundaries. Table stay whole, heading stay with their content. After that, metadata creation. For each chunk, you generate summaries, extract keywords, and even create hypothetical questions that the chunk might answer. All of this flows into your database. And notice it says database, not just vector store. Because in production, you often need both embeddings and relational data working together. Now let's look at the query side. When a user asks a question, it doesn't just go straight into the vector search. You have a reasoning engine with a planner that figures out what the query actually needs and then executes the right tools to get the answer. You have a multi-agent system where different agents can work on different parts of the problem. And below that, you have these human thought validation nodes, a gatekeeper, an auditor, a strategist. These verify the process and catch problems before anything gets returned to the user. On the right side, you have evaluation. Not just did the user click thumbs up, but actual metrics. Qualitative evaluation using LLM judges. Quantitative evaluation measuring precision and recall. And performance evaluation tracking latency and cost. And see the red section at the left. Stress testing using red teaming. Biased opinions, information evasion, prompt injection. You need to know how your system breaks before your users find out. That's the full picture. Now, let's zoom into some critical pieces, starting with data injection. In basic rack, you take your documents, split them into chunks and embed them, simple. But in production, your data isn't simple. You have PDFs with tables, word documents with headers and footers, HTML pages with navigation menus mixed into the content. If you just extract test and chunk blindly, you lose all that structure. So the first step is restructuring. You pass the document and understand what's a heading, what's a paragraph, what's a table, what's a code block. Structure carries meaning and you want to preserve it. Next is structure aware chunking. Instead of splitting every 500 tokens regardless of what's there, you chunk based on the document's natural boundaries. You keep your heading together with paragraph it describes. You don't cut your code function in half. The sweet spot most teams land on is somewhere between 256 to 512 tokens per chunk, with some overlap to maintain context across boundaries. But the exact numbers matters less than respecting the structure. And then there is metadata creation. For each chunk, you're not just storing text and embeddings. You generate a summary of what the chunk contains. You extract keywords. And here is a useful trick. You generate hypothetical questions that this chunk could answer. Why? Because when a user asks a question, you're matching their query to your chunks. If your chunks have pre-generated questions attached to them, you are matching question to question, which works much better than matching a question to a random paragraph. Now, this whole injection pipeline isn't glamorous. Nobody gets excited about passing PDFs. But this is the foundation. If your data is poorly structured going in, nothing downstream can fix it. Now, all this processed data needs to go somewhere. In most tutorials, you'll see a vector database. You'll store your embeddings, you'll do similarity search and done. But in production, you often need more than just vectors. Think about it, you might need to filter results by date, so you only get the latest version of a policy document. You might need to filter by department or by document type. You might need to join information across multiple chunks that belong to the same document. This is relational data. And trying to do all of this with just vector similarity doesn't work well. So production systems often use a database that can handle both, embeddings for semantic search and relational data for filtering and joining. And this is actually where a tool like Neon becomes really useful, who is also the sponsor of this video. Neon is serverless postgray. So you get the full power of a relational database. But it also supports PG vector, which means you can store and search embeddings right alongside your relational data. So instead of managing two separate systems, one for vectors and one for metadata, you have everything in one place. You can do a semantic search and filter by date in the same query. You can join chunks back to the parent documents. You can track versions and timestamps. And because it's serverless, it scales automatically. You're not paying for idle capacity when your rack system isn't being used, and you're not scrambling to provision more resources when traffic spikes. It can scale down to zero when there is no traffic and spin back up in milliseconds when a request comes in. Now, here is something interesting. Neon was recently acquired by Data Bricks for about a billion dollars. And the reason data bricks gave this for acquisition was AI agents. They said over 80% of databases created on Neon are now being provisioned by AI agents, not humans. Think about that. AI agents spinning up databases on their own to store and retrieve information. That's where things are heading. And Neon's architecture, being serverless and instant to provision, makes it a natural fit for these agentic workflows. For rack specifically, having Postgray under the hood means you can also do things like database branching. You can create a copy of your entire knowledge base instantly to test a new chunking strategy or a new embedding model without touching production. When you are happy with the results, you merge it back. It's like Git for your database. Now, let's talk about what happens when a query comes back. In basic rag, you just take the query, embed it and find the most similar chunks. But that assumes the query is clear and well-formed. In reality, users ask vague questions. They use different words than what's in your documents. Sometimes, one query actually needs information from multiple places. And this is where hybrid search comes in. Instead of just vector similarity, you combine it with keyword search. Vector search is great for meaning, but keyword search catches exact matches that vectors might miss. Things like product names, error codes, specific terms. Most production systems today use both, and then rerank the combined results to surface the most relevant chunks. But even with good retrieval, you still have a problem. You've got chunks of text. Now what? In basic rag, you take your retrieve chunks, stuff them into prompt and ask the LLM to generate an answer. And that works for simple questions. But what if the user is asking something complex? Something like, compare our Q3 performance in Europe vs Asia and suggest which region we should focus on next quarter. That's not a single retrieval. You need Q3 data for Europe, you need Q3 data for Asia, you might need historical trends, you might need market forecasts. And then you need to actually reason across all of that to make a recommendation. And this is where the reasoning engine comes in. Instead of going straight from the query to retrieval to generation, you have a planner that first breaks down what the query actually needs. What information is required? What steps do we need to take? What tools should we use? The planner creates a plan and then the tool execution layer carries it out. Maybe it runs multiple retrievals. Maybe it calls an external API for market data, and maybe it does some calculations. And in more advanced setups, you have a multi-agent system. Different agents specializing in different things. One agent might be good at retrieving financial data, another might be good at summarization, another might handle calculations. These agents work on the process database, fetch information relevant to their part of the journey and then output gets combined into a final response. And this is what people mean when they talk about agentic rag. The system isn't just retrieving and generating. It's reasoning, planning, and coordinating multiple steps to solve complex problems. But here is the thing. More complexity means more chances for things to go wrong. An agent might retrieve the wrong information. The planner might misunderstand the query. The final response might sound confident, but be completely off. And this is why you need validation. So let's look at the bottom of this architecture. There is a conditional router that sends output through the validation nodes before anything reaches the user. You have a gatekeeper that checks if the response actually answers the question. You have an auditor that verifies the information is grounded in the retrieved context and not hallucinated. And you have a strategists that evaluates if the response makes sense given the broader context. The idea is to mimic how a human would think. Before you send an answer, you ask yourself, does this actually address what was asked? Is this accurate? Does this make sense? In production, these validation nodes catch a lot of problems before they reach users. A response that sounds confident but contradicts the source documents. An answer that's addresses a different question than what was asked. A recommendation that doesn't account for important constraints. Now, on the right side of the diagram, you have evaluation. This is how you measure whether your system is actually working. And there are three types. Quantitative evaluation uses LLM judges to assess things like faithfulness, relevance and depth. Is the response faithful to the retrieved context? Is it relevant to the query? Is it thorough enough? Quantitative evaluation measures retrieval precision and recall. Of the chunks you retrieve, how many were actually relevant? Of all the relevant chunks in your database, how many did you successfully retrieve? And performance evaluation tracks latency and cost. How long does each query take? How many tokens are you using? This matters when you're running at scale. Without evaluation, you're flying blind. You might think your system is working great while users are quietly getting bad answers. And finally, at the top of the diagram, stress testing. Before you deploy to production, you need to know how your systems break. This is red teaming. You deliberately try to break your own system. Can someone inject a prompt that can make your system ignore its instruction? Can someone get to leak information, it shouldn't cheer? Can someone get biased or harmful outputs by phrasing questions in certain ways? You test for biased opinions, information evasion and prompt injection. You find the weakness before your users do. And this isn't optional. If you are deploying a rack system that touches real users or real business decisions, you need to know its failure modes. So that's the full architecture. Data injection that preserves structure and creates rich metadata. A database layer that combines vectors with relational data. Hybrid retrieval that uses both semantic and keyword search. A reasoning engine that can plan and execute complex queries. Multi-agent coordination for problems that needs multiple steps. Validation nodes that catch errors before they reach users. Evaluation to measure how well things are actually working. And stress testing to find the cracks. It's a lot more than chunk, embed, retrieve, generate. But this is what it takes to build something that actually works in production.

MORE TRANSCRIPTS

Thumbnail for Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story by মই পাৰিম Motivational speech

Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story

মই পাৰিম Motivational speech

Thumbnail for Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included by Grind This Game

Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included

Grind This Game

Thumbnail for Трейдинг с нуля: объяснил ПРОСТО каждую деталь by КриптоБош

Трейдинг с нуля: объяснил ПРОСТО каждую деталь

КриптоБош

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript