Thumbnail for 5 AI Engineer Projects to Build in 2026 | Ex-Google, Microsoft by Aishwarya Srinivasan

5 AI Engineer Projects to Build in 2026 | Ex-Google, Microsoft

Aishwarya Srinivasan

19m 41s3,456 words~18 min read
AI audio transcription
Transcript source

AI audio transcription

This transcript was generated from the video's audio because no usable YouTube caption track was available. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes
[0:00]If you're trying to break into AI engineering in 2026, or even if you're trying to level up from where you are right now.
[0:00]I want to walk you through five portfolio projects that I think will genuinely make a difference when you're sitting in front of a hiring manager.
[0:00]And I'm not talking about the kind of projects where you follow a tutorial, deploy a basic chatbot and put it in your resume.
[0:00]I'm talking about the kind of work that makes someone look at your GitHub and say, okay, this person actually understands how production AI systems work.
Use this transcript
Related transcript hubs

[0:00]If you're trying to break into AI engineering in 2026, or even if you're trying to level up from where you are right now. I want to walk you through five portfolio projects that I think will genuinely make a difference when you're sitting in front of a hiring manager. And I'm not talking about the kind of projects where you follow a tutorial, deploy a basic chatbot and put it in your resume. I'm talking about the kind of work that makes someone look at your GitHub and say, okay, this person actually understands how production AI systems work. Now, I know the whole portfolio advice space can feel overwhelming because everyone has different opinions, and it's hard to know what actually matters versus what's just noise. So, what I have done to help you is distill this down to five projects that each target a distinctive in-demand skill. And I'm going to walk you through each one step by step so you know exactly what to build, what tools to use, and more importantly, why each piece matters. Hi, I'm Aishwarya Srinivasan. I've spent over 10 years building and shipping machine learning and AI systems. I have a master's in data science from Columbia University and I've worked at companies like Microsoft, Google and IBM. And I've led AI developer relations at Fire Works AI, which is one of the leading AI infrastructure startups. I've been on both sides of the hiring table many, many times, and what I'm sharing with you today comes directly from that experience. All right, so let's dive in. So, the first project that I want you to build is a production grade retrieval augmented generation or rack. Now, if you've been anywhere near the AI space, you've definitely heard of this term. And there is a good reason it keeps coming up. RAG is genuinely one of the most common patterns in enterprise AI right now. But here is what I want you to understand, the gap between a RAG demo and a RAG system that's actually production ready is enormous. And that gap is exactly where you can differentiate yourself. What you're building is essentially a domain specific ask my dog system. So you pick a corpus of documents. It could be technical documentation, research papers, legal contracts, health care documents, whatever domain interests you. And you build a system that retrieves the right information and answers questions with proper citations. The keyword here is citations, because anybody can get a large language model to generate a plausible sounding answer. But grounding that answer in actual retrieved evidence is what makes it trustworthy. I'll highly recommend breaking into this in three phases. In the first phase, you're just getting the fundamentals working. You'll ingest your documents, whether that is PDF, markdown files or web pages, and then chunk them into pieces about 500 to 800 tokens with roughly say 100 tokens of overlap between chunks. Now, that overlap matters because you don't want to accidentally slice an important sentence right at the boundary and lose the context. Then store those chunks as embeddings in a vector store. You can use Chroma or Weaviate, both are excellent choices to start with. And then you build a retrieval pipeline that pulls the top K most relevant chunks for a given query and generates an answer that cites where the information came from. At this stage, your deliverable is straightforward. You can show someone a document, show them the answer your system produced and point them to the exact paragraph from where it drew it. The second phase is where you graduate from a demo to a production quality, and this is honestly the part that most people never get to, which is precisely why it is so valuable. So here you're going to implement a hybrid retrieval, which means combining traditional BM25 keyword search with vector based semantic search. The reason you want both is that vector search excels at understanding meaning and intent. But sometimes a user is searching for a very specific term or phrase and BM25 handles that beautifully. Now, on top of that, you'll add a cross encoder reranker, which takes your initial set of retrieved chunks and rescores them using a model that evaluates the query and each chunk together as a pair. Now, this consistently and dramatically improves the precision of your results. You'll also want to implement citation enforcement, meaning that the system should explicitly decline to answer if retrieved chunks do not actually support a specific response, rather than hallucinating something that sounds plausible. And I would say store all of your prompts in a versioned config file, because prompts are a part of your system architecture. And treating them that way shows real engineering maturity. Now, the third phase is what makes this truly shippable. You're going to curate a golden evaluation data set of around 50 to 200 question answer pairs that you've manually verified for correctness. Then you'll write an offline evaluation script that measures faithfulness, which essentially asks the question, are the claims in generated answer actually supported by the retrieved chunks? Now, wire this into your continuous integration pipeline, so that every pull request automatically triggers an evaluation run. If quality drops below your threshold, the build fails. This is exactly how production AI teams operate and having this discipline visible in your portfolio immediately tells the hiring manager that you understand the full life cycle of an AI system. Now, for your tech stack, I would suggest LangChain or LangGraph for orchestration, ChromaDB or Weaviate for your vector store and Cohere's reranker or cross encoder models from sentence Transformers for reranking and Ragas for your evaluation framework since it's specifically designed for assessing RAG systems. Now, the second project is building a local AI assistant that runs entirely offline using a small language model. And I think this one is more important than a lot of people realize. You might be wondering, why would you bother running a smaller language model locally when you can just call a powerful API like GPT-5? The answer is that in real world, there are countless scenarios where you cannot or you shouldn't share the data to an external service. There are privacy regulations to consider, there are latency requirements that rule out network round trips, there are cost constraints at scale, and there are edge deployment situations where internet connectivity isn't guaranteed. So companies care deeply about these constraints and most candidates have essentially zero hands-on experience navigating them. So here is what you'll do. Start by installing Ollama, which is genuinely the simplest way to get an open source model running locally on your laptop. Pull down a model in between 3 to 7 billion parameter range, whether it's Llama 3.2, Phi-4 or Mistral 7B are all strong choices. And then build either a command line tool or using a fast API wrapper around it. The most important thing in the first phase is measurement. I want you to rigorously benchmark your model's inference performance, whether it's tokens per second, time to first token, total response latency, write all of this down and include it in your documentation. Because these numbers tell a story about practical trade-offs of local inference. Now, in the second phase, you're going to add structure and determinism, which is where this project starts to really get interesting from an engineering perspective. You'll enforce a JSON output schema on your model's responses, validate them with Pydantic and implement a retry mechanism that catches invalid outputs, re-prompts once before failing gracefully. This pattern of constrained generation, plus validation, plus retry is something that you'll encounter constantly in production systems and it's remarkably rare to see in portfolio projects. I would also encourage you to experiment with the temperature setting. Run the same set of prompts at temperature 0 versus 0.7 and carefully document the variance in outputs. This demonstrates that you understand the stochastic nature of a language model and you know how to control it when reliability matters. Now, the third phase is model comparison study. And honestly, this might be the most impressive deliverable of the entire project. So pick three models, let's say Llama 3.2, 3B, Phi-4 Mini and Mistral 7B. And benchmark all of them on the same hardware. Compare the memory usage, tokens per second, and most importantly, output quality on a standardized set of 30 to 50 test prompts. Write it all in a concise technical report with actual numbers and analysis. This is incredibly valuable for your portfolio because model selection is a decision that every AI team has to make regularly. And showing that you can approach the systematically with data, rather than just going with whatever is popular on X or LinkedIn, tells a hiring manager that you think like an engineer. And if you want to go that extra mile, try quantized version of these models, whether it's GGUF Q4 or Q5 quantization and document the quality versus speed trade-off here. Now, that level of rigor will absolutely make your project stand out. And it will also help you understand these basics of AI engineering. Project three is something that almost nobody includes in their portfolio project and that's precisely why I'm telling you to do it. Now, you're going to take that RAG application that you built in project one and add a comprehensive monitoring and observability layer to it. And the reason this matters so much is something I really want you to internalize. In production environments, building the initial system accounts for maybe 30% of the work. The remaining 70% is knowing whether it's working correctly, understanding why it fails when it does and being able to diagnose and fix these issues quickly. If you can demonstrate that you can think about systems this way, you immediately set yourself apart from the vast majority of candidates who only know how to build. In the first phase, you're going to instrument every step of your RAG pipeline with tracing. That means for every single request that comes in, you can see exactly which chunks were retrieved, how the reranker reordered them, what prompt was sent to the language model, what the response was and how many tokens were consumed. So tools like LangSmith, Langfuse or Braintrust all make this relatively straightforward to set up. If you're just getting started, I would particularly recommend Langfuse because it's open source and you can self-host it, which means that you won't run into usage issues while you're experimenting and iterating. Then the second phase is about tracking quality metrics over time. And this is where you start thinking like a site reliability engineer, but for AI. You want to measure latency at P50 and P95 percentiles, not just the average, because averages have a nasty habit of hiding your worst case performance from you. Then track your cost per request, so you can actually quantify what each query costs in dollar terms. Measure the citation coverage, which is the percentage of your answers that are properly grounded in retrieved evidence, and monitor your failure rate, which means how often does the system errors out or produces a response that it cannot support. The deliverable that you're aiming for here is that when somebody asks you, what happened when quality degraded last Tuesday, you can actually pull up a dashboard, point to that anomaly and walk them through the root cause. Then the third phase connects everything with regression gating. Your evaluation data set from project one now runs automatically as part of your continuous integration pipeline. And if the faithfulness or any other key metric drops below a defined threshold, the build fails and the change doesn't get merged. You're also versioning your prompts and configuring files right alongside your code because a prompt change can affect the system behavior just as dramatically as a code change. Now, this kind of operational discipline is exactly what production AI teams practice every single day. And seeing something like that in a portfolio project is genuinely rare and impressive. Now, I want to be straightforward with you. This project won't look as flashy as some AI image generator or chatbot with a pretty front end. But this does signal something that employers value enormously. You think about systems holistically, not just the model in the middle. And that system level thinking is exactly the gap that most AI teams are desperately trying to fill right now. Now, the fourth project is fine tuning. And I want to set some right expectation right now because there's a lot of confusion in the community about when and why you would actually fine tune a model. Now, fine tuning is not about making a model generally smarter. It is about making a model consistently excellent at a specific, well-defined task where even careful prompting falls short. So before you fine tune anything, you need a clear task where you can credibly demonstrate the gap. So here's the best the base model can do with most carefully engineered prompt and here is the measurable improvement after fine tuning. For your task, I would strongly recommend either structured JSON extractions from messy unstructured text or tool-call accuracy where the model needs to select the right function and populate the correct parameters for that function. And these are the problems where fine tuning provides a clear, quantifiable improvements that you can present with actual numbers. In the first phase, you'll do supervised fine tuning using LORA. You're going to start with a clean data set of somewhere between 2,000 to 10,000 examples and I've added some links of these clean data sets below. I want to really emphasize on the word clean here because data quality matters far more than data quantity. If your training examples are inconsistent or poorly formatted, you're essentially teaching the model to be inconsistent and poorly formatted, which is obviously counterproductive. Then I'll say either use LORA or QLORA for parameter efficient fine tuning, which is wonderful because it means that you don't need massive GPU clusters to do meaningful work. A single A100 or even a T4 with QLORA will get you the job done. For your base model, I would suggest a Quint 38B, that should do the job. Then evaluate on a handout data set with concrete metrics. Things like JSON validity rate, exact match accuracy and refusal correctness, which measures whether the model properly declines to answer when it should. Now, the second phase introduces preference tuning, and this is where you demonstrate that your understanding goes beyond the basics. So, instead of just showing the model what the correct output looks like, you're going to show it comparisons. For the same input prompt, here is a good output and here is a worse one, and I want you to learn the difference. You'll generate multiple outputs per prompt, label which ones are better and which ones are worse, and then train using DPO-style preference optimization approach. Then re-evaluate on your test set and show the incremental improvement over your supervised fine tuning baseline. This is powerful because it demonstrates familiarity with modern alignment and training techniques that are actively used in the industry today. All the fine tuning steps that we're talking about here are stackable, so they stack one on top of the other. And for your tooling, Hugging Face's TRL library handles both SFT and DPO training really well. Axolotl is another excellent framework that takes a lot of the configuration complexities off your plate. If you need GPU compute without the hassle of managing cloud infrastructure, Fire Works AI is a great platform to use. And here is something that I would highly encourage you to include in your project write up. Show the training curve, present the before and after metrics very clearly, and honestly, discuss what went wrong during the process and how you iterated past it. Hiring managers genuinely love seeing that you can troubleshoot and adapt. Because that's what actual day-to-day work looks like. Things are not going to be perfect and you'll have to make these adjustments. Now, the fifth and the final project is a real-time multimodal application, and this is your opportunity to demonstrate that you can handle the unique challenges of streaming data, tight latency budgets and the inherent messiness of systems that need to respond in real time rather than in comfortable batch processing scenarios. I would suggest picking one of the three tracks depending on what excites you the most. So the first option is a voice assistant where you can take in the audio input through an automatic speech recognition, pipe the transcription through a language model for reasoning, and convert the response back to speech with text-to-speech synthesis. The second option is a computer vision pipeline where a webcam feed goes into an object detection model, and then a language model reasons about what's being detected. The third option is a streaming log analyzer where you ingest server logs in real time, run anomaly detection, and have a language model generate human-readable explanation of what's going wrong. If you'll ask me which one I would choose, I would go with the voice assistant track because voice AI is experiencing remarkable growth right now and toolings have matured significantly. You can use Deepgram or Whisper for speech recognition, any capable language model for reasoning layer, and ElevenLabs or Cartesia for speech synthesis. WebSockets work beautifully for orchestrating the whole pipeline. So now in the first phase, just focus on getting the streaming pipeline functioning end to end. You don't have to worry about the optimization yet. Your goal is to get data flowing in, structure your events properly and have the language model responding in real-time input. Getting this to work at all is a meaningful accomplishment. Now, the second phase is where you add latency tracking, and this is honestly the part that demonstrates most engineering maturity. You need to decompose your end-to-end latency into a detailed budget. For a voice assistant that means separately measuring ASR latency, LLM time-to-first-token, TTS time-to-first-byte and all the overhead in between. I would say build a visualization that shows this breakdown for every single request. Being able to say something like, our total response time is 1.2 seconds, and here is exactly how that breaks down across each component. That kind of analysis is what makes interviewers genuinely excited about the candidate because it shows that you understand performance engineering. Now, the third phase adds resilience, which is all about what happens when things go wrong. What does your system do when the speech recognition service goes down? What happens when the language model times out? You need graceful degradation strategies. Maybe the system falls back to a simpler response or openly acknowledges a delay, rather than hanging silently. So I would say implement timeout handling so nothing blocks indefinitely and build a replay mode where you can feed recorded input back through the pipeline for debugging purposes. This is how real production systems are designed and showing that you've thought carefully about the failure modes and recovery puts you in a completely different category than somebody who only builds in happy path versions. All right, let me bring all of this back together. So we spoke about five different projects, a production rack system with proper evaluation and CI gating, a local model benchmarking application, a monitoring and observability layer, a fine tuning project with measurable before and after improvements, and a real-time multimodal application with latency analysis and resilience. Now, each one targets a different skill set that companies are actively looking for right now. And together they all tell a cohesive story about who you are as an AI engineer. As always, I've linked all the tools, frameworks and resources that I've mentioned throughout the video in the description below, so definitely go check that out. And if you found this video helpful, I would really appreciate if you would subscribe and hit that bell icon so you don't miss my upcoming content. I regularly post about AI and ML career guidance, free resources and learning paths, along with technical deep dives and explanatory videos. And I also share my own journey as an immigrant building a career as an AI leader here in the US. If you have any questions and suggestions of topics that I should cover in the next video, please do put that in the comments below. I do read all of them. Now go build something. I'll see you in the next one.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript