Thumbnail for Build a Multimodal Live Streaming Agent with ADK by Google for Developers

Build a Multimodal Live Streaming Agent with ADK

Google for Developers

20m 50s2,887 words~15 min read
Auto-Generated

[0:00]Hi everyone. Welcome to another episode of Hands-on with AI agents. And in this episode, we're going to be talking about multimodal live agents with Agent Development Kit.

[0:15]And today, the video is going to be focused on two main parts. The first, we're going to be building a multimodal live agent with Gemini API. And the second part, we'll see how to build the same agent, but with Google's agent development kit. And to do that, we have two special guests here today, Heiko and Socrates, who are Gen AI Blackbells. Hi folks. Hi Sita. Really nice to meet you and thanks for having us. And uh Hi Sita, thank you for having me again. And Heiko, let's start with you. All right, let's get going. And as Sita already mentioned, today we're going to build, uh we're going to see how we're going to build live agents with which we can interact via voice, via image, via video, uh with Google's latest Agent Developer Kit, ADK, and the Live API. So first the architecture. There's four key components to this architecture. We have a client, which in this case is just a uh UI in the browser. We have a proxy which basically will establish the connection to the multimodal Live API and also handle everything around tool usage and authentication. So the proxy, the Gen AI Python SDK proxy is the hard of uh this application. Uh every request that comes in from the user will be routed to the proxy before it then goes to the live API. The Live API is obviously the brain of the whole application, but because the Live API is provided by Google, uh there's not much you have to do yourself. Now, the neat part is that you can actually build your own tools and your own agents in order to customize it and give it the capabilities that you would want. So in this case, we have a Get Weather tool, uh that runs at a cloud run function, which will interact with the Weather API whenever the user is asking about weather related uh questions. So if the request is, you know, what is the weather in London, uh Gemini's multimodal Live API will recognize that the request is about uh weather and that it will need to use a tool in order to get the current weather uh from London. So what Gemini will then do is it will use a function call, just like you are used to with regular LMs, uh you can use function calls here. And the function call will then be passed in the tool handler which will then decide which tool um the application needs to use. And that is actually part of the uh uh challenge that you will have when you build an application like that and Socrates later will show you how this can actually be simplified with the Agent Development Kit. But for now, uh let's just have a look into this tool handler itself in order for you to understand exactly what is happening. So the tool handler is on the server side and in the core of the server side code. So the tool handler basically executes a tool um that comes back from the Live API. So the Live API will decide, oh, oh for this weather related question, I actually need a tool called Weather API. And then you will have to match the function call that comes from the Live API and you will pass it and match it with the corresponding tool that is running in the background. The tool itself is, as I mentioned, a cloud function.

[4:37]So if you go to weather tools and then get weather tool, you will see here the implementation of the tool itself, how it interacts with the open weather API and how it will in real time get the information from the Weather API and then send that information back to the Live API, which will then formulate a response that sounds human to the user. So let's have a look how that looks like in real life. Hey Gemini, how's it going? Hey, I'm doing well. How about you? I'm doing well too. Uh, can you please get me the weather for London? The weather in London is scattered clouds with a temperature of 14 degrees and 62% humidity. Excellent. So what you have seen now is everything that goes actually on in the background. So I, I asked Gemini about the weather in London. Gemini realized, oh, for this I actually need a tool. So it calls the Get Weather function. It also realized which parameter it needs to use. Because I asked for the weather in London, the city parameter needs to be London. And then the API, the tool will interact with the weather API, get the response back, and puts that response back into the Live API and so the Live API can then respond in human language to me, you know, it is 14 degrees in London with scattered clouds. That's basically how you build a live application with Google's Gemini Live API and with very basic tools like the Weather tool. And now I want to hand over to Socrates, who can show us how to make it even easier for you to build uh a live interactive application uh with agents in it. Socrates, I hope you are ready for us. I'm ready for this.

[6:48]Thank you Heiko for the amazing demo and let's see now how we can simplify your life with Agent Development Kit. Up to this point, what Heiko showed you was a front end. In this front end, we had a web interface that we used to get the audio, text, video, screen sharing, and uh connect with the proxy as said and then the proxy was responsible to do the whole magic and to interact with uh with Gemini live streaming API. And then in time that we had a response, we used to get the audio and text and present that to our web interface. Now with Agent Development Kit, uh what we need to do is we need to get this proxy, remove it completely, right, and add the implementation that I will show you next. So now in the back end and in this proxy side, we need just two main functions. The first main function is the handle client messages in which we need to push all the audio, text or video data that we have into a live request queue as we call it in ADK. Then ADK has implemented a wrapper around Gemini live streaming API that means that we don't need to know how to trigger web sockets in order to interact with the model and to get the results back, but this implements everything by itself.

[8:16]That means it reads from the live queue, it wraps the messages with the right way and inserts them to Gemini live streaming API. Then every time that it receives a response from the model, what it does, it translates this responses to the events that Agent Development Kit provides to the end users. With that way, you need to implement a second handler, a second function that will be responsible to receive all these messages and return those back to the front end. Of course, you can use the same messages for debugging or other purposes as well. But let's see how the code looks like. The first function is the handle client messages. Let's have a look. So what is happening is uh the the back end uh receives all the messages through a web socket interaction that we have with the front end. So the all the data are located into this JSON object that can have different types. Because we received audio, we received images, we received text from uh the front end. So in that case, what we need to do is we need just to identify what type of uh data we receive and then, as I mentioned earlier, into the session of uh this particular agent, we need just to fill the live request queue with the data that we received in this particular case, we have audio data. As simple as that. So we don't need to do something else and we send the request directly to ADK and ADK is responsible to wrap them, transform them and send them to uh live streaming API. With a similar way, we send images as you can see here. The only difference is we need to define the right mime type to image/jpeg, for example. And if we need to send text, the only thing is to prepare to prepare the content and to send just the content like this to the live request queue. As simple as that. So we don't need to spend effort creating web sockets, wrap the, the data with the right way and then send them to the model. ADK does this for you. But this is now the function that is responsible to send the data through ADK to live streaming API. How can we decode and receive the events back? We have the second function that we need to describe. This is this handle agent responses. Uh as we have seen to other videos as well, uh the interaction with um the agents that we built are based on events. So from my session, I'm reading all the different events that I'm getting. The first event that I'm getting from that I can get from uh ADK is about interruption. Interruption is a flag that comes together with the packages from ADK. So if it is an interruption, then I'm sending an interrupted signal back to the UI. If now it was a function call, as you can see immediately ADK has all the necessary information to identify that this is a function call. Then ADK replies back with the name and the arguments that we need it needs to trigger and also it triggers the tool behind the scenes for us. That means that we don't need to create any kind of tool handling. The whole, the whole tool handling happens through ADK. The next part is about getting the function responses. As you can see, it is a different flag that we can get what is the name of the function that we trigger, what is the response that we got back and we send that back to the UI. The next one is about getting the data back. I mean a response, a text, it is the transcript. By the way, we return in the live agents, we have the ability to return the audio and the and the transcripts at the same time for the end user. So in this if statement, what we can see is if the data is text, for example, and we have a a full a full content, uh we have this if uh statement just to say if it is not a partial information, return this full text back to the user. Why this is important? Because uh ADK will return back uh the text, the transcription at real time part by part as it recognize that. Or it can return with that way that we have here, the complete text of the interaction. Then we have also a way to return images, but most important, we have a way to return also the audio data. How do we get the audio data? Again, it is very simple. We get them from the inline data of the particular event. And then what we need to do is we need to return to these data back to uh the uh UI. As you can imagine with these two simple functions that they're not more than 100 lines of code, you can implement live agents. The last missing part is how do we instantiate such an agent? We saw earlier in other videos that in order to create a normal agent, you need to use the class of agent. Let's have a look. Okay, what are the key components in order to instantiate a live agent? ADK gives you the ability to create a voice config first. What is this voice config? You can configure what type of voice you want to, uh, to have. Second, it gives you the ability to configure the run of this live agent. That means here we have enabled the audio transcript, for example, as you can see, we pass the audio config just to set up the voice with the way that we want. And we set also the modalities, the response modalities that in our case will be audio with the transcripts enabled. And then the only thing that we need to perform is to say with the runner that we have, just run live. The difference between a normal agent is here ADK says runner.run. Here we say runner.run_live and we transform immediately a simple agent to a live agent. The only difference in run live is we need to pass also the configuration of this run. And that was a part how we instantiate, how we start the runner. And for completeness, the only piece that I want to show you is how I instantiated also the agent. It is exactly with the same way that we have seen to other videos. I define an agent constructor with a specific model type, with a specific name, with a specific description, with specific instructions, and global and local instructions and with a specific set of tools. And that's all. This is exactly the same way that we instantiated any agent if and if it is not a live agent. Let's see the outcome of making all these changes and by introducing ADK as the backend of our live agents. Is that different of whatever Heiko showed you earlier or the performance is exactly the same? Let's see a demo. Hi Gemini, can you hear me? Hello Socrates Kartakis, how can I help you with Google Cloud today? Could you let me know what is your name, please? I am a personal assistant AI designed to provide information and support related to the business. But do you have a specific name? You can call me Gemini. Okay, great. Could you let me know what are the Gen AI services that uh Google Cloud uh provides?

[16:42]Yes, Google Cloud offers Agent Assist, Agent Development Kit (ADK), and Gemini for Google Cloud which provides AI-powered end user assistance with a wide range of Google Cloud products. Okay, could you let me know more details about Agent Development Kit? This sounds interesting. Agent Development Kit (ADK) is a flexible and modular framework for developing and deploying AI agents. ADK can be used with popular LLMs and open-source generative AI tools and is designed with a focus on tight integration with the Google ecosystem and Gemini models. ADK makes it easy to get started with simple agents powered by Gemini models and Google AI tools while providing the control and structure needed for more complex agent architectures and orchestration. Okay, this sounds interesting. Thank you very much for your help. I will deep dive alone into the documentation of ADK. You're welcome, Socrates. If you have any more questions in the future, feel free to ask. So as you saw, as you saw already, right, with ADK we will be able to run different functions, receive results, interact with uh the model and uh interrupt the model. I was extremely rude with uh Gemini actually by interrupting that continuously and the performance was extremely good actually identical of of what Heiko shared with you. So ADK helped us to simplify the development of live agents by the need of two functions only. We had only to configure the voices with the way that we wanted. Of course, you could use the default options. And then, uh the rest was just to run the agent by using the run live function. And that's all. Boom, you have a live agent up and running.

[18:53]Exactly. And that's what AI engineers really want to focus on, right? They don't want to focus on how do I build a UI, how do I build a how do I build a proxy server? They actually want to build agentic systems. And now with ADK, it is super easy to build agentic systems in a live fashion, which is super powerful. Probably the next step for everyone is to go to ADK public documentation and have a look to see how they can start implementing live agents by using the wrapper from ADK.

[20:24]Absolutely. Thank you both Heiko and Socrates for this great session on how to build multimodal live agents with Gemini and Google's ADK. And for all of you watching out there, thank you for tuning in. We hope you like this video. Check out the links in the description for all the resources. And let us know in the comments what you thought about this video and what we should be building next. Thank you. Thanks for everyone. Thank you. Bye-bye.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript