TubeScript Get a Transcript

Thumbnail for Build a Browser Use Agent with ADK and Selenium by Google for Developers

Build a Browser Use Agent with ADK and Selenium

Google for Developers

37m 11s3,643 words~19 min read

Auto-Generated

Watch on YouTube

Share

[0:00]Hi everyone, welcome to another episode of Hands on with AI Agents. And in this episode, we're going to be talking about an agent for brand search optimization. And the cool thing is this even includes a computer use sub agent.

[0:18]Let's talk more. Hi Nikhil. Hi Sita. Thanks for having me. So what even is a brand search optimization agent? Brand search optimization agent is an agent that automates a search engine optimization workflow and for a given brand it tries to optimize the titles of the brand, given a comparison between the top search results for a set of keywords. So, it will find out the keywords, find out the top search results for those keywords and then compare and contrast and give you an analysis of the titles you should be optimizing for your brand. Sounds wonderful. Let's jump into it then. I'm guessing we're going to be using ADK here, no surprises there. Yep, we are using Google's recently released agent development kit. So let's go into the architecture then. So, talking about the flow, um, a search engine optimizer particularly working with the brand is concerned about showing their brands at the top of the search results for a set of keywords. So, in this workflow, given a brand name, what the agent does is it finds out the keywords based on the titles, description, and attributes from a BigQuery table. That's the step one, and once it finds the keywords, it then uses the top keywords from the list and then uses a web browser agent to search for those keywords on the web. In this case, I'll be showing Google Shopping as an example, and gets the titles of the top products from the webpage, and then compares them with the products and titles coming from BigQuery. The analysis and the comparison then yields a report where you can see the comparison side by side with the titles of your brand versus the top brands, and then you can find out insights like what are the missing keywords, what extra information needs to be added, so that your brand comes at the top for those search keywords. Okay, so now that we have seen the flow of the agent, let's move to the architecture. And as I talked about in the introduction to the ADK, we have different tools and sub agents here. On the left, you can see there is brand search optimization root agent. The main job of this agent is to route you to different sub agents. And we have three sub agents here: keyword finding agent, search results agent, and comparison root agent, which again connects to two more sub agents. Starting with the keyword finding agent, we have tools such as BigQuery and connecting and getting data from BigQuery. So, this tool lets you send the brand name and get the title, description, and attributes from a BigQuery table. And then the LLM behind the scene, Gemini model, will find out the keywords based on the titles, description, and the attributes that we got from the BigQuery. So this agent ultimately gives you a list of keywords and then also lets you choose top keywords based on some criteria, and you can change that in the prompt. The next agent is the search results agent, which has tools such as going to URL, finding elements, scrolling down, and getting the page source, and from the tools you can guess this is the web browser agent.

[4:03]And behind the scene, it uses Selenium and Web driver. So what this agent will do is take the keyword and then search for the keyword on a website, say Google Shopping, and get the top three results for that keyword, and save the titles of the results for the next step, which is comparing with your own brands title. So, coming to the next agent, we have two parts: the generator and comparator. The generator will compare and create a report of what the new titles should look like by comparing with the top search result titles. And the critique will then improve on the comparison to give you a better report, and this is where you as a company can also introduce your policies and if you have specific rules.

[4:57]So, ultimately, the comparison root agent is where we stop and get a report. Any questions here? Yeah, and when you're building this multi agent, um, how do you decide between which of these functions should be a sub agent? Like how do you scope each of these sub agents? That's an excellent question, Sita. So this is very similar to software development, um, where we divide a multi-agent system by its specialties and expertise. So sub agents should have their own specialties, in this case, keyword finding is one of the tasks that the multi-agent system does, so it's natural for it to be a sub agent. What it does is it also handles the complexity and breaks down the tasks similar to software development. It lets you expand the reach and capacity in the future, if you want to have more complex keyword finding agent, you can replace this sub agent with something more complex, and it also increases the robustness and flexibility of your overall system. So web browsing is divided, keyword finding is in its own a sub agent and also comparison is a sub agent with its output as a report. Sounds good. And I've also wanted to ask you, why did we not use a web scraping API as an agent and why use a computer use agent here? Like what's the need to you to open up a new tab or in a browser? Yeah, that's that's excellent question. So web scraping is a natural way to think about getting data from web web pages or HTML content. But nowadays, modern websites load the data dynamically when the initial page loads, so there is some JavaScript code that adds more dynamic elements. So what Selenium or web browser base agents do is they'll let you open a web browser and then they can see the dynamically loaded content as well.

[7:05]And there are some website which need user interaction, complex user interactions to access the data, for example, clicking on some start button or scrolling down the page to actually see the content. So the web browser base agents as opposed to the scraping ones will let you do that as well. And there are more advantages, advantages such as getting over anti-scraping measures, extracting data from elements that are not directly in the HTML source, and also navigating complex websites structures by mimicking the real user behaviors.

[7:43]So overall, there are advantages of using a web browser agent based on Selenium or web drivers, versus just using a web scraper. And now I guess it's code time. Can you show us how to implement this architecture? Absolutely. So, before we dive into the code, let's look at the Readme and the setups needed to get into the multi-agent setup. So this agent needs BigQuery setup with a table, setup with the sample data. So, there are some instructions in the Readme, and as you follow the instructions, you'll find that there is a run dot asset script inside the deployment folder here. And this needs to be run to set up the BigQuery table with sample data. What it does is it first installs all the requirements with Poetry install and then populates the BigQuery table. And if I go inside this Python file, you can see this is the sample data we will be working with. There are titles, descriptions, attributes and brand, and it creates a dataset if it doesn't exist and populates the BigQuery table. And once this step is successful, what you should see on the GCP side is this table with the title, description, attributes and brand. Any questions on this? And even if someone doesn't use Google Cloud or BigQuery, they can still use this agent by swapping out this particular component, right? Yes, that's that's absolutely right. BigQuery is just one of the multiple options you can use here, and later on, I'll be showing how to use your own database client and then swapping out BigQuery with your own data set, and it's a fairly easy to do that.

[9:50]Um, you just need to change a Python function and pass in the brand name to it and get the same data as the BigQuery. So with that, let's jump on to the main agent here. It will be under brand search optimization package under the folder brand search optimization. And when you look at the agent.py, this is our main agent, the root agent that does the routing. It's 38 lines of code. What it does is it has sub agents defined. As I said, keyword finding, search results and comparison, and it has its own description. So when I click on the root prompt, you'll find the prompt it uses and the description is here, a helpful assistant for brand search optimization. When you look at the prompt, the root agent prompt gathers the brand name, then follows a series of steps by finding the keywords, searching the keywords on web and then transferring it to the comparison root agent which has two more agents behind the scene. It has some constraints and it needs to complete all the steps. So this is one way to achieve a series of workflow steps. Another way to do this would be to import sequential agent, like this Google.ADK.agents.sequential agent and then use the loop agent to do the same steps. But for now, what I've done is, I've used prompting to achieve the same goal. So with this, let's jump on to the root agents. The first one is keyword finding. And the main function of this agent will be to find out the keywords using Gemini's inherent capabilities to to understand the titles and descriptions to come up with keywords for for a blob of text. So what you can see here, it uses a tool, and this is the tool I was referring to earlier. This is the tool you'll need to replace if you're not using GCP or BigQuery.

[12:12]What this tool does is it initializes a database client and calls the database client with the skill query. This is constructed dynamically, given table ID, dataset ID, project, and these will come from your environment file. And for a particular brand, you're looking for, and you can configure it to have more than three as well, right now, it's limited to three.

[12:41]It will run this query job, get the results, and then transform the results in a markdown table like this. This is a formatting that we've been working with, you can also have your own formatting if you prefer anything other than markdown.

[13:05]So the output from the keyword finding agent is a list of keywords. Um, and if I go back to the prompt, here you can also see that I'm asking it to remove the duplicate keywords and group them with similar meanings, and rank them based on some criteria I've mentioned there. So generic keywords are ranked higher so that your brand comes up at the top for generic keywords, which are more common and frequent.

[13:40]And you can also define your own criteria here to group and rank the keywords. So this is the step one, and we'll be moving to the web browser agent in the next step. And how are you passing the input to this agent? From the root agent? That's a very good question. So, yes, so if I go back to the root agent, you can see that it takes the keywords from the keyword finding agent and passes it to the search results agent. Um, for the debugging debugging purposes, what I've done is I'll have also let it use a user inputted keyword. So there are two modes. It can either take the keywords from the keyword finding agent, or it can also ask the user to search for a keyword. So you'll see the output of all the keywords when this step finishes. You can pick a keyword from that list, or it will pick the top one from the earlier agent. So moving on to the search results agent, this is the agent that does web browsing, and I'll start with the agent definition at the bottom. You can see that it has a list of tools which are going to URL, taking screenshots, finding elements and so on. And what they do behind the scene is they call the Selenium actions. Um, and there is a wrapper tool, which is analyze web page and determine action, and this is used when the task at hand is a bit more complicated. So there are some atomic tools. Let's look at them one by one. First one will be go to URL. Um, it uses the driver object and gets the URL and navigates to the URL. Taking screenshot takes tool context, which is a short-term state that ADK maintains across tools. Takes this as input and saves the screenshot with the file name, and then this is later used as an artifact by the other tools and agents. Click at coordinates is self-explanatory, takes X and Y coordinates and then finds the element and then clicks on it. Finding element with text is needs the text pattern, and it finds the element in the driver object, and it also throws more meaningful errors if it cannot find the element you're looking for. Clicking with text again takes a string, and then finds the element by the pattern and then clicks on the element.

[16:40]Enter text into element is one that this particular workflow will be using, because we'll be doing searching with the keyword on the website. So text to enter will be our keyword and element ID will be the search bar, and the ID associated with the search bar. What it will do is enter the text into the search bar and then enter send the keywords for searching. Scrolling down might be needed for our agent based on the website we visit. For the Google Shopping example, I've seen the agent scroll down one or two times, but the agent decides when to scroll down on its own.

[17:34]So this is also a useful tool. Getting page source is the most important tool of of the brand search optimization agent, because it relies heavily on the page source. And then to get across some of the limits, I have put in a hardcoded limit here, which you can change later on to a different value. So what it does, it gets the page source, and if you limit the page source, it will automatically complete the rest of the page source, and if the page source is less than the limit, there won't be an issue. Analyze webpage, as I said earlier, will determine the series of steps that are needed to achieve an end goal. So in this case, the end goal is to find out the titles of the top three products. So given a page source, what this tool does is it passes the page source to this tool and then it also gives the task. So user task is also an input, and then finds out a series of atomic steps the agent needs to take to achieve the goal. So in this case, if the goal is to find out the titles of top three elements, it can decide to find the search bar, enter the text in the search bar, scroll down if needed, click on some buttons if needed, and then find out the top three search results from the source and get the titles of those search results.

[19:16]No, this is a detailed workflow, Nikhil. Thank you. So, basically we've seen two steps until now. The first one is where you you know, the sub agent actually identifies the keywords, and the second one is where it uses the keywords identified to perform a computer use triggered search. Um, and the root agent can actually use these sub agents and the tools within the computer use agent as and when it decides to, right? Yes, that's that's the overall flow, but behind the scene what is happening is root agent is transferring you to the search results agent and search agent has these sub tools. So the tool usage is within the search result agent. Um, so in this case, uh, when a search agent is triggered, you're still within the scope of the agent until the task is done. Once the task is done, you are transferred back to the root agent, uh, for the next routing, which is the comparison. Sounds good. So let's look at the comparison agent, and particularly when we look at the prompt, you can see that it calls the comparison agent and then critique agent. So there is a root agent which calls the comparison agent and then critique agent. So you can see the root agent is routing through the generator agent, which is this agent, then critique agent is called after that, and they loop through to find out the best comparison report. And when the comparison critic agent is satisfied, we relay the information back to the user.

[21:00]So, now this is a one this is one way to achieve the looping, but ADK also has a built-in loop agent which you can import like this.

[21:16]And then use the loop agent instead of the standard LLM agent. But for the purpose of this demo, I've achieved this through a series of prompt instructions, and you'll see it in action in some time. Ultimately, what this agent does is first creates comparison in the between the titles of the brand and the titles of the top search results, and then compares and critiques them until it finds the best approach to improve the titles. And some of the brand owners might have their own policies, which can go in the critique agent, or it could be related to a specific company, or some rules. Business specific rules can also go in the critique agent, and that's why we have this here so that it's not just the Gemini models coming up with the comparisons. You have some rules in action to improve on the Gemini generated comparisons or improvements to the title. So that concludes the agentic flow. What we have seen is three sub agents: keyword finding, search result and comparison. And ultimately, given a brand name, you get comparison between the brands titles versus the top search results. So with that, let's move on to the evaluation. So brand search optimization also shows how to do the evaluation given eval data set. And you can find that data set inside eval folder under data. I have a sample run of the agent from the beginning where it asks for a brand name to the final comparison report. And this eval set is then used by ADK's, or agent development kits evaluation tool to to do the evaluation based on the configs mentioned here.

[23:25]And this serves as a evaluation tool to find out how many tools were called, and if the agent was able to achieve what it was meant to do. So to do this, um, we write this eval.py, and import agent evaluator and then invoke it like this agent evaluator.evaluate with the module and the data file path. Um, and the way to run this eval.py is under deployment and eval.sh where you can find that there is a command line tool utility that comes with ADK to do ADK eval with the folder name, the data set file and a config file path. So why is eval helpful in this case? Because we have so many tools, we want to make sure that the right tools are called for the right keywords, and the agent is taking the right actions.

[24:31]Um, and once we make sure that it's working for a few queries, we can save those sessions and then use that to increase the robustness of our agent. So you can find out that in some cases the agent may not be calling the right tools. Um, you can see that in this course and then improve on the prompt to ultimately make your agent more robust. So eval is very important when it comes to increasing the robustness and overall flow of the agent.

[35:57]So this concludes my demo, and what we have seen so far is what brand search optimization is. We went through the agentic flow, the architecture, the code including the three sub agents, and one of them was web browsing agent. We looked at evaluation, testing and an example through the ADK's web app. The example went through taking the brand name as an input and then creating the report. Thank you, Nikhil. Why don't you give us more resources on where we can find information? So the brand search optimization is included as one of the ADK samples on GitHub. You can find it in Google's GitHub repository under ADK samples. You can find brand search optimization. The setup is mentioned in the Readme and we are very happy to get more suggestions and contributions from all of you. And thank you all for watching. We hope you like this video. All of the resources mentioned in this video are in the description. Let us know what you think and what we should build next. Happy coding.

MORE TRANSCRIPTS

Thumbnail for U.S. - China Relations, Explained | WIRED by WIRED

U.S. - China Relations, Explained | WIRED

WIRED

Thumbnail for JESÚS DE NAZARET: Todas las pruebas escritas y arqueológicas de su existencia by Academia Play

JESÚS DE NAZARET: Todas las pruebas escritas y arqueológicas de su existencia

Academia Play

Thumbnail for 🔴 3 Best SCALPING Signals… This 'EMA SCALPING' Catch Every MOVE Before It Explodes by Trader DNA

🔴 3 Best SCALPING Signals… This 'EMA SCALPING' Catch Every MOVE Before It Explodes

Trader DNA

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript