Thumbnail for AWS Glue Tutorial for Beginners| Learn everything about Glue in 30 mins| Glue Data Catalog| Glue ETL by AWS Made Easy

AWS Glue Tutorial for Beginners| Learn everything about Glue in 30 mins| Glue Data Catalog| Glue ETL

AWS Made Easy

33m 44s5,329 words~27 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline
Pull quotes
[0:01]In this course, we are going to be discussing in-depth about AWS Glue service and understand the basic concepts of AWS Glue, and we'll also do some hands-on lab.
[0:01]So, uh, firstly, let's see what are the topics that we are going to cover in this service.
[0:01]And after that, we are going to, uh, do some prep for the hands-on lab that we are going to do in the further sections.
[0:01]And after that, we will start, uh, discussing about the main components of AWS Glue, which is, uh, Data Catalog, uh, which consists of Databases and Tables, and Crawlers and Connections.
Use this transcript
Related transcript hubs

[0:01]Hello everyone, welcome to the Zero to Hero course on AWS Glue. In this course, we are going to be discussing in-depth about AWS Glue service and understand the basic concepts of AWS Glue, and we'll also do some hands-on lab. Okay. So, uh, firstly, let's see what are the topics that we are going to cover in this service. So, uh, we are going to discuss what is AWS Glue and what it is used for, first, and then we'll discuss what are the different Glue components, and we'll walk through the AWS Glue console, and, uh, what are the components that are there in the AWS Glue console. Okay. And after that, we are going to, uh, do some prep for the hands-on lab that we are going to do in the further sections. So, uh, this is going to be a very quick prep. And after that, we will start, uh, discussing about the main components of AWS Glue, which is, uh, Data Catalog, uh, which consists of Databases and Tables, and Crawlers and Connections. Okay. So, uh, this is, uh, one part of the AWS Glue, which is Data Catalog. And in the next section, we are going to be discussing about AWS Glue ETL, which has, which is like, um, which consists of Glue Jobs and Triggers. Okay. So, uh, these are the topics that we are going to be, uh, covering in this, uh, course. Along with the theory, we are also going to be doing some hands-on to, uh, better understand Glue and see how it works. Okay. So, uh, without any delay, let's get started. So, uh, firstly, let's see what is AWS Glue and what it is used for. Okay. So, uh, by going by AWS definition, AWS Glue, Okay, to begin with, let's discuss what is AWS Glue and what it is used for. Okay. AWS Glue is a fully managed ETL service by AWS. Okay. So, uh, let's discuss these two separately: ETL service and fully managed. ETL service means, uh, ETL stands for Extract, Transform, and Load. You can have your data in a source, and you can extract the data, and you can do some transformations and load it to a target. This process is called ETL in data engineering pipelines. Okay. So, uh, Glue provides this ETL functionality. And, uh, other thing is, it's fully managed. Uh, fully managed means that AWS manages this service for you. Uh, it manages all the backend infrastructure, the servers, and all the software provisioning, whatever, like, you don't need to provision or deploy any servers, you don't need to install any software or anything. AWS completely does that for you. So, that is why it is called fully managed. So, uh, that's the meaning of fully managed ETL service. Not just ETL, there is another aspect to Glue as well. So, the two main features of AWS Glue are Data Catalog and Spark ETL Engine. Okay. So, what is Data Catalog? It's a persistent technical metadata store. So, what do you mean by that? You can have your data in any of the different, uh, data stores. So, this data store can be S3, uh, or RDS, or DynamoDB, or any other thing in AWS. Okay. So, you can have your data in any of these data stores, and the metadata of this data can be stored in a catalog in AWS Glue. By metadata, we mean that, like, I mean, it can be the schema-related information and the, uh, the source of the data where it is stored and all those things. So, you can have a centralized data metadata repository in an AWS Glue for the data which is sitting in many services in AWS. Okay. So, uh, this Glue can connect to 70 different data sources and manage your data in a centralized Data Catalog. Okay. So, like we discussed, these data stores can be, like, 70 different, uh, data sources. Okay. And this catalog can be connected to any of those and you can maintain your, uh, the metadata about your data in a centralized Data Catalog. Okay.

[3:59]So, we'll discuss in detail what is the use of this Data Catalog in the further sections, but I hope you got, uh, you know, a fair picture of what this Data Catalog stands for and what this ETL stands for. Okay. So, like we discussed, the ETL engine in using the Glue ETL engine, you can visually create, run, and monitor your ETL pipelines. Okay. So, we will see how to create those ETL jobs and all in the hands-on sections, but, uh, that's what ETL functionality of Glue means. Okay. I hope I was able to give you a fair picture of what are the two, the two main functionalities of AWS Glue, which is Data Catalog and ETL engine. Okay. So, uh, So, uh, with that idea, let's see what are the different Glue components and discuss each component in detail. Okay. So, uh, there are two main categories, like, two main components of AWS Glue, which is Data Catalog and then there is ETL. Under Data Catalog, we will have Database, Tables, Crawlers, and Connections. And under ETL, we have ETL jobs, Triggers, and Workflows. Okay. So, uh, these are the main components of AWS Glue, and we will be discussing each of those concepts in detail in this course. Okay. So, let's go to, uh, AWS console and see how AWS Glue console looks like and see what are the components in AWS Glue. Okay. So, uh, here I'm logged in to my AWS console. Let's go to AWS Glue. So, this is how the Glue console looks like. If you see here, uh, these are the two main components, like we discussed. One is Data Catalog, and then the other one is Data Integration and ETL. Okay. If you expand this Data Catalog, there is Database, Tables, uh, Stream Schema Registries. Uh, this is also a new feature of AWS Glue, which we are not going to discuss in this video. Uh, but I think, uh, that is not required for a basic understanding of Glue. We will discuss this in upcoming videos. And then there is Connections, Crawlers, and all those things. Okay. So, this is, uh, these are the components that we are going to be, uh, looking into in Data Catalog. And if you expand this ETL section, there is ETL jobs, Visual ETL, uh, Notebooks, so you can, uh, write your code in a notebook, or you can, uh, visually create and edit your ETL jobs. We'll see that in the hands-on section. And there are interactive sessions and all. There is Triggers, Workflows, and everything. Okay. So, uh, this is how AWS Glue console looks like, and these are the different components and where you can find them in Glue console. Okay. Now, uh, let's discuss, discuss each component one by one in detail. So, firstly, we'll talk about Database. Okay. So, a database is basically a component of Glue Catalog, like we discussed. So, it's a logical container within the Glue Data Catalog that stores metadata tables. Okay. So, this is just a logical container, meaning it's, there is no physical database as such. So, you can create tables in Glue, and all those tables can be grouped under a database. Okay. So, I repeat again, it's not a physical database. And even these tables as well, they are not physical. We just create a metadata table. We just have a table definition. And the data will be still sitting in the source itself. Okay. So, like we said, these tables contain information about data stored in various sources, such as Amazon S3, Amazon RDS, Amazon Redshift, and more. Okay. So, these tables will contain the information of the, about the data which has been stored in S3, or RDS, or Redshift, or in many other sources. Okay. The data will be still sitting in its original source. And these tables will just contain the information of that data. And the grouping of these tables into a logical, uh, you know, namespace is called a database. Okay. I hope I was able to convey that idea clearly. Okay. The Glue database helps organize and manage metadata, making it easier to catalog, search, and query your data assets. Okay. By grouping your tables into a database, it becomes easier to organize and manage your metadata. And, uh, like we discussed, it's not a physical database, it's just a logical namespace. Okay. So, uh, let's see in the Glue console how to create a database. Okay. Before we get to that, uh, we need to do some prep for the hands-on lab. We just need to create two things. So, one is S3 bucket, which will be required for demo, and we also need to create an IAM role for the demo purposes. Okay. So, uh, we will create these two and then we'll get started. Okay. Let's create an S3 bucket first. I'll open S3. And, uh, let's create a bucket here. So, I'm just, I'm going to create all the resources in North Virginia. So, make sure you create everything in same region. I'm going to call it as Glue tutorial, or let's call it as, call it as Glue tutorial bucket. And leave all the default settings.

[8:55]Let's create. Okay. So, it says that it exists. So, let's call it as Glue tutorial bucket one. The name needs to be unique across AWS. Okay. So, now we have a bucket. So, we are going to use this bucket, uh, for all the demo purposes. And the next thing that we need to do is, we need to create an IAM role. Click on Roles here, and click on Create Role. AWS service, and let's search for Glue. We are going to create a Glue service role. Click on Next. Um, so for Glue, the purpose of this tutorial, we'll just give Amazon S3 full access, because this role will have to, uh, scan S3 data and all those things. And we'll also give CloudWatch Logs full access. Okay. So, I think this should be enough for now. If, uh, we see any errors, we can come back and add the permissions at a later stage. Okay. Let's call this as Glue tutorial role.

[10:08]Okay. And click on Create. Okay. So, now we have the role ready. So, what we will do is, we will create a folder in this, and call this as Landing zone. So,

[10:31]Create folder. And we'll create another file called another folder called Customers. And we can upload the data into this. Okay. So, I'm going to do another thing here. And what we will do here is, load_date is equal to 20230603. Okay. So, I'll tell you why I'm doing this later. Um, but we will create this folder.

[11:11]And then upload our data into this folder. Okay. So, let's upload a sample CSV data here. Okay. So, I'm going to upload this here. Okay. So, this is our source data. We are going to use this for demo purposes. Okay. So, now we have everything that we need ready. So, let's go ahead and create a database in Glue Catalog. Okay. Here in Glue Console, click on Databases, and click on Add Database here. And let's call this as, let's say, sample_db. You can call it whatever you want. Okay. Or you can say, like, project_db. Okay. So, it's just a logical grouping of all the tables. Okay. Location is optional. You can click create database. Okay. So, it's very simple. We just created a database. Next, let's see what are tables. Okay. So, I think we already discussed what a table means. It's just a, uh, like, metadata of your data which is stored in various data sources. Okay. So, and AWS Glue Tables play a crucial role in organizing, querying, and transforming data by providing a structured way to describe and access data. And again, like we discussed, it's not a physical data table, as in the data doesn't move into this table in AWS Glue. The data still sits in your source, which can be S3, RDS, etc. And this table just contains the metadata about your data, like where it is stored and what is the schema, and etc. Okay. So, uh, let's see how to create a table in Glue Data Catalog. Okay. So, we are going to create the table under this database. So, let's open this database, and you can click on Add Table. Okay. So, we are going to create the table for the data that we just uploaded to S3. And the name of this table is going to be customers. And the database is project_db, and standard AWS Glue table, and the source is S3.

[13:20]And the data is sitting in our account itself. And we can give the path of that S3. Okay. So, we'll open this bucket. Landing Zone, Customers. Okay. So, we will select this entire customer. Okay. And the format of the data is CSV, and the delimiter is comma itself. Okay. And click on Next. So, you can edit the schema manually. You can click on Add, and you can specify your column name and data type, and etc. You can manually define your schema and create a table. There are two ways of creating a table. You can manually define the schema and everything of your table, or you can run what is called as Glue Crawler and create a table. Okay. So, let's discuss what is Glue Crawler first, and then come back and create the table here using a Glue Crawler. Okay. Crawler is basically a program that connects to the data source and automatically scans the data in various data sources and determines its schema and create metadata tables in the Glue Data Catalog. Okay. So, uh, this picture actually depicts it very accurately. So, there is a data which is sitting in your data store, and here is your Data Catalog. Crawler is a program which connects to this data store, and scans the data which is sitting here and, you know, infers the things like its schema and everything. And then using that schema, it creates a metadata table in the Glue Data Catalog. Okay. So, uh, that is what is the functionality of Crawler. With that understanding, let's now create a table using a Crawler in Glue Data Catalog. Okay. So, we will abandon this process, because here we will have to, uh, manually edit the schema. So, let's go back to Databases here. Click on Database and click on this Add Tables Using Crawler. Okay. So, this is for manually adding a table. Okay. So, let's call this as customers. And click on Next. Is your data already mapped to Glue Tables? Not yet. Data source is S3. Browse S3. Glue tutorial bucket one. Landing zone, customers. Okay. So, we will select this entire customer. Okay. And, uh, so we will, we are going to give Crawl all sub-folders, and, uh, Add S3 data source. Okay. And click on Next. Now, we need to select the IAM role which will be used by the Crawler to scan this data and create a table for us. So, we are going to use the role that we created in the previous step. Okay. And click on Next. And the database in which it needs to be created, the project_db that we created.

[16:06]And if you want to add any prefix to the table, uh, you can add it. Otherwise, the name of the folder, uh, which is customers in this case, will be, uh, used as a table name. So, let's keep it as customers itself. And frequency, so when do you want to run this, uh, Crawler? So, we are going to select On-demand, so that it runs only when we need. Okay. And we'll review everything and click on Create Crawler. Okay. So, now the Crawler is created. Let's click on Run Crawler. Let's go back here. And see. You also, if you see here, the Crawler is running. So, let's wait for this Crawler to finish and see if it creates a table. Okay. The Crawler has run, and it says it's in stopping state now. Let's click on this Crawler and see what happened. Okay, it says that it has failed. Let's see what is the error. Click on View CloudWatch Logs. Let's see the logs for that Crawler run. Okay. So, okay. So, we see that Glue is not authorized to perform Glue. Okay. So, it looks like we need to add this Glue permissions to this role as well. So, let's go back to this IAM and add that permissions to this and rerun. Okay. So, let's add, okay. So, I think that should work. Add permissions. Okay. So, now let's go back and rerun this Crawler again.

[17:58]Okay. So, it's running again. So, let's wait for this run to complete and see if it works.

[19:07]Okay. So, it looks like it has completed running. So, let's click on the Crawler again and see. Okay, yeah, this time, the run is complete. Okay. So, now let's go back to Tables and see if it has created the table here. Yep. Okay. So, we can see that this Customers table is created. So, if you go to Databases and in the database, we will have this table created. Customers table, and let's see about this table. Okay. So, if you see, uh, this is the column name and data type that the Crawler has inferred and created. Okay. If you are curious on seeing the file, how it looks like, so this is how the file looks like. So, it has inferred the schema of this file and loaded it to Glue database. Okay. So, now, uh, as two steps, we have created a database, and we have successfully created a table, uh, with S3 data as our source. Okay. So, now let's go forward and discuss the next components of AWS Glue. Uh, before we go ahead, I actually want to discuss another aspect here. So, once you have created this table, what is the point, right? Like, I mean, what is the use of this table? So, one thing is, you can, like, I mean, you are maintaining the metadata of your data, which is great. But what if you want to query this data using SQL? So, uh, we can make use of Athena for that. So, let me just quickly demonstrate that as well. So, you can, uh, actually query this data which is sitting in S3 using SQL, which is cool, right? Like, I mean, you don't have to move the data to a database. Okay. In Athena console, click here and click on Query editor. And in Query editor, you can see Data Catalog, and the database is project_db, and this is the customers table that we just created. Okay. You can click on Preview Table here. Okay. So, if you are using Athena for the first time, you need to select a, like, S3 location for where these results will be stored. You can click on Edit Settings here. And click on So, let's create a folder here in this bucket itself. And call it Query results. Create folder. So, let's go back here. And select this Query results. Okay. Click on Save.

[21:35]So, that's it. You can now start querying your data. Okay. So, let's run this query and see if we get the data.

[21:53]Yep. So, if you see, you are able to run a SQL query on the data which is sitting on S3. So, this is another cool feature of AWS Glue. Once a table is created in Catalog, you can query that data using Athena here. Another thing that I wanted to highlight here is this load_date column that is present. Okay. This, this column is not present in the data, but it is added here. And if you see, it is a partitioned. Okay. So, what essentially that means is, here in our data, we are, uh, like, partitioning by using this, uh, load_date. Okay. You can add, uh, like, if you get another customers file, you can add it to a different date and different days. You can create any folders with this format. And if you want to query the data of customers only from this date, you can put that filter over here in the query, and it will scan only that folder and exclude the rest of the folders. So, that it will give you a better performance, uh, while querying the data. So, that is the, like, importance of partitioning of your data in S3 and using that partitioning to get a better performance while querying your data in Athena. So, uh, next, I wanted to discuss about connections. So, in AWS Glue, a connection is a configuration object that enables Glue to connect to your data stores. Like we discussed, your data can be in any of the data source like S3, or Redshift, or RDS, or anything, right? So, you, you need to be able to connect to that data to, uh, get the information about the data, to scan through the data and infer the schema, and, uh, etc. So, uh, in order to be able to connect to that data, you need to have things like your credentials, username, password, database endpoint, and all those things. So, uh, you can create a configuration object which contains all those endpoints, usernames and passwords, and store it in AWS Glue, and which can be later used to connect to that data store. So, here in Glue Console, if you expand this catalog, you can click on Connections here. And you can add a connection. So, if you click on this Create Connection, so you can specify your data source, like, let's assume it's Redshift. You can select Redshift and click on Next. And then you can give your, you select your Redshift cluster, database name, username and password. You can give all these inputs and then store it. So, next time when you're connecting to that Redshift, you don't need to specify all these things manually. You can specify this connection configuration and the Crawler, or like Glue will automatically connect to that store using these credentials. Okay. So, uh, that is the idea of connections in AWS Glue. Okay. So, the next important topic that we are going to talk is about AWS Glue ETL jobs. So, ETL jobs is a, like, I mean, at the core of AWS Glue ETL functionality. So, ETL jobs are used to transform your data. So, you can extract the data from a, like this picture depicts. You can extract the data from a source, transform it in the job and load it to a target. And this job can be written in Spark, like PySpark. So, Spark is a very powerful tool for data processing in Big Data World. So, you can leverage the power of Spark using AWS Glue. You don't need to install Spark or setup any Spark cluster, because AWS Glue is serverless and fully managed. You can just use Spark in your ETL job and leverage the power of Spark and transform your data. Okay. And there are two ways by using which you can create this ETL jobs. You can visually create the jobs by, uh, mean, by specifying your source, your target, and what are the transformations that you want to apply. Or you can, uh, I mean, by doing that, it will generate a script for you. The Glue will automatically generate a script for you. Or you can bring your own script with the business logic and then run it, run it in your AWS Glue without having to setup any servers or any environment. Okay. So, uh, that is the idea of AWS Glue ETL jobs. I hope I was able to communicate that. Uh, but let's see in action how these ETL jobs work and how to, uh, create those jobs. Okay. So, let's see how to create a Glue ETL job. So, here in Data Integration and ETL, click on ETL jobs. You can, so like we discussed, you can either, uh, create a job using a visual, uh, tool, or you can use it, uh, use an interactive notebook, or you can use a script editor where you can write, uh, like, customized script of your own. Okay. So, just for the sake of simplicity, let's use this visual ETL. And, uh, So, here, uh, in Visual, we are seeing that, like, I mean, what is the source, what is the transforms that you want to apply, and, uh, what is the target that you want to specify. Okay. So, our source is going to be S3. So, I have selected my S3 source. And, uh, what are the transforms that you want to apply. Okay. So, you can do any of these transforms, like, Rename Field, Filter, conditional. Or like you can do anything, any transformation that you want over here. Okay. So, for simplicity, let's select, uh, like, one transform called Drop Fields. Okay, this is going to drop a field. Let me click on this and, uh, actually, before we do that, let's edit this first step. So, let's, in our data source, we are going to, you can give S3 source location or Data Catalog table. We, since we have created a catalog table, let's select that, project_db, and customers. Okay. So, this is our, this is going to be our input. Okay. So, uh, the next thing is, let's edit this transformation. Okay. So, we are going to drop a field. So, let's assume we want to drop this phone field, phone two. Okay. So, this will drop that phone field. Okay. So, next thing is, we want to specify a target. So, we will write this output to an S3. Okay.

[27:50]So, S3 not parents, let's give it as, so the, what is the parent, like, I mean, after which transform we need to select this one. So that after transform, it is loaded into this target.

[28:07]And what is the format that you want to store? Let's say we want to store it in Parquet format. Okay. Compression type is for and let's store it in S3. Okay. Before that, let's create a, uh, uh, here. Let's create something called as transformed_zone. Okay. So, Landing Zone is our input. Transformed Zone is going to be our output.

[28:34]Okay. So, let's come back here and select this folder as an output. Okay. And, yep, so I think we should be good.

[28:52]So, uh, just to summarize what we are doing is, we are reading the data from this S3 location using catalog table. We are dropping a field. And then we are storing the output into an S3 bucket in Parquet format. Okay. So, that is a very simple ETL job. So, let's name this as my, or let's call this customers_ETL_drop_field. Okay. And let's save this job.

[29:21]Okay. We need to give an IAM role. So, let's select this Glue tutorial role and see if that works. And then after that, you can select, like, I mean, all these things, like, it's a Spark job or not, like, I mean, you can select the Glue version and what is the language and all these things. And we'll leave that as default. Okay. But you can play around with these parameters, like, I mean, what is the number of workers that you need. These all will depend on your workload and everything. Okay. So, click on Save here. And now our job is being saved. Okay. So, let's see the script of our job. Okay. So, depending on the input, like, I mean, the visual steps that we specified, Glue will automatically generate a script, ETL script for us. So, if you see here, like, I mean, it has just reading that data into a Glue Dynamic Frame. So, this is basically an abstraction of your data as a Glue Dynamic Frame. And then it is calling this transformation called drop fields.apply. And then it is writing the dynamic frame to an S3 location in format of Parquet. Okay. So, uh, yep, I think this should be good. Let's click on Run. So, it has started. So, let's click on the runs here, and it says running. So, let's give it some time and see what happens. Okay. So, now, uh, the job has run successfully, and it says succeeded. So, let's go to this output location and see if it has loaded the data. Yep. So, if you see, it has loaded the data in Parquet format. Okay, let's see if it can view the data. SQL query. Okay. So, there is some problem with the serialization, but yeah, this Parquet file is stored as an output of this job in this S3 bucket. So, that's how you can create jobs and run those jobs. So, we discussed only one option of visually creating the job like this. You can also create a job by specifying your own script. So, you can have any complicated logic in this script over here and put it here and run the job. Okay. So, the next thing that I wanted to discuss is the triggers. So, you can have triggers for your job. So, triggers are basically the events which can start your job. Okay. The triggers can be of two types. It can be an event-based trigger, or it can be a scheduled trigger. So, if you click on Add Trigger, let's call this as customer_ETL_trigger. So, uh, whether it is On-demand. So, if you select On-demand, only when you run this trigger, it will be triggered. So, you can schedule a trigger. So, frequency, you can select daily, hourly, whatever. Okay. And what is the minute of the hour? So, let's call it first minute. Okay. And click on Next. And what is the target? So, target is a job, and ETL job that we created. Okay. And click on Next. And create. Okay. So, now we have created this trigger over here. And if you see, it is a scheduled trigger.

[32:32]So, it will trigger this ETL job, which is a target, uh, at one minute past every hour. Okay. So, that is the concept of triggers. You can also have an event-based trigger, which is to say, like, uh, let me just give you a demo. So, Crawler event. So, you can have, like, events. So, whenever a job succeeds, or whenever a Crawler runs, after that, if you want to run a particular job, you can have an event-based trigger over here. Okay. So, I think we have covered pretty much all the basic concepts of Glue.

[33:05]I hope I was able to give you a fair idea about the catalog and the ETL engine of AWS Glue. And I was also, uh, able to demonstrate some, uh, concepts using hands-on lab. Okay. Uh, of course, this is not an extensive tutorial. There are a lot of other features that Glue has introduced, like, uh, schema registry and then there is, uh, Workflows option and all those things. We can explore those in the, uh, upcoming videos. I hope you found this video helpful. Uh, if you have any questions, do let me know in the comments below. And also, do let me know if you want to cover any topic related to AWS Glue in depth in my, uh, next videos. Thank you, and I'll see you in the next video.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript