[0:00]In this quick tutorial, we will understand what exactly snowflake, how is it different from traditional data warehouses, we will also do some hands-on practice and understand the key features and benefit.
[0:11]In year 2020, one company IPO in New York stock market raising 3.36 billion dollar. This made them a biggest software IPO in the world. That company was snowflake. Right now, 751 of Forbes Global 2000 companies are using snowflake, so you can already say it's very very popular. And if you go to any job portal, you will find tons of jobs. See here on Nauhri right now I'm finding 34,000 jobs. So what exactly snowflake? See, it's a data warehouse. So you need to know what exactly is data warehouse and for that I have a separate video. But I will quickly go over the architecture. So let's say there is a company something like Amazon or Blinkit, they will have relational databases. So let's say they have some orders table, inventory table and this can be Oracle or Microsoft SQL server etc. They will also have unstructured data, right? They will have PDF files, log files, some JSON objects and so on. Now, if you want to do data analytics or AI, you can't do it directly on this raw data sources because of several issues which I have explained in that video. So what companies do is they will extract all this data and put them into a data lake. Data lake can be simple Amazon S3 object storage or Azure blob storage etc. It's like a dump ground where you dump all the data as it is. And then you will do transform and load into your data warehouse. So here you will perform certain transformation, you will do data cleaning. Let's say if you have business in multiple countries, then you might have multiple currencies. Now when you want to do the analytics, you want to bring them onto the same currency. So you might do a currency conversion, you will do some aggregation. Those kind of things happens in transform step and then you load it into a data warehouse. Now this data warehouse is your analytics and AI ready data. So here you can hook your power BI dashboard, your Jupiter notebook, your AI program etc. And snowflake is this data warehouse. So there are many data warehouse, Redshift, Teradata and so on and snowflake is one of the data warehouse. It is a cloud-based SQL data warehouse. So first of all snowflake you can't install on your local computer. It is cloud native and mainly you write SQL queries when you're working with this snowflake. So what are the differences between RDBMS and data warehouses? Well, primary use for RDBMS such as Oracle, MySQL etc. is OLTP. You know, it's online transaction processing system, whereas this is OLAP analytics. So you will use snowflake for data warehouse. You will not use a snowflake to store your payment transactions and etc., right? That's the transactional data use case. The typical workload for OLTP is many small reads/writes, high concurrency day to day transactions like orders, payments etc.
[3:36]The typical workload is few very large queries scanning billions of rows in data warehouse you will have billions of rows and you will have large queries.
[3:49]You know, when you're performing analytics or when you're doing an AI, you're performing large queries, which will typically take time. And then RDBMS is self-managed, you can of course get it as SaaS in one of the cloud providers. Data warehouse when you talk about data warehouse especially snowflake, it is fully managed, okay? So whatever we are discussing right now is for snowflake. There are other data warehouses, which are not even fully managed. But snowflake is fully managed SaaS platform. So they started snowflake in 2012 and between 12 to 15, it became generally available as a cloud native data warehouse, right? So that's a time history. In 2020, they went IPO, right? That biggest IPO and they came up with this data cloud vision and launched Snowpark. So up till here, it was all SQL. So if you're working in snowflake, you will be mainly writing SQL queries. But they realize, you know, with the competition from Data Bricks and so on, they realized they need to give support to other programming language. So with Snowpark, they started supporting Python, Java and so on. And today, they call themselves a AI data cloud. So if you go to their website, they will say okay, snowflake is all about analytics, AI, data engineering, apps and collaboration. Power them all in AI Data Cloud. So they are attempting or they are going into the direction of becoming unified analytics platform. Unified analytics platform means you can do your data engineering, data analytics, AI, all in one place. Just like Data Bricks. In Data Bricks, you can do all of this in one place. So snowflake is a cloud-based AI data platform that combines data warehousing, data engineering, machine learning, AI application development, data sharing and governance in a unified environment. Now, I can go over all kind of theory but what I would like to do instead is get you started with snowflake free account, all right. So go to snowflake.com on their website and click on start for free. Once again, it's a cloud-based environment. So you can't download any software here, so you have to go to their website and register here. So I'm using some learning email ID for registration. Just click on continue. Company, you can give any company name. Uh, Data Engineer and whatever, just select one of these, okay? It doesn't matter. Select your region. You can just say Asia, wherever you are, just select that region. Get started.
[6:36]Select the programming languages that you work with. Python. You can select this or you can just skip this survey and it will send you an email. I got this email. I will click on that to activate my account. Okay, give your password etc and click on get started. So it will take few minutes to prepare the environment and then you will get this uh screen, okay? So what I would like to do is I would like to explore a sample data set. So just click on start here. All right, so it took some time to load that thing, but basically we are done with the registration. You get some $400 free credits and 30 days free trial, okay? So we will be doing our learning with that free trial. In this lecture, we are going to go through a sample use case for this tasty bites fictitious global food truck chain. Snowflake provides this data set for our learning. Here, we will load the data from AWS S3. And then we'll run some SQL queries, okay? So whatever you're seeing here, this is SQL code, right? And this thing here is called worksheet, SQL worksheet. It's like Jupiter notebook for SQL where you can move your mouse cursor here and you can uh run this, this particular cell. You know in Jupiter notebook, you can run the code line by line. So it's the same thing, but it is for SQL. Now let's go through this useful SQL worksheet tips. So before you work on a worksheet, it has a context, which means you need to set a role, warehouse, database and schema. So for our worksheet, the role is snowflake learning role. See it is loading the roles and warehouses. So it has already created that role as part of the default installation. And then for warehouse, we have this snowflake learning wh means warehouse, okay? So see, these are the roles. So we are using the learning role and then we are using this particular warehouse, okay? And database and schema. So if you go to database here, you have all this learning databases. So if you click on it, you will find learning databases. You will also find here the sample data and so on, all right? So if you look at this visual, you will get more idea. But here also, see you will see all this sample data. Let's go to next and you can run a single query or multiple queries. So here when you click move your mouse cursor here, click on this or control enter, you will run a single query. And if you say run all, it will run everything, okay? Then you can explore query results and performance. Try using the object Explorer, okay?
[9:39]All right, now, let's get started with this worksheet. So here, what we are doing is essentially, we are using a role. So any SQL workload that you're running will have some role, right? You're a data analyst, data engineer, what kind of user you are. So this code will be doing that. So move your mouse cursor here and press control enter if you are using Windows, if you are using Mac, there will be different command. Or you can place mouse cursor here and simply click on this button. If you say run all, it will run all the code. But I, I don't want you to run all the code. Just run first single line. The UI is initializing. It's going to take some time. You will move to the second line, which is use warehouse snowflake learning warehouse, okay? So we are having this uh learning environment right now. So we are using that learning warehouse, okay? So see, statement executed successfully. So we are using snowflake learning role. Here it will show you all the query statistics like this query take took 35 milliseconds, uh, 35 millisecond or whatever that time period is. And initially it's it will take more time to run those queries. Then you come here, again run that particular cell. So you are now using snowflake warehouse, okay? Then go here, control enter again. There you are uh using this database called snowflake learning DB. Now if you go to database here, snowflake learning DB. Now here you have public schema, you have information schema and you have my user name is dhawal. P, so load sample data, okay? So see this is load sample data. The thing that you are seeing here, this snowflake learning DB is a database, okay? And underneath what you have is this schema, okay? So you have information schema and you have this public one.
[11:57]Now, all right, so let's start, all right, so let's try creating that object. So what we are going to say is set schema name is concat current user, which will be the well P. See, the well P underscore load sample this, okay? So let's do control enter and we are saying use schema identifier schema name. So my schema name will be dhawal P underscore load whatever and by default now whatever operation we do on it is going to use this particular schema. Now here, we are going to create a table. So the table is for a menu. See, you have a food truck chain and food truck will have a menu of your food items. And that will have menu ID, menu type ID, truck brand name, item category and so on, okay? So this is a simple SQL statement. And by the way, this SQL that you are running here is snowflakes own SQL dialect. There is a ANSI SQL, right? The standard SQL that you pretty much use everywhere and all these dialects. So for example, when you use SQL in snowflake, there will be mostly it will be you will be familiar with all the commands, but some commands will vary. So snowflake folks have done their own customization here. I was getting some error at this line 31 and when I reran it, it just worked, okay? So I have run all the way till this where you are creating a menu. And when you click on this particular query ID, you will see the details of that query such as, okay, how much time it took, start time, end time and so on. Now with this, this table should be created. So let's refresh this and under this P load schema, we should see that table. See, now you have a menu table. Initially it was like no objects and when you click on it, you will see the columns etc. somehow this snowflake instance is slow for me, uh, but when I was using it yesterday it was pretty fast. So I don't know what's going on, but anyways, we see this table now. Now when you say select star from menu, control enter, obviously, there are no menu items, okay? Now, uh the result will come up and I will show you that there are no menu items. Now, what we are doing here is a snowflake folks have put this S3 bucket with some sample data. We are going to fetch the sample data from S3 into snowflake environment. And here we are using this command, create or replace stage blob stage, okay? And you can uh read the documentation on this. By the way, there are two kind of stages, internal and external. So this is the external stage that we are creating. And stage is more like a pointer, okay? So this blob stage now will point to this directory. We have not got the data yet, okay? So all we are doing is creating a reference lake. We are saying that, okay, blob stage means this particular S3 location and whatever objects you have and the file format is CSV. So there is, there are CSV files in here. And then, let me just uh run this thing, so click on this and execute. And meanwhile this, see menu table, it's showing the schema of it. So menu ID is a number, menu type ID, menu type, okay? So these are all the categories. You can go through it. It's a menu table. Now, I created this particular stage here. And when I executed, uh, see here I created a stage, right? And when I executed list, which means list uh the files which are present, it shows me this, okay? So it has this one file, which is csv.gz file and that is a zip file containing my menu items, okay? Now here, I will run this copy into menu command. So what this command will do is, see it will say copy into menu. Copy into menu means into this menu table. Right now this table is empty, okay? So when we ran select star command, we did not see the results, but if you have seen the results, it is all empty. And now we are saying that copy into menu from blob stage. Blob stage is what? Blob stage is this S3 bucket, okay? There we have raw pose and menu, okay? So whatever data you have there, just copy that into table. So here we just created a reference called blob stage, which is pointing to this S3 bucket. And here we are actually copying the data. So once the data is copied now, let's run this command select count star. And there you go. It says you have 100 rows. Now select top 10. So when you say select top 10 star, it will show you the top 10 rows from that menu table, okay? So in this menu table, now we have data from S3 and now we are saying, okay, show me the first 10 rows in this table. All right. See, these are the menu items. So menu ID, you have truck brand name, you have the actual menu item, which is lemonade, beverage, cold option, cost of goods, sales price. And then there is a JSON field also. So menu item health metrics. That's a JSON field and each JSON object looks something like this. So it has this key called menu item health metrics which has ingredients. All these ingredients, okay? So now here, what menu items does the freezing point brand sells? So when you run this, it will show you that I'm not going to run that. Uh, you can run it because it's just taking some time. Then when you run this particular command, it will uh show you the profit. So profit is sales minus cost of goods, right? So it's it's a simple SQL query folks. Okay, that's what it will run. And this particular query is probing this particular JSON field and it is showing you the Mango Sticky Rice ingredients. So here and we are saying that, okay, when menu item is Mango Sticky Rice and truck brand is this, go to menu item health metrics. See menu item health metrics here. And show me the ingredients. So whatever object you get dot value dot ingredients, flatten that as an array, okay? So see, if you have Mango Sticky Rice, let's run this command and see. All right. Eventually, you got all these ingredients. So to summarize in this particular lecture, we loaded data from S3 and then we ran bunch of SQL queries, okay? So here in this particular table, we loaded data and we ran bunch of SQL queries. Now folks, experience wise this looks more like you have MySQL. MySQL is a relational database, right? And you have bunch of databases and here you run all these queries, right? So that's what it looks like. See here, on the left hand side, your databases. Here you are running SQL queries and you see the result. But the difference is, this is OLTP, it is transactional data. Here, this is OLAP, it is analytical processing. And if you look at some of the databases here, they're pretty big. So, see, I'm in sample data and this SF 100 TCL, there is this table called store sales. So I created a new SQL worksheet, okay, by clicking on this. Then I set the context, role, warehouse, database and then when I say select count star from store sales, check how many records it has. Pause this video and you tell me how many records is this. Well, this is 288 billion records. So this is what is big data and that's where snowflake shines in. MySQL, Oracle etc. OLTP databases are not designed to handle this kind of workload, okay? So folks, that is the beauty of snowflake and I hope via this tutorial, you got started with some basic commands and you got some basic familiarity with snowflake environment. Let's discuss some of the key features and benefits of snowflake. The first one, it is cloud native, fully managed. You log in to their website and everything is given to you as a managed service. You don't have to do any kind of management of creating clusters, maintenance, installation and so on. This lets you focus on your business problem and you can avoid investing time on the infrastructure details. It also runs on AWS, Azure, GCP, all those majority cloud providers with the same experience. The second benefit is it has separation of compute and storage. So in traditional systems, compute and storage were tightly coupled, okay? And sometimes what happens is, let's say you don't have too much of uh data usage, storage usage, but you are using compute uh very often or you have a high compute usage. In that case, you have to scale both of it because these two things are tightly coupled in traditional systems. But in snowflake, they are independent. You can scale individually. Let's say you want more compute but less storage or let's say you want more storage and less compute, you can configure it very easily. In snowflake example that we saw in the one of the past lectures, we saw this uh virtual warehouse, right? So virtual warehouse uh enables you for that, okay? So here you can select one of the warehouses and these are virtual warehouses. Through virtual warehouse, you can um you know, get a scalable compute. The next benefit is automatic performance optimization. So it has bunch of features such as micro partitioning, pruning, caching and through all this optimization, you get very high performance when you are doing big data analytics. The next one is support for structured and semi-structured data. So it obviously works with the structured data such as tables and columns, but it also works with semi-structured data, for example, you can work with JSON, okay? You can use variant data type to store and query semi-structured data. Now in the SQL use case that we saw before, we saw the use of flatten. You know, we had this JSON column and that has some JSON data. So JSON is a semi-structured data, but you can query on that semi-structured data. You can use the functions like flatten for doing that. The next one is time travel and zero copy cloning. If you have attended Code Basics Data Bricks module, then we, we learned time travel. So you can do a query such as select star from sales at offset greater than equal to this one, okay? So you can go back in time and query the data. It also has zero copy cloning, which means it can instantly create a clone of table or database without duplicating data. We'll go over this in detail in the future lectures, but these are some of the special features you get in snowflake. Built-in security and data governance is a big one. Nowadays all the organizations, they ask for data governance features, okay? And for that reason, majority of the popular data warehouses and data platforms provide these kind of features. And what are the features? Well, number one is role-based access control, okay? This is something that every organization is looking for, multi-factor authentication. See in my company Atlik, when we talk to clients for projects, this is the first question they always ask, okay? What about data security? How do I get data governance? And using uh solutions like snowflake, Data Bricks, provides you all those built-in features, always on uh encryption. Then row and column level access policies.
[24:53]So yes, you can set the access control at even row level, okay? So let's say uh some people have access to the table and only few individual have access to certain rows. You can configure your security and access control to that level at row and column level. And the next one is Snowpark and developer ecosystem. Uh, snowflake used to be SQL heavy, you know, it was mainly designed for SQL usage, but later on they realized, uh, and maybe because of the competition with Data Bricks, they realized that they need to add support for Python, uh Java, all these programming languages and they came up with this Snowpark, through which you can, um, you know, get this Python support in snowflake or Java and other languages support. You can also get continuous data ingestion with snowpipe. So in snowpipe, snowpipe is a feature through which, let's say you can set up your data pipeline in a way that, let's say you have daily files coming in your landing folder in AWS S3 or let's say ADLS, okay? And through snowpipe, you can monitor that directory. And as soon as you get the new files, you can ingest that data into snowflake, okay? You can set those things using snowpipe and then fully elastic pay as you go. Uh, so when you're using snowflake, uh, you are fully elastic, you know, you are paying, uh, as you go. So many organizations like this kind of billing policy, you know, they just want to pay uh per their usage. So that kind of benefit you get with snowflake. All right, so just to summarize, these are the key features and benefits and you can just take a screenshot or just uh try to summarize this, uh, in your own head, organize your thoughts because this is one of the popular interview questions that they, they ask in a snowflake interview. You may be wondering, what's the difference between snowflake and traditional data warehouses? See, when big data revolution came, there were traditional data warehouses such as Teradata. And at that time in old days, the approach was different. These data warehouses were on-prem, okay? There was no cloud native approach. So let's summarize the difference between traditional data warehouse and snowflake. Nowadays, by the way, Teradata is also available in cloud, but I want to summarize this so that you can understand the evolution of this technology. So traditional data warehouses are typically on-prem and VM-based. They are tightly coupled in terms of compute and storage. For snowflake, this is the biggest feature folks, okay? Just remember that it separates compute and storage. And it is cloud native, okay? So it avoids all kind of management. Your payment policy is kind of flexible and you're paying as you use either compute or storage. The second one is snowflake has automatic optimization, okay? Micro partitioning, pruning, caching, there are bunch of features due to which it is very optimized. You know, we in one of the previous lectures, we saw that we had 280 billion rows. When you are running queries on that kind of data, you need optimization, you need speed and snowflake provides that. If you want to do performance tuning in traditional data warehouses, see, you have to do manual indexing, partitioning, optimization. If you have worked with even RDBMS such as Oracle, SQL server, MySQL, you know that whenever your query runs slow, you will create some kind of index, you will do partitioning, you will have to do all this optimization manually. In snowflake, you get it out of the box. The next one is traditional data warehouses typically supports structured data types. Okay, whereas snowflake support structured and semi-structured, meaning we just saw, right, in one of the previous lectures, we had a JSON column and we were doing queries on JSON columns. So JSON is a semi-structured data, snowflake provide support for it. By the way, nowadays these databases are evolving. So if you're using even Postgress or many, many popular databases, they, they are also adding this kind of semi-structured data support. In terms of maintenance, snowflake, fully managed, very less maintenance, whereas in traditional data warehouses, you have to do manual backups, patching, tuning. If you have uh talked to any DBA database administrator, you will understand, you know, they spend so much time when I was at Bloomberg, they will have scheduled maintenance. The databases will be down on let's say weekend for few hours and the team will come, you know, they will work on weekend and they will do all these activities. The next one is the cost model in snowflake is very flexible, pay per go, whereas here is fixed hardware and license cost. And the problem with this approach is like you might be renting or you might be buying a very expensive hardware and license thing. But if there is no usage, then that money will go inverse, right? It's like you are having some subscription service where you are paying every month some fixed amount. Let's say Netflix, you're paying every month, let's say $20, $30. Whereas this one is, okay, and I and I use this approach where if I want to watch a movie, I go to YouTube, I rent a movie for $3 or $4, whatever that amount is, and I just see it. So let's say if there are three months where I have not watched any movies and by the way, I don't watch many movies or shows. So if for three or four months, if I have not watched any movies, I will not be paying for that fixed monthly subscription, right? So I'm just paying per go. So that kind of benefit you get with snowflake and the cloud support, in terms of snowflake, it has it has multi-cloud support. It runs on AWS, Azure, GCP etc. Nowadays there are companies, you know, their entire ecosystem will be on Azure or AWS. So when you have this preferred cloud providers, having snowflake on that cloud provider makes things easier. And then for use cases, if you have dynamic, large case, modern analytics workloads, you use snowflake. Otherwise you use traditional. Nowadays, see, companies when they are adopting data engineering practices, they will go with modern solution like snowflake or data bricks.



