[0:00]This four hours of Databricks Masterclass will make you a Databricks data engineer, even if you do not have any prior experience. Because in this master class, you will learn about Spark Cluster, Spark architecture, Databricks fundamentals, magic commands, Databricks utilities, Databricks file system, reading external data from cloud data storage, secret scopes, Spark transformations, loading data incrementally using autoloader. And after covering all these areas, you will learn some special, which is required to crack the interviews. I am talking about Delta Lake. You will learn Delta Lake in detail and learn about Delta log, external delta table versus manage delta table, tombstoning of files, data versioning, time travel, Delta table optimization technique such as Z order by, vacuum command, and much more. Wow, that sounds like this is your one-stop solution to become a Databricks data engineer. But you should only watch this video if you are dedicated enough to learn everything with excitement. If your answer is yes, so let's get started with this amazing Databricks master class. So, welcome, welcome, welcome. Welcome to Databricks Masterclass. This session is really, really, really interesting because I know that demand for Databricks is growing rapidly. And companies are going after like Databricks developers and they should because Databricks is like one of those technologies which are in demand, plus which are powerful enough and which are strongly inclined towards future. They are continuously adding so many updates that you need to be updated enough to just cope with the pace. So, I can say that if you just see Databricks few years back, it was totally, totally different, but now. Now, now, the scenario is totally changed. There are so many updates in Databricks. Let me give you a hint. So, earlier we were using mounting if you want to just, um, read any data from the external location, but now we use Unity Catalog. Now we use Delta Lakes and in Delta Lakes, we are seeing updates on a daily basis. Trust me, like there are so, so, so many updates in the Delta Lake and it's very important to stay updated. When I just look for the courses available in Databricks, even the paid courses, there are really, really good. The only thing that was missing was information was not updated. Information was not as per the latest trend. So, here I am with Databricks Masterclass and I will be covering all the latest information. Yes, all the latest trends that are going with the Databricks. So, I'm just trying to fill that gap and I hope that you will love this session and trust me. Like I have like read books related to Databricks, I have read books related to Delta Lake and let me just show you as well. So, this book. This book is a gem, so this is like specially related to Delta Lake and this this book, I think, is recently published. So, there's like so, so, so much of knowledge that is missing right now in the online platform. So, I thought like it's my duty for my community to just pass on this knowledge so that you can also be competitor enough and trust me if you want to crack interviews in today's scenario. You need to be updated, you need to be like you you need to be like someone who knows like all the latest then trends and technology that are being used. Who is like well versed with all the recent updates? So, now you have found the best video which will be covering each and everything from scratch. Yes, this Masterclass will will be covering everything from scratch, so if you do not have any prior experience, any prior knowledge, do not need to worry because I will explain you each and everything from scratch and will be covering all the latest updates. And just a hint, in this particular Masterclass, we'll be covering Delta Lake in detail, which is the most in-demand technology right now. And we will see how we can just work with Delta Lake and I will just try to pass on the knowledge that I have gained so far. So, do not need to worry, do not need to pay for any of the resources, everything is available for free. You just need to be focused, you just need to just take out your notebooks, pen, pencils, everything and just start taking notes because this is the one-stop solution that you were looking for and now you have got it. So, let's get started with our Databricks Journey Masterclass. So, first of all, let's discuss about the prerequisite. So, there are not much prerequired. Obviously, you just need a laptop or PC with stable internet connection so that you can just enjoy this video so that you can just learn from this video. Next, Databricks account. Oops. Oops, oops, oops. So, if you do not have Databricks account, do not need to worry because I am here to help you how you can just set up a free Databricks account and this will not be a Databricks community account. No, no, no, no, no, no, no, no. As I just told you, because this video is a level up for like as compared to other videos. So, we will be just creating cloud account because that is required if you want to just crack interviews because if you want to show that you are experienced enough, then you should know all the hands-on and you should know all the workings that we have within the cloud account. So, even if you do not have any cloud account, we will create everything from scratch, do not need to worry. Next, next, next, next. The main thing, excitement to learn Databricks because Databricks is like the Hulk. I I call it as Hulk because everyone is running towards Databricks. Why? The main advantage of using Databricks is it is independent of any cloud environment. You can just use Databricks with any cloud environment and that is the best thing. If you are using Azure, you can use Databricks, if you are using GCP, that is Google Cloud Platform, you can use Databricks, if you are using AWS, you can use Databricks. If you are using AWS, you can use Databricks. Yeah. So, this is like cloud independent. So, that's why all the three major cloud players can integrate Databricks and that's the biggest advantage of this. And that's why every company is just relying on Databricks because obviously, when you transform the data, when you want to just work with the big data sets, Databricks, Spark Clusters are your only choice. Did I just say Spark Clusters? Do not need to worry. I will tell you everything that I am saying right now. So, what is a Spark Cluster? What what is a Spark Cluster? What is a cluster?
[8:10]So, the thing is, let me give you an example so that you can just get this knowledge like very easily instead of just making it complicated. Let me just, uh, give you an example. You must have worked in your computer lab maybe in your school or in your college. So, you would have seen so many computers in that particular room, right? Let's say you saw like 30 computers, right? Let let's say you just saw 30 computers in one room, right? Now, if I just ask you to combine all the 30 computers together and treat those 30 computers as one computer, can you imagine the power of that one computer would be? The power of processing the how how fast it can process the data? Can you imagine? Obviously, it is much more than one single computer. Even if I just say that you can just take one computer, right? And you can just increase its RAM, its ROM, its storage, everything. There is a limit to do it. There is a limit, but there is no limit when you combine computers. Now you would say, I only have 30 computers, so what? You can just include another computer lab of 30 computers, now you have 60 computers. Then you can include another computer lab of 30 computers, now you have 90 computers. So, there is no limit when you horizontally scale your machines. And in Spark language, we call machines as nodes. Let me type it for you and start taking notes, start taking notes right now. So, we call them as nodes instead of machines. Like you can just call the machines as well, but in Spark language, we call them as nodes. So, just imagine, now we have 90 machines, now we have 90 nodes.
[10:28]And we are treating them as treating them as like only one single machine because these are interconnected. Now, these machines. Okay, okay, okay, okay. Your question is valid.
[11:15]Why do we need to just combine so many machines? Because we can distribute our work. Just imagine, I just got a task. I just got a task to let's say process 1GB of data. This is an a hypothetical situation, so don't worry about that. Let's say I just got, okay, let's take it as one TB. 1TB of data, right? 1TB of data, 1 TB. So, now I can just ask these 90 machines to distribute the data, to distribute some some data, some part of data, that's called like subset of data or let's say that's also called as like data partitioning. So, these like this 1TB of data can be distributed among 90 machines and these 90 machines can work in parallel and can deliver the results.
[12:10]That is why we need parallel processing. That's why we need clusters. And that's why we need Apache Spark. Got it? I hope like now you know the concept of Spark cluster. What is a Spark Cluster? What what is a Spark Cluster? Now, after this, I am going to cover one very important topic, not just for the sake of learning. But if you want to crack interviews, this is the area that you should focus on. Definitely. This is none other than Spark architecture. I know this topic is a little bit complicated, but I will tell you like I I will just make this topic so, so, so easy to understand. Trust me, just just be with me for like next few minutes and then you will be like, okay, now we have understood each and everything related to Spark architecture. Just trust me. Now, we don't know what is this. Okay, now we don't know what is this, now we don't know what is this, what is this. We just know so far that cluster is a group of machines. Cluster is a group of machines. We just know this part. Okay, you know this part? Very good. So, in this example, we talked about that we have let's say three machines. Machine one, machine two, machine three. And our cluster is just having three machines, right? Okay. We just have three machines. So our cluster is having just three machines. Let's understand this spark architecture with this example. So, let's say we have just three machines. So, first of all, obviously, when I say I want to process 1TB of data, right? And let's say this is you. This is you.
[14:11]Don't mind my drawing. Okay, this is you, right? So now you have this is a cluster. This is a cluster.
[14:20]So, let's call it as cluster because you just know cluster, you don't know what is going inside this, right? So, this is your cluster. So, now every cluster has a cluster manager. So, this cluster manager can be Spark standalone cluster manager or the external one as well. So, this this has a cluster manager. Every Spark Cluster has a cluster manager. And obviously, as name suggests, it manages everything. Let me tell you, let me tell you, wait, wait, wait, wait, wait. So, now, what will it will do? First of all, it will use one machine, one node to create driver application, to create driver program, this one. First of all, this is the first thing. It will treat one machine as driver machine or driver program, right? Just take notes, just take notes. So, one machine is like driver machine or driver program, right? So, this is your our driver program. Now, this is done. Like one machine is occupied and we have two machines left, right? Okay. Now what what will happen? Now, when we submit the code to cluster manager, like to cluster, right? It first goes to cluster manager. Yes, it first goes to cluster manager. Then it creates this driver program. This is done. This you understood, right? This you understood? Now, this driver program will divide your data processing in small, small tasks. It will just give the directions like what actually need to happen. Right? Let me repeat. It will break your data processing command in small, small tasks and will give instructions what actually need to happen. So, it has some information. It has some information. Right? It has some information. Now, this information will go to the cluster manager. Okay. Now, this information has also the information related to number of machines required.
[16:47]Number of machines required to actually execute the code. And those machines are called as worker nodes. Called as worker nodes. Right? Got it? So, let's say this driver program said that I need both the machines, I need this machine as well, I need this machine as well. Okay. Now, it will create worker node on both the machines. Okay, worker one, worker two. Okay, worker one, worker two. So, it will create worker node on both the machines. Now this information will be going to these worker nodes, worker one and worker two. Right?
[17:50]Now, cluster manager's responsibility is done. He says, you gave me 1TB of data, I created driver program for you. Driver program gave me information to create two worker nodes, I have created that. Manager's responsibility is done. Manager's task is done. Now, these two worker nodes will actually process the data, will actually do the hard work.
[18:16]Then they will return this result back to the driver program. Yes, now there is no intervention of this cluster manager. He's out. Now, these two are communicating. Worker nodes is communicating with driver program. This worker node is communicating with driver program. And obviously, the whole cluster is now synced with each other and we get the process data out of it. Understood? This is the easiest way. This is the easiest example I could have taken. So, if you have understood some of this part, it's really good. Because when I was understanding this part, I watched, I think, videos like so many times. I just read books so many times because it is really confusing, but try to grasp it. Rewatch this part again and trust me, you will just grasp this part. I know you have understood like at least 50% of it, I am damn sure, I am damn sure about it, but just rewatch this part. So, to summarize, let me summarize it. This is you. You want your Spark Cluster to process your data, right? So, you have a cluster manager. This cluster manager will create one driver program, right? So, this driver program will see your code. Okay, you want to do this, you want to do this. So, it will break that processing thing into small, small tasks and it will provide the information to cluster manager. And then that information will also be written in this information page that we need two worker nodes. It will create two worker nodes. He will say, okay, my task is done. Now you can communicate. Now, these worker nodes and driver program will communicate with each other. One thing to note, driver program will not do any kind of execution. No. It will not process your data. It will just give the directions. It will just orchestrate the tasks, that's it. That's it. It will not process, it will not do the hard work. Hard work will be done by worker nodes. That's it, simple. And they do not need to use their brain. No. Everything will be instructed by driver program like what task should be done first and like what should be the orchestration flow, everything will be provided by driver program.
[20:37]This one. Simple. This was all about Spark architecture. If you haven't understand 100% of it, it is fine. That means you are learning. And now it's time to just rewatch this part again and once you have rewatch it, now let's start with now what is Databricks? So, the thing is, we just talked about clusters, right? Now, the best thing is you you do not need to manage these clusters. No. Databricks is actually doing all the hard work for us. It is so good. It is managing this whole cluster for us. It is kind of a management layer for our Spark Clusters. Yes, that's why we use Databricks. Nowadays, we do not need to manage clusters. We do not need to do anything. Everything will be done by Databricks. Everything will be done by Databricks. That's why we love Databricks. And that's why Databricks is the like OG. It is like the OG. So, now Databricks is managing our cluster. That's a good news, right? Yeah, that's a good news because we do not need to do anything and we just need to configure our cluster. Don't worry, I will just show you how you can do it. So, that was all about Databricks. Now you have understood what is actually Databricks. So, Databricks will manage our clusters. Databricks will actually take care of all the clusters, everything. Right? So, now, finally, it's time to talk about account creation. So, there are basically two ways to create account, one is community edition. What is community edition? So, basically, this this is a kind of a free account that is provided by Databricks itself when you want to learn something, when you just want to, uh, have a have an exposure of Databricks, right? So, this is a kind of a community edition account. Uh, you can also create this account, but the thing is, if you want to crack the interviews, if you want to show that you are experienced in Databricks, you need to show that you are well versed with working with Databricks along with the cloud storage because obviously in the real world, we do not know, we do not need to use community edition. In the real world, we do not use community edition accounts. No. We do not use this. We use cloud accounts. Obviously, because companies data, organizations data is not available on community edition. It is stored in the Delta Lakes. It is stored in the storage accounts of cloud providers such as AWS, Azure, GCP. So, in the real world, we work with Databricks and we connect our Databricks account with cloud providers. That's why in this particular video, I will show you how we can create Databricks account, which is linked to a cloud provider and then we will learn each and everything linked to that cloud provider or let's say cloud storage account so that you can actually have that exposure. You can actually have that experience.
[23:37]How things are working with cloud because it is totally different. Like not different but yeah, like you cannot use storage path which is like provided by the cloud and you actually cannot see the data because that is not shown to you when you use community edition, right? So, there are many things that are missing in community edition, but I will show you how you can just create your free cloud account, cloud Databricks account and definitely you will learn a lot. Trust me. This can be the game changer for you because you will feel confident while sitting in the interview and you can just tell them that I am experienced because I know like how to work with cloud accounts. Because obviously, when you sit in the interview and when you will say, I have only worked with community edition, obviously, they will not consider you. But now they will consider you because you are watching my video and I am just providing the best quality of education. So, for this particular video, we will use Azure. I love Azure. So, we will be using Azure. It doesn't matter because you can just use any cloud provider because we are learning Databricks. We are not learning Azure, AWS or GCP. We are learning Databricks along with any cloud storage, right? And as I have mentioned that Databricks is independent of cloud. So, when you learn Databricks connected with any cloud platform, that means you have learned Databricks with all the platforms. This is the best thing about Databricks. And why I have picked Databricks with Azure? Because you need to configure so much when you work with Azure. You need to just use Key Vaults. You need to just use Service Principal. You need to just configure, let's say, a secret scopes in Databricks. When you actually want to access data stored in the Delta Lake and that's what recruiters want you to know. That's what they are looking for. You should be well versed about the all the configurations that are required to access the data. So that's why I have picked Microsoft Azure and it will give you like a lot of knowledge, a lot of skills. Trust me. This can be the game changer for you because you will feel confident while sitting in the interview and you can just tell them that I am experienced because I know like how to work with cloud accounts.



