Thumbnail for SQL Table Partitioning (Visually Explained) | Optimize Big Table Performance | #SQL Course 42 by Data with Baraa

SQL Table Partitioning (Visually Explained) | Optimize Big Table Performance | #SQL Course 42

Data with Baraa

33m 22s6,215 words~32 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline
Pull quotes
[0:00]So, today we're going to talk about amazing technique on how to optimize the performance of big tables.
[0:00]So, as usual, we're going to deep dive into what is exactly partitions and why we actually need it.
[0:00]And then we're going to go through the whole process step by step on creating a partition table.
[0:32]It's a technique in order to divide a large table into smaller pieces, and each piece we call it a partition.
Use this transcript
Related transcript hubs

[0:00]Hey friends. So, today we're going to talk about amazing technique on how to optimize the performance of big tables. So, the indexing is not the only method that we have. We have amazing one called SQL partitioning. So, as usual, we're going to deep dive into what is exactly partitions and why we actually need it. And then we're going to go through the whole process step by step on creating a partition table. So, this one going to be really fun. Let's start now with the first question. What is SQL partitioning?

[0:32]All right, so what is SQL partitioning? It's a technique in order to divide a large table into smaller pieces, and each piece we call it a partition. Well, this sounds like we are dividing one big table into smaller tables, but it's not like that. We are just dividing one table into smaller partitions, so we're going to see it in the database still as one solid table, but behind the scenes, it is split into multiple partitions. So, now let's going to understand what this means. Okay, so now let's say that you have a table at your database, and over the time, this table is getting bigger and bigger, where you have like hundreds of millions of rows. Now, once you have such a big table, what can happen? Everything going to be slow. So, for example, if we are reading the table and the execution plan is doing full scan of the table, this can take SQL a long time until all the rows are fetched. And if you decide to make like an index for this table, what going to happen, SQL going to go and build a very big B-tree index, where there are a lot of branches and files, and so on. And having a big index is not always a good thing, because if you do operations like delete rows, update rows, or inserting rows, these operations going to need a long time to process. So, having a big index doesn't mean that you can have a good performance for your big table. So, that's means having a big table is problematic because everything going to be slow. So, now what we can do in order to optimize the performance of this big table? Well, we can use SQL partitioning, and in order to do that, we have to understand the behavior and the transactions that are happening on our table. And what usually happened is that the table grows over the time. So, you can have like a subset of data that belongs to 2023, and another one that is created and updated in 2024, and then you have something like more current in 2025. So, that means we have like in our table, old data, and as well new data. And we usually interact with the new data more often than the old data. So, maybe for example, for 2023, there is like only one read transaction, and for the data in 2024, we have done like two reads and one write. So, it is a little bit more than 2023, but for the new data, for the current year, there will be heavy transactions. So, we can have a lot of reads, a lot of writes, we are updating, inserting, reading. So, a lot of things are going on for the new data. So, that means we are accessing frequently the big table, only to interact with the new data, and we rarely need the old data. So, what we can do? We can go and divide this big table, and we usually divide it by like a date. So, that's means we can go and split this table by the year, and we put each year in one partition. So, at the end, we're going to have like three partitions. And now it's really important to understand that those are three partitions, they are not three tables. So, that means at the client side, the users can see only one table, but behind the scenes, we have like three partitions. Now, let's say that you have a query in order to read the data from 2025. And now what can happen, SQL will not go and scan all the data from the table. It going to go and only target one partition, the 2025. So, that's means SQL is only scanning the relevant informations, the relevant partition, and not the entire table. And now we have another benefit of having partitions. Let's say that you are using a modern database. And normally they support parallel processing. So, if you have the infrastructure for that, what going to happen, the database engine can process each partition independently and parallelly. So, whether you are reading or writing data, so what going to happen, SQL going to process your queries parallelly, which of course going to reduce the overall execution time. So, that means if you have a modern infrastructure, like maybe for example, the Azure synapse and so on, go with the partitions. Because the partition then could be stored in different servers, and this helps of course, the SQL engine to use all the resources at once. So, that means partitions allows scalability, and as well parallel processing. Partitions going to make the indexing more efficient. So, instead of having one very big index for the whole table, if you put an index on a partition table, what going to happen, each partition going to get its own index. Which means the size of the indexes going to be smaller, and of course, this helps a lot with searching for data, or as well extending the index itself. So, for example, if we are inserting data to the partition 2025, the SQL will not go and change anything on the other indexes. It's going to go and only change the index of the partition 2025. So, that you can see the power of the partitioning. It improves significantly the performance of your table, whether you are reading or writing data to this big table. So, this is what we mean with partitioning, and why we need it.

[5:29]All right, friends, so now we're going to go to the process of creating partitions in SQL. At the start, it might sounds a little bit complicated, but we're going to do it step by step, and I have a sketch for that. So, we have like four steps because we have in the database like multiple layers. So, let's see how we can do that. Let's go. So, the first step is, we're going to go and define the partition function. So, what is that? We're going to go and define here in the function the logic on how to divide your data into partitions. And this going to be based on the partition key, like column, region, and so on. But the most famous one is the date, and that's because our tables like get bigger over the time.

[6:17]And there are like multiple types of functions. We're going to focus on the range function. So, how it going to work? We're going to have like a range of dates, and then we have to define like boundary values. And let's say that I would like to make a partition for each year. And in order to do that, we have to define the partition boundary. So, it is like a value. The boundary of the years could be like the first day of the year or the last day of the year. So, here in this example, we're going to take for the boundary the last day of the year. So, the last day of 2023, 2024 and 2025. So, we call those values the boundary of our function. Now, between the boundaries, we're going to have our partitions. So, for examples, all the rows for 2025 and earlier years, it's going to be the partition one. So, between the boundary and everything before is one partition, and after that, between the two boundaries, we have partition two. So, this partition going to be for all rows of 2024, and then we have another section, the partition three, where we have all rows of 2025, and then between the last boundary and everything onwards, it's going to be partition four. And here we're going to have all the rows from 2026 onwards. So, that's we have now a logic. We are telling SQL how to divide our data into multiple partitions. And here there is like two methods, the left and the right. So, what are those two methods? So, again, we have our boundary, and now the big question, to which partition does this boundary belongs to? Is it partition one or partition two? And that's why we have those two methods. If you say it is left, that's mean the boundary belongs to the partition number one. But in the other hand, if you say it is right, then the boundary going to be part and belongs to the partition number two. So, you have to decide whether the boundaries belongs to the left partition or to the right partition. And with that, in the partition one, we're going to have all the rows of 2023, including the last day of 2023. Because in the partition two, we only focus on 2024. So, it's just the boundary belongs to the left partition, it's very simple. Now, let's go and implement that in SQL. So, let's do it, the syntax is very simple. We're going to say create partition function, and then we have to give it a name. So, it's going to be partition by year since we are dividing the data by the year. And after that, we have to define the data type. So, we are splitting the data by a date. So, it's going to be date. And after that, we have to define the partition function type. So, in our example, we are using the range. And now comes the very important step, we have to define the boundaries.

[9:01]So, we're going to say for values, and we're going to enter here three boundaries like in our example. For each year, we're going to define a date, so 2023, and the last day of the year. Same goes for 2024, and for the last one, 2025. So, with that, we have defined the logic, the range, we have defined the boundaries, and we tell the SQL, the boundaries are a date. So, let's go and execute our function. Okay, so that's it, as you can see, it's very simple. We just created a function that splits the data by the date, using the range left. And of course, this function is not yet attached to any tables or anything, it is just a logic that is stored in the database. All right, so now since our partition function is stored inside the database, we will have metadata about those functions stored in the system schema. So, we have there a dedicated table called partition functions, and there we can find information about all functions that we have inside our database. So, let's go and execute it. And as you can see, we find now our new created partition function. So, partition by year, it is range, and it has an ID and so on. And I really recommend you to check it before creating any new partition function, maybe you have already one in the project.

[10:25]Okay, so now let's check the next step in our process. We're going to go and build now the file groups. So, what is a file group? It is like a logical container of one or more data files. So, it's very simple, it's like folders. We're going to go and create now like multiple folders, so later we can insert inside them files. And this is really nice because it gives us like freedom and flexibility. Where we can go and decide how the data files are organized for each partition. So, what we usually do, we go and create for each partition a file group. So, we're going to have like four folders, or four file groups, for 2023, 2024 and so on. So, now let's go back to SQL in order to do that. All right, so now let's go and create those file groups. The syntax is very simple. So, it's going to say alter database, and now we have to tell the database where these file groups should be stored. In which database? So, I'm going to stay with the SalesDB. And then we have to tell, okay, add file group, and after that we have to define the name of the file group. So, the first one going to be for 2023. So, the syntax is very simple. Let's go and do it for the other years. So, we need 2024, 5 and 6. Okay, so that's all, we can just select everything and execute. So, as you can see, it's very simple. We have just created four file groups, and they are empty. So, we don't have anything inside those containers. Now, let's say that you have made mistake with the namings and so on, and you would like to drop one of them. So, the syntax is as well very easy. So, it's going to stay alter database SalesDB and instead of add, you're going to say remove. So, once you execute this, this file group will be dropped. But we need it. So, let's go and recreate it. Now, as usual, after creating stuff, let's check whether everything is created correctly, and whether we have any duplicate or anything wrong. So, with that, we have as well a file group table inside the system schema, and let's go and execute it. So, I'm just filtering with the type FG for file group. So, let's execute it. And now we can see in our database, we have four file groups. Now, four of those file groups, we just created it, right? So, we have the 2023, 24 and so on. But we have something called primary file group. This is the default file group that is created for each database. So, it is a container for all data files in your database. And as you can see, we have here a flag saying, it is a default. So, it's a default, and we have it one, and for the rest, they are not the defaults. So, this is really nice to see all the file groups inside your database to check that you don't have duplicate and so on.

[13:07]Okay, now moving on to the third step, where things going to get more physically. So, far we have like a function, the file group, and all those stuff are logical stuff, we don't have data yet. In order to have data, we have to go and create data files. So, as we learned before, data files going to contain our actual data, and they're going to be stored physically in the database. So, you can go and assign for each file group like one or multiple data files, and the file format here is .NDF. It is secondary data files. We have like primary and secondary. But in the partitions, we usually go with this format, the NDF. So, again, the file groups are a logical containers, and the data files are physical files, where our actual data going to be stored physically in these files. So, now let's go back to SQL in order to create some data files. Okay, so now we're going to come to the little bit annoying part where we're going to go and create files, but the syntax is as well very simple. So, it's going to say the same things, alter database, and our database is SalesDB. And then this time I'm going to say add file. And now we have to give SQL not only the name, but the physical place of the files. So, let's do it step by step. We're going to open new two parentheses. So, first we have to define for SQL the logical name. It is not the file name, it is the logical name of the file. So, let's give it a name, for example, P 2023. And then comma. So, this is the logical name. And now the next one is, we're going to give the physical name of the file together with the path. So, we're going to say file name equal, and now we have to define for SQL the complete path of the file. In SQL server, there is like a default path where the data going to be stored, and I'm going to go and use the same path. And the path really depends on the version and as well the type of the SQL server that you are using. So, for the current version that I'm using for this tutorial, we can find it over here in this path. So, if you go to the C, then program files, Microsoft SQL server, MSSQL, and the version for me is 16, SQL Express, and then inside MSSQL data, and so on. So, we're going to go inside this folder. And now we can see over here all the database files. So, we can see, for example, here, the SalesDB, the SalesDB logs, and we have here the AdventureWorks and so on. You're going to see all the files of your database. And what we're going to do? We're going to put as well our partitions files inside the default folder. But for real project, you have to ask the database administrators about the exact location where you can put your partitions. So, let's go back to SQL, and I'm going to put this path over here. And then we have to specify the file name. So, it's going to be P 2023. And now we have to specify the file name. So, NDF. And with that, we have now a complete path with the file name. So, we are almost there, but we are not done yet. We have to tell SQL where to put this file. In which container, in which file group. So, we're going to go over here and we're going to say to file group. And here make sure to select the correct one. So, FG 2023. All right, so that's all. Let's go and execute it. So, let's do it. And with that, we have created a file inside a file group. I will not be creating like multiple files inside one file group, it's going to be like one to one. So, now what we're going to do? We're going to go and create the other files for each file group, for each year. So, we just have to copy and paste, and just change the names. So, for 2024, it's going to be like this. So, that's it. And the same thing for 2025.

[16:42]And for the last one, 2026. And we can go and select now everything and execute it. So, that's it. With that, we have created now four different files, and we have mapped as well each file to the correct file group. And I usually don't create like a lot of files. I just create like one for each year, or maybe for a bunch of years, so you don't have to go and make for each day like a partition or something like that. Okay, as usual, after creating stuff, we have to go and check the metadata. Now, I have here prepared a query, where we query the file groups together with the files. So, all the data information could be found inside the table master files, and then we join those tables and select our database. So, let's go and query this one. And now you're going to get a list of all files inside your database. So, we see over here, we have the primary for the database itself, and you can see the path of the file, and as well the size of it. And we can see over here, we have four files, and the file group that is assigned to, and the complete path of each file. And you can monitor over here, of course, how the size of each file is growing over the time. Maybe one of them is getting like really big, and then you can think about, let's go and split it to multiple files. So, that's about how to create data files.

[18:08]All right, so now we're going to move to the last step, where we're going to go and define the function scheme. Now, if you have a look to this picture, you see that there is something missing. From one side, we have defined how to divide our data into multiple partitions, and from the other side, we have prepared all the files and the file groups and so on. And now what is missing is the connection, how to connect those partitions to the file groups. And we can do that by using the partition scheme. So, all what we are doing now is just defining which partition belongs to which file group. So, for example, we're going to go and map the partition one to the file group 2023. And with that, all the data of 2023 and earlier going to go to the file group 2023. And of course, we have to go and map each partition to a file group. If you don't do that, you will get error in SQL. And once we build the partition scheme, then we're going to have all the components ready in order to have a partition table. So, now let's have a quick summarize. The partition function going to decide on how to split your data into multiple partitions. The partition scheme going to go and map the partitions to a file group, and the file groups are like folders in order to organize your files. And each file group has one or more data files, where your actual data going to be stored physically in these files. At the start, it might be confusing, but now as you understand each layer, then it's going to make it easier for you to build partitions. So, now let's go back to SQL in order to build the partition scheme. Okay, so now we have the easiest part, where we're going to connect everything together. So, the syntax is as well very simple. It's going to say create partition scheme, and now we have to give it a name. So, let's go with like scheme partition by year. And now we have to map the partition function with the file groups. So, first we're going to say as, and then we define here the partition function. So, as partition, and now we need the partition function that we have created. So, as partition partition by year. And then after that, we're going to map it to the file groups. And here it is very important to map it in the correct order. So, the order is very important. So, the first one was file group 2023, the second one, 2024. And we have 2025, and the last one, 2026. So, again, the order is very important. And as well, it's going to be a little bit tricky. So, sometimes, as you are creating like the functions, maybe you make a mistake that you don't know how much partitions SQL going to create. Like in our example, we have three boundaries, and SQL going to create four partitions.

[20:55]So, it happens sometimes that you think, okay, I have three boundaries, and then I'm going to get three partitions, which is not really correct. So, for example, let me just remove one of those. And let's say I have only three file groups. And let's go and execute this one over here. Now, we are getting error, it says the partition function generates more partitions than the file groups. And that is really correct, because our definition of the logic going to split the data into four partitions. And now we are giving SQL only three file groups, which is not correct. So, we have to go and add the last one. And one more thing, SQL will not go and check whether you are mapping things correctly to the file groups, because it doesn't really care about the naming of those files. So, for example, if you go and put this one at the end, what can happen, it's going to be a big problem. So, all the years of 2023 going to be stored inside 2024, 2024 going to be in 2025. So, everything going to be mixed, and SQL can do it like you tell it. So, that's why make sure you have the correct sort. So, that's it. Let's go and create our scheme. So, it is working. This is very simple. We just mapped now the partitions to the file groups. And as usual, we check things after creating. And I have prepared here like a really nice query from the metadata in order to see the whole thing, the functions, the file groups, the schemes. You can of course, and add to it the data files, but I'm just going to stick with this over here. So, again, SQL server, we have a dedicated table for the partition schemes. Then I'm just joining it with the functions, and then with the destination data spaces in order to get the partition number and the file groups. So, let's go and execute it. And now we can see very nicely the scheme that we have created, and the function name of the partition, and then we can see the partition number and the file group name. So, we can see how things are mapped together. So, if you get it like this, then so far, everything is good.

[22:48]All right, so so far what we have done, we have prepared all the layers. So, we have the setup is ready to be used in any table. So, we have the functions, the files, the file groups and scheme and everything is ready, but still we are not using it. The logic just exists and the files are empty. So, now what we're going to do? We're going to go and create a table, but not a normal one, a partitioned table. So, let's go and do that, it's very simple as well. So, create table, and we have to give it a name. So, let's get it as well in the schema. Sales orders, and I'm just going to give it the name partitions. So, now we have just to define like few columns inside this table. So, let's get an order ID, and data type int. And let's go and get an order date, with the data type date. And maybe just one more called sales and the data type int. So, this is very normal table that we create in databases. But it's still not yet partitioned. Now, in order to use everything that we have defined, we're going to go do the following. We're going to say on, and now we have to tell SQL only the name of the partition scheme. So, everything else is like connected and mapped together, because the scheme is mapping the function with the file groups, the file groups are mapped to the data files, and everything is like connected together. And here in the table, we have just to give the name of the scheme. So, the name of the partition scheme is scheme partition by year. And now it's very important to give a column. And since the whole logic and the function is based on a date, we cannot go and specify here, for example, the order ID or sales, because it makes no sense. We're going to go and pick the order date and put it over here. And with that, we have created a partition table. So, now what we're going to do? We're going to go and start inserting data to our table. So, let's go and do that. We're going to say insert into sales order partition, and we're going to pick values like this. So, one, and then let's get any date, like 2023, like, for example, May, the middle of the month, and the sales could be anything like, let's say 100. So, let's go and execute this. And let's go query our table. So, it is this one over here. All right, so now we have one record inside our partition table. And now the big question is, in which partition, in which data file, did SQL store this record? So, we have to test whether everything is working fine. So, in order to do that, I have prepared as well a query. So, we are again asking the table partitions with the destination data spaces, where we're going to get the number of rows in each partition, and then we have the file group. And we are focusing on our table Orders partitioned. So, let's go and execute this one. And now we can see very easily, we have the four partitions, our new record is inserted in the correct place, in 2023 file group, and in the correct partition, which is correct. Now, let's go and add more records. I'm just going to go and duplicate it, record number two. And I'm just going to pick a date in 2024, and this one going to be like 20. Just change the value. So, 50. Let's go and execute it. And now we have a second row inside our table. And again, the big question is, whether it is working. So, let's go and execute this again. And now we can see our record is inserted in the partition two in the file group 2024, which is correct. Now, let's go and check the boundaries whether it is working correctly. So, I'm going to go and here, in the third row, I'm going to say the last day of 2025. So, it's going to be month 12, and the last day. So, 20. Let's go and insert it and check our table. So, we have a new record. And now, let's go and check my expectation here. That this row going to be inserted in the file group 2025. So, let's go and execute, and that is correct. As you can see, the record is inserted in the correct partition. And this is really important to test the boundaries whether they are working correctly. Because it's a little bit tricky. You have this range left, right, and boundaries, and so on. So, you can do it like this to check whether the expectation of your logic is working correctly. And the last one, I'm just going to do it very fast. So, let's do it. 2026, and I'm going to pick the first day of this year. So, let's go and insert it. And now what is the expectation? I think it is pretty simple. So, let's go and query, and the first day of this year is inserted in the partition number four. So, I can say everything is working correctly. If you get it like this, then you have created successfully a partition table. And you have prepared all the layers of this partition correctly. I know this is a lot of work, but to be honest, it is fun. Because for the first time in database you feel like you are controlling stuff. Usually in database everything like behind the scenes, and you don't know exactly where the files are stored of your tables and so on. There is a lot of abstraction in databases, but here, like we are getting deep in databases, and we are controlling and managing all those files, which is sometimes it's nice to have this freedom and flexibility.

[28:29]All right, so now let's have a quick summarize how everything is connected together. So, we have a table, and then we specify for SQL that is connected to a partition scheme, and in the partition scheme we have everything connected. It is linked to a specific partition function, and there we have the partitions. And at the same time, it is connected to file groups. And the file groups are connected to the data files. So, as you can see, all those layers and elements are connected together. Now, let's see how this works. So, we have inserted the last day of 2025. And now the first thing that's going to happen, the partition function going to decide to which partition it belongs. So, as you can see, it is a boundary value, and since we have defined it as a left, it going to target the left partition, the partition three. And then the partition scheme going to connect it to the right file group, and in this scenario, it's going to be the file group 2025. And we have here only one file, so it's going to as well go to the correct data file. And in this file, the SQL going to store this row. So, it is pretty easy.

[29:32]And now we come to very important part where we're going to understand how the partitions are really improving the performance of my query. And of course, we can do that by checking the execution plan. So, now in order to compare like the behavior with and without the partition, what we have to do is to create a mirror table without partition. So, we have our table here, the partitioned one. What I'm just going to do, I will go over here and say into sales orders. And we're going to call it sales orders underscore no partition. So, we are taking the data and the structure from the orders partition, and of course, it will not be partitioned. So, let's go and execute it. Now, if you go over here, we can see that we have two tables. We have the no partition and the partitioned one. So, now what we're going to do? We're going to write a query on both tables, and then compare the execution plan. So, first, let's start with the no partition. So, from, and now in order to see the effect of the partition, what we can do? We're going to say where order date equal to, and now we just going to pick a value like 2026, the 1st of January.

[30:35]So, let's go and query it. And we're going to do the same thing in new query, but this time for the partitioned. So, partitioned. Now, in order to see the execution plan, make sure to activate it. So, we go to the action bar over here, and we're going to say include the actual execution plan. So, let's click on it, and execute, and with that, we have here an execution plan. And let's do the same thing for the no partitions. So, execute. And we have here execution plan. So, now let's check what do we have in the execution plan. We're going to focus on this one over here. So, right click on it, and then go to properties. And now we can see a lot of details about the execution plan, but what is interesting is the number of rows. So, as you can see, we are reading four rows, that's means the whole table. And of course, we have here the CPU and the other costs. Now, let's go and check the partition. So, let's click over here. So, now if you check over here, you can see that the total number of rows is one. So, SQL didn't read all four rows. It reads only row. And that's because we have in this partition only one row. And as you can see, the number of partitions that is used is as well only one.

[31:44]So, as you can see, it's worth the efforts. We have optimized our queries, and this has a great impact on big tables.

[32:30]The number of resources and the number of reads going to be reduced massively. All right, friends, so that's all about the SQL partitioning. It is amazing for fact tables and large tables. And now in the next tutorial, what I have done, I have collected all the best practices and tips and tricks about how to optimize the performance of SQL through all my projects in the last 15 years. So, it's going to be amazing, don't miss that. If you like this video and you want me to create more content like this, I'm going to really appreciate it if you support the channel by subscribing, liking, sharing, commenting. All those stuff going to help the channel with the YouTube algorithm and as well my content going to reach to the others. So, thank you so much for watching and I will see you in the next tutorial. Bye.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript