[0:06]Scaling is probably one of the most important aspects of computing, and a common cause of bankruptcy. If our processes use more memory and CPU than what they need, they are wasting money, or stealing those resources from others, thus making them less efficient. On the other hand, if you give processes less memory and CPU than what they need, their performance will be affected negatively, making user experience suck. And that's the good outcome. Much worse situation is that underpowered processes might crash with out of memory and other similar exceptions. Hence, the goal is to assign just the right amount of resources to processes. Not too much, not too few, just right. We do that through scaling and we need to answer three questions. What do we scale? Where do we scale? And who scales? We'll take a quick break for me to introduce you to Port. The sponsor of this video. Actually, I probably do not have to introduce you to Port if you watched other videos in this channel. Nevertheless, here it goes. Port is all about creating developer experiences and delivering them in a portal using the building blocks you need, according to your stack, developer personas, and culture. Now, that was taken from their website. What I can tell you is that Port is, without doubt, my favorite tool for building developer portals, or to be more precise, developer consoles sitting on top of internal developer platforms. It is easy to set up and easy to use. It allows us to create data models and actions that enable developers to manage services provided as part of your platform. It is awesome. And I strongly recommend it. Try it out if you haven't already. Big thanks to Port for sponsoring this video and now let's go back to the main subject. So it is about what, where, and who. Let's start with what we scale. We can scale processes, applications, so that they get the right amount of memory and CPU. Remember, it's about finding the right amount of resources. Too much is expensive and not enough means that apps will be slow or they will crash. That memory and CPU is provided through nodes, through servers. Those also need to have just the right amount of memory and CPU. If all the servers combined do not have the resources applications need, they will not be able to scale up those apps. On the other hand, if they have more resources than what those applications need, we are wasting money. If you're using a hyper scalar like AWS, Azure, or Google Cloud, using more server capacity than needed increases cost for no good reason. Having less means that some applications won't be able to run. All in all, we scale both applications and servers. Applications need resources and servers provide those resources. Ideally, applications should be configured to consume just the right amount of memory and CPU and servers should provide that same amount of resources. No more, no less. The next is where, or to be more precise, the direction. We can scale stuff by increasing the size or increasing the number. It can go up and down or left and right. It can be vertical or horizontal. Vertical scaling means that we are adding more memory and CPU. We increase the amount a process of an application is allowed to consume and we increase resources assigned to servers. We make them taller or we make them shorter. Horizontal scaling, on the other hand, is about increasing the number of something. Instead of increasing or decreasing memory and CPU assigned to an application, we increase or decrease the number of replicas of an application. The same can be said for servers. We can change their size vertically by increasing or decreasing memory and CPU or we can change the number of servers. A collection of servers is a cluster and we can change the total amount of resources of a cluster by having bigger servers or we can change the number of servers in a cluster. Finally, we have the who. Who scales applications and servers? It can be humans who reconfigure applications to use a different amount of resources. Similarly, humans can be creating virtual machines with different sizes or adding physical memory and CPU to physical servers. Alternatively, we can have machines doing that work. We can have processes that scale applications and virtual servers based on criteria like memory and CPU consumption or, as we will see later, any other data. That's what we call automation. Now, it is not a secret that I'm not very fond of humans. I tend to have much easier time reasoning with machines. So you might think that I will advocate for automation of any type of scaling. But, as we will see later, that is not the case. There are situations when scaling is better done manually. All in all, we scale both applications and servers. We scale vertically or horizontally and scaling itself can be done manually or automatically. So, let's see all those combinations in action. I will use Kubernetes to demonstrate scaling but the principle should apply to any other type of a platform. So do not run away if Kubernetes is not your thing. You can probably apply the same principles on main frames as well.
[5:40]Here's the definition of an application. It is a simple Kubernetes deployment and the service. The only important thing to note is that I did not specify resource requests and limits. There is no mention of how much memory and CPU it should use. As a side note, do not get scared by the environment variable memory leak max memory. That's a very simple demo application that hardly uses any memory and CPU when running in normal mode and that environment variable will instruct it to leak some memory so that we have something to work with. In any case, the important thing to note is that the deployment does not have resource requests and limits. That means that Kubernetes will not be able to make an intelligent decision where to put it. That it will be the first to be kicked out in case there are no sufficient resources in a cluster, and quite a few other things that might cause issues. I should have done better than that. I should have specified how much memory and CPU that application is expected to use, but I didn't. The main reason why I omitted resources is simple. I have no idea, literally no idea how much CPU and memory that application should use. We can fix that. We can let Kubernetes manage memory and CPU of that application by using vertical pod auto scaler or VPA. Here's a simple definition. That is a vertical pod auto scaler definition that targets deployment silly demo. Now, there are a few modes we can apply. We will discuss those later. For now, what matters is that the VPA we are looking at has the update mode set to auto. As a result, VPA will automatically update resource requests and limits of the pods that will be managed by that deployment. Now, before we apply that manifest, I must stress that VPA is not baked into Kubernetes. It needs to be installed separately and how we do that depends largely on the provider. Google Cloud GKE, the one I'm using today, has the option to enable VPA when creating or updating clusters. In some other cases, like for example, AWS EKS, it needs to be installed separately. Now, with that note out of the way, let's get back to where we were and apply the VPA and the up. Now we can list all vertical pod auto scalers. The output is depressing. It's not showing much and that's normal. VPA might need a few minutes to observe resource usage of target pods before it calculates memory and CPU usage. Hence, we should wait for a while before we list all vertical pod auto scalers again. This time, the output is not depressing. We can see some useful data over there. There is close to no CPU usage, but memory is over one gigabyte. We can see more detail if we describe the VPA. The key is the recommendation section that shows us a recommendation for the container silly demo. Over there, we can see lower bound, which is the minimum amount of recommended resources, target, being the recommended values, uncap target representing the most recent recommendation and finally, upper bound that shows the maximum amount of recommended resources. Since the update mode is set to auto, VPA will be updating the target resource, the deployment periodically. However, more often than not, that update is not performed to the pods created before VPA calculated the recommendation. So to see it in action, we will delete those pods and take a look at resources of the pod. We can see that VPA automagically assigned a bit of CPU and over one gigabyte of memory. From now on, Kubernetes will have the information it needs to make the right decisions when scheduling or moving pods related to that application. And now comes the important question. When should we use vertical scaling? Well, I will leave the answer to that question for the end when we'll discuss when to use all the scaling models. For now, I will only say that vertical scaling, at least when Kubernetes is concerned, does not work with horizontal scaling, which we will explore in a moment. That might change in the future, but for now, just say no. As such, it is useful only for applications that cannot run multiple replicas. So, single replica applications might be good candidates for VPA and not much more. Also, since single replica applications are the best candidates for vertical scaling, and we do not tend to design applications like that anymore, that also means that good candidates for vertical scaling are ups designed a while ago. More often than not, those types of applications should not be restarted more than absolutely necessary. Right now, changes to pod resources results in restarts. So, that's another potentially big issue one should consider. That will change soon when in place update of pod resources graduates. Nevertheless, that day is not today, so beware. Now, to be clear, when I say that they do not scale vertically, what I really mean is they do not scale vertically automatically. We still need to specify resource requests and limits. That should not be optional. Hence, if we should not scale vertically automatically, manual operations are the only alternative and VPA can help with that as well. So, let's delete what we did and deploy the application as a single replica again. Vertical Pod Auto Scaler has four update modes. There is auto that assigns resource requests on pod creation as well as updates them on existing pods. That's the mode we use so far. Then there is the recreate mode, which at the moment, does the same as auto. It creates pods when recommended resources change. However, and this is important, once in place pod updates become GA, auto will be changing resources without recreating pods. So, that's probably the one you should use. Further on, there is the initial mode that assigns resources only on pod creation. More often than not, that's a much safer option to choose. Instead of restarting pods for the sake of adjusting resources, wait until pods are created. Sooner or later, we have to upgrade apps. So, that's typically the opportunity for the VPA to adjust resources. Finally, there is the off mode, which does not change anything. Now, you might be wondering, what's the purpose of a scaling mode that does not do anything? However, that's probably the most useful mode. It allows us to find out which resource amounts we should assign without assigning them. Here's an example. The only difference when compared to the previous VPA is that the update mode is set to off. Hence, it will not manage resources in the target. Now, let's apply it and wait for a few minutes for VPA to calculate the recommendation and describe it. This is almost the same output as before, except that the update mode is set to off. We're still getting the lower bound, target, uncap target and upper bound recommendations. As I already mentioned, the only tangible difference is that the VPA is not applying any changes to the target, but instead, it provides the recommendation that we can use to set memory and CPU resources ourselves. We could accomplish a similar result using the top command. That also gives us CPU and memory. However, the major difference is that the top command gives us current use while VPA gives us various recommendations based on consumption over a period of time. Now, to be clear, we could get even better results through Prometheus or a similar observability tool. Nevertheless, VPA with the off mode is a more convenient and easier to digest solution. So, let's say that we take the VPA recommendations and incorporate them into the deployment or whichever resource time we might be using. The end result might look like this. This time, the deployment has hardcoded resource limits set to generous one CPU and two gigabytes of memory and requests to underwhelming 50 milli CPUs and half a gig of memory. Those numbers should have been taken from the VPA recommendations. However, in this case, I put lower values for requests since we will need them for what's coming next. Let's apply it and move on.
[14:00]Horizontal scaling is a completely different beast than vertical. We are not trying to increase memory and CPU resources assigned to applications, but to increase the number of replicas. In case of Kubernetes, we accomplish that through horizontal pod auto scaler or HPA. Here's an example. Just as the VPA we used earlier, this HPA also references the silly demo deployment. That's where similarities stop. We are specifying that the application should have a minimum number of replicas set to two and that it should never go above five. The exact number of replicas will be decided based on metrics. In this case, we have two. The first one is based on CPU utilization where we expect to scale up if it reaches 80%. The second metric is based on memory utilization, also set to 80%. In both cases, the HPA will calculate those percentages by observing the actual resource usage and compare it with whatever we specified in the deployment. Now, let's apply it and list all horizontal pod auto scalers. Initially, just as with VPA, it does not have data to do anything, except to scale to two pods, since that is the minimum we set. So, let's give it a few moments and output horizontal pod auto scalers again. This time, we can see that the memory usage, the second one, is way above the target 80%. As a result, it scaled immediately to five replicas. We can confirm that by listing all the pods. There's not much more to it. HPA is relatively simple to define and it works fairly well. Just remember not to mix it with VPA. Those two are not aware of each other, so both might be scaling up and down, left and right, without taking into the account that the other one accomplished the goal. There is, however, a better way to scale applications horizontally. And to see it, let's first delete what we did so far and apply the app without the scaler again.
[16:05]The better way to scale ups horizontally is Kubernetes even driven auto scaling or KEDA. Here's an example. It is a scaled object which, just as with the VPA and HPA manifests we used is referencing the deployment, silly demo. It has the minimum number of replicas set to one and the maximum to 20. The difference when compared with the HPA is in the triggers, which can contain any number of scalers. Look at that, for example. We can trigger scaling based on data from Active MQ, Apache Kafka, Apache Pulsar, Arango DB, AWS DC, and that, and so on and so forth. Almost anything anyone could imagine as the source of data to scale ups is available. In this case, we are using Prometheus running inside the same cluster. We are setting 250 megabytes as the threshold and using the query that retrieves memory usage of the application. Now, to be clear, that's a silly example that does not show KEDA in its full glory since the result will be almost the same as if we used HPA. I was too lazy to set up a complicated demo, so I'm leaving it to your imagination to run wild and figure out all the use cases we could accomplish using all those scalers. Or, if your imagination is not working today, you might want to check the over there. That that video, and see it in action, right? That's a detailed video of what KEDA is. Now, let's go back and apply the manifest and see it in action by listing all scaled objects. There is not much to see there, since KEDA delegates the actual scaling to the horizontal pod auto scaler. It, in a way, extends the capabilities of HPA by allowing it to use data from a variety of sources. Okay. Let's list all horizontal pod auto scalers then. We can see that an HPA was created by KEDA and instructed to scale the application by providing it with the data which, in this case, is coming from Prometheus. We can confirm that it is working by listing all the pods. Now, that does not look good. Some pods are running while others are crashing due to insufficient memory or other reasons, which all boil down to the fact that the cluster does not have the capacity to run 20 replicas of that application. As a matter of fact, quite a few pods are in the pending state, meaning that Kubernetes cannot even try to put them anywhere. The cluster is clearly too small, bringing us to a perfect place to talk about scaling nodes.
[18:39]Just as with applications, servers or nodes can be scaled vertically or horizontally. Most of the time, scaling nodes vertically does not make sense. If you need a bigger node, we should create a bigger one instead of attempting to increase memory or CPU of an existing node. At least that's the case when running in cloud. If you're on Prem, you might be used to dynamic scaling of resources using, let's say, VMware, but I'm here to tell you that is just silly. If a node is too small, create a bigger one and move the up that needed more capacity to that node. If the up cannot be moved somewhere else, you should move somewhere else. You should change the company. I can safely say that it is unhealthy for you to stay there. With that note that probably offended at least a few of you, let's move to the only reasonable direction to scale nodes.
[19:35]Let's take a look at the nodes we are currently running in the cluster. There is only one node, a small one, a tiny one. So, it's no surprise that many of the 20 replicas of the application failed or could not be scheduled. We need more, and the question is, how to get more? One option would be to go to the Google Cloud console, blockade the GKE cluster, click a few buttons, fill in a few fields, and that's it. Easy, right? Well, ignoring the discussion why click ofs is not the right sect to join, scaling that way is close to impossible to do simply because the need to increase or decrease the number of nodes can materialize itself at any moment. Right now, we might have a traffic spike compelling us to scale up only to realize that shortly afterwards we need less. We would need to have a dedicated team of people watching pods and clicking buttons 24/7. Most of the teams that scale nodes manually end up over provisioning. They end up with more nodes than they usually need just so that when the increase in traffic happens, there is just enough capacity. Companies with such teams should be proactive and file for bankruptcy right away. If you know that it is coming, why not get over with it sooner rather than prolonging the inevitable? Fortunately, at least when Kubernetes is concerned, horizontal scaling of nodes is a solved problem. We just need to enable it. In Google Cloud GKE, that is fairly easy. All we have to do is update the cluster to tell it to enable auto scaling and specify the minimum number of nodes and the maximum number of nodes, and that's all it takes, at least when GKE is concerned. Since the cluster was under pressure to expand long before we enabled auto scaling, it should probably take only a few moments until it increases the capacity. We can confirm that's what happened by listing all the nodes. And look at that. There we go. There are now three nodes. One, the original being 42 minutes old, and the other two created less than a minute ago. The cluster auto scaler calculated that three nodes should be enough to run all the pods, including those 20 replicas of the silly demo app. We can confirm that, just in case, by listing all the pods. Now, don't be scared by seeing that the status of some pods is container status unknown or error. Those are failures from the past that will go away eventually. What matters is that 20 pods are now running. Here's the best part, though. Just as the number of nodes increased to meet requests for additional resources, it goes down when that same demand decreases. And we can demonstrate that by changing the KEDA scaled object to this one. The updated object now has the threshold set to two gigabytes. Given that silly demo pods consume less than one gigabyte of memory, that should trigger KEDA to scale down the number of pods and to continue scaling down, to continue scaling down over and over again until eventually, it reaches only one or two pods. So, let's apply it and wait for a while. I mean, we will not wait for a while. I will fast forward. Scaling down is a more conservative operation, so it tends to take much longer to remove pods than to end them. After a while, let's say five minutes, we can retrieve the pods and see what's going on. We can see even more errors, but that's to be expected since the application we are running, the one that I wrote, does not know how to deal with sick term signals. Kubernetes has been sending in attempts to kill it gracefully. Now, that would be a subject of a completely different video, which I can make if you're interested, just let me know in the comments. What matters is that those new failures are not caused by insufficient capacity, since, as we already saw, cluster auto scaler created a sufficient number of nodes. If we ignore those errors, we can observe that the number of running pods now dropped to 15. That means that KEDA detected that the actual memory usage is way below the threshold and started scaling down. Since, as I mentioned, going down is more conservative, uh, we should wait a while longer. Some 20 minutes should do for what I want to show. So, let's fast forward and output pods again. The output now shows that only two pods are running. Eventually, it will drop to one, but there's probably no need to wait for that. It's clear that horizontal scaling of applications goes in both directions. Nevertheless, I did not show all that for a sake of confirming that ups go up and down, but rather to confirm that cluster scaler also works in both directions. With a significantly smaller number of pods in the cluster, the number of nodes should also drop. Let's check that out. There we go. It dropped from three to two nodes. Mission successful. Money saved. No one goes bankrupt. At least not due to unreasonable bill from Google.
[24:26]By now, you probably understood at least on high level how scaling works. What to do and when to do it. Nevertheless, a short set of instructions might be useful. Let's start with apps. Use vertical scaling with legacy apps that cannot run in more than one replica. For everything else, Vertical Pod Auto Scaler is useful mostly to provide recommendations, how much memory, and CPU to assign to pods manually. The biggest obstacle for wider adoption of vertical scalers, at least when Kubernetes is concerned, lies in conflicts with horizontal scalers. That might change soon, given that quite a few companies, especially those associated with cost reduction, are working on better ways to scale vertically. Nevertheless, none of those I've seen so far are good enough to be used for anything but recommendations that should be applied manually. While scaling apps vertically has limited value, horizontal scaling is a must for all applications that can run multiple replicas and do not get penalized by being dynamic. Those are typically stateless apps. More often than not, horizontal scaling should be part of every app, and when Kubernetes is concerned, the main question is whether to use horizontal pod auto scaler or KEDA. The former is more than fine when scaling decisions are made based on memory and CPU while KEDA shines for scaling based on any other criteria. If we turn our attention to servers, vertical scaling is most of the time just silly. Just as vertical scaling of ups results in recreation of pods, at least until in place ups reach GA, vertical scaling of servers results in creation of new servers, at least when VMs in hyper scalers are concerned. If new nodes are created, we can just as well scale horizontally. On the other hand, I cannot imagine a single reason why anyone would not always, and I repeat, always use horizontal scaling for the nodes. It's a no-brainer. Enable cluster auto scaler right away. Just do it. Thank you for watching. See you in the next one. Cheers.



