Thumbnail for Manage Checkmk IT monitoring in Ansible by Christian Lempa

Manage Checkmk IT monitoring in Ansible

Christian Lempa

34m 44s5,943 words~30 min read
Auto-Generated

[0:00]Hey everybody, this is Christian and today I'm going to show you how to automate the IT monitoring platform Checkmk by using Ansible. Because if you remember in my first Checkmk tutorial we had covered some of the basics about installing the Checkmk monitoring solution in my homelab. We've added some hosts, activated some services and while that approach was really great for a smaller setup like my homelab. Some of you reached out to me asking how to actually handle this at scale. So if you have a larger infrastructure with many different servers and things you want to monitor, installing and configuring everything manually becomes a huge pain and a lot of work. So that's exactly why in this video we are taking a look at the Checkmk Ansible collection. For this video, I've created some Ansible playbooks that basically take everything that we did in the first tutorial, but turn them into a fully automated process that you can easily replicate for your entire infrastructure. Trust me, once you see how easy it is to automate this entire process, you will never want to go back to clicking through the web interface again. So let's dive in and of course, a huge thanks goes out to Checkmk for sponsoring this video. Alright guys, so just a few words before we start with this tutorial. So this year is the GitHub page for the Checkmk's Ansible collection. You can see that it's basically using the Checkmk's rest API in the background. Which means it shares a lot of similarities to the API when it comes to the fields, values and structure of the requests. Which you later see in this tutorial is pretty useful. It's also worth saying that this collection is an open source community project of Checkmk, so there is no commercial support for it and Checkmk does not guarantee proper functionality. But on the other hand, everybody is welcome to submit issues or pull requests for changes or new features to this community project. And I can also say from my personal experience that I didn't run into any issues. Overall I had a great experience with the Ansible collection for Checkmk and the team was really super responsive and helpful whenever I had questions for reviewing this video. By the way, although I'm primarily using the examples from the Python files in this video, there's also a more structured documentation on the Ansible's Galaxy website. And if you're not familiar with Checkmk or Ansible at all, then, well, watch my video tutorials on YouTube. Here you will learn all of the basics and the foundation you need to follow this one. And while you're at it, also make sure to give it a like and subscribe to the channel, that would be amazing. Also, you will find all the Ansible playbooks as templates on my boilerplates repository on GitHub so that you don't need to write down all the stuff that we are covering here. And of course, all the links are in the description of this video. Of course, before we can use this collection, we first of all have to install it on our local device. So therefore just scroll down to this section here install this collection and just copy this command and paste it to the terminal on your local device. Also, what you should make sure from time to time, if you're updating your Checkmk instance to a newer version, you regularly should also update the Ansible collection to the latest version. So here on the GitHub page, you can see there is a new release that was just published last week. So you can basically just use the same command to upgrade with the upgrade parameter, so then it's updating the collection to the latest version. And the first example that I want to show you how to automate Checkmk is how to activate the site. When you scroll down this is also the first example that you find on the readme page. If you remember from my first video tutorial every change that you do in Checkmk to the configuration of your site has to first be activated before it's affecting the active monitoring configuration. So I would always have to click on this button when I change the configuration and of course, when we're automating the entire configuration in Ansible, we would also want to use an automated way to activate these changes. We will just start by using this template here and I just want to go in a new project directory here. Let's go to my tutorial section and let's just create a new directory Ansible Checkmk, I think that's fine. Let's CD into that and open it in VS Code. So first of all, I want to create a few files here, first of all, the ansible.cfg. So here you can add some general configuration settings about Ansible, you probably know this from my other tutorials, so I won't go into too much detail here. But basically I'm just setting my remote user, I set an inventory file that I need to create here. In this inventory I just put all of the servers in my homelab or in my infrastructure that I want to use in the Ansible playbooks. And then we will start by creating a simple playbook, for example, let's name this activate changes.yaml, where we can paste this example template. Okay, of course, before we can make changes to our Checkmk platform, we first of all have to fill in these informations here, the server URL, the site that we're using and we also have to add credentials for the administrative user and secret. Now, it is generally recommend that you don't use your administrative user that you use for your manual configuration or logging into the monitoring platform. It is always recommended to create a separate user for the Ansible automation, but for simplicity, I will just skip it and use my basic administrative user here. I still don't want to put my secret or my password of the automation user in clear text here in the playbooks. One way that I just recently discovered is by using the Ansible Vault. So the Ansible Vault will create an encrypted secrets value file where I can put in the username and password. is actually pretty simple, you just need to enter Ansible Vault and then create a secrets file, for example, secrets.yaml, I think that's totally fine. Then we need to put in a secret vault password so that we need to get access to this. I'm just putting in my password and then we want to just define these two variables here. So I'm just copying the automation user and the automation secret. Of course, I'm changing these values now to my real administrative user and password. Again, for simplicity, I will put my admin user in here and okay, so now we have saved this file. You can see it here, but when you try to open it, you can see this is encrypted without the password, you can't see anything in here. The next thing that I want to change is the server URL. Depending on how you installed Checkmk if you followed my tutorial or if you used a separate way, you should have a URL where you usually log into your Checkmk platform. So I can just copy this here and put it into a separate variable. At the beginning, I will create a new section that is called vars, so here I will put in a server URL. And the site, I will also define this is Checkmk. And here we can basically just replace these with the variable placeholders. So the way how you do this is also simple, just add two curly brackets here and fill in the values that you've used above. At the beginning, I will also add a new section, vars files and will add the secrets.yaml file. So just replace the credentials as well. And of course, here in the sites definition, you could also specify if you only want to activate the changes for a specific site. For simplicity, I will also replace it with my variable here. Alright, so just a few things that I still want to change, first of all, I want to give this playbook a reasonable name. And another thing that I want to change here is I don't want to activate the changes from the hosts. So there are several ways in Ansible how you can do this.

[8:08]If you are running some queries or requests from your local device, like this is a basic API request in the background, it doesn't have to be initiated from the hosts in the inventory file. Instead, it can be initiated from my local device. So therefore I'm changing the hosts to localhost.

[8:28]So this is the most simple way to make sure this request is initiated from your local machine. First of all, I want to check how this is working, so when we go back to the Checkmk's monitoring platform, you can see there are no pending changes that have to be activated. But maybe I can simulate a basic change, if I go somewhere in the configuration of, I don't know, server, prod one and I might want to change something in here. For example, let's monitor the system D socket summary, whatever, so just make a simple change that we would have to activate manually here. Alright, so to see how this is automated by using Ansible, instead of clicking this in the web UI, we just go back to the terminal here and run an Ansible playbook with the name here, activate changes. Now, what I also want to add is the parameter or the flag, ask vault pass, so this is required to decrypt the secrets.yaml file. So when I execute this, I need to fill in my password for the credentials. Now, it's doing the activation and you can see it changed localhost. Now localhost, remember, it is just executing the request from my local device. If we go back to the web UI, you should see this alert button here at the top should disappear and yes, you can see that works. So now we have no pending changes, we just activated everything automated. And I think I would always recommend you if you make any changes via the Ansible playbooks to your Checkmk configuration, you can add this small task here at the bottom of all the steps and then you automatically activated everything. Alright, so this is already pretty useful. The next thing I want to do is I want to add more hosts to my Checkmk instance. Because if we go back in here and you might remember this from my first video tutorial, these are the hosts that I had configured in this tutorial. So I added my server protection one, two, seven, some containers, like traffic, one and two, DNS, my firewall and also my Proxmox protection one and two. But of course, I still have other servers in my infrastructure. For example, my recently created Kubernetes cluster, this is running on server production, three, four and five. I haven't yet added these to the Checkmk side, and I thought it might be a great example to just show you how to fully automate new hosts in the Ansible playbook. This is pretty cool and it definitely saves you a lot of time, especially if you have a larger infrastructure that you want to monitor. Alright, so if we go back to the Ansible collection of Checkmk, you can see, when you scroll up here, you can find some really great examples that show you in detail what you can all do with this collection. You can use the lookup plugins to query existing information. And when you want to change something, you just go to the module section, so here you can find the general activation module that will activate the changes, but also many others, for example, hosts. This is what I want to use to add new hosts to this configuration. And here, this is a pretty huge documentation with some examples. You can find here in the readme file, I just want to do some very simple thing here. I want to create a host with an IP address, so I will just copy this here. And maybe I want to create this in a new Ansible playbook that I want to call manage hosts.yaml. So I will just put it in here, but of course, we still need the same structure like we did in the active changes with all the variables and stuff like that. So I will also copy this and somehow put them both together. So here you can see I will just replace these four values here at the beginning. So these I always have to define in the same way. Just want to put the activation step at the bottom of course, because I first want to add the hosts, then I want to activate the changes. I think that's just start with the first server that I want to create. Of course, I want to use the name server prod three. Want to put it in the main folder, the state should be present and I don't need an alias for this, so I can delete it. And in the attributes section, you can fill in all the settings that you usually would have to create in the web UI. And one thing that I first of all, didn't really know where to find this is, so when I want to change a different settings here, that is not in the example template collection. For example, if I want to add some SNMP setting or I want to change the API integrations to some value, how do I actually find out what these values are and what names I for the fields I have to put in into the attributes section here? And I want to show you that because this is really important for all the other future configurations that you're building in playbooks. And the easiest way to find these things is in the official API documentation of Checkmk, because as I said in the beginning, the Ansible collection is basically just a different interface that executes the API requests in the background. So a lot of these attributes, fields and stuff are all the same. So just go to the Checkmk monitoring platform and here in the left sidebar, under help, you have a developer resource section where you can find the rest API's documentation. So when you click on that, it will take you to this documentation. And to be honest, guys, I just want to say at this point, this is really one of the best documentations that I've found somewhere. So it's very, very detailed and you'll find all the information that you're looking for in here. Alright, so here in this documentation on the left sidebar, you can find hosts. So this explains how to create hosts, delete hosts and all of this stuff using the API. But when you go to create hosts, you can see there is a rich documentation about all the fields and values that you can fill in here. Here, for example, you can find the attributes. When you click on this, this opens a dropdown menu where you can see all the names of the fields and values you have to put in. In case of an IP address, it is very simple because it's just a simple string. But if you want to change some other stuff in here, like this dropdown menu, you have to know, so what these values are that you need to put in. In the documentation, you can see under the tag_agent field, here is a great explanation of the values and what they mean in the web UI. The same you will also find for any other things that we are configuring later, like rule sets and where it gets a lot more complex. This is always, yeah, the best source of truth for creating these playbooks according to your requirements. Alright, so I know, this just a side, but I think it is important to know this, so I wanted to explain it at this point here. Of course, the IP address I need to change and I think that's fine. So when we just return to the terminal, we just need to run this one. I think I called it manage hosts.yaml. Oh yeah, okay, fine. Let's execute this. And because I also added the activate changes step at the end, it should also be automatically activated. So when I go back to the Checkmk instance and go to hosts, ah there it is. So there you can see a new host with this IP address has been configured and there are no pending changes. Of course, you can simply just repeat this step for all the other hosts that you want to create. I usually tend to create separate tasks for each of these hosts, but it is basically up to you how you do this. You can see how quick it is to use automation tools like it's really amazing. And you can also see because we're using Ansible, Ansible automatically checks if these configuration settings are already existing and it only changes things that have to be changed. Yeah, like the create server three, that was existing before, so it's not trying to add this configuration again. It's only adding the new servers that I have added to this configuration. And if we refresh the page, you can see there are my other servers. Okay, so that's basically how you can add any host configuration, no matter if this is for a virtual server or for container or whatever you want to monitor. It is always the same procedure. The next step is installing the agent on these servers. So in the previous video, we just did that manually, we opened an SSH connection to this server, prod three. We downloaded the Checkmk agent package, then run the installation, then we also needed to register it for the HTTPS certificates and all of these things, yeah. Just imagine you would have to do this for all the servers, it gets a lot of work. But of course, there is also in the Ansible collection a automated way of installing the agent and it also can automatically do the auto registration. I should also mention at this point, there's even a role to install a full Checkmk server and manage Checkmk sites. Also pretty useful. We are not going through this in that video, because I think it's a bit out of scope. Otherwise it's just getting too long, but I think the agent role is really super useful. So there's also another documentation here that explains all the fields and the values that you need to put in, but I will just show you my simplified template. It's actually pretty simple. So let's go in here, let's create a new file that is install-agent.yaml, I think that's fine. This is of course a task that I don't want to execute on the localhost. So this is something that I want to do on remote machines, so therefore it was important that we added this inventory here with all of the servers. So I can basically just grab the server productions three, four and five where the agent isn't installed yet and put them here. Now we also need to add the roles that we want to add to these hosts. In our case, it is the Checkmk general agent, so we want to install the agent on all these servers. Then of course, we need to add in the address of the agent server, which is usually similar to the URL, but just without the protocol, the HTTPS at the beginning that you define in here in the Checkmk agent server protocol. So make sure to set this to HTTPS if you're using HTTPS, because the default value is HTTP, if I'm correct. Also define your Checkmk site and if you want to run the auto activation in the registration stuff, also make sure to set the agent auto activate to true. Now that of course, also requires the username and password. Any other flags are documented here in this file, so just have a look if there might be something else that is required in your environment. I just found out these are all the values that are sufficient for my setup. So let's just do this and let's run Ansible playbook, install-agent.yaml. So this is also really, really useful if you automatically want to update the Checkmk agent on all your hosts. So if you have a large infrastructure, you can see how fast and quick this is, it's really amazing. Yeah, it's it's doing everything for us automatically here, installing it for this specific operating system, configuring all the necessary stuff and registering the agent, it's really amazing. And of course, if there is a new version of Checkmk deployed and you want to update the agent version on all your servers, just change the value in here and make sure to add the servers and it automatically gets distributed to your entire fleet. Alright, so once that is complete, we can go back to the Checkmk instance and go into the service discovery of the newly created host. And here you should see the connection from the Checkmk agent to the Checkmk server is working. So this is indicated by the state here and that we have a list with all the undecided services. So currently nothing is monitored, but that is the next step that we also want to automate in Ansible. Because usually you manually have to go into each of these hosts, accept all the undecided services to add them to the monitoring configuration. Then update the host labels in here and of course, also activate the pending changes. Now, as you can imagine, in the Ansible collection for Checkmk, there is also a separate module for the service discovery. So this is here in the general discovery section and there you can also again find some example configuration templates. I picked the second example here to fix all, so that means update the labels, remove vanished service and add all the discovered services. Because that's a general recommendation that I probably should mention at this point is like in the first tutorial, we manually decided what services we want to monitor and we disabled all the others that might not be important for us. But to follow the best practices for Checkmk, it is generally recommend to accept all, so monitor as much as you can. That you automatically get notified about anything that might go wrong on your system and only when something does not match, make specific adjustments or configure rules or thresholds. We later on also do that by creating some rules with the Ansible playbooks. But here, as you can see, we probably should accept all the monitoring things here. One thing that I also might want to mention at this point here, you can see there is a critical warning for the Proxmox virtual environment memory usage, which says that it would use 91% of the memory, which is above the critical threshold of 90%. However, this is actually not really a big deal because when you compare it to the memory monitoring on the actual host operating system, the memory usage in fact is only 21%. And the reason for this is that the Proxmox virtual environment somehow reports the memory usage in a slightly wrong way. You can even see this when you go to the Proxmox server and go in the details of this server. You can also see the memory usage here at cached or reserved memory into this calculation, which is actually not correct. So what I usually do is I accept all the other monitoring services and then I just disable this Proxmox virtual environment memory or I create a separate rule for this that this is not really a big deal for me. But anyway, so to enable all these services, let's go back to the Ansible collection. I think we can still put this into the same manage host playbook. And then of course, we need to replace the credentials and other variables in here. Now, this is an example for one specific host, but if you want this for all the host, you have to change it a bit. So change the host name to hosts and then provide an array of the host names, you want to discover the services on, like in my example, server production three, four and five. And once this is complete, I can just run the manage-hosts playbook again. Put in the password and then it goes through the service discovery. And when I go back to the Checkmk instance and refresh the page, you should see that all of the services are now monitored. There are no pending changes because we also automatically activated them in the last steps here in our automation, so everything should be correct. And if we go to monitoring dashboard, you should also see that all the services are now reported into the dashboard. Now, when you still see these warnings here, don't worry, sometimes Checkmk takes a little time until all the checks are rescheduled. Okay, guys, so that's how you add newly monitored hosts and discover services on Checkmk. Now let's move on to another pretty important aspect of the Checkmk configuration automation, managing rules. And this is honestly not always a straightforward as the previous sections because rules have many different parameters and conditions depending on their nature. For example, if you go to setup service monitoring rules, you can see how many different rules are existing in Checkmk and this is just one section and there are many other rules for monitoring specific services and integrations. For example, one thing that I use in the HTTP, TCP, email configuration, a check for my DNS server. So this rule defines that I need to make a specific DNS query. So this way, I cannot only check if my DNS server is up and running, also if it resolves the names of my homelab correctly. But for other environments, you might have completely different rules and service monitoring configurations, so it was really hard to find some examples that everybody needs. But still, I'd like to try it and give you two practical examples that I have discovered as a use case in my homelab. We'll start with the DNS monitoring. And if we go back to the Ansible collection, if we want to try to replicate this rule setting here, there is a new module specifically to manage rules. As I said, these examples might just apply to this particular use case and you might have completely different configurations that you want to use. Here the API documentation in Checkmk is really useful, because that gives you all the necessary fields and configurations of these rules. So when we go into rules, create rule, here you can find all the necessary fields and configurations and how to structure your Ansible playbook. And what is also super helpful that I discovered. Sometimes it is useful to create a rule that you want to automate in the web UI first so that you get all the fields that are very specific for that particular rule. And then if you go back to the section, there is a button that is called export this rule for API. So this gives you a value representation of the raw fields that are very specific for that one rule set. However, this is not all that you have to add in the playbook. This is not the entire structure, this is just the structure of the value raw parameters. Now the others, like the properties or the conditions, you still have to add in the Ansible playbooks. But there's also another way of getting this information from existing rules and this is by using the lookup modules. So when you have existing rules in Checkmk and you just want to know what they look in an Ansible structure, you can use the lookup module for rules. So here is a basic example that just gets a rule with a particular rule ID. Of course, you have to have this ID first, but you can also find this very easily in the web UI when you go back to that specific rule, go to edit and then in the rule properties, when you click on show more, this will reveal the unique rule ID as well as the rule set name that you also might need in the Ansible playbook. So let's copy the rule ID. Let's put it in here. One error that I ran into and this is only happening on macOS with this Ansible collection. This was probably designed for Linux operating system, so you might need to run a small virtual Linux machine and just run the same command here again. And here you can see this is the output that contains the entire settings for that rule. I know it can sometimes be a bit tricky to transform this into an actual Ansible playbook. But I think once you have created your templates and you're just reproducing the same configuration, it is kind of easy to change certain values in here. Because the properties and the conditions are very similar for most of the rules that you want to create. Only the value raw fields that you can basically copy from existing configurations might be the only thing that is very much different for each type of rule set. There is a second example that I also want to show you because if we go back to the Checkmk monitoring solution, you can see I have a few critical warnings in here, some of them are valuable, some of them not. But one thing that really bothered me here was the temperature warnings for some of the internal hard drives. Because here you can see that the NVMe drive on my secondary Proxmox server is always reporting a critical temperature. However, this is not actually an issue because these thresholds originally have been created for magnetic hard drives that are indeed struggling with higher temperatures than 35 and 40 degrees. However, in case of NVMe drives or SSDs, those can take much higher temperatures, NVMe sometimes over 60 or 80 degrees and this is completely normal for this type of drive, which is, by the way, a problem I don't seem to be the only person is having. So therefore you can see Checkmk has created an internal knowledge base that explains this particular problem and how to create a rule set that will change the threshold values for these drives. So these are the crucial one terabyte SSDs and the Western Digital one terabyte NVMe drive. Now, the way how you would manually create a rule for this is you open the action menu and then go to parameters for this service. Click on temperature and then there is a button add a rule for current host and sensor ID. So this will automatically create a service monitoring rule and it already makes it active for this explicit host. Now, if you're using the same Western Digital drive on several servers, you could disable this, so then the rule gets active for all the hosts when they have drives. Maybe you have different drive sizes, so you can use an asterisk here to make this active for that particular model. And then in the values, you can change the upper temperature levels to warning at, I don't know for an NVMe drives, let's set it to 60 degrees and critical at 80 degrees. This is how you would create such a rule in the web UI. Of course, I've already transformed this into an Ansible playbook. So here is an example that we can use, so we want to copy this to our project. Maybe let's change the name from DNS rule to manage rules, I think that makes a little more sense and put them here underneath. Of course, I need to change a few things in here. So the values raw, these are basically the levels 60 and 85, I set 80 degrees. Okay, fine and then match on, here we need to put in the name of the drive. So that one here, you can also add other items. So for example, let's go back to the monitoring. So here we also had problems with this here. CD 1,000 asterisk, so then the new threshold should become active for these two drive models. Theoretically, I probably should create a separate rule for these SSDs here and reduce the temperature levels because NVMe really take the high temperature SSDs a bit lower, but I think that should still be fine. Alright, so that has been changed. Let's go back and let's go to service monitoring rules. Now we can search for temperature and there should be two rules in here. So the first one I've already created, the second one is that that we've just newly created for the Western Digital and the crucial SSDs. Now if we go back to monitoring, might want to wait a bit, but ah no, you can see it's already updated. So now we still got one warning for the Proxmox server, but the other warnings has gone away. If we go to the second Proxmox node, you can see there's a temperature for all the drives, but if we go in here, you can see the thresholds have been changed to warning of temperature 60 degrees and critical temperature of 80 degrees. So I think that's pretty nice. And yeah, this is a pretty simple way how you can customize the rules of Checkmk according to your needs and requirements. Okay, yeah, so I think that's enough for today. Of course, we haven't covered each and every section of the Ansible collection. There are certainly other parts that might also be interesting, but I think you know, understand how it basically works and you can check out the documentation on your own and find out how to adjust it according to your needs. And check out my boilerplates repository on GitHub to find all the templates that I have created for this tutorial and have fun automating your IT monitoring for your home lab. Please let me know what you think about it and if you have any other questions or topics that I should cover in future videos, please let me know. And if you enjoy these videos and tutorials, if they are valuable for you, it would be amazing if you'd consider supporting me on Patreon or become a YouTube member. And again, thanks to all the people who already supporting me here, you guys are just amazing. As always, thank you so much for watching and I'm going to catch you in the next video, take care, bye bye.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript