[0:00]Sometimes in IT, you try to be useful and you just end up breaking everything. In today's episode, I will discuss how I offered to help my buddy who's a systems engineer and I did just this. If you work in tech long enough, you know that you will eventually bring down production. It doesn't matter how good you are, how many certs you have, or how many years you have in the game. If you do this long enough, you will break something, you will delete an important file. You'll bork a bunch of PCs at once through a GPO, or you'll make a tiny VLAN change that snowballs into something massive. It's not a matter of if, it's just a matter of when. And for me, it happened on a weekday with something as random as printers. So here's how the whole thing started. My buddy, who's a systems engineer, has an audit remediation project that he's working on. Basically, the task was to harden a bunch of printers, things like SNMP V3 settings, firmware updates, and the usual compliance stuff that nobody really wants to touch, but you eventually have to. And I'm thinking, if it's printers, it's going to be easy. I love printers. They're like little puzzles you get to try and figure out. I figured I would be flipping a switch here and there, updating a driver, doing some firmware updates, kind of not really talking to anybody, which is the type of ticket that I look for on the job. Like it's not like we're dealing with some firewall changes, domain controllers, or anything else that's mission critical. That's at least what I initially thought when I took on the ticket. So he looped me in and explained to me that the plan was simple. We were going to turn off SNMP V3, we'd push firmware updates to all of them, and after this, we would be good. In my experience, this did not sound scary at all. I did a little bit of research, I talked to chat GPT, whether things would break when I did this, and chat GPT told me that I was fine, I discussed with him, he told me that I would be fine.
[1:31]All signs pointed to the fact that this was not going to do anything harmful. I volunteered to help, and I figured I'd rope in one of my tier one buddies to help me out because these were like 30 printers, and I didn't want to go through all of them myself and just manually change these settings. Couple hours of work, tops. Now, before I get into what actually went wrong, I'm going to explain what SNMP is? Because if you're newer to IT, this is a protocol that you're going to run into at some point. SNMP stands for Simple Network Management Protocol. In plain English, it's just a way for devices on the network, like printers, routers, switches, to be able to talk back to management systems. It's how you can pull things like performance data, status, or certain errors, and you don't have to physically touch the device. For example, with printers, SNMP is going to be what allows the print server to monitor certain things. Hey, this printer's online, it has no paper jams, uh, it's low on toner. Without it, you're kind of managing blind. The device might still work, but you just don't have that feedback loop telling you whether certain things and services are alive or dead. Now, SNMP actually has three versions, version one and version two are both old. You have plain text community strings, nothing's encrypted, this is easy to sniff. V3 came along with authentication and encryption, so it's really secure. But like anything in IT, something that's more secure often means something that's more complicated. And again, the remediation project that we were doing was just we're going to disable SNMP V3 all the way around and just update these printer firmwares. It sounds backwards. But the logic was that SNMP V3 was misconfigured in many cases and we just wanted to get rid of it to clean things up and make it simpler. From my point of view, it was just another checkbox to tick, just another task I was doing during my day. So the plan was simple. I grabbed the T1, we started working through the printers one by one, disabling SNMP V3 where we could find it and updating firmware as we could. Again, I felt solid. I had talked with the system engineer, I talked with GPT, I thought nothing would go wrong. As soon as we started rolling out these changes, things started to go sideways. Rather, it was probably about an hour after that I realized things started going sideways. Printers started dropping off one by one, people were calling into our help desk and the printers could not print at four different bank branches. This was a moment where my head just dropped, my heart dropped, and I knew that it had to be something that we had done. I still didn't know what. So the primary contact of the bank escalated this thing to a P1, which basically just means all hands on deck until we get it fixed ASAP. My manager reaches out to me and says, hey, I see you on this printer remediation ticket, that's related to this printer P1. Did you have anything to do with this? This is not a message that I was super excited about getting. The systems engineer that I was working with pinged me as well and asked me the same thing. I tried to come up with an explanation, again, I still didn't know the root cause of the issue at this point. I was thinking that because we had rebooted some of these printers, maybe they didn't have static IPs and DHCP had given them a new IP or something like that, and that's why they were showing as offline. But that theory kind of went out the window because it wasn't just a handful of the printers that didn't come back online, it was literally all of the printers that we had worked on. Every single one we touched looked dead from the server side. That was when I knew that we had broken something fundamental and we needed to figure out a way to fix it. So here's where the technical side comes in, because if you've never worked in an environment that uses a print server, it's going to be good to understand. A print server is basically like a middleman between users and printers. Instead of users mounting and connecting to a printer directly, the print server hosts that printer. It's got the queues, it's got the print drivers, and then it passes all of that out to the users. That way, if you have 200 users at a bank, you're not trying to install 200 drivers on 200 different machines and manually mounting this printer to all these different machines. You just install it once, make sure you have the correct driver on the print server, and then everybody can print through there. And again, documents are queued through the print server as well. Now, part of how this works is that the server has to know whether the printer is online or offline. That's where SNMP comes into play. The server talks to the printer using SNMP. If it gets a response, the printer's green. If it doesn't get a response, the printer, from the server's perspective, is offline. So when we turned off SNMP V3 on the side of the printers, it was actually still enabled on the server side of things. So all of a sudden, the server is trying to poll all of these devices using SNMP, and the printers aren't responding in the way that it expects. Same IPs, same config, same everything, but as far as the server was concerned, they were all dead. So users couldn't print, the branches obviously thought that all of the printers were toast because of what we had done. Ultimately, it was just a miscommunication between the server and these printers. So at this point, I'm scrambling, I'm double checking IPs, I'm looking at DHCP, I'm talking with the system engineer, thinking what the heck could we possibly have done that threw all these offline? And after about 30 minutes, the manager who actually originally pinged me, who was just a system administrator for almost a decade, reached out to me and said, hey, I've seen this before. He says, this is on the print server. And that was ultimately the answer. Again, the printers were online, but the server thought they were offline. The fix was so stupidly easy. It was literally unchecking SNMP on the printer property's side of the print server. As soon as we did this for every printer object, the printers started showing online again. Users started printing, the branches calmed down, and everything was calm and tranquil. It was awesome. This whole outage lasted probably for a half an hour to 45 minutes, but when you're a sys admin who caused the outage, that felt like a lifetime. And so once everything was back up, it was time to own it. The system engineer who I was working with reached out to the bank's contact and ultimately just owned the issue and explained what happened. He took responsibility and explained that it was his remediation project. I also took the opportunity to apologize to him for doing this because really it was my fault, it wasn't his fault. Because the truth is I was the one who's pushing these config changes, making these printers fall offline and then leading people astray thinking that it was DHCP or something like that when the printers started going down. The good thing is that nobody was mad. The client was ultimately happy just that the problem was fixed. My manager talked it up as a learning moment and the engineer was super cool about it. Again, we all just owned up to our mistakes. And then honestly, as the dust settled, we all kind of chuckled a little bit because we were making a change as simple as updating firmware and changing SNMP on printers and we brought down production for three banks. So what did I actually learn from this? First off, IT will humble you. You can have the right intentions, do all of the research with GPT, talk with internal contacts, be as sure as you possibly can, and things can still explode into something terrible. That's just how things go. And then second, is that devices do not just live in isolation. Changing a setting on a printer doesn't just affect that printer. It affects how the printer talks with the print server and ultimately how that printer interacts with hundreds of users. I was so focused on the device that I didn't even think about server side if I'd need to make any config changes there. It was a rookie mistake. Thirdly, this experience showed me the value of having experienced people around. My manager, who had again, 10 years of system administrator experience, ultimately saved the day because he had seen the issue before. He helped us to resolve this way quicker than we might have had I still been poking around on DHCP. And at the end of the day, it wasn't the end of the world. Production was brought down for 30 to 45 minutes, we owned up to it, we fixed it, everyone had a good laugh, and we closed the ticket, thank goodness. So that is how I tried to fix some printers and brought down three, four bank branches. Appreciate you guys. Hopefully this story has been useful for you. Let me know your worst horror story that you've had working in IT and when you brought down production. Thanks so much. Have a good day. Be safe, be smart, make some good decisions and good luck with those printers.



