Thumbnail for We Lost Internet at a Bank – And DHCP Wasn’t the Problem... by Jake's Tech

We Lost Internet at a Bank – And DHCP Wasn’t the Problem...

Jake's Tech

9m 14s1,918 words~10 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes
[0:00]It's a normal morning working in IT and suddenly a bunch of PCs at a remote branch just drop off the network.
[0:00]This one unraveled into a full-blown P1 involving firewalls, SD-WAN, and a 300-day ASA connection that came back to bite us.
[0:00]Everything looked clean and was working smoothly, and the next day, people started reporting that they were losing internet.
[0:00]The switch was online, the firewall was online, and there was plenty of DHCP space.
Use this transcript
Related transcript hubs

[0:00]Picture this. It's a normal morning working in IT and suddenly a bunch of PCs at a remote branch just drop off the network. No changes, no reboots, just dead. This wasn't just a restart your router situation. This one unraveled into a full-blown P1 involving firewalls, SD-WAN, and a 300-day ASA connection that came back to bite us. Let me take you through how we found it and everything I learned along the way. So, we were just working on a new PC deployment project of like 100 PCs. Everything looked clean and was working smoothly, and the next day, people started reporting that they were losing internet. It wasn't everyone, just a few PCs, but it was more by the minute. The switch was online, the firewall was online, and there was plenty of DHCP space. So, what gives? To start off, I did some initial basic troubleshooting connectivity. I confirmed that these PCs could not ping their default gateway, and I also confirmed that they couldn't ping the DHCP server. We were getting like a general ping failure, and running ipconfig /all on these devices showed that they had an APPA IP, which is an auto-assigned IP when a device can't automatically get an IP from DHCP. So, my first thought is DHCP is full because we see this pretty often where you have a DHCP scope that maybe there are 100 IPs that it can give out. Especially with a project like this where you're plugging new PCs in, I'm thinking, okay, we've exhausted it. There's no more leases to give out. We just need to expand the DHCP scope. But I checked this and there was like 200 available IPs, so DHCP was not full. Also, other PCs that are connected to this switch are still working fine. So, I hopped into our 9200 switch stack, which is three big switches. I remoted into it and was trying to track down the port that this PC was on. We didn't have a cable tracer. For a minute, I was kind of at a loss for how we're going to track this down because we didn't have LLDP on on the switch. After fumbling around in the switch for 5 to 10 minutes, we came to our wits and I was able to track down the PC using its MAC address by showing the MAC address table on the switch. So, I found the port that the PC is on. Now, I'm looking at port configuration. Did something change? Is there port security? Is there DHCP snooping? The VLAN looked good. There were no real errors on the port, a bunch of output error drops, but I think those are innocuous. I started looking at how DHCP was set up. So, this switch was set up as a DHCP helper, a DHCP relay. And so, basically, it has a virtual interface, a Layer 3 interface. And the switch receives these DHCP requests and then forwards them along to the DC. DHCP helper address looked correct. It was the DC's IP. So, it was pointing these DHCP requests to the right place. So, I'm thinking, is this like an individual PC issue, but it's multiple PCs? Is it something that has to do with these switch ports, but I couldn't really find anything. So, we plugged a new PC into the non-functional port, just to verify that it wasn't working on any PC on that port. And it could not get an IP. It was stuck on identifying network or unidentified network. This tells me something has to be wrong upstream or at the VLAN level. As I said, VLANs looked fine. So, we decided, hey, let's just give this thing a static IP within that DHCP range that is free. We did that. We set default gateway and everything worked fine. It was able to connect to the internet. So, this confirmed this is an issue with DHCP, not routing, not VLAN, anything like that. So, now I'm back in the switch, fumbling around, trying to figure out what the heck is going on because the DHCP server looks healthy. I checked DHCP server events and can't see any errors or anything like that. I go in the switch. I can ping the DC, which is the DHCP server. So, the switch has connectivity to the DC. At this point, I'm just completely baffled. SVIs up, helper address, no DHCP snooping or port security, and there's connectivity all the way up to the DHCP server. So, as one would, being a system administrator, I reached out to our network engineers as a next escalation point to work with them and figure this out. And the NE said, give me 10 minutes. I think I know what's going on. This is where it starts getting deep. I'm going to break this down for you, and this is the stuff that you do not see in the textbook when you're learning about DHCP. No CompTIA exam teaches you this, but it's super important to know and understand. Here's what normally happens with DHCP relay. A client has no IP address, so it sends a DHCP discover broadcast message. This is at Layer 2. So, it's not sending this to a specific IP, it's just flooding it out every port, rather the switch sees that it's a broad that it's a DHCP broadcast message, and it says, ah, I know what to do with these DHCP requests. It uses that helper address and sends the request along as a unicast message to that DC IP because the switch does have Layer 3 connectivity. The PC still doesn't. The DHCP server replies back to the switch. The switch says, ah, I remember who this is for, this MAC address at Layer 2, and then it gives the IP to that PC. This is all simple enough. We verified connectivity all the way from that PC up to the DC unless something messes with the return path. So, here enter our ASA, the firewall, and our SD-WAN device, which is called a VCE, a Velocloud Edge. The VCE is at the very edge of the network and this handles SD-WAN and routing uplinks. We have an ASA in our data center that sees traffic and has stateful connections. So, here's the catch. Our VCE has two ISPs connected to it, and so it can send traffic over one of two public IPs because it's in something called an active-active setup. Basically, it chooses the best public IP to send it over, the best ISP, based off of quality of service and other factors, bandwidth being used, things like that. So, each of these two WAN links has a separate public IP in the VCE. This means that a packet that goes out on WAN link 1 might get a reply on WAN link 2. And an ASA, a firewall, doesn't like that because it has stateful connections. So, why did this break things? The ASA creates a connection object for every flow. It sees the source IP, the destination IP, the port, and the protocol, and then it locks that connection down and it's like an open connection, it's what we call a flow. When the VCE failed over to that second link, or even just flapped over to that second link, due to one of many different factors, the return traffic came from a different IP than the ASA expected. ASA, firewall, sees this incoming connection and says, nope, this doesn't match any IP that I have documented, blocked, throws it out in my network engineer's terms, off the cliff. This includes DHCP traffic. So, even though the DHCP server received the request, sent a reply back, it never made it back to the client because the ASA dropped it silently. To make this even worse, the NE found an old ASA connection that was over 300 days old. Still holding that original public IP from before the failover or before the flap. This means no new connections could get established until this was dealt with. Okay, so here are the fixes and what we should learn. And this is me relaying what the NE told me. He disabled that active-active setup and set it to an active-standby setup. This means that only one public IP is used at a time. This ensures traffic leaves on one path, and then comes back on that same path. The ASA sees consistent flows and stops dropping traffic. Now, he mentioned that he could have just cleared connections on the ASA and that would have fixed things. This like manually flushes those stale connections and temporarily takes care of our issue, but it doesn't solve the underlying problem. And this whole situation shows us that stateful firewalls like the ASA are flow sensitive. If your WAN path changes mid-session, it's probably going to break things. SD-WAN active-active can cause asymmetric routing. That's dangerous unless your firewall is aware of both paths. Old connections can silently kill new traffic. Just because it's idle, does not mean that it's harmless. And then sometimes DHCP isn't really about DHCP. In this case, DHCP failed due to a state mismatch on the firewall, not because the relay, VLAN, or DHCP server was broken, which I was so zoned in on at the beginning because I'd seen it time and time again. So, if you ever see an APIPA address, the network stuck on identifying or unidentified, and then DHCP packets leaving but getting no reply, don't just blame DHCP. Ask yourself, is there a firewall in the path? Could asymmetric routing be happening? Is the device NATing through multiple public IPs? And could an SD-WAN failover be changing that return path? This is the kind of thing that separates a help desk tech from somebody who's at that next level, like a network engineer. And here are some things that I learned and just reiterated in my knowledge through this three-hour troubleshooting session. DHCP relies on Layer 2 and Layer 3. Even though the client, the PC, doesn't have an IP address yet, the switch receives that Layer 2 message and then uses its virtual interface to relay that message to your DHCP server. Stateful firewalls like the ASA do not like asymmetric traffic. If the relay path fails, it's going to drop that packet, period. And VCE flow logic is based on that first packet. So, if traffic flips paths, the VCE may try to reply down the wrong one and get blocked. And then to never forget is that even idle connections that are 300 days old can block live traffic. So, one zombie session can throw an entire branch out of internet. Okay, again, these are the type of issues that don't show up in your CompTIA labs, but this is the kind of thing you have to know how to navigate in the real world. It was a huge learning lesson for me and I'm blessed to have had that network engineer to one solve my problem and two take the time to explain the problem to me. If this helped you or if you have your own horror story about stuck firewall connections, SD-WAN or DHCP mysteries, drop it in the comments. I'm building more content like this. Real incidents, real fixes, no fluff. Appreciate you guys. Have a great day. Be safe, be smart, and make some good decisions.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript