Troubleshooting a recent outage (ESXi 6.5)

I’ve been administering vSphere since version 4.0, but it’s been reliable for me, and as a result I don’t have a lot of experience troubleshooting problems. I want to try and figure out more about what happened during this outage, but I’m not totally sure how to go about it:

Environment: HPE ProLiant (G8 – G10) servers and Dell PowerEdge R740. Dell SAN switches and Compellent SC 3020 iSCSI SAN, Cisco switches for management and VM traffic. ESXi 6.5 with the latest build. We have separate traditional vSwitches for management, iSCSI, and VM traffic. No changes have been made recently.

0800: One of our Dell SAN switches crashed, and I got warnings from our servers about redundancy degraded. I was planning to investigate it that evening since everything was still running.

1500: All of our VMs started to become non-responsive, but were still replying to pings. I drove into the site and I saw the switch that had crashed previously in the day had come back online by itself (7 hours later!).

I noticed our host servers also becoming disconnected from vCenter and VMs becoming disconnected from storage. This is where things started getting confusing for me because I assumed I was dealing with a storage problem, but now I’m having network problems too.

I tried to log into our host servers directly through the web interface over the LAN, but it always timed out. Our management interface isn’t on the same switches as the SAN. At this point I had our network team check our switches and they were running high utilization (99%) and they rebooted the switch stack. However after the reboot I still could not log into the servers. At this point I should have used direct console access with a keyboard and monitor, but in my panic I totally forgot that was even an option.

I needed to get everything back online ASAP, so I shut down the SAN switches, SAN, and all the ESXi host servers (about 6 of them). I brought the SAN switches, SAN, and host servers back online and everything started to behave normally again. I was able to log into the servers, they reconnected to vCenter, and I was able to start the VMs. The outage lasted about 2.5 hours total.

I know I made mistakes here in my panic, but I really want to understand better about what happened though. I can say that it appears the storage problem somehow affected the servers to the point where the management network could not even be used (like a denial of service or race condition of some sort).

Thanks in advance for any advice you might have for what I can look at.

View Reddit by adept1onredditView Source

Related Articles

One Comment

  1. Sounds like a potential APD/PDL issue. But to truly find out you should open a ticket and submit all the logs, support should be able to tell you exactly what happened so you could potentially prevent it in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *