Twice in 72 hours I’ve had VMware 6.5 on a Dell Poweredge R710 stop responding to heart beats from VCSA and, more interestingly, access via keyboard and ESXi shell / F2 & F12 keys on local console is either very sluggish or does not work at all. SSH to the VMware server on the management network itself however works just fine. I have tried restarting services like vpxa and hostd to have the shell (whether local or remote via ssh) become unresponsive yet still able to initiate and login via SSH as much as I like. One or two of the guest VMs stop responding, however any others that were running continue to run seemingly unaffected. For instance, I have VCSA and the VPN end-point for connecting to the network on the host server in question yet during these two instances VMSA is reporting the host is unresponsive and two VMs were not responding.
It happened unexpectedly, the first instance of which was three nights ago. I was shuffling and replacing server hardware and unplugging cables from the switch – and not coming in contact with the server in question, not accidentally unplugging one of its network cables, nothing – when I first noticed one of the hosts guest VMs were down. As I stated above I was able to SSH in but the local console via keyboard would respond to my keystrokes for bursts of a minute or two and then become unresponsive. I unplugged-and-replugged the USB keyboard several times and while at times it would “wake” things up in that I could get a response to me pressing a key, it wasn’t consistent.
This server does a bunch of local disk and some slower 1Gbit iSCSI network I/O as it runs Veeam backing up VMs to local storage and the Veeam config and server instance itself to the iSCSI datastore. I would like to imagine that ESXi 6.5 isn’t pulling a Windows 98 and becoming crippled-until-reboot when I/O goes awry but I am imagining it may have something to do with it. I need to examine the logs but there are a lot of them, the times of when the events happen aren’t entirely known AND most things keeps logging and running as usual after the whole issue starts. So yeah, figuring out where to look is a bit tough.
Any ideas on what could be going on or what to grep for on the log directory to shed some light on where to look?
To see the full content, share this page by clicking one of the buttons below