VMware

Two instances in 3 days of VMWare Host becoming unreliable and unresponsive but majority of guest VMs running unaffected

Twice in 72 hours I’ve had VMware 6.5 on a Dell Poweredge R710 stop responding to heart beats from VCSA and, more interestingly, access via keyboard and ESXi shell / F2 & F12 keys on local console is either very sluggish or does not work at all. SSH to the VMware server on the management network itself however works just fine. I have tried restarting services like vpxa and hostd to have the shell (whether local or remote via ssh) become unresponsive yet still able to initiate and login via SSH as much as I like. One or two of the guest VMs stop responding, however any others that were running continue to run seemingly unaffected. For instance, I have VCSA and the VPN end-point for connecting to the network on the host server in question yet during these two instances VMSA is reporting the host is unresponsive and two VMs were not responding.

It happened unexpectedly, the first instance of which was three nights ago. I was shuffling and replacing server hardware and unplugging cables from the switch – and not coming in contact with the server in question, not accidentally unplugging one of its network cables, nothing – when I first noticed one of the hosts guest VMs were down. As I stated above I was able to SSH in but the local console via keyboard would respond to my keystrokes for bursts of a minute or two and then become unresponsive. I unplugged-and-replugged the USB keyboard several times and while at times it would “wake” things up in that I could get a response to me pressing a key, it wasn’t consistent.

This server does a bunch of local disk and some slower 1Gbit iSCSI network I/O as it runs Veeam backing up VMs to local storage and the Veeam config and server instance itself to the iSCSI datastore. I would like to imagine that ESXi 6.5 isn’t pulling a Windows 98 and becoming crippled-until-reboot when I/O goes awry but I am imagining it may have something to do with it. I need to examine the logs but there are a lot of them, the times of when the events happen aren’t entirely known AND most things keeps logging and running as usual after the whole issue starts. So yeah, figuring out where to look is a bit tough.

Any ideas on what could be going on or what to grep for on the log directory to shed some light on where to look?

Thank you!



View Reddit by dataslangerView Source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

5 Comments

  1. The only thing I’ve encountered like that was a result of local storage on the ESXi hosts being severely hammered. I ran into this a little while back while trying to setup a Windows S2D cluster on top of several ESXi hosts. Under conditions of high I/O I found that the hosts would show as unresponsive and inaccessible in vCenter. None the less I could connect to any VM’s running on those hosts and also SSH to the hosts themselves. I was able to confirm that it had to do with the increased load caused by the S2D cluster because if I powered off the S2D cluster VM’s (from the CLI using SSH) the hosts would almost immediately become responsive again in vCenter.

    Beyond confirming that I’ve seen this type of situation I can’t provide too much more help. I would probably look at your VM’s and identify which ones use the highest I/O and see if powering one or more of them off when you’re encountering this issue causes it to magically resolve. In my case I ended up moving my S2D cluster to a different tier of storage (SSD based) on the same hosts and I never had the problem again.

  2. UPDATE: Making tracking down the logs even more of a pain, my NTP was disabled for whatever reason but I determined both the time offset from the server to real time and approximate time errors would display based on the last timestamp of web traffic one of the guest VMs on the affected server re-directed to a non-affected server. Using this info I was able to find I/O errors in the VMware host logs. I also checked the status report for a Veeam backup and found that it failed ~23% thru the backup process at the approximate time of crash.

    The VMWare ESXi logs report more or less:

    … warning hostd[CCC2B70] [Originator@6876 sub=Statssvc.vim.PerformanceManager] Calculated read I/O size 716568 for scsi0:1 is out of range — 716568,prevBytes = 36129310208 curBytes = 37914281472 prevCommands = 147320curCommands = 149811

    … warning hostd[CCC2B70] [Originator@6876 sub=Statssvc.vim.PerformanceManager] Calculated write I/O size 971133 for scsi0:1 is out of range — 971133,prevBytes = 136700586496 curBytes = 138285476352 prevCommands = 234291curCommands = 235923

    ..

    ​

    Looking at the scsi0:1 (which must be in reference to the device on a VM, because I cannot find any reference to it other than other “scsi0:1” errors in the hostd log and also “vscsi0:1” in ref. to guests whos only data stores assigned reside on the same hard disk as one another), I am thinking it is a RAID-1’d 2TB drive behind my Dell PERC H200.

    The problem is that both perccli and megacli do NOT work to display anything related to the PERC H200 so I am not sure how I would go about checking the condition of the disk(s) as SMART functions aren’t working thru the PERC.

    ALSO: the disk that was being backed up to by Veeam is NOT the 2TB RAID stripe but instead is another RAID logical disk on the same RAID card. So maybe the RAID card is causing the problem? Maybe both arrays have problem disks? I can’t check!

    For the time being I am disabling the backups. Any input on what to do or ideas would be great still.

    Thanks guys

  3. Sounds like hostd is busy keeping your disk path states updated.. any ScsiDeviceIO module messages in vmkernel?
    When the host is sluggish but you can SSH in, run esxtop and check the controller and device stats (u and d). You’ll see if IO is being queued or if you have major latencies there.

Leave a Reply