1.0 Background: [Setup]
I have a problem I hope someone here can help me with. I have a large vSphere 6.0 with a few hundred hosts. These hosts are distributed between 4-5 vCenter environments.
I have several HPE Nimble storage arrays organized into groups and they present various LUNs to the various vCenter environments.
The vCenter environments are organized into two groups. A couple are for Hosting servers as guests and the others are for VDI. The storage arrays are arranged in a similar fashion so that one array cluster only presents LUNs to vCenters that have servers as guests and the other array presents LUNs to the vCenters used for VDI.
The exception to this are two volumes. We have one volume that is used to present template and ISO images located in the SAN used to present LUNs to the VDI environment that is also made available to the vCenters for Servers. And another volume that is hosted on the SAN for Servers and is presented to all environments for temporary storage and the movement of data between the various vCenters. This LUN is generally empty and is not used for heart-beating etc. All volumes are mounted using ISCSI software Adapter.
2.0 The Problem:
In troubleshooting an unrelated issue with the storage vendor we noted these lightly used volumes were showing up really high on the IOPS and Network throughput list higher than many of our production LUNs some of which contain 200-300 VMs. IOPS and and throughput for these volumes were way out of line for their use. We were able to determine this IO activity was not related to any actual workloads so it must be coming from the hosts or is VMware/ESXi related in some way.
3.0 What has been Done:
We opened a ticket with VMware support. They ran through all of the storage related performance checks and did not find anything out of order. We disabled ATS on the hosts to see if it would change anything it did not.
We were able to determine the scope to be as follows of the 5 vCenters the two volumes are mounted to ,the problem only shows up on three of the 5 vCenters 1 vCenter for servers and 2 VDI vCenters. Our largest vCenter which contains 4 clusters the problem only shows up in two of the clusters.
We know that if we unmount the volume the problem appears to go away. We see corresponding drops in IOPs and throughput.
We’ve taken packet traces and checked the ESXI hosts logs and we found something interesting but have not been able to get anything to line up.
We can see the high-IOPS/throughput when Setting a performance chart Performance > Datastore > Realtime > (datastores) > Read Rate. In places where this problem does not exists, the throughput is < 20 KBps but in places were we are seeing the problem we see 3000-8000KBps.
Most of the errors we are seeing in the vmkernel.log on the hosts are MISCOMPARE ON VERIFY
Please let me know if additional information is needed.