VMware

Awful support call

Hi

So I have a call open with Vmware for around 2 months, we have guests losing network connectivity randomly on our 6.5 cluster (between 5-20 instances per day, no real consistency). Vmware suspected the SAN was at fault, so we’ve spent a considerable amount of time looking here (gen10 hp blades/nimble storage), and while issues/update requirements were found and dealt with accordingly we have seen no improvement in behaviour.

While pushing back on Vmware it got moved to another engineer who doesn’t seem to think storage is the cause, this was around 3 weeks ago. No one is offering any real solutions and we have gone through 3 engineers in 3 weeks due to staff issues on the Vmware side, leading to mainly repetitive requesting of logs and nothing happening… Despite apparently being a “high priority and already with the escalations team”.

I’m making this post out of desperation more than anything to see if anyone has any pearls of wisdom, if I have to run “Get-Log -Bundle” one more time there is a high chance I’ll cry /s.

In case anyone is wondering, the current state of the issue is before we see tcp/ip network connectivity drop, the vmware.log of the vm shows:

“2019-11-27T10:47:44.171Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779624: Receiving Ethernet.IsPresent request.

2019-11-27T10:47:44.171Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779624: Completed EthernetClass request.

2019-11-27T10:47:44.172Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779625: Receiving Ethernet.SetStartConnected request.

2019-11-27T10:47:44.172Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779625: Completed EthernetClass request.

2019-11-27T10:47:44.173Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779626: Receiving Ethernet.SetAllowGuestControl request.

2019-11-27T10:47:44.173Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779626: Completed EthernetClass request.

2019-11-27T10:47:44.173Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779627: Receiving Ethernet.ConnectionControl request.

2019-11-27T10:47:44.173Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779627: Completed EthernetClass request.

2019-11-27T10:47:44.174Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779628: Receiving HotPlugManager.BeginBatch request.

2019-11-27T10:47:44.174Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779628: Completed HotPlugManager request.

2019-11-27T10:47:44.177Z| vmx| I125: VigorTransportProcessClientPayload: opID=34187407-8c-3287 seq=1779635: Receiving HotPlugManager.EndBatch request.

2019-11-27T10:47:44.177Z| vmx| I125: VigorTransport_ServerSendResponse opID=34187407-8c-3287 seq=1779635: Completed HotPlugManager request.

2019-11-27T10:47:46.057Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779751: Receiving Ethernet.IsPresent request.

2019-11-27T10:47:46.057Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779751: Completed EthernetClass request.

2019-11-27T10:47:46.058Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779752: Receiving Ethernet.SetStartConnected request.

2019-11-27T10:47:46.058Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779752: Completed EthernetClass request.

2019-11-27T10:47:46.058Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779753: Receiving Ethernet.SetAllowGuestControl request.

2019-11-27T10:47:46.058Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779753: Completed EthernetClass request.

2019-11-27T10:47:46.060Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779754: Receiving Ethernet.ConnectionControl request.

2019-11-27T10:47:46.060Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779754: Completed EthernetClass request.

2019-11-27T10:47:46.061Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779755: Receiving HotPlugManager.BeginBatch request.

2019-11-27T10:47:46.061Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779755: Completed HotPlugManager request.

2019-11-27T10:47:46.063Z| vmx| I125: VigorTransportProcessClientPayload: opID=1ce27f9b-79-32a6 seq=1779762: Receiving HotPlugManager.EndBatch request.

2019-11-27T10:47:46.063Z| vmx| I125: VigorTransport_ServerSendResponse opID=1ce27f9b-79-32a6 seq=1779762: Completed HotPlugManager request.

The network drop will only effect 1 vm and not all other vms on the host.

This is currently the main focus of investigation, however every time we arrive at this point in the last few weeks it seems a new engineer is assigned and I start the process again…

Any thoughts would be amazing!


View Reddit by mark_gdView Source

Related Articles

26 Comments

  1. Well, that sure sounds like a really elusive issue, in defense of everyone’s efforts. Everyone would be scratching their heads on this one.

    That said, I struggle to remember what Vigor is, but it might be a WebClient management interface component. And the events you list suggest that there’s a flapping of the network… uplink? for the VM, and these events are being communicated across the stack. It isn’t totally clear if it’s a literal uplink disconnect/reconnect (that would stand out a lot). Might be more internal in nature, between VM and network resources on the host.

    Still pretty informational in nature here, but I think it represents a bigger problem underneath. You have to start looking at correlating logs in hostd, vmkernel, and vobd, and seeing what else has gone on at the time. If I hear you log literal network flaps that coincide with this, and nobody has caught on, I will probably laugh and cry at the same time.

    Edit: Do you have Guest introspection, NSX, or any other components that might have some play in the whole networking stack for the VM?

  2. I’m extremely confused here – you’re saying that ethernet connectivity is lost, but are being told to chase a storage problem?

    ​

    Is there any commonality between VMs that experience the problem – VMNIC type? OS type? VMware Tools version? What is happening on the host/vm/vcenter immediately before the problem? I’m sure you’ve checked most of this, but I’m asking to get more info into the post for others much smarter than me to help.

  3. We are solving a similar network drop issue that was caused because one cpu would get busy and it does not use the other for NIC unless RSS is checked in newer vm tools versions. Might be worth checking

  4. We had a similar network connectivity issue. Turned out that Intel native driver introduced in 6.5 was causing the issue. Vmware recommended upgrading to a newer version of the native driver but it kept occurring. They then suggested to change to linux style driver and the issue went away. Using Log Insight we could see indrv_uplinkreset which correlated with the issue.

  5. Have you updated firmware on your switches, storage and servers? We had an issue with Cisco dropping packets and using duplicate MAC addresses, clearing ARP actually made it worse and a firmware update fixed it. Not saying this is your issue but just curious if all your firmware was updated.

  6. Are you using any third party SFPs like axiom? We recently had to have all of our axiom SFPs switched out for Cisco branded because we saw some very strange issues with the Axiom SFPs.

  7. Are your qlogic cards using qfle3 drivers? We just went through a shit show that only using the bnx2x driver fixed. Had to install bnx2x and then unload and uninstall the qfle3 drivers from the host.

  8. I had a similar issue – eventually traced it to something broken inside of windows or that specific VM. It would just drop packets randomly. Didn’t matter which port on the physical switch, tried VMXNET3/E1000, tried creating a specific network portgroup assigned to a specific network card nic. It would drop pings even from other VMs on the same host. Tried even turning off the firewalls/av and anything else that might be interfering with packets. Power management wasn’t a thing. And everything thing I tried didn’t help. It was really weird. Packet captures would just go dark until it reconnected.

    Ended up building a new windows server install and migrating all the roles over. Couldn’t be bothered to spend too much more time troubleshooting something on an os that was going to be obsolete soon anyway – this issue just hastened its decommissioning.

  9. I have a couple clarifying questions:

    1. What is the duration of “losing network connectivity”? Is this a blip or do the go offline until power cycled or some other action is taken?
    2. Is there no network connectivity, are Windows DHCP VMs just showing a 169.254.x.xaddress?
    3. Are these all Windows VMs, or a mix?

  10. Long shot but. I have had intermittent networking issues similar. My environment is we have all the network interfaces trunked and tag all vlans in the DvSwitch port groups, our uplinks are in a active/active with load balancing route based on originating virtual port.
    At times a vm has lost network where other VM’s on the same host and vlan are fine.
    It was traced to a miss-configured vlan on one of the network switch ports, due to the load balancing any vm could be using any of the links at any time.
    Best way to detect these is to turn on vlan health check on the DvSwitch.

  11. I see you mentioned drivers are already up to date so I assume that includes the blade firmware?

    What about the chassis and modules?

    We had an issue with our G10 blades and the flex port nics where the entire network stack would just die until reboot (a little bit worse than your situation ) and it was a combination of VMware HPE drivers, the flex nic firmware and the virtual connect version before it was fixed, we had to roll back to VC 4.5

Leave a Reply

Your email address will not be published. Required fields are marked *

Close