VMware

Weird networking issue with new environment

I’m starting to see some odd issues that appear to be network-related on a new compute stack I’m building, but I can’t pin it down and I’m looking for opinions on things I can check.

The symptom manifested itself when I was trying to do a file transfer between a VM on my old stack and a VM on my new stack. When I moved the target VM back to the old stack, the problem apparently went away. Today, I was trying to vmotion between hosts on the new stack (something that worked fine two weeks ago), only to get sporadic failures (mostly timeouts).

The vmotion network is isolated on my two new Cisco 9300 10Gb uplink switches, so I think I can rule out anything related to the old compute stack or the old portion of the network. I also tried this both with and without jumbo frames with the same results.

The hosts are HP DL360 G10’s running VMware ESXi, 6.7.0, 13981272 (6.7.0U2) with 2x10Gb uplinks and some 1Gb uplinks. The 1Gb links are on vswitch0 and used for management and uplink to core network. vMotion is on the 10Gb links and shares vswitch1 with two other vlans. All four hosts are connected to a stacked pair of Cisco 9300-24UX running 16.09.03. Both vswitches are standard, not distributed.

I can’t definitively say when this problem started, but vmotion (seemed to) work fine before I started loading about a dozen non-production VM’s onto the stack.

During the file transfer test, I was attempting a simple copy (from Command prompt) from within a source Windows machine on the old compute stack (with 1Gb uplinks) to the target VM on the new stack (with 10Gb uplinks). The files were about 256MB each. The first two or three of about a dozen files would succeed, but the command would time out shortly and fail. The file transfers also seemed a little slow but I didn’t time them.

During the vMotion test, about half the time the vMotion will fail with some variant of a communication error. I can ping from to/from vmk1 (vmotion vmkernel) on all hosts with no packet loss, including pinging with jumbo frames (when I had it enabled).

Any suggestions on how I can start to troubleshoot this? Any chance this is storage-related (even though not doing storage vmotion, and my storage is all FC).



View Reddit by vmFrankView Source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

3 Comments

  1. First drivers/firmware check on new cluster always. Especially with new hosts and 10gb adapters.

    Next communication failing on the file transfer sounds like malformed EtherChannel. Make sure connections between all switches have the required VLAN’s and is configured correctly.

  2. What is your underlying storage and how is it connected? While I agree it could be drivers and firmware for sure as the families should match and has a high probability of being your issue. I’m curious for the storage deets. Also, you could try iperf with a large window size maybe 5Meg to really saturate the pipe for long periods to see if you have consistent bandwidth or the dreaded sharktooth indicating packet loss with tcp on the network side.

Leave a Reply