4 Node Cluster
Dell 7415 – All flash
QLogic FastLinQ Adapters – FS.com modules and fiber cables.
Cisco 9500 x16 w/ 8x 10Gbps SFP+ Module
vNetwork is vDS. vSAN Port Group is tagged as VLAN 125.
Each host has two adapters. vmnic2 plugged into te1/1/(1,3,5,7) and vminc6 te2/1/(1,3,5,7).
The port groups on the vDS are set to explicit failover. VM network, and vMotion traffic is all set to Active vmnic2, and Standby vmnic6. The vSAN traffic is set to reverse, vmnic6 is active and vmnic2 is standby.
All the Cisco ports are configured as such
switchport trunk allowed vlan 1,30,100,110,120,125,1010
switchport mode trunk
spanning-tree portfast trunk
So, here’s the strangeness. When the portgroup failover is set to Link State Only, and notify switches, we get MAC flapping between the ports associated with a host. So if ESXi1 active is ten2/1/1, and standby 1/1/1, we get a ton of flapping on those ports, and that happens on all 4 nodes.
If I set it to not notify the switches, and beacon, I dont get any MAC flapping.
When I run the vSAN health checks, it will all be greened up. Then I run it again, and I get the Unicast Ping failed and the Large MTU ping fails. Then I do it again, and everything greens up again. Its *always* the vSAN network. The pings for the vMotion unicast, and vMotion MTU always work.
With the Cisco 9500s, we can only set MTU size globally, which is at 1500. As are all the vmk adapters.
When the vmkpings fail, its ALWAYs the same two hosts. I moved the hosts around in the switch to see if it was the switch, and it follow the hosts. If I run a continuous vmkping between those two hosts, the pings on the vSAN network will go through, then out of nowhere stop. It can drop between 5 packets and 200 packets. But it will always come back.
Im about 98% postive its a switch configuration, however with the failure always being the same two host, it makes me question that.
I started a ticket with support, and the guy went through everything and said it looked good EXCEPT two values, seen ehre.[root@esxi14:/var/log] vsish -e get /net/pNics/vmnic6/stats
He stated that those numbers seemed high to have all MTUs set to 1500.
Any thoughts, questions comments, concerns or anything would be helpful.
Edit: Well it *was* always the same two hosts that had communication issues. One more host just added to it.
It looks like this:
11 -> 14 = Fail
14 -> 11 = Fail
12 -> 14 = Fail
14 -> 12 = Fail
…..and then I run it again, and it is just 11 and 14.
vmkping output from 14
vmkping -I vmk1 11 -c 1000000
PING 11 (11): 56 data bytes
64 bytes from 11: icmp_seq=1 ttl=64 time=0.139 ms
64 bytes from11: icmp_seq=2 ttl=64 time=0.111 ms
64 bytes from11: icmp_seq=3 ttl=64 time=0.108 ms
64 bytes from11: icmp_seq=4 ttl=64 time=0.111 ms
64 bytes from 11: icmp_seq=5 ttl=64 time=0.117 ms
64 bytes from 11: icmp_seq=11 ttl=64 time=0.137 ms
64 bytes from 11: icmp_seq=12 ttl=64 time=0.147 ms
64 bytes from 11: icmp_seq=13 ttl=64 time=0.130 ms
64 bytes from 11: icmp_seq=14 ttl=64 time=0.183 ms
64 bytes from 11: icmp_seq=15 ttl=64 time=0.138 ms
64 bytes from 11: icmp_seq=61 ttl=64 time=0.143 ms
There are 6 packets list between 5 and 11, and 46 packets lost between 15 and 61.
Edit: For troubleshooting, I moved the primary vSAN network off of the add in module of the 9500, and into the main chasis of the switch, and have the same results with the same two hosts. Fart.
I’ll check out the firmware, SVL and firmware. Thanks!!!!
#show stackwise-virtual link
Stackwise Virtual Link(SVL) Information:
S-Suspended P-Pending E-Error T-Timeout R-Ready
Switch SVL Ports Link-Status Protocol-Status
—— — —– ———– —————
1 1 TenGigabitEthernet1/0/14 U R
TenGigabitEthernet1/0/15 U R
2 1 TenGigabitEthernet2/0/14 U R
TenGigabitEthernet2/0/15 U R
Stackwise Virtual Configuration:
Stackwise Virtual : Enabled
Domain Number : 1
Switch Stackwise Virtual Link Ports
—— ———————- ——
1 1 TenGigabitEthernet1/0/14
2 1 TenGigabitEthernet2/0/14
And according to everything else, its setup correctly. We are not using vlan 4094 for anything anywhere else.
The firmware on the Qlogics are at 14.10.07, which is the latest version I can get from the dell website.
I took the standby adapters out of the mix until I can get some more insight from a better networking person than I am.
Thank you for all the help.
Those Cisco 9500 x16s are running on 16.10.1, and I have heard that that firmware is less than great…possible issue?
Well…Found the issue with the help of VMware tech support. ” If they are the issue you are having with 6.7 U3 makes sense, Friday last week… we received an informed about Dell servers having all kind of networking issues with 6.7U3, so they are not recommending to upgrade yet, so wanted to gather that information from you to report to my seniors about another case of Dell servers with the 6.7U3 causing issues. ”
Turns out to be an internally known issue.
View Reddit by PBandCheezWhiz – View Source