VMware

Strange vSAN Connection Issues

4 Node Cluster

Dell 7415 – All flash

QLogic FastLinQ Adapters – FS.com modules and fiber cables.

Cisco 9500 x16 w/ 8x 10Gbps SFP+ Module

vNetwork is vDS. vSAN Port Group is tagged as VLAN 125.

Each host has two adapters. vmnic2 plugged into te1/1/(1,3,5,7) and vminc6 te2/1/(1,3,5,7).

The port groups on the vDS are set to explicit failover. VM network, and vMotion traffic is all set to Active vmnic2, and Standby vmnic6. The vSAN traffic is set to reverse, vmnic6 is active and vmnic2 is standby.

All the Cisco ports are configured as such

interface TenGigabitEthernet1/1/5

switchport trunk allowed vlan 1,30,100,110,120,125,1010

switchport mode trunk

switchport nonegotiate

spanning-tree portfast trunk

​

So, here’s the strangeness. When the portgroup failover is set to Link State Only, and notify switches, we get MAC flapping between the ports associated with a host. So if ESXi1 active is ten2/1/1, and standby 1/1/1, we get a ton of flapping on those ports, and that happens on all 4 nodes.

If I set it to not notify the switches, and beacon, I dont get any MAC flapping.

When I run the vSAN health checks, it will all be greened up. Then I run it again, and I get the Unicast Ping failed and the Large MTU ping fails. Then I do it again, and everything greens up again. Its *always* the vSAN network. The pings for the vMotion unicast, and vMotion MTU always work.

With the Cisco 9500s, we can only set MTU size globally, which is at 1500. As are all the vmk adapters.

When the vmkpings fail, its ALWAYs the same two hosts. I moved the hosts around in the switch to see if it was the switch, and it follow the hosts. If I run a continuous vmkping between those two hosts, the pings on the vSAN network will go through, then out of nowhere stop. It can drop between 5 packets and 200 packets. But it will always come back.

​

Im about 98% postive its a switch configuration, however with the failure always being the same two host, it makes me question that.

I started a ticket with support, and the guy went through everything and said it looked good EXCEPT two values, seen ehre.

[root@esxi14:/var/log] vsish -e get /net/pNics/vmnic6/stats

rx_1519_to_max_byte_packets: 49042027

tx_1519_to_max_byte_packets: 10999754

He stated that those numbers seemed high to have all MTUs set to 1500.

​

​

Any thoughts, questions comments, concerns or anything would be helpful.

​

Edit: Well it *was* always the same two hosts that had communication issues. One more host just added to it.

It looks like this:

11 -> 14 = Fail

14 -> 11 = Fail

12 -> 14 = Fail

14 -> 12 = Fail

…..and then I run it again, and it is just 11 and 14.

​

vmkping output from 14

vmkping -I vmk1 11 -c 1000000

PING 11 (11): 56 data bytes

64 bytes from 11: icmp_seq=1 ttl=64 time=0.139 ms

64 bytes from11: icmp_seq=2 ttl=64 time=0.111 ms

64 bytes from11: icmp_seq=3 ttl=64 time=0.108 ms

64 bytes from11: icmp_seq=4 ttl=64 time=0.111 ms

64 bytes from 11: icmp_seq=5 ttl=64 time=0.117 ms

64 bytes from 11: icmp_seq=11 ttl=64 time=0.137 ms

64 bytes from 11: icmp_seq=12 ttl=64 time=0.147 ms

64 bytes from 11: icmp_seq=13 ttl=64 time=0.130 ms

64 bytes from 11: icmp_seq=14 ttl=64 time=0.183 ms

64 bytes from 11: icmp_seq=15 ttl=64 time=0.138 ms

64 bytes from 11: icmp_seq=61 ttl=64 time=0.143 ms

​

There are 6 packets list between 5 and 11, and 46 packets lost between 15 and 61.

​

Edit: For troubleshooting, I moved the primary vSAN network off of the add in module of the 9500, and into the main chasis of the switch, and have the same results with the same two hosts. Fart.

I’ll check out the firmware, SVL and firmware. Thanks!!!!

Using this:

[https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9500/software/release/16-10/configuration_guide/ha/b_1610_ha_9500_cg/configuring_cisco_stackwise_virtual.html](https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9500/software/release/16-10/configuration_guide/ha/b_1610_ha_9500_cg/configuring_cisco_stackwise_virtual.html)

​

#show stackwise-virtual link

Stackwise Virtual Link(SVL) Information:

—————————————-

Flags:

——

Link Status

———–

U-Up D-Down

Protocol Status

—————

S-Suspended P-Pending E-Error T-Timeout R-Ready

———————————————–

Switch SVL Ports Link-Status Protocol-Status

—— — —– ———– —————

1 1 TenGigabitEthernet1/0/14 U R

TenGigabitEthernet1/0/15 U R

2 1 TenGigabitEthernet2/0/14 U R

TenGigabitEthernet2/0/15 U R

k#show stackwise-virtual

Stackwise Virtual Configuration:

——————————–

Stackwise Virtual : Enabled

Domain Number : 1

​

Switch Stackwise Virtual Link Ports

—— ———————- ——

1 1 TenGigabitEthernet1/0/14

TenGigabitEthernet1/0/15

2 1 TenGigabitEthernet2/0/14

TenGigabitEthernet2/0/15

And according to everything else, its setup correctly. We are not using vlan 4094 for anything anywhere else.

​

The firmware on the Qlogics are at 14.10.07, which is the latest version I can get from the dell website.

​

I took the standby adapters out of the mix until I can get some more insight from a better networking person than I am.

​

Thank you for all the help.

Those Cisco 9500 x16s are running on 16.10.1, and I have heard that that firmware is less than great…possible issue?

​

Well…Found the issue with the help of VMware tech support. ” If they are the issue you are having with 6.7 U3 makes sense, Friday last week… we received an informed about Dell servers having all kind of networking issues with 6.7U3, so they are not recommending to upgrade yet, so wanted to gather that information from you to report to my seniors about another case of Dell servers with the 6.7U3 causing issues. ”

​

Turns out to be an internally known issue.


View Reddit by PBandCheezWhizView Source

Related Articles

4 Comments

  1. Sounds like you have an issue between the two switches.

    As always check drivers/firmware.

    Don’t use beacon probing unless you have 3 switches and 3 network adapters.

    Check the switches for having a correct VPC configuration between them, firmware and VLAN 125 is allowed over it with correct MTU.

    The behavior sounds like malformed EtherChannel.

  2. And my final edit.

    I just could NOT get the qedentv drivers to update. They would just never take or install correctly. So I reloaded each ESXi host again, but WITHOUT the Dell customized ISO, and then did the updates via YUM and it worked. As of right now, fingers crossed, everything is staying connected and working as it should.

Leave a Reply

Your email address will not be published. Required fields are marked *

Close