4 node cluster running 6.5Redhat VMdVswitch 2 links connected to a nexus core LB set to Route based on physical NIC load

I have one VM that seems to lose network connectivity in a strange way. From inside the VM if I ping a certain hostname it resolves but fails to ping the IP, but if I ping the IP directly it will respond. I’m told this happens about every 6 months and the last time was Dec 2019.

I’m told in the past when this occurs they would migrate the VM to a different host in a cluster and it would just start working.

I’m here now and this isn’t a solution, but the system is working again because we did the host migration so I can’t reproduce it and the last time it occurred was back in Dec 2019.

I’m at a loss because in the past when I’ve seen these types of issues it was related to the load balancing setting in the switch or portgroup and it usually affected multiple VMs but this only occurs on 1 VM, and there is another identical VM for this app in the VM cluster that never experiences this.

Can I get some ideas on where to troubleshoot next?

Now the only strange thing I have found is the host that the effect VM is on is showing the wrong CDP info, it’s actually showing the CDP as a neighbor as one of the hosts in the cluster. I’ve never seen this before… related. No other VMs on this host is having issues or has in the past.



I’ve compared the networking settings between all the hosts and they appear to be the same, but I’m going to go over them again but any pointers anyone has would be great, this is a real head-scratcher for me but maybe something people have run into before?

  1. double check your load balancing and teaming method. our ucs nexus environment requires source mac hash to properly function where our ucs aci environment requires we use originating source ip ( probably slightly wrong terminology off hand) but the impact of incorrect configuration resulted in what appeared to ne odd arp issues where communication would be lost between guest vms, but could connect from workstations or servers elsewhere on the network. also resolved by migration to shared host.

    look into vendor specific docs related to the compute platform and the switch stack imo.

