This is definitely one for VMware support but this is only my homelab so I don’t have that option so I’m putting it out there for the community in case someone else bumps into this or has any input.
tldr; Is there an incompatibility with NSX-T and the NSX network introspection driver included with VMware Tools? I’m seeing IPv6 packet address corruption on outbound packets only when originating traffic with it installed, response to external connection seems fine. Goes away when the driver is uninstalled. These are VMs are that are **not** attached to a logical segment, just a plain old DVS. Linux VMs running open-vm-tools are (unsurprisingly) unaffected.
For some background, I’ve been kicking the tires on NSX-T for a couple of weeks now on a couple of test hosts and decided to roll it out across all my hosts over the Christmas break whilst I had some downtime, all seemed to be working fine initially and I was looking forward to some network virt loveliness. The whole network is dual stacked and running the latest versions of everything, ESXi 6.7 U3b/NSX-T 2.5.1.
It started out a couple of days ago with Windows NPS not starting on three separate Windows VMs, surely they can’t all be hosed? Eventually track it down to them not being able to resolve client FQDNs from DNS on startup, all these VMs are dual stacked with DNS primary being v6. nslookup just hangs and won’t resolve anything, IPv4 is fine.
Now these 3 hosts are also AD DCs so they have DNS running locally, trying `nslookup – ::1` also times out. Strange. DNS is running and listening on correct addresses, I can access and query externally from my laptop no problem. Stranger.
Deploy a fresh template same issue, download same template using workstation and locally it is fine.
Cut forward several hours of frustrating troubleshooting later a clean ESXi install will work perfectly, as soon as I install it as an NSX-T transport node the problem returns. Remember these VMs aren’t attached to a logical segment just my old DVS.
Digging into the dvfilters, nsxcli and running some captures I can see that IPv6 traffic originating from the VM is being mangled like so:
1. cmd.exe -> `ping 2001:db8:0001::1`
2. On the wire as destination address: `0:0:2001:db7:0001:::`
3. Lost in the ether never to be seen again
4. If I hit the VM externally its reply traffic doesn’t seem to be affected.
Linux VMs with open-vm-tools are fine so start to suspect a tools driver issue maybe?
Maybe it’s a bug in the 11.0.1 vmxnet3 driver (I only upgraded everything last week) so I try disabling offloading etc, nope. Changing the NIC to e1000e, same.
Eventually I install a fresh copy of Server 2019 from the install ISO to make sure my template isn’t hosed, with e1000e and no tools installed works perfectly again. OK, getting somewhere.
Install tools selecting complete install option (usual habit, I’m wondering if maybe it’s a bad one now) and immediately the problem returns. Uninstall and it goes away again. This time just install with the normal install option and everything still works, change NIC back to vmxnet3 and still OK.
Have a look at the optional install selections as I haven’t looked in a while and take the first guess at the VMCI -> NSX Network Introspection driver. Bingo, that’s the kiddie and I make the issue come an go on demand.
I’ve tried with 11.0.0 tools and same issue, haven’t tried any older versions yet.
I need to do so more research on what this driver actually does/enables in detail to see if there is anything else i’m missing, from a cursory google and read of the release notes I can’t see if it is meant for NSX-V rather than T which is my initial thought.
Anyone ran into anything like this before? It’s certainly ran me around well, haven’t had one this good in a while.
As a side note has anyone else found it harder to go from NSX-V -> NSX-T than it was to go from nothing -> NSX-V. There’s so much to relearn in a different twist, it’s really embarrassing but I swear it took me 2-3 weeks to properly grasp how the transport zone/uplink profile/transport node profile concept interacted. Same with VLAN segments. It all seems so logical now but boy it was hard to get my head around coming from V, but it makes perfect sense now from a platform agnostic perspective.
To see the full content, share this page by clicking one of the buttons below