VMWare 6.5 on m610 w/ Infiniband – Cluster Networking Woes

Hello everyone!

I’ve been working on configuring a cluster of m610 blades to function as a little make-shift vSAN lab and intending on running Cluster traffic (vMotion, vSan) over the Infiniband switch (M3601Q) and leverage IPoIB.

I currently have four blades with ESXi 6.5 installed on. Each of those has a ConnectX Mellanox Infiniband card (three of them are MT26428s, one is MT25418; 40GB vs 20GB respectifully) installed.

To get the mezzanine cards to show properly I followed the directions here: [https://forums.servethehome.com/index.php?threads/10gb-sfp-single-port-cheaper-than-dirt.6893/page-9#post-145423](https://forums.servethehome.com/index.php?threads/10gb-sfp-single-port-cheaper-than-dirt.6893/page-9#post-145423)

After installing the MLNX OFED []( driver I am able to see the associated port at 40000MB showing as Connected in my Physical NIC displays (or 20000MB for the one older card).

I have opensm running on a barebones Linux install on an old M600 that also has an Infiniband card acting as the switch manager running in MASTER mode.

Normally at this point if I was running these nodes in Linux I would use ibping, ibstat, etc. to confirm that the link is functional on each node. I’m unsure how to do that on ESXi so… I just went forward assuming that each NIC showing as ‘Connected’ means it was working.

Each node has a new switch created, with the 40(20)G NIC associated and a VMKernel associated that has vMotion, vSAN, Provisioning, and Replication enabled (the 1GB link on each holds the Management and VM Networks).

With that configuration set, I end up with 4 network partitions and no inter-node communication for vSAN. If I put a VMKernel with those functions over on the 1GB NIC everything works hunky-dorey (just very slowly).


I’m not sure how to proceed from here. Anyone have any guidance on where to look next?

