Hi guys, this problem is driving me crazy and preventing me from putting this kit into production.
We have small clusters, x6 hosts in each, esxi 6.7u3, 3par 8200 storage array, vcenter 6.7u3 also. DL360 Gen10 servers with x4 intel x710 10Gbe ports, and 32Gb SD card in each.
Esxi is installed on the SD card, and the scratch location set to a shared iscsi datastore, with each host having its own folder. This appeared to work fine, until I reboot the whole cluster of 6 at the same time, where the datastore with the scratch folders randomly disappears, and the host loses its ScratchConfig.ConfiguredScratchLocation setting which reverts back to default.
I have tried setting the Syslog.global.logDir setting, which I thought made an improvement, until more reboots later and the problem was still there.
When the host drops its scratch location, it hangs longer on boot, and also has this error message in the log:-
LVM: 15237: Failed to open device naa.60002ac0000000000000000200020cb8:1 : Atomic test and set of disk block returned false for equality
It seems whenever that error happens, the datastore is dropped, and the scratch setting is lost. Its utterly infuriating! Out of desperation I removed the elx isci driver/vib, no difference. I just dont know where to go from here, as pretty much whenever we reboot a whole cluster, some of the hosts will drop their scratch location and require manual intervention to fix. Not acceptable with only a 6 host cluster!
Any tips to try and fix this? It is driving me insane and making me feel useless that I cannot fix the problem. I should add it does not happen to all hosts at the same time, its more random. Of course once the setting has reverted with the error above, it needs manual intervention to fix. Also, as soon as the host is booted, I CAN access the datastore it had a problem with during boot, its not lost forever, which is even more bizzarre.