We have a one of several systems that is not behaving in the manner expected. All the systems work in the same way but this one in particular is mis-behaving…
We are required to pin host workloads into pre-defined groups. This we have done by using affinity rules to keep groups of VMs together. One VM from each group is in a anti-affinity rule, which keeps all the VM groups apart on separate hosts. This works well and is proven to work on all systems except one.
This system goes into a migration frenzy every 5 minutes (probably the periodic DRS checks).
The VMs in each group have their memory requirements “reserved”, but the totals for each set fall well within the host resource capacity. The hosts have 64GB RAM each, and not of the groups have a total VM memory reservation of more that 57GB. What we are seeing on this one particular system is that even one of the groups that has a total reservation of 47GB refuses to power on the last VM, claiming there to be “not enough memory”. This despite the fact the host summary showing between 50% and 75% memory utilisation and the final VM only trying to reserve 8GB!
Some of the groups will power on OK, but during one of the migration frenzies one of the group will be found to be on a different host to the rest, causing an affinity rule violation. Every 5 minutes the whole group will move to another host, and this repeats.
If we set DRS to be partially automated with the expectation that we would see what recommendations are being raised, we get nothing. We tried and waited half an hour, not a peep out of the system. As soon as we set it to fully automated again, it all kicked off with the migration frenzies again.
Comparing this system with others, we haven’t been able to identify any significant configuration differences.
My feeling is that something is messed up inside vCenter (VCSA 6.7u2), and I would be tempted to deploy a new one. I would however like to understand if possible, how and why this one has gotten to the state it is in. The reason being I want to know if there is something we might have started doing that could have caused this.
Is there any experience of similar problems out there?