I’m currently working on a horizon deployment in an environment that appears to be hugely over committed. The CPU utilization sits at 99.99% throughout the working day, across all 7 hosts, there’s very little fluctuation. Memory is not as bad (generally 70-80%). I am confident the environment is over committed based on my understanding below, but I want to understand how strict is HA is in reserving 20% of the resources for failover (or does it ‘dip into’ those resources based on some tolerance level based on workload?)
Here’s the scenario:
Total VMs = 261 (Win 10 1909)
VM configurations = 4gb RAM, 2 sockets and 2 cores CPU = 4vCPU.
7x ESXi hosts, in a HA cluster with 20% Cluster resources reserved for failover capacity. DRS is enabled.
CPU: Each host has: Xeon Gold 6148 @ 2.4 ghz = 2 sockets x 20 cores per socket = 40 cores. It shows in vSphere as 80 logical processors so presumably this is due to hyperthreading?
Host CPU in Ghz: Each host has 2.4ghz x 40 cores = 96 ghz.
Total Cluster Logical Cores = 7×80 = 560
Total Cluster CPU Resources = 7 hosts x 96 ghz = 672ghz
Total Cluster Memory Resources: Each host has 383gb RAM = 382×7 = 2.674tb
So total requirements for the 261 VM’s
RAM = 261x 4gb = 1.044tb
CPU = 261 x 4vCPU = 1044
The environment is suffering from slow logons, sluggish performance on desktops. I can see that CPU ready values on the individual hosts are around 300ms (this is during ‘downtime’ i.e. 6pm on friday evening…) but typically it’s like 1200ms or more at peak use.
So taking the above figures into account and NOT including the 20% reserved capacity for HA failover this environment must be hugely over committed? If I take 20% off the total CPU and RAM resources, and then considered the VM requirements taking HA reservation into account, it’s not pretty, right? Will HA be reserving an aggregate of 20% of cluster resources across all 7 hosts so accommodate a failure?
I’m not 100% sure my workings above are correct so any pointers here would be appreciated. The business is aware of this and is building another (1 host) I believe, which of course is not sufficient but is there any other method (ESXtop?) that we can use to illustrate how over committed this environment is? I also noticed MTU size is 1500 (which doesn’t help..).
I’m typically at the ‘master image and pool management’ end of Horizon administration so am out of my comfort zone, so any assistance would be appreciated.