I just started at a new employer as a DevOps Specialist, and one thing that came up on the very first day is, as a result of their contracts with customers, their entire production environment is running at a “big blue” 3rd party in VMWare. This environment seems to have *MAJOR* reliability issues on a regular basis. From what I’ve learned there have been over a dozen customer-impacting outages in production already this year, and a very large portion of them can be traced to specific VMWare issues.
While I am trying to get information on what the environment is, the attitude at the company is “good luck, they don’t tell us anything they aren’t required to by contract”. The most recent outage happened when VMs were put on a host we were previously promised would “no longer be used” to host our services as it “had known hardware issues”. Of course, 3 VMs were on it a few weeks later.
So, with little to no information likely to come from the hosting provider, and having no control over the environment myself, I want to get all the information I can from the VMs about the host. I have domain admin and root respectively on all VMs, and can install any tools (offical or otherwise) that might help me.
What information can I extract from within the VMs, and how can I do that? Are there various APIs I can read useful info from like the HTTP accessible APIs you get in AWS? My top priority would be to identify what host each VM is on, or at least psuedo-identify them (things like knowing if a host agreed not to be used was used, or when multiple systems are misbehaving, be able to tell they are all on the same host.) Of course, anything else would/could help as well.
View Reddit by Kell_Naranek – View Source