VMware

VMWare ESXi 6.7 U3 on Cisco UCS C220 M5 server. Guest issues with Ram above 256GB

My environment has 6 Cisco UCS C220 M5 servers. Host 5 and 6 were just deployed and the only notable difference in the configuration is an UCSC-RAID-M5 RAID controller and 8 SSDs. Each host is running Xeon Gold 6234 processors with 1.5Tb of RAM. I loaded ESX using the Cisco custom ESX installation ISO from the VMWare site.

Deploying a template, or setting up VM and installing from ISO, to any other hosts with any amount of RAM is no issue. When I deploy from template to Hosts 5 or 6 and give the VM anything above 256Gb RAM, Windows will boot but I get I/O errors when I try to save a new IP, join the domain, and a number of other general configuration tasks. If I shutdown and reduce the RAM to below 256GB, everything works as expected. These hosts are for a SQL availability group and will need more than 256GB of RAM.

​

I have a VMWare ticket open and the engineer found this in the ESX log

0.166 Memory Module 1 DDR4_P1_A1_ECC 8.1 0 **error** 0 Memory 2020-09-21T15:09:35 85 00 51 01 39 20 00 a6 08 01 7f 68 0c 01 00 08 00 48 20 20 00 58 00 00 fa 00 00 00 00 00 00 00 fd 00 ff 00 f1 f1 f1 00 00 00 00 00 00 00 00 ce 44 44 52 34 5f 50 31 5f 41 31 5f 45 43 43
0.167 Memory Module 2 DDR4_P1_A2_ECC 8.2 0 error 0 Memory 2020-09-21T15:09:35 86 00 51 01 39 20 00 a7 08 02 7f 68 0c 01 00 08 00 48 20 20 00 58 00 00 fa 00 00 00 00 00 00 00 fd 00 ff 00 f1 f1 f1 00 00 00 00 00 00 00 00 ce 44 44 52 34 5f 50 31 5f 41 32 5f 45 43 43
0.168 Memory Module 3 DDR4_P1_B1_ECC 8.3 0 error 0 Memory 2020-09-21T15:09:35 87 00 51 01 39 20 00 a8 08 03 7f 68 0c 01 00 08 00 48 20 20 00 58 00 00 fa 00 00 00 00 00 00 00 fd 00 ff 00 f1 f1 f1 00 00 00 00 00 00 00 00 ce 44 44 52 34 5f 50 31 5f 42 31 5f 45 43 43

The list goes on and includes errors on all 24 DIMMs.

Cisco is reviewing logs but not finding anything on the hardware side. I have been testing Memory and validating configuration but we are not getting anywhere. After the engineer found this he escalated to their VM Management team and I haven’t heard back in over 48 hours. The project timeline is completely blown and I’m at a loss what else to look at.

I had no issues creating virtual disks in the RAID controller or adding datastores from either host. Other than the RAM issue on the guest everything else seems fine.

Any help is appreciated.


View Reddit by TS_QuintView Source

Related Articles

2 Comments

  1. Have you set your BIOS options to the recommended settings for virtualization? Cisco has a good document to roll through:

    https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-b-series-blade-servers/whitepaper_c11-740098.pdf

    A firmware bug I ran into a long time ago in the 3.x code base was flagging the wrong dimms as failed. That was a fun one. Verify the firmware you’re running is the recommend level for your version (3.2.x or 4.1.x)

    I’ve run UCS blades (B200 M5) with 1.5TB of ram and massive MS SQL VMs without issue, so it’s certainly possible. If you don’t get any traction with your case escalate to the duty manager of either company and make them get on a conference call together and figure this shit out.

  2. Run a memory test preferably from the Cisco management console

    If not just grab the lastest memtest uefi iso and let it rip.

    I bet that some dimms are possibly bad.

Leave a Reply

Close