Random LINT1/NMI PSOD on Gaming VM with GX 980

Hi, I have now several weeks the following issue that when I am playing a game, ESXI 6.7.0 Update 3 (Build 14320388) will random crash with a PSOD and the error message:

>LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem;

My system is:

* Dell R720 (Got super lucky with that one and I am super happy to have it)
* 2x Xeon E5-2650 v2
* 128GB RAM
* perc 710p mini with 5x 900GB SAS HDD (RAID5)
* 512GB NVME (using CPU2)
* 64GB SATA SSD (for testing out vFlash)
* GTX 980 passthrough (tried on different slots, CPU1 and CPU2)
* According to the iDRAC internal web panel, all firmwares are up to date

That thing is, it’s like super random. For now I am playing Xcom2 and Borderlands 3 and both had those errors.

iDRAC is telling the following after such an crash:

>A bus fatal error was detected on a component at bus 64 device 2 function 0.
A bus fatal error was detected on a component at slot 4.

I have literally 0 clue why this is happening. The most annoying part is, that it will crash also all other VM’s where I am testing stuff out to educate myself with VM stuff.

For now I really really liked ESXI, but that one is a real issue, as I have no tower to put the GTX into to play the games 🙁


Gaming VM:

* Windows 10
* 8 vCPU (was at first on 16)
* 24GB RAM (16 did not helped)
* 100 GB NVME VHD for OS
* 200 GB NVME VHD for Games
* 200 GB perc VHD for Games
* GTX 980 with passthrough
* GTX 980 HiDef Audio passthrough (for now not added, as I assumed that that thing was causing issues)


I tried the following, to no avail:

* Disabling C1E (read it somewhere)
* Low latency with frequency reservation (assumed it maybe was a timing issue)
* hypervisor.cpuid.v0 = FALSE (otherwise the driver install is not possible)
* pciPassthru.64bitMMIOSizeGB = 64 (tried also 4 and 24, as I am was not sure if it’s meant for RAM or VRAM or both)
* pciPassthru.use64bitMMIO = TRUE
* pciHole.start = 2048
* pciHole.end = 6144 (also tried 8196, as somewhere it was stated to increase stability, but other places told that ESXI 6.5+ is doing it automatically? Additional I am not sure if it should hole exactly 4GB or more)
* pciPassthru0.msiEnabled = FALSE
* Swapping the GPU, so that the other CPU is handling it (it then only changes the slot and bus numbers in the iDRAC log)
* simplifying the VM (not using NVME with less connected VHDs)
* ran memtest


What really weird is, that running **folding@home GPU folding** has **no issue at all**. It never crashed when FAH is running and I have it running near all the time (get that covid).

At best would be that I get a desktop and put the GTX in there for baremetal, but thanks to unforeseen circumstances I can’t afford one and I am desperately trying to get it to work. The GTX was itself a present from a friend to me to be again able to play some games, but unfortunately those PSOD happend.

I am not sure if this is the right sub, but r/esxi does not has a lot of users and r/VFIO is unfortunately the wrong one, as the are handling proxmot, KVM and similar.

Any clues what I could try? I can’t find anything useful on the internet anymore.

