Huge VM stun on snapshot merge?

I’ve recently noticed it on some machines that are serving a web server and a MySQL database.
This happens after veeam is done backing up and has to merge the snapshot file. During that time, that could go from 30 seconds to 10 minutes, some of my machines stop responding to ping, giving service and even responding on direct console connection.

My datastores are nfs3, could this have anything to do?

I also got the impression that machines with greater count of disks seem more impacted or maybe it’s just machines with higher IO.
Any clue on what could be happening? Any ideas on how to mitigate this effect?

***EDIT: Initially stated VMFS3 instead of NFS3

  1. High I/O and lots of changes between the snapshot and the merge can be a cause of this. You can help by giving the db disks dedicated PVSCSI controllers. I’m not sure if migrating your VMFS version would help as well, but I can’t imagine that it would be anything but good. Best bet would be to set the db disks to independent persistent, and use native Veeam MySQL backups.

  2. Make sure you’re on the latest version of Veeam and ESXi (check for compatibility for apps).

    Mentioned below about migrating machines to the latest Datastore.

    I’ve seen a setting the job as incremental and changing the timings around help short term

    And then it usually comes down to disk IO as mentioned previously, unless there isn’t some misconfg in VMWare/Veeam. If you’re sitting on some old SAN you’re going to hit issues with the modern workload. There be some specific storage tweaks you can do depending on the platform.

  3. I think this is expected. There are a lot of factors that could be involved. Server utilization at the time of clean up, disk array, etc.



    The only way I’ve been able to get around this is to try and schedule backups at a less busy time to try and not impact the applications.

