vSphere Replication – Lessons learned

I got the opportunity to work with vSphere Replication in a real production environment for the first time.
It was not a smooth experience and i hope the lessons learned will help others VMware users.

Quiescing, backup, vSAN = ESXi host disconnected

Environment with vSphere replication 6.1.2 and ESXi 6.0 U3 and RPO 15 minutes.
We were having regular “random” ESXi host disconnection from vCenter.
The virtual machines were still running though.
vSAN was reporting “unexpected Virtual SAN cluster members”
Restarting the management agents was not enough to regain access to the host.

After a long investigation, we have identified the following points:
-It only affects ESXi server that host a source VM in vSphere replication.
-It only happens with Windows VMs replicated with “Guest OS Quiescing” enabled in replication options.
-Backup at the VM level seems to be needed to trigger the bug. (Network based backed)

How to regain access to a host disconnected:
-Identify the VMs on the disconnect host that are protected by vSphere replication.
-Connect to the guest OS and shutdown the VMs.
-The ESXi host will be reconnected in vCenter.

Permanent fix:
Do not use “quiescing” for windows VM if there is also backup at the VM level.

VM backup affect vSphere replication job

With vSphere replication quiescing:
Host disconnected in some conditions, see previous topic.
Errors may be triggered like “Disk consolidation needed” and others.
See resolution section in KB 2040754

Even without vSphere Replication quiescing:
If the backup start during the replication job, the job will be cancelled …and RPO will not be respected.
Below is the sequence of tasks when using vSphere replication and TSM with network backup
Sync started by VR Scheduler
Task: Backup virtual machine (scheduled)
Task: Create virtual machine snapshot
Sync aborted: Filter failed
Task: Rename snapshot
Task: Remove snapshot
Virtual machine disks consolidation suceeded.
Virtual maching vSphere Replication RPO is violated
Sync started by VR Scheduler
Sync completed

It is unfortunately impossible to have control of the of vSphere replication jobs schedule, so it means that there is a risk of not respecting RPOs each time there is a backup.

RPO violations difficult to troubleshoot

If you end up with RPO violations it will be very difficult to identify the origin.
VMware support may redirect you to vSphere Replication RPO Violations

However it is not really helpful.
As a user I should not have to go in the logs to identify why the RPO are not respected.
The RPO Violations “events” should be more precise and tell me for example that:
There was too many block changes in the VM
There is a bandwidth constraint
etc
The only information available in the events is the “bytes transferred” for the “sync completed” but it doesn’t mention the amount of block changed for example.

VMware support may also recommend to isolate traffic or add new appliance, this could help in some cases but it will be better to identify first what is the bottle neck before doing any changes.

Moreover there is no information on how vSphere replication “Schedule” all jobs.

Recommendation for VMware:
Please provide more visibility to the users by showing exactly what vSphere replication is doing and why.

RPO violations for large VMs

Based on my observation it seems that RPO violations for large VMs may happen even with a low activity on the VM.
Small VMs don’t have RPO issues.
There are RPO violations for VM with more than 4TB assigned even with a low disk usage and low activity.
For example vSphere replication using few seconds to replicated 10 megabytes in one job and sometimes 40 minutes for the same size.

To rule out totally the “block” change I have tested with one new VM with 4TB but without any OS installed.
It really means 0 block change.
All replications jobs were less than 2 seconds at the beginning with 0 bytes transferred.
But later on it increases to 5 minutes for 0 bytes transferred.
And then to more than 15 minutes still for 0 bytes…triggering a RPO violation.

And as discussed earlier it is very difficult to troubleshoot.
Same problem with one VM with one very large disk, but also for one VM with 10 empty disks of 1 TB.

So it doesn’t seem to be a problem with VM with disks of more than 2TB

From the vSphere replication release notes 6.1.2:
vSphere Replication tracks larger blocks on disks over 2TB. Replication performance on a disk over 2TB might be different than replication performance on a disk under 2TB for the same workload depending on how much of the disk goes over the network for a particular set of changed blocks.

Finally increasing the RPO from 15 to 30 minutes didn’t help. There was still random RPO violations.
It doesn’t seem to be a problem linked to the amount of change in the first place, so increasing RPO is not really supposed to help.

Maintenance tasks for a protected VM complex

From the vSphere replication FAQ

Reverting to a virtual machine snapshot that was created after vSphere
Replication was configured, at the source location, typically causes vSphere
Replication to perform a full sync for a virtual machine

Virtual machines protected by VMware vSphere High Availability (vSphere HA)
can be replicated. However, when a replicated virtual machine is recovered by
vSphere HA, vSphere Replication might require a full sync.

The two above points could affect RPO for very large VMs because a full sync may takes many hours.

vSphere Replication can replicate virtual machines with snapshots. Snapshots at the source are not reproduced at the target location.
This is a main limitation for people using for example Citrix Machine Creation Services and using “snapshot” that want to replicate the image AND the snapshots to a remote site.

NO API !!
It is difficult to provide a SDDC if we need manual tasks to start replication

It is not possible to change the size of virtual machine disks while it is being replicated.
There is a procedure but it is not straightforward.

Replication traffic is going through the vSphere replication appliance

I always thought that the vSphere replication traffic was going from source ESXi servers to destination ESXi servers. I was wrong.
The schema on the documentation is misleading and i didn’t read the full documentation.

The schemas on this post are clearer.

Conclusion

Do not use quiescing for vSphere replication for Windows VMs if there is also VMs backup.
Be aware of the limitation of vSphere replication.
If you need DR for large VMs, test vSphere replication first to ensure that it can meet the RPO requirements.
Maybe you will have to consider SAN to SAN replication (So no vSAN) or use another technology.

Leave a Reply

Your email address will not be published. Required fields are marked *