VCP 5 - Objective 6.4 – Perform Basic Troubleshooting for HA/DRS Clusters and vMotion/Storage vMotion

Identify HA/DRS and vMotion requirements

HA Requirements

All hosts must be licensed for HA (Essentials Plus, Standard, Enterprise, and Enterprise Plus). Any 3.5 hosts must have a patch applied to account for file locks.
Need at least 2 hosts in the cluster.
Hosts should be configured with a static IP address. If using DHCP be sure to use reservations so the hosts IP will not change after a reboot.
Need at least 1 management network that is common across all the hosts. Best practices will call for at least 2 management For ESX hosts this will be a service console, for ESXi hosts earlier than version 4.0 this will be a vmkernel interface, and for ESXi hosts 4.0 and above this will be vmkernel network with the Management Network enabled.
Hosts must have access to all the same VM Networks in the cluster.
VMs must reside on Shared Storage.
In order to enabled VM Monitoring, you must have VMware tools installed.
Host Certificate Checking should be enabled.
Should not mix IPv4 and IPv6 clusters as you will be more likely to have a network partition.

DRS Requirements

All hosts in a DRS cluster must have access to shared storage.
Processors must be from the same vendor and the same processor family. (IE. Intel Xeon). If processors are not exact you can use EVC in order to mask features from the processors and provide a common baseline across the cluster. EVC was explained earlier in this guide. You can also used CPU mask to hide certain features of the CPU to the VM. This is applied on the VM level whereas EVC is applied on the cluster level.

vMotion Requirements

Each host must be correctly licensed (essentials plus and up) for vMotion and Enterprise and up for Storage vMotion
Each host must meet the shared storage requirements for vMotion
- Datastores must be available to all the hosts participating within the migration
Each host must meet the networking requirements for vMotion
- Hosts must have a vmkernel port that has been assigned to vMotion. This network must reside on the same subnet on both hosts. It must also be named identically. Also, the networks that the VMs are attached to must also reside on both hosts and be named the same.

Cannot vMotion VMs that are using RDMs for clustering purposes
Cannot vMotion a VM that is backed by a device that isn't accessible to the target host. I.E. A CDROM connected to local storage on a host. You must disconnect these devices first. USB is supported as long as the device is enabled for vMotion.
Cannot vMotion a VM that is connected or backed by a device on the client. You must also disconnect these first.

Storage vMotion Requirements and Limitations

VM disks must be in persistent mode or be RDMs. For RDMs in virtual compatibility mode you can migrate the mapping file or convert to thick or thin provisioned disks so long as the destination is not an NFS datastore. RDMs in Physical Compatibility mode support the migration of the mapping file only.
You cannot migrate VMs during a VMware tools install.
The host that the storage vMotion is running on must be licensed for Storage vMotion.
ESX(i) 3.x hosts must be configured for vMotion. ESX(i) 4 + do not require vMotion to perform a Storage vMotion.
Obviously the hosts needs access to the source and target datastores.

Verify vMotion/Storage vMotion configuration

The easiest way I have found to verify the vMotion compatibility of a given VM is to select it in the Inventory and click the Maps tab. This will show you the vMotion Map and display the following information

The networks your VM is attached to and which hosts are in turn attached to that network
The datastores you VM resides on and which hosts in turn have access to that datastore.
The current CPU usage of the hosts.
Hosts marked with a red X are not suitable and violate one of the above requirements.
Hosts enclosed with a green circle are compatible for the vMotion, but still do not guarantee that it will complete.

For storage vMotion, just ensure that you have met the requirements above. There really isn't a lot of requirements for Storage vMotion.

Verify HA network configuration

Refer to section 5 regarding setting up HA as all the networking configuration is in there.

Verify HA/DRS cluster configuration

I've already spoke about how to setup HA and DRS clusters in section 5 of this study guide. I would refer to it for this section as well. One note is the the HA section on the Summary tab of a cluster. Here you can see your admission control settings, current and configured failover capacity, as well as the status of Host, VM, and Application monitoring. The runtime info link gives you your slot information (sizes, total slots available and overall, as well has counts of good and bad hosts.). The cluster status will show you who the master host is, the number of slaves connected to it, as well as which datastores are used for datastore heart beating and a count of protected VMs. The configuration Issues link will show you a list of all issues that have been detected on the cluster.

The DRS section will show you your selected automation level, DPM status, DRS recommendations and faults, your configured threshold and your current and standard load deviation. The resource distribution chart will show you a stacked graph showing VMs memory and CPU usage statistics across the hosts in the cluster.

Troubleshoot HA capacity issues/Troubleshoot HA redundancy issues

I'm going to combine these two sections and outline all of the scenarios in the vSphere Troubleshooting guide.

Selecting Host Failures Cluster Tolerates causes the cluster to turn red (invalid).
- Could be caused by having hosts in the cluster that are disconnected, in maintenance mode, not responding, or have an HA configuration error.
- Could also be caused if you have a one or so VMs with a CPU or memory reservation much larger than the others. Since this admission control policy uses slot sizes, and slot sizes take reservations into account when calculated, this may skew the size of the slot.
- Solution is to simply check that all hosts are healthy and connected. This policy only includes resources from those hosts which are connected and healthy.
"Not Enough Failover Resources Fault" when trying to power on a VM (using host failures cluster tolerates policy).
- Again, could be caused by a disconnected host.
- Same, could be caused by a VM with an abnormally large CPU/Memory reservation.
- Problem could occur if there are no free slots in the cluster, or if powering on the VM causes the slot size to increase (if it has a large reservation). In this case you could use HA advanced settings to lower the slot size, modify the reservations, or lower the number of hosts failures that your cluster will tolerate.
- You could also consider using a different policy such as % of cluster resources.
vCenter is not choosing the heartbeat datastore that you specified
- The specified number of datastores to use is more than required. vCenter will only chose the optimal number of datastores to use from the list and ignore the rest.
- A datastore might not be chosen if its only available to a limited number of hosts in the datacenter. Also may not be chosen if it lives on the same LUN or NFS server of a datastore that has already been chosen.
- Won't be chosen if the datastore is down or experiencing connectivity issues.
- If there is currently a network partition or a host is isolated it will continue to use those datastore specified at the time it was isolated even if the user preferences have changed.
Operation fails when trying to remove a datastore that is used for heartbeating.
- If a datastore is unmounted and that datastore was chosen to be used for ha, then normally another datastore is chose as its replacement. The HA agent will then close all the open handles it has to the datastore to be removed. However, if there is currently a network partition or the host is isolated, the ha agent doesn't unlock these files, thus causing an error (HA agent failed to quiesce file activity on the datastore.
VM appears as unprotected even though it has been powered on for several minutes.
- Can be caused if a master host has not been elected and/or vCenter is unable to communicate with a master host. Should show an HA warning/error of Agent unreachable or Agent uninitialized.
- Multiple master hosts exists and the one that vCenter sees is not responsible for that VM. Likely that vCenter will be reporting a network partition as this is normally what will cause multiple masters to be elected.
- Agent cannot access the datastore where the VMs configuration file is located. Normally occurs during an all paths down condition in the cluster.
Virtual Machine restart fails
- caused if the VM was not protected at the time of the failure.
- Insufficient resources on the hosts in which the VM is compatible with.
- HA attempted to restart the VM but encountered a fatal error every time it tried.
- Could also be a false positive and the VM could actually be running.

Interpret the DRS Resource Distribution Graph and Target/Current Host Load Deviation

I spoke about the resource distribution graph above.

As for deviation loads…

Target deviation is calculated based on your DRS settings ( migration thresholds) The current utilization of VMs and hosts is then used to calculate your current load. If your current load exceeds your target load the cluster is labeled as imbalanced. DRS runs every five minutes and attempts to move workloads around if you are imbalanced.

Troubleshoot DRS load imbalance issues

DRS clusters will become overcomitted when the cluster no longer has the resources to satisfy every VM within it. Suddenly losing a host can temporarily cause a cluster to turn yellow as it immediately loses a good chunk of resources.

A DRS cluster will turn red (invalid) when the tree below it becomes invalid. This can happen if you are reconfiguring a resource pool while a VM is failing over. The solution to this to simply power off some VMs in order to get a consistent state in your resource pools within your cluster. Also the cluster can become invalid if the reservation of the VMs is greater than that of their parent resource pools.

Troubleshoot vMotion/Storage vMotion migration issues

See above and section 5

Interpret vMotion Resource Maps

See vMotion section above.

Identify the root cause of a DRS/HA cluster or migration issue based on troubleshooting information

Again here is the end all catch all for the section. Use all of the information above and in the troubleshooting guide to determine steps to take towards finding the root cause of a problem. Again, real world experiences is going to be your best option for this one.

2 thoughts on “VCP 5 – Objective 6.4 – Perform Basic Troubleshooting for HA/DRS Clusters and vMotion/Storage vMotion”

Matt Vogt says:

January 30, 2012 at 11:00 pm

Good stuff as usual, Mike. Quick note/question/possible edit, in the “Verify HA/DRS cluster configuration”, you talk about the “HA section on the Summary tab of a host”, but I think you mean the summary tab of a cluster to see the failover capacity, admission control, etc.

Cheers,
Matt

1. Anonymous says:
  
  January 31, 2012 at 8:38 pm
  
  Right you are Matt. Thanks for the catch! I’m sure I have lots of them in these notes as I’m kind of just hammering them out! I do plan on going back through them and fixing what I can find before publishing a pdf or something. Thanks so much for the comments, and the tweets as well!