Friday, June 6, 2014

vSphere Performance Troubleshooting

I wanted to follow my previous two posts with vSphere Performance Troubleshooting. There are three golden questions when doing general performance troubleshooting.
  1. What are the symptoms?
  2. When did it work last?
  3. What has changed since it worked?
Don't let the user speculate on the root cause of the problem. Get a specific description of what they are doing and specifically what symptoms they see. Make sure you have an open mind, consider all the components involved, while focusing on the symptoms.

Performance issues are very subjective, it basically comes down to the fact that the end user is not happy. When diagnosing performance problems in vCenter, vCenter Operations Manager, or esxtop there are thousands of metrics you can look at for issues. But, very few of them are going to actually tell you, "Is the user happy?"

The best metrics to help diagnose user experience are metrics that measure contention, which is demand deprivation. A resource was wanted, but denied.

The first thing you should do is ask some questions about the symptoms of the problems. Is it all the VMs on a host? Is it all the resources in a cluster? Are the virtual machines on the same datastore? Is it a network issue? Start wide with your questioning and then start to narrow it down. Try to nail down a baseline comparison between time frames when they are experiencing the problem and when there are no problems. If you can get specific time frames, you can use vCenter Operations Manager to look at the times when the user is experiencing the degraded performance. vCOps retains months of metrics with 5 minute intervals.

If we define a performance problem, it is work that can't be done fast enough with the physical resources supplied. The demand exceeds the usage. The virtual machine is demanding more GHz, GB, or IOPs than the resource is able to deliver.

Contention = Demand - Usage

This can also happen from a virtual machine perspective, even if there are enough resources on the host, if the virtual machine doesn't have enough virtual hardware to meet the demand the performance will be degraded. You are basically starving the virtual machine of needed resources. It can't demand from the hypervisor the resources that are required to run the application.

Houston we have a problem!

Also remember, the root problem can affect multiple resource areas. For example, a memory shortage could cause issues with swap-in rate from network storage, high network utilization, slow disk I/O, and maybe CPU deprivation. It is a cumulative affect from the source memory problem.

Every problem has a solution, you can either reduce the demand by tuning the application (i.e. memory leak, inefficient algorithm for accessing disk, ect...) or increase the resources that are being supplied.

The top tools to use when trouble shooting contention metrics are vCenter Server performance tab, vCenter Operations Manager, and esxtop/resxtop. When trying to whittle down an issue, you will probably use all these tools in conjunction to find the root cause of the performance issue.

Top Contention Metrics include:
  •  All types of contention (wanted vs delivered)
  • CPU Ready(%)
  • Disk Latency(ms per IO)
  • Memory Contention(%) - Swap-In Rate (KBps)
  • Network Packets Dropped(#) 
Next, I thought I would go through some common problems you find in virtualized environments.

Problem: Slow Storage

Symptoms may include:
  • Latency per I/O is high (> 20 ms sustained)
  • CPU contention (or Ready) is high for other VMs on the same host
    • VirtualMachine-01 is squatting on a pCPU waiting for I/O. VirtualMachine-02, VirtualMachine-03, and VirtualMachine-04 are accumulating Ready time waiting for VirtualMachine-01to leave the pCPU.
    • VirtualMachine-01counts the time as Wait (but not idle), not Run or Used or Utilization.
    • VirtualMachine-01 CPU utilization, demand, or workload can be low.
    • Host CPU utilization, demand, or workload can be low.
    • Wait - Idle time for a vCPU is much more than 0
  • Consult the storage team to look at meeting the IOPs demand
  • Move virtual machines onto other datastores or high performance datastores
  • Tune guest OS or application to reduce demand for disk IOPs
Problem: Host CPU Shortage

Symptoms may include:
  • CPU Contention(%) or CPU Ready(%) is high
  • CPU Workload(%) is high, near or over 100%
  • CPU Utilization(%) is high
  • Disk latency is low (storage squatting is not causing the Ready%)
  • Tune gueast OS or application to reduce demand for CPU
  • Move VMs on to other hosts
  • Check DRS is working in the cluster
Problem: VM CPU Costop High

Symptoms may include:
  • Costop(%) is high
    • Costop(%) is like Ready(%), but this time one vCPU runs hot and others  are cold, the skew between hot and cold starts to grow larger
  • Reduce the number of vCPUs in the virtual machine
  • Reduce the number of virtual machines or vCPU on the host or cluster
  • Tune the guest OS or application to better utilize all the vCPU (be aware of single threaded appliations)
Problem: Host Memory Shortage

 Symptoms may include:
  • Contention(%) and/or Swap-In Rate (KBps) is not zero
  • Active(%) is high (this measures how much of the pRAM is actively being used, not just allocated/granted/consumed)
  • Overall Workload(%) is high (This is vCOps which is similar to Active)
  • Tune guest OS or application to reduce demand for memory
  • Move VMs to other hosts
  • Check DRS is working in the cluster
  • Consider reducing the memory in some virtual machines (some applications like databases opportunistically use all the vRAM installed)
Problem: VM Undersized (memory)

 Symptoms may include:
  • Active(%) is near 100%
  • Contention(%) and/or Swap-In Rate (KBps) is zero
  • Inside the guest OS, there is more swapping than usual
  • Host is supplying all the memory that the VM demands of the host, but the VM doesn't have as much vRAM installed as it needs
  • Add vRAM to virtual machine
These are just a few of the more common scenarios, keep in mind that metrics measured in ESXi, vCenter, and vCenter Operations Manager are accurate; but metrics measured in the guest OS are not always accurate. Timekeeping inside a virtual machine is subject to jitter (time is accurate to within a second not a milisecond).

Metrics that depend on highly accurate time, like CPU utilization, are not reliable. But, sometimes looking at the metrics in the guest operating system are essential, swapping in the guest OS isn't visible outside the operating system.

Don't assume the problem is somewhere else and not your problem, but remember it might be.
News: Top vBlog 2016 Trending: DRS Advanced Settings