Monday, May 26, 2014

vCenter Operations Manager Troubleshooting

vCenter Operations Manager 5.8 (vC Ops) is a tool from VMware that collects massive amounts of data from a variety of sources. You might wonder what is the difference between the metrics collected from ESXi by vCenter server and the metrics collected by vCenter Operations Manager? VMware vCenter shows you a lot of different metrics for the past hour in 20 second increments, if you start to research information further back in time it reveals less metrics and the data points become more averaged out. For example, the past day has 20 second intervals, the past week shows 30 minute intervals, and the past month shows two-hour intervals in vCenter. A two-hour long average can hide a lot of peaks and valleys, it might be good for some general capacity planning, but it isn't good if you are trying to troubleshoot the root cause of an application performance issue. It is simply to large an interval, you need a much finer data sampling. That is where vC Ops comes in!

vCenter Operations Manager does three things differently. It keeps all the metrics, it keeps five minute intervals, and it keeps them for six months. You can retroactively go back and tell an application owner if he was having a performance problem at a certain time. It is going to give you a lot more confidence about providing relevant information to your IT business partners.

Another fearture with vCenter Operations Manager is dynamic thresholds. vC Ops takes the collected metrics and it looks for patterns over time. It can then make predictions for the future with these patterns that help you proactively maintain your environment.

Under the Operations tab there are four badges, which include Health, Workload, Anomalies, and Faults for every resource. Health is nothing more than the aggregate of the other three badges and is scored 0 to 100, with 100 being the best score. 

Faults show alert information, such as the link state down. When an event triggered alert occurs, it does not automatically clear by design, this is to ensure that someone looks at the state and takes corrective actions so it doesn't happen in the future.

The Workload badge shows the demand for the physical resources. Every resource has some level of capacity. For example, CPU capacity is the number of cores you have times the amount of GHz per core. The virtual machines on the host are using a portion of that capacity, the workload is what fraction of that capacity is being demanded. Workload is demand divided by the capacity as a percentage.

Workload shows us four resource dimensions with CPU, memory, disk I/O, and network I/O. In this case, our host is providing the resources and the demand is being generated by the virtual machines. There is some virtualization overhead for ESXi, but predominately the demand is coming from the guests. We can see that CPU workload shows us 4%, which means that of the total capacity (number of cores times the amount of GHz per core), 4% of the resources are being asked by the virtual machines on the host.

Looking at the host, we can immediately see that the host isn't stressed. The Workload badge is calculated by the most utilized resource, in this instance memory is the most heavily used resource with 29%, which provides the Workload badge of 29% above. 

The important thing to remember about workload is that it is the demand for the resources not the usage. Think of it this way, demand is how much of the resource was wanted and the usage is what was delivered. This is fundamental to performance troubleshooting, most performance problems are caused because the demand is higher than what is delivered. The gap between demand and usage can be considered contention. The more contention, the less likely your end user is happy with the performance. And ultimately, the definition of a performance problem is the user is unhappy and the definition of their not being a performance problem is the user is happy.

If we had a virtual machine that ran a batch process and was at 100% utilization for 5 hours during the night is that a performance problem? If the user received the expected output the next day, there isn't a performance problem. Adding additional resources may speed up the process, but it isn't really necessary. Performance problems are very subjective.

Virtual machines have approximately 250 to 300 metrics. But, none of those metrics predict if the user is happy. A small fraction of those metrics can help establish if the user is happy, and those tend to be the metrics that measure contention; which show that the virtual machine demanded a certain amount of resources and the amount of resources that were provided.

In the Workload box, it displays the resource utilization for the last 5 minutes for CPU, memory, disk I/O, and network I/O. Just above, we can see the host is bound by Memory and it shows us the utilization for the past 6 hours. There are some peaks and valleys, but there is plenty of capacity remaining over the past 6 hours. If we want to look further back, then we need to switch from the Detail view to the All Metrics view.

We are going to finish off our post by talking about the last badge, Anomalies. Remember earlier, I mentioned that vCenter Operations Manager collects the metrics and then tries to find patterns in the numbers to produce dynamic thresholds. It tries to predict the resource utilization for the next day. If that value is outside the predicted range, then that becomes an anomaly. If you have a very high anomaly count, it doesn't mean there is a problem, but it might be a good indicator that there is something wrong based on past behavior. For instance, if you had a virtual machine that had been sitting around for months for an application to be rolled out to your business users, your anomaly score is going to be low until that application load has become available for production. The workload may still be relatively low, but because it is outside of the predicted range the anomaly score is going to be high.

Every night the vCenter Operations Manager analytic virtual machine dynamically calculates the thresholds, it looks at every metrics for every virtual machine and predicts what it is going to look like for tomorrow. Every single one of the predictions is different. That is why the analytic virtual machine in vC Ops requires so much disk capacity, disk I/O, CPU, and memory. Every night it is analyzing millions of data points.

If we look on the right side of the Operations details, we can see the metrics that are not behaving normally. We can see the expected ranges at the specific time frames. In the picture above, the Data Store Capacity Contention (%) is below what is expected. The red arrows pointing down shows values below the dynamic threshold, and the red arrows pointing up shows the values above the dynamic threshold. The yellow light-bulb shows that it is an active anomaly.

If your Workload and your Anomaly badges are both red, then you definitely want to track down the issue. 

vCenter Operations Manager can be a bit daunting at first, but if you take the time to learn the tool it can help you move from reactive problem solving to proactive maintenance with its built in analytics engine.
News: Top vBlog 2016 Trending: DRS Advanced Settings