What is an alert definition? An alert definition is a template for tracking problems. You start with a base object to monitor, you define the impact by categorizing it and rating how critical it is, and then you choose one or more symptoms that constitute the problem.
There are a few different components that make up an alert definition, they include:
- Symptoms
- Recommendations
- Actions
- Notifications
Alert definitions come with vRealize Operations Manager out of the box, but you also have the capacity to create your own alert definitions. That includes building your own symptoms, prescribing your own recommendations, and assigning actions. Alert definitions created by IT Operations are specifically made to meet your organizations business requirements and SLAs.
To access the alert definitions, from the Home screen select the Content object in the Navigation panel.
Symptoms are condition tests against an object; it tests a single or multiple values to find out if it is true. For example, a virtual machine that has CPU Ready% greater than 10 may trigger a Health alert.
By default, vRealize Operations Manager comes with multiple symptom types:
- Metric/Super Metric
- Static: Compare to a specific metric value
- Dynamic: Compare to the DTs of the metric
- Property (example: number of CPUs)
- Compare metrics that have been labeled as Properties by the adapter
- Message Events (example: change events)
- Match one of the object based events collected by the adapter
- Fault (example: host disconnected)
- Match one of the object based faults collected by the adapter
- Metrics Event (example: CPU is high from 3rd party adapter)
- Matches a metric based event collected by the adapter
- Not used by VMware adapters
If you want to override the default thresholds, go into the Policies on the Administration page and modify the values with the slider bars.
Let's create a symptom, we are going to click the green + on the symptoms definition, drop down the list from the Base Object Type and expand vCenter Adapter. Scroll down and select Host System.
Next, expand the Memory metrics; we are going to create two symptoms for host memory shortage. First, create a symptom for Swap In Rate (KBps) is greater than 0 with an immediate alert and then make an symptom for Workload (%) is greater than 95% with a critical alert. I named my custom defined symptoms High Swap In Rate and High Host Memory Workload. The polling cycle for vRealize Operations Manager is 5 minutes.
Recommendations help provide guidance when trying to resolve an alert; it can be a combination of best practices, vendor recommendations, and tribal knowledge. Recommendations are displayed with the alert and can have actions associated with the recommendation. There is a very good library of default recommendations, but you can create new recommendations if the defaults do not meet your needs.
The intention is to make recommendations actionable, so that you have the ability to have one-click actions to respond to infrastructure issues. If you are experiencing host memory contention, you may have the action to shut down some of the virtual machines or vMotion them to another host. These actions will be available through the vRealize Operations Manager Python Adapter and vCenter Orchestrator workflows in the near future.
With the initial release, there are a limited amount of actions available.
To finish out this article, we are going to create our alert definition, taking advantage of the symptoms that we created for host memory shortage. Select the Alert Definition on the left hand Navigation panel and the click the green + icon on alert definitions screen.
We are going to give our alert definition a name and a description, for my example I used Host Memory Shortage.
For the Base Object Type, I am going to expand vCenter Adapter and select Host Systems.
On Alert Impact, the Impact is to Health with the Criticality of Immediate. I select the Alert Type and Subtype of Application: Performance and keep the default Wait and Cancel Cycle.
Now we add our Symptom Definitions; earlier we created two symptom definitions, we are going to drag over High Host Memory Workload and High Host Swap In Rate.
And to finish our alert definition, we are going to drag over the recommendations of add more hosts to the cluster to increase capacity, use vMotion to migrate some virtual machines with high memory workload to other hosts that have available memory capacity, and power off this virtual machine to allow for other virtual machines to use CPU and memory that this virtual machine is wasting. You will notice the last recommendation has the action to power off VM.
After saving it, we see our new alert definition.
If you wanted to go directly to the active alerts, you would click on the quick link.
It will show you a list of all the alerts in the system and you can drill down on the specific alert for more details. The other way you can view alerts is going directly to an object.