There are two distinct states a device can be in when storage connectivity is lost; All Paths Down (APD) or Permanent Device Loss (PDL). For each of these states, the device is in an All Paths Down condition, but how they are handled is different. All Paths Down (APD) is a condition where all paths to the storage device are lost or the storage device is removed. The state is caused because the change happen in an uncontrolled manner, and the VMkernal core storage stack does not know how long the loss of access to the device will last. The APD is a condition that is treated as temporary (transient), since the storage device might come back online; or it could be permanent, which is referred to as a Permanent Device Loss (PDL).
The Permanent Device Loss (PDL) is the permanent removal of a device. This is typically caused by a storage administrator removing a LUN at the storage array, either by unmapping or deleting it. The VMkernal core storage stack knows the device is not coming back because the storage array informs the host of a PDL state through a SCSI command response. The removal is considered permanent when all paths have the PDL error.
There are two variants of PDL, planned and unplanned:
- Planned PDL is when the administrator follows the recommend workflow to remove a storage device (Unmounting a LUN or detaching a datastore/storage device from multiple VMware ESXi 5.x/6.0 hosts)
- Unplanned PDL is when the storage administrator just removes a storage device (at the storage array)
A device may return from a PDL, but there is no guarantee of data consistency at that point.
When a LUN is removed from an ESXi host without preparing the host, there is a potential for them to enter an All Paths Down (APD) state that will become host and virtual machine impacting; the hosts may disconnect from vCenter Server and the ESXi servers would require a reboot to resolve the problem. To address this issue, in vSphere 5.1 there was an advanced setting introduced called APDHandlingEnabled, which defaults to 1. When this setting is enabled, the host continues to retry non-VM I/O commands to a storage device in the APD state for a limited period only. If this value is set to 0, the behavior of retrying failing I/Os forever will be imposed.
A configurable timer with a default of 140 seconds starts when a host first detects a datatstore is in All Paths Down (APD). Hostd will mark the datastore as inaccessible with an APD Started reason. The timeout parameter controls how many seconds the ESXi host will retry non-VM I/O commands to a device before determining the device is unreachable.
After the timer expires, the datastore will be identified as APD Timeout, pending non-VM I/O will be aborted and hostd will mark the datastore as inaccessible with an APD Timeout reason. Any further non-VM I/O will be fast failed with a status of NO_Connect.
This will ensure hostd does not become unresponsive, and prevents ESXi hosts disconnecting from vCenter Server.
You might want to increase the value of the timeout if there are storage devices connected to your ESXi host, which might take longer that 140 seconds to recover from a connection loss. You can enter a value between 20 and 99999 seconds.
What about the Guest OS initiated I/O? The Guest OS is responsible for retrying or aborting it's own outstanding I/O. vSphere has a feature to make VMX file writes more resilient, VMX I/O updates to the config file are atomic which mitigates corruption to the VMX file during an APD. It is initially written to a tmp file, and then swapped with the actual vmx.
Now that we have a better understanding of All Paths Down, lets create an alarm for APD Timeouts. vCenter Alarms can be specified for a particular object and it will apply to all the child-objects beneath in the vCenter server tree. In this instance, we are going to create an alarm to specific event that applies to all the ESXi hosts.
To start, we are going to go into the vSphere Web Client and go to the Manage tab and click on the Alarm Definitions view. In the Definitions view, we are going to click on the green + button to create a new alarm.
On the General view, we are going to give our alert a name and description. You will set the Monitor For to specific event occurring on this object, for example VM Power On and then click Next.
Next we are going to add the trigger. The Event is going to be esx.problem.storage.apd.timeout with the Status of Alert.
On the Actions view, this is where we are going to setup the action taken when the alert hits a threshold. In the image below, I set an email notification to send me an alert when the alarm goes from yellow to red. I have it configured to alert me Once. After this is completed, if you are using a distribution list, the APD Timeout alerts will go to everyone in the distribution list and you will have advanced notification of a possible problem.