Details
The issue addressed in this article occurs when a rescan is issued
while an all-paths-down state exists for any LUN in the vCenter Server
cluster. Therefore, a virtual machine on one LUN stops responding
(temporarily or permanently) because a different LUN in the vCenter
Server cluster is in an all-paths-down (APD) state.
These symptoms may indicate you have an all-paths-down state:
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
These symptoms may indicate you have an all-paths-down state:
- You may see intermittent
Request timed out
(out of 6 or 7 successful pings) while trying to ping the virtual machine. - When powering on a virtual machine with a raw device mapping (RDM) the progress bar stops at 50% and the virtual machine console becomes unresponsive at the VMware splash/loading/BIOS screen.
- In vSphere 4.0, virtual machines drop packets intermittently.
- Virtual machines drop ping packets.
- Network becomes temporarily unresponsive.
- In the
vmkernel.log
file, you see entries similar to:NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa._______________" - failed to issue command due to Not found (APD)
NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.__________________".
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
All-Paths-Down State (APD)
The all-paths-down state is a condition where no working path exists to a storage device (LUN). These situations can cause an all-paths-down state to a LUN:- A hardware failure (permanent or transient)
- Removal of a LUN
Rescanning
This article is concerned with the type of rescanning that can cause virtual machines on other LUNs to become unresponsive:- Configuration changes in ESXi/ESX 4.x involving VMFS (Virtual Machine File System)
Configuration of VMFS that involves creating, deleting, or increasing datastores can cause an automated rescan. vCenter Server issues a vCenter Server-wide rescan as part of the workflow to discover storage changes. This rescan allows automatic discovery to maintain a consistent view of storage across all hosts in the cluster. For example, to perform a rescan in ESX 4.0, use this command:esxcfg-rescan -d vmhba#
Note: The command to perform a rescan on the vmhba using various methods, such as RCLI, vMA, and PowerCLI varies. For appropriate commands, refer the documentation.
- Removal of a LUN
The removal of a LUN through an array-based administration action followed by a manual rescan can cause virtual machines on other LUNs to also become unresponsive.
Solution
The issue is resolved in ESXi/ESX 4.1 Update 1 and the fix has also been included with ESXi 5.0.This issue is resolved in the patch release for ESX 4.0. For more information see, VMware ESX 4.0, Patch ESX400-200912401-BG: Updates vmkernel, vmklinux, tools, CIM, and perftools (1016291).
Notes:
- With ESXi/ESX 4.1 Update 1 and ESXi/ESX 4.0 Update 3, you do not need to modify the advanced setting. Virtual machines that are not associated with the APD volume(s) do not become unresponsive on a rescan.
- When unpresenting a LUN containing a datastore, follow the instructions in Removing a LUN containing a datastore from VMware ESXi/ESX 4.x (1029786). If the issue still persists, contact VMware Technical Support.
Workaround
ESXi/ESX 4.x can list all of the LUNs it detects, as well as the state of these LUNs. If none of the paths to a storage device are in the ACTIVE state, then ESXi/ESX considers the device to be in an all-paths-down state. If an all-paths-down state does exist, then this is likely the issue causing LUNs to be unresponsive, either for a limited period of time or permanently, when a rescan occurs. For more information, see Identifying disks when working with VMware ESX (1014953).If virtual machines are not responding on an ESXi/ESX 4.0 host, determine if an all-paths-down condition exists by running the command:
Where:# esxcfg-mpath --list-paths --device device_naa | grep state
or# esxcfg-mpath --list-paths --device device_mpx | grep state
device_naa
is the Network Addressing Authority (NAA) unique address for the full storage devicedevice_mpx
is the identifier if a NAA ID is not available
Starting with ESXi/ESX 4.0 Update 1, you can set an advanced configuration option on all hosts in the vCenter Server cluster to reduce rescan times and to prevent virtual machines from not responding. By default this option is disabled.
Caution: Not every all-paths-down condition is permanent. Some all-paths-down conditions, such as those that occur briefly during a network re-configuration, are transient. Enabling this option can cause devices in a transient all-paths-down state to become unavailable. VMware recommends disabling this option after the rescan operation completes.
To enable this option, run the command:
# esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD
To disable and reset to the default value without requiring downtime, run the command:
# esxcfg-advcfg -s 0 /VMFS3/FailVolumeOpenIfAPD
To check the value of this option, run the command:# esxcfg-advcfg -g /VMFS3/FailVolumeOpenIfAPD
Note: This does not apply for ESXi/ESX 4.0 Update 2 and 4.1 because the patch is integrated in these versions.
With
ESX 4.1 Update 1 and ESX 4.0 Update 3, you no longer have to make the
modification to the advanced setting. Virtual machines that are not
associated with the APD Volume(s) do not become unresponsive upon a
rescan.
No comments:
Post a Comment
Thanks for showing interest in tech-jockey.