Not too long ago, I had one of my XenServer hosts crash due to a hardware failure. It was not my Pool Master. When a XenServer host crashes the virtual machines running on the failed server can’t be controlled, so we’re unable to tell them to start up on another host. The XenServer pool still considers them running, however, they’re not listed in XenCenter. Hopefully this can help some of you get back up and running a little quicker.
The first couple steps were taken from Citrix’s document, XenServer System Recovery Guide. It’s short and straight forward so if you need more information on these steps, please refer to it.
These commands need to be run from a functioning member of the pool. The first thing to do is find the UUID of the failed server.
[root@XS01 ~]# xe host-list params=uuid,name-label,host-metrics-live uuid ( RO) : 542926d8-a1c6-43a9-8ee3-92072214e9bb name-label ( RW): XS01 host-metrics-live ( RO): true uuid ( RO) : df28090c-cde8-495e-a70e-3e0b2879fdcb name-label ( RW): XS02 host-metrics-live ( RO): false uuid ( RO) : f2f13fcf-8ff4-4bbb-b281-af4cc5db91cb name-label ( RW): XS03 host-metrics-live ( RO): true
From the above output we can see that the XS02 host is the server that died and it has a UUID of df28090c-cde8-495e-a70e-3e0b2879fdcb. But now the question is “Which VMs were running on this server?”
[root@XS01 ~]# xe vm-list is-control-domain=false resident-on=df28090c-cde8-495e-a70e-3e0b2879fdcb uuid ( RO) : f763254e-b77f-30a4-0a5c-8a43d880d5dc name-label ( RW): VMSERVER01 power-state ( RO): running uuid ( RO) : 7ccbe7b2-089a-412e-8e88-af2175bb0c4a name-label ( RW): VMSERVER02 power-state ( RO): running
The resident-on parameter in the above command is the UUID of the failed server. We can see that I had two VMs running on that server. Even though the pool says they are running, we should check. Depending on the reason we lost the host, the VMs may actually still be running. So to be sure I was attempted to ping both VMs and they were not up. If they are up, you may be able to RDP to them and shut them down cleanly.
Let’s now tell the pool that the VMs are powered off.
[root@XS01 ~]# xe vm-reset-powerstate resident-on=df28090c-cde8-495e-a70e-3e0b2879fdcb --force --multiple Caution! Incorrectly using the --multiple option could result in ALL virtual machines within the pool being reset. Be careful to use the resident-on parameter as well. Alternately, you can reset VMs individually.
This will reset the power state and we can now see the virtual machines in XenCenter.
It would now be a good time to try and ping all the VMs affected. Depending on the type of crash of the host, the VMs may actually still be running. If you can still ping the VMs, try to get a clean shut down of them (We don’t want to start up two of the same VM.) Once verified the VMs are actually shut down, start the VMs back up using XenCenter on another host.
Again, in my situation, we had a hardware failure and I will not be booting this host back up into the pool. To remove the server from the pool I did the following.
[root@XS01 ~]# xe host-forget uuid=df28090c-cde8-495e-a70e-3e0b2879fdcb
Again, the UUID here is the UUID of the failed server. Now the host should no longer appear in XenCenter and it’s nice and clean… Wait a minute! What’s this?! Now the local Storage Repositories that were on the failed server appear! Ok, we need to tell the pool to forget those as well.
First we have to find the UUID of the SR’s that were on the server.
[root@XS01 ~]# xe sr-list params=uuid,name-label,host
Look through the list, you should see some SRs where the host says “Host Unknown” or “Host not found” something similar. These are likely to be the SRs that were on the failed host. Verify you have the correct SRs and then tell the pool to forget them.
[root@XS01 ~]# xe sr-forget uuid=[UUID_of_Storage_Repository]
Do the above command for each SR UUID that is abandoned. Once done, XenCenter should be all cleaned up and we can now focus on rebuilding a replacement server.