Not too long ago, I had one of my XenServer hosts crash due to a hardware failure. It was not my Pool Master. When a XenServer host crashes the virtual machines running on the failed server can’t be controlled, so we’re unable to tell them to start up on another host. The XenServer pool still considers them running, however, they’re not listed in XenCenter. Hopefully this can help some of you get back up and running a little quicker.

The first couple steps were taken from Citrix’s document, XenServer System Recovery Guide. It’s short and straight forward so if you need more information on these steps, please refer to it.

These commands need to be run from a functioning member of the pool. The first thing to do is find the UUID of the failed server.

[root@XS01 ~]# xe host-list params=uuid,name-label,host-metrics-live

uuid ( RO)                 : 542926d8-a1c6-43a9-8ee3-92072214e9bb
           name-label ( RW): XS01
    host-metrics-live ( RO): true

uuid ( RO)                 : df28090c-cde8-495e-a70e-3e0b2879fdcb
           name-label ( RW): XS02
    host-metrics-live ( RO): false

uuid ( RO)                 : f2f13fcf-8ff4-4bbb-b281-af4cc5db91cb
           name-label ( RW): XS03
    host-metrics-live ( RO): true

From the above output we can see that the XS02 host is the server that died and it has a UUID of df28090c-cde8-495e-a70e-3e0b2879fdcb. But now the question is “Which VMs were running on this server?”

[root@XS01 ~]# xe vm-list is-control-domain=false resident-on=df28090c-cde8-495e-a70e-3e0b2879fdcb

uuid ( RO)           : f763254e-b77f-30a4-0a5c-8a43d880d5dc
     name-label ( RW): VMSERVER01
    power-state ( RO): running

uuid ( RO)           : 7ccbe7b2-089a-412e-8e88-af2175bb0c4a
     name-label ( RW): VMSERVER02
    power-state ( RO): running

The resident-on parameter in the above command is the UUID of the failed server. We can see that I had two VMs running on that server. Even though the pool says they are running, we should check. Depending on the reason we lost the host, the VMs may actually still be running. So to be sure I was attempted to ping both VMs and they were not up. If they are up, you may be able to RDP to them and shut them down cleanly.

Let’s now tell the pool that the VMs are powered off.

[root@XS01 ~]# xe vm-reset-powerstate resident-on=df28090c-cde8-495e-a70e-3e0b2879fdcb --force --multiple

Caution!  Incorrectly using the  --multiple option could result in ALL virtual
machines within the pool being reset. Be careful to use the  resident-on
parameter as well.  Alternately, you can reset VMs individually.

This will reset the power state and we can now see the virtual machines in XenCenter.

It would now be a good time to try and ping all the VMs affected. Depending on the type of crash of the host, the VMs may actually still be running. If you can still ping the VMs, try to get a clean shut down of them (We don’t want to start up two of the same VM.) Once verified the VMs are actually shut down, start the VMs back up using XenCenter on another host.

Again, in my situation, we had a hardware failure and I will not be booting this host back up into the pool. To remove the server from the pool I did the following.

[root@XS01 ~]# xe host-forget uuid=df28090c-cde8-495e-a70e-3e0b2879fdcb

Again, the UUID here is the UUID of the failed server. Now the host should no longer appear in XenCenter and it’s nice and clean… Wait a minute! What’s this?! Now the local Storage Repositories that were on the failed server appear! Ok, we need to tell the pool to forget those as well.

First we have to find the UUID of the SR’s that were on the server.

[root@XS01 ~]# xe sr-list params=uuid,name-label,host

Look through the list, you should see some SRs where the host says “Host Unknown” or “Host not found” something similar. These are likely to be the SRs that were on the failed host. Verify you have the correct SRs and then tell the pool to forget them.

[root@XS01 ~]# xe sr-forget uuid=[UUID_of_Storage_Repository]

Do the above command for each SR UUID that is abandoned. Once done, XenCenter should be all cleaned up and we can now focus on rebuilding a replacement server.

Gregory Strike

Husband, father, IT dude & blogger wrapped up into one good looking package.