As the title pretty much tells, I’ve been working on fixing the Root-Disk-Multipathing feature of our XenServer installations. Our XenServer boot from a HA-enabled NetApp controller, however we recently noticed that during a controller fail-over some, if not all, paths would go offline and never come back. If you do a cf takeover and cf giveback in short succession, you’ll end up with a XenServer host that is unusable, as the Root-Disk would be pretty much non-responsive.

Guessing from that, there don’t seem to be that many people using XenServer with Boot-from-SAN. Otherwise Citrix/NetApp would have fixed that by now…. Anyhow, I went around digging in our XenServer’s. What I already did, was adjust the /etc/multipath.conf according to a bug report (or TR-3373). For completeness sake I’ll list it here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Multipathing configuration for XenServer on NetApp ALUA
# enabled storage.
# TR-3732, revision 5

defaults {
        user_friendly_names no
        queue_without_daenon no
        flush_on_last_del yes
}

## some vendor specific modifications
devices {
        device {
                vendor "NETAPP"
                product "LUN"
                path_grouping_policy group_by_prio
                features "1 queue_if_no_path"
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout "/sbin/mpath_prio_alua /dev/%n"
                path_checker directio
                failback immediate
                hardware handler "0"
                rr_weight uniform
                rr_min_io 128
        }
}

And as it turns out, this is the reason why we’re having such difficulties with the Multipathing. The information in TR-3373 is a bunch of BS (no, not everything but a single path is wrong, the getuid_callout) and thus the whole concept of Multipathing, Failover and High-Availibility (yeah, I know - if you want HA, don’t use XenServer 😛) is gone.