Xen Cluster with Debian GNU/Linux, DRBD, Corosync and Pacemaker

Jean Baptiste FAVRE

mai 2010

La version française est disponible ici: Cluster Xen sous Debian GNU/Linux avec DRBD, Corosync et Pacemaker

Introduction

Xen is one of the most advanced open source virtualization technology. Using virtualization allows easier server deployment and, starting, enhance application availability. Thanks to Live migration, admins can easily empty an host server (AKA Dom0) so that they can fix hardware issue, or perform updates, without the need of shutting down virtual machines. But, at first look, all this stuff has to be done manually. Since root cause of many outage is human errors, this could be great to be able to make it automatic. Hard to do ? Not so, follow the guide...

Cluster basics

According to Wikipédia, a cluster consists in technics which aim to group a pool of physical and independant server, making them working together for :

Cluster size may vary a lot. In fact, cluster start from 2 servers up to thousands of them.

Xen cluster requirements

In our exemple, we'll use 2 physical servers as Dom0. Virtual machines will be spread on both dom0. So we have to:

Xen cluster constraints

Our cluster must be compliant will following requirements:

Cluster architecture

Xen cluster architecture
                    Internet
                   _____|______
                  |___Switch___|
                      |    |
               _______|    |_______
   10.0.1.1/24|                    |10.0.1.2/24
          ____|____            ____|____
         |         |          |         |
         | Dom0 A  |          | Dom0 B  |
         |_________|          |_________|
              |                    |
192.168.1.1/24|                    |192.168.1.2/24
              |____________________|
              DRBD+Cluster+Migration

Both dom0 have a double network attachment. The first one will be dedicated to Wan access, the other one will be use for DRBD replication, cluster managment and live migration.

Cluster installation on GNU/Linux Debian

As cluster stack, we will use Corosync and Pacemaker. First one is a cluster's messages layer, second one is a resource manager. Both of them are available in a specific debian repository:

File /etc/apt/sources.list.d/corosync-pacemaker.list
deb //people.debian.org/~madkiss/ha lenny main
Corosync and pacemaker install
apt-key adv --keyserver pgp.mit.edu --recv-key 1CFA3E8CD7145E30
aptitude update
aptitude install pacemaker

Once intalled, it's time to configure cluster. First, you need to generate cypher key to authenticate cluster messages and nodes.

Initial configuration for Corosync (on dom0 A only)
corosync-keygen
scp /etc/corosync/authkey node2:
mv ~/authkey /etc/corosync/authkey
chown root:root /etc/corosync/authkey
chmod 400 /etc/corosync/authkey

Then, you can configure Corosync itself so that he use the right network interface.

File /etc/corosync/corosync.conf
totem {
	version: 2
	token: 3000
	token_retransmits_before_loss_const: 10
	join: 60
	consensus: 4320
	vsftype: none
	max_messages: 20
	clear_node_high_bit: yes
 	secauth: on
 	threads: 0
 	rrp_mode: none
 	interface {
		ringnumber: 0
		bindnetaddr: 192.168.1.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
	}
}
amf {
	mode: disabled
}
service {
 	ver:       0
 	name:      pacemaker
}
aisexec {
        user:   root
        group:  root
}
logging {
    fileline: off
    to_stderr: yes
    to_logfile: no
    to_syslog: yes
	syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
            subsys: AMF
            debug: off
            tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

Here, you can start your cluster on both nodes, just after having enabled Corosync in file /etc/default/corosync using option Start=yes.

Corosync activation and start
/etc/init.d/corosync restart

You can check cluster status using command crm_mon --one-shot -V.

Cluster status check
crm_mon --one-shot -V
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: No STONITH resources have been defined
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity


============
Last updated: Fri Nov  6 21:03:51 2009
Stack: openais
Current DC: nodeA - partition with quorum
Version: 1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ nodeA nodeB ]

Cluster is telling us that he did not found any STONITH resource and that's really bad.

STONITH means "Shoot The Other Node In The Head". This is one of the most important functionnality for clusters: the ability of being sure that a resource is not duplicated. Consider a specific resource like virtual IP address: having the same IP address on both cluster node is the best way to make your cluster fail. For domU, it's about the same: they can not run on both dom0 at the same time.

Cluster configuration

Our cluster is now configured with basic options. It's time to set up some other ones:

stonith
You now know it, a production environment must have STONITH configured. That said, we won't use it here (don't be afraid, Google is your friend, even if it's somehow evil).
quorum
It's an election mecanism which help cluster deciding if it can work properly or not. It's useless for a 2 nodes cluster.
ressource default stickiness
Allow resources to stay on the node they are runnning on, even after a fail-back. Therefore, admin will have enough time to make sure he fixed issue before getting failed node back in cluster pool.

To configure these options, we'll use Corosync's integrated shell. Shell can be launched with command crm

Applying default cluster's options
crm
crm(live)# configure
crm(live)configure# property no-quorum-policy=ignore
crm(live)configure# property stonith-enabled=false
crm(live)configure# property default-resource-stickiness=1000
crm(live)configure# commit
crm(live)configure# bye

It's now time to install Xen.

LVM configuration

Our Xen cluster uses LVM. A summary of the main LVM commands can be found here (french only, sorry guys):LVM: Logical Volume Manager

You must create a VG named XenHosting. In this VG, you can create a LV named cluster-ocfs.

Both operations have to be done on both dom0.

DRBD installation

DRBD installaiton is detailled here (french only): Installation de DRBD > 8.3.2 sur Debian GNU/Linux Lenny

DRBD configuration

A summary of main DRBD commands can be found here (french only): DRBD: Distributed Replicated Block Device

Here, you have to set up global options before getting drbd0 working:

File /etc/drbd.conf
global {
        usage-count yes;
        minor-count 16;
}
common {
        syncer {
                rate 33M;
        }
        protocol C;
}
resource cluster-ocfs {
  meta-disk internal;
  device /dev/drbd0;
  disk {
    fencing resource-only;
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
  on remus {
    disk /dev/XenHosting/cluster-ocfs;
    address 192.168.1.1:7800;
  }
  on romulus {
    disk /dev/XenHosting/cluster-ocfs;
    address 192.168.1.0:7800;
  }
  net {
    allow-two-primaries;
  }
  startup {
    become-primary-on both;
  }
}

OCFS2 installation

Many other clustered filesystem exist. For this exemple, we'll use OCFS2.

OCFS2 installation
aptitude install ocfs2-tools

OCFS2 configuration

One word about OCFS2. In a perfect world, we should manage OCFS2 with pacemaker. In this particular case, this won't be the case (I had issues with lock managment which is mandatory for pacemaker).

OCFS2 configuration and start
node:
        ip_port = 7777
        ip_address = 192.168.1.1
        number = 0
        name = nodeA
        cluster = XenHosting1

node:
        ip_port = 7777
        ip_address = 192.168.1.2
        number = 1
        name = nodeB
        cluster = XenHosting1

cluster:
        node_count = 2
        name = XenHosting1
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running 'dpkg-reconfigure ocfs2-tools'.
# Please use that method to modify this file.
#

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=XenHosting1

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=31

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=30000

# O2CB_KEEPALIVE_DELAY_MS: Max. time in ms before a keepalive packet is sent.
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min. time in ms between connection attempts.
O2CB_RECONNECT_DELAY_MS=2000
/etc/init.d/ocfs2 restart
mkfs.ocfs2 /dev/drbd/by-res/cluster-ocfs

OCFS2 cluster resource configuration

We did not specify any mountpoint for our OCFS2 file system we just create. I did not forgot anything: pacemaker will take care of everything for us.

But, let's wait 2 minutes before goin on and think. The only goal of OCFS2 is to allow us to share configuration files between nodes without the need of copy them. But we have to start DRBD replication before mounting filesystem. If DRBD resource is not enabled, or not in master state, we must not mount file system.

That said, we need:

Here we go:

OCFS2 cluster configuration
crm configure
crm(live)configure# primitive Cluster-FS-DRBD ocf:linbit:drbd \
	params drbd_resource="cluster-ocfs" \
	operations $id="Cluster-FS-DRBD-ops" \
	op monitor interval="20" role="Master" timeout="20" \
	op monitor interval="30" role="Slave" timeout="20" \
	meta target-role="started"
crm(live)configure# ms Cluster-FS-DRBD-Master Cluster-FS-DRBD \
	meta resource-stickines="100" master-max="2" notify="true" interleave="true"
crm(live)configure# primitive Cluster-FS-Mount ocf:heartbeat:Filesystem \
	params device="/dev/drbd/by-res/cluster-ocfs" directory="/cluster" fstype="ocfs2" \
	meta target-role="started"
crm(live)configure# clone Cluster-FS-Mount-Clone Cluster-FS-Mount \
	meta interleave="true" ordered="true"
crm(live)configure# order Cluster-FS-After-DRBD inf: \
	Cluster-FS-DRBD-Master:promote \
	Cluster-FS-Mount-Clone:start
crm(live)configure# commit

If you created /cluster directory, your clustered FS will "automagicaly" mount.

Status check
crm configure show
node nodeA \
	attributes standby="off"
node nodeB \
	attributes standby="off"
primitive Cluster-FS-DRBD ocf:linbit:drbd \
	params drbd_resource="cluster-ocfs" \
	operations $id="Cluster-FS-DRBD-ops" \
	op monitor interval="20" role="Master" timeout="20" \
	op monitor interval="30" role="Slave" timeout="20" \
	meta target-role="started"
primitive Cluster-FS-Mount ocf:heartbeat:Filesystem \
	params device="/dev/drbd/by-res/cluster-ocfs" directory="/cluster" fstype="ocfs2" \
	meta target-role="started"
ms Cluster-FS-DRBD-Master Cluster-FS-DRBD \
	meta resource-stickines="100" master-max="2" notify="true" interleave="true"
clone Cluster-FS-Mount-Clone Cluster-FS-Mount \
	meta interleave="true" ordered="true"
order Cluster-FS-After-DRBD inf: Cluster-FS-DRBD-Master:promote Cluster-FS-Mount-Clone:start
property $id="cib-bootstrap-options" \
	dc-version="1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58" \
	cluster-infrastructure="openais" \
	last-lrm-refresh="1266431899" \
	node-health-red="0" \
	stonith-enabled="false" \
	no-quorum-policy="ignore" \
	expected-quorum-votes="2"

Xen installation

I won't detaille here steps to get Xen working. For more informations, please refer to: Xen: Virtualisation sous GNU/Debian Linux

Anyway, don't create any domU for now. Even if the migration of a stand-alone virtualization system to a clustered one is quite easy, it's simpler to start from scratch, specialy if you are a beginner in cluster.

Xen configuration

Xen dom0 configuration will stay in /etc/xen/ but domU's one which will be in /cluster.

Xen domU configuration tree
tree /cluster
/cluster
|-- CD_iso
|   |-- centos
|   |   |-- CentOS-5.4-i386-netinstall.iso
|   |   `-- CentOS-5.4-x86_64-netinstall.iso
|   `-- debian
|       |-- debian-503-amd64-netinst.iso
|       |-- debian-504-amd64-netinst.iso
|       `-- debian-504-i386-netinst.iso
|-- apt
|   |-- drbd8-module-2.6.26-2-xen-amd64_8.3.7-0+2.6.26-21lenny3_amd64.deb
|   `-- drbd8-utils_8.3.7-0_amd64.deb
`-- xen
    |-- xps-101.cfg
    |-- xps-102.cfg
    |-- xps-103.cfg
    |-- xps-104.cfg
    `-- xps-105.cfg

As you can see, I'll use cluster FS to share some other things, like iso images or debian packages I need on both nodes as well as domU configuration files.

To let cluster stack deal with domU managment, we have to disable domU auto start and state save.

Let's adapt Xen configuration

File /etc/default/xendomains
## Type: string
## Type: string
## Default: /var/lib/xen/save
#
# Directory to save running domains to when the system (dom0) is
# shut down. Will also be used to restore domains from if # XENDOMAINS_RESTORE
# is set (see below). Leave empty to disable domain saving on shutdown
# (e.g. because you rather shut domains down).
# If domain saving does succeed, SHUTDOWN will not be executed.
#
XENDOMAINS_SAVE=

## Type: string
## Default: "--all --halt --wait"
#
# After we have gone over all virtual machines (resp. all automatically
# started ones, see XENDOMAINS_AUTO_ONLY below) in a loop and sent SysRq,
# migrated, saved and/or shutdown according to the settings above, we
# might want to shutdown the virtual machines that are still running
# for some reason or another. To do this, set this variable to
# "--all --halt --wait", it will be passed to xm shutdown.
# Leave it empty not to do anything special here.
# (Note: This will hit all virtual machines, even if XENDOMAINS_AUTO_ONLY
# is set.)
# 
XENDOMAINS_SHUTDOWN_ALL=

## Type: boolean
## Default: true
#
# This variable determines whether saved domains from XENDOMAINS_SAVE
# will be restored on system startup. 
#
XENDOMAINS_RESTORE=false

## Type: string
## Default: /etc/xen/auto
#
# This variable sets the directory where domains configurations
# are stored that should be started on system startup automatically.
# Leave empty if you don't want to start domains automatically
# (or just don't place any xen domain config files in that dir).
# Note that the script tries to be clever if both RESTORE and AUTO are 
# set: It will first restore saved domains and then only start domains
# in AUTO which are not running yet. 
# Note that the name matching is somewhat fuzzy.
#
XENDOMAINS_AUTO=

## Type: boolean
## Default: false
# 
# If this variable is set to "true", only the domains started via config 
# files in XENDOMAINS_AUTO will be treated according to XENDOMAINS_SYSRQ,
# XENDOMAINS_MIGRATE, XENDOMAINS_SAVE, XENDMAINS_SHUTDOWN; otherwise
# all running domains will be. 
# Note that the name matching is somewhat fuzzy.
# 
XENDOMAINS_AUTO_ONLY=true

Before restarting xend:

Disable xendomains service start
update-rc.d -f xendomains remove
Removing any system startup links for /etc/init.d/xendomains ...
   /etc/rc0.d/K20xendomains
   /etc/rc1.d/K20xendomains
   /etc/rc2.d/S20xendomains
   /etc/rc3.d/S20xendomains
   /etc/rc4.d/S20xendomains
   /etc/rc5.d/S20xendomains
   /etc/rc6.d/K20xendomains

You can now restart xend. Now you can setup your first domU.

Xen domU configuration

We'll setup an HVM domU

File /cluster/xen/xps-101.cfg
#
# HVM configuration file
# Linux
#
################################
kernel       = '/usr/lib/xen-3.2-1/boot/hvmloader'
device_model = '/usr/lib/xen-3.2-1/bin/qemu-dm'
builder      = 'hvm'
memory       = '256'
vcpus        = '1'
cpus         = '1'
localtime    = 0
serial       = 'pty'
#
# Parametrage disque
# boot on floppy (a), hard disk (c) or CD-ROM (d) or Network (n)
boot         = 'dcn'
disk         = [ 'phy:/dev/drbd/by-res/xps-101,ioemu:hda,w',
                 'file:/cluster/CD_iso/debian-504-amd64-netinst.iso,hdc:cdrom,r' ]
#
# Parametrage reseau
vif          = [ 'bridge=br-wan,type=ioemu,mac=00:16:3E:01:01:65,ip=10.0.1.101, vifname=xps-101.eth0' ]
#
# Comportement
on_poweroff  = 'destroy'
on_reboot    = 'restart'
on_crash     = 'restart'
extra = "console=tty xencons=tty clocksource=jiffies notsc pci=noapci"
#
# Parametrage VNC pour l'installation et/ou la recup
vfb = [ 'type=vnc,vnclisten=127.0.0.1,vncdisplay=1' ]
keymap       = "fr"
#
# Parametrage de la machine virtuelle
name         = 'xps-101'
hostname     = 'xps-101.mydomain.tld'

This is our first domU config file. It will have a CDRom as well as a VNC console. CDRom will be available at boot. You'll be able to reach VNC console with SSH tunnel:

SSH tunnel to connect to VNC console
ssh user@nodeA -f -N -L5901:127.0.0.1:5901

VNC console is now available on localhost, TCP port 5901.

You can now install domU, or you could also integrate it now into cluster. If you install it now, you'll have to shut it down before setting up cluster stack.

Anyway, you first have to setup DRBD resource. Details are here: DRBD: Distributed Replicated Block Device.

domU integration into cluster

To get domU integrated into cluster, we need to:

The last point is very important. starting version 8.3.2, DRBD is fully integrated in pacemaker cluster. But DRBD can still be in a split-brain mode. And you will have to fix it manually (automatic split-brain recover is really not an option in production clusters). If you try to start a Xen resource on a node without Master DRBD resource, cluster stack will register fail attempt and prevent further start on this node, creating a constraint. To fix it, you'll have to deep into cluster configuration to manually remove constraint

Xen domU configuration into Corosync / Pacemaker cluster
crm configure
crm(live)configure# primitive xps-101-DRBD ocf:linbit:drbd \
	params drbd_resource="xps-101" \
	operations $id="xps-101-DRBD-ops" \
	op monitor interval="20" role="Master" timeout="20" \
	op monitor interval="30" role="Slave" timeout="20" \
	meta target-role="started"
crm(live)configure# ms xps-101-MS xps-101-DRBD \
	meta resource-stickines="100" master-max="2" notify="true" interleave="true"
crm(live)configure# primitive xps-101 ocf:heartbeat:Xen \
	params xmfile="/cluster/xen/xps-101.cfg" \
	op monitor interval="10s" \
	meta target-role="started" allow-migrate="true"
crm(live)configure# colocation xps-101-Xen-DRBD inf: xps-101 xps-101-MS:Master
crm(live)configure# commit

Now domU should start. If you did prepare the physical DRBD resource on the other node, you shall be able to live migrate your DomU. If you did not, you really should ;-)

Last but not least, you don't need to set up cluster on the second node, configuration has already been propagated by cluster.

While defining Xen resource, attribute meta allow-migrate="true" allows domU live migration. If it's not set and you want to migrate resource, then domU will be stopped on first node, and then started on the second one.

Dealing with cluster resources

Find here a summary of cluster managment main commands

Main cluster managment commands
resource migrate xps-101
crm node standby nodeB
crm node online nodeB

Sources

Format

This document is available in following formats:

About Jean Baptiste FAVRE

I'm a system engineer specialized in Linux / Unix. I mainly work on virtualisation and on web performances. From time to time, I find time to read some books listening classical music. But I always take my keyboard back quickly.

License

Creative Commons License This documentation is published under Creative Common by-nc-sa licence

Valid XHTML 1.0 Strict |  Valid CSS |  Creative Common by-nc-sa licence

Index

  1. Introduction
  2. Cluster basics
  3. Xen cluster requirements
  4. Xen cluster constraints
  5. Cluster architecture
  6. Cluster installation on GNU/Linux Debian
  7. Cluster configuration
  8. LVM configuration
  9. DRBD installation
  10. DRBD configuration
  11. OCFS2 installation
  12. OCFS2 configuration
  13. OCFS2 cluster resource configuration
  14. Xen installation
  15. Xen configuration
  16. Xen domU configuration
  17. domU integration into cluster
  18. Dealing with cluster resources
  19. Sources
  20. Format
  21. About ...
  22. License