Virtual Block Devices / Virtual Disks in Xen - HOWTO
====================================================

HOWTO for Xen 1.2

Mark A. Williamson (mark.a.williamson@intel.com)
(C) Intel Research Cambridge 2004

Introduction
------------

This document describes the new Virtual Block Device (VBD) and Virtual Disk
features available in Xen release 1.2.  First, a brief introduction to some
basic disk concepts on a Xen system:

Virtual Block Devices (VBDs):
	VBDs are the disk abstraction provided by Xen.  All XenoLinux disk accesses
	go through the VBD driver.  Using the VBD functionality, it is possible
	to selectively grant domains access to portions of the physical disks
	in the system.

	A virtual block device can also consist of multiple extents from the
	physical disks in the system, allowing them to be accessed as a single
	uniform device from the domain with access to that VBD.  The
	functionality is somewhat similar to that underpinning LVM, since
	you can combine multiple regions from physical devices into a single
	logical device, from the point of view of a guest virtual machine.

	Everyone who boots Xen / XenoLinux from a hard drive uses VBDs
	but for some uses they can almost be ignored.

Virtual Disks (VDs):
	VDs are an abstraction built on top of the functionality provided by
	VBDs.  The VD management code maintains a "free pool" of disk space on
	the system that has been reserved for use with VDs.  The tools can
	automatically allocate collections of extents from this free pool to
	create "virtual disks" on demand.

	VDs can then be used just like normal disks by domains.  VDs appear
	just like any other disk to guest domains, since they use the same VBD
	abstraction, as provided by Xen.

	Using VDs is optional, since it's always possible to dedicate
	partitions, or entire disks to your virtual machines.  VDs are handy
	when you have a dynamically changing set of virtual machines and you
	don't want to have to keep repartitioning in order to provide them with
	disk space.

	Virtual Disks are rather like "logical volumes" in LVM.

If that didn't all make sense, it doesn't matter too much ;-)  Using the
functionality is fairly straightforward and some examples will clarify things.
The text below expands a bit on the concepts involved, finishing up with a
walk-through of some simple virtual disk management tasks.


Virtual Block Devices
---------------------

Before covering VD management, it's worth discussing some aspects of the VBD
functionality that will be useful to know.

A VBD is made up of a number of extents from physical disk devices.  The
extents for a VBD don't have to be contiguous, or even on the same device.  Xen
performs address translation so that they appear as a single contiguous
device to a domain.

When the VBD layer is used to give access to entire drives or entire
partitions, the VBDs simply consist of a single extent that corresponds to the
drive or partition used.  Lists of extents are usually only used when virtual
disks (VDs) are being used.

Xen 1.2 and its associated XenoLinux release support automatic registration /
removal of VBDs.  It has always been possible to add a VBD to a running
XenoLinux domain but it was then necessary to run the "xen_vbd_refresh" tool in
order for the new device to be detected.  Nowadays, when a VBD is added, the
domain it's added to automatically registers the disk, with no special action
by the user being required.

Note that it is possible to use the VBD functionality to allow multiple domains
write access to the same areas of disk.  This is almost always a bad thing!
The provided example scripts for creating domains do their best to check that
disk areas are not shared unsafely and will catch many cases of this.  Setting
the vbd_expert variable in config files for xc_dom_create.py controls how
unsafe it allows VBD mappings to be - 0 (read only sharing allowed) should be
right for most people ;-).  Level 1 attempts to allow at most one writer to any
area of disk.  Level 2 allows multiple writers (i.e. anything!).


Virtual Disk Management
-----------------------

The VD management code runs entirely in user space.  The code is written in
Python and can therefore be accessed from custom scripts, as well as from the
convenience scripts provided.  The underlying VD database is a SQLite database
in /var/db/xen_vdisks.sqlite.

Most virtual disk management can be performed using the xc_vd_tool.py script
provided in the tools/examples/ directory of the source tree.  It supports the
following operations:

initialise -	     "Formats" a partition or disk device for use storing
		     virtual disks.  This does not actually write data to the
		     specified device.  Rather, it adds the device to the VD
		     free-space pool, for later allocation.

		     You should only add devices that correspond directly to
		     physical disks / partitions - trying to use a VBD that you
		     have created yourself as part of the free space pool has
		     undefined (possibly nasty) results.

create -	     Creates a virtual disk of specified size by allocating space
		     from the free space pool.  The virtual disk is identified
		     in future by the unique ID returned by this script.

		     The disk can be given an expiry time, if desired.  For
		     most users, the best idea is to specify a time of 0 (which
		     has the special meaning "never expire") and then
		     explicitly delete the VD when finished with it -
		     otherwise, VDs will disappear if allowed to expire.

delete -	     Explicitly delete a VD.  Makes it disappear immediately!

setexpiry -	     Allows the expiry time of a (not yet expired) virtual disk
		     to be modified.  Be aware the VD will disappear when the
		     time has expired.

enlarge -            Increase the allocation of space to a virtual disk.
		     Currently this will not be immediately visible to running
		     domain(s) using it.  You can make it visible by destroying
		     the corresponding VBDs and then using xc_dom_control.py to
		     add them to the domain again.  Note: doing this to
		     filesystems that are in use may well cause errors in the
		     guest Linux, or even a crash although it will probably be
		     OK if you stop the domain before updating the VBD and
		     restart afterwards.

import -	     Allocate a virtual disk and populate it with the contents of
		     some disk file.  This can be used to import root file system
		     images or to restore backups of virtual disks, for instance.

export -	     Write the contents of a virtual disk out to a disk file.
		     Useful for creating disk images for use elsewhere, such as
		     standard root file systems and backups.

list -		     List the non-expired virtual disks currently available in the
		     system.

undelete -	     Attempts to recover an expired (or deleted) virtual disk.

freespace -	     Get the free space (in megabytes) available for allocating
		     new virtual disk extents.

The functionality provided by these scripts is also available directly from
Python functions in the XenoUtil module - you can use this functionality in
your own scripts.

Populating VDs:

Once you've created a VD, you might want to populate it from DOM0 (for
instance, to put a root file system onto it for a guest domain).  This can be
done by creating a VBD for dom0 to access the VD through - this is discussed
below.

More detail on how virtual disks work:

When you "format" a device for virtual disks, the device is logically split up
into extents.  These extents are recorded in the Virtual Disk Management
database in /var/db/xen_vdisks.sqlite.

When you use xc_vd_tool.py to add create a virtual disk, some of the extents in
the free space pool are reallocated for that virtual disk and a record for that
VD is added to the database.  When VDs are mapped into domains as VBDs, the
system looks up the allocated extents for the virtual disk in order to set up
the underlying VBD.

Free space is identified by the fact that it belongs to an "expired" disk.
When "initialising" with xc_vd_tool.py adds a real device to the free pool, it
actually divides the device into extents and adds them to an already-expired
virtual disk.  The allocated device is not written to during this operation -
its availability is simply recorded into the virtual disks database.

If you set an expiry time on a VD, its extents will be liable to be reallocated
to new VDs as soon as that expiry time runs out.  Therefore, be careful when
setting expiry times!  Many users will find it simplest to set all VDs to not
expire automatically, then explicitly delete them later on.

Deleted / expired virtual disks may sometimes be undeleted - currently this
only works when none of the virtual disk's extents have been reallocated to
other virtual disks, since that's the only situation where the disk is likely
to be fully intact.  You should try undeletion as soon as you realise you've
mistakenly deleted (or allowed to expire) a virtual disk.  At some point in the
future, an "unsafe" undelete which can recover what remains of partially
reallocated virtual disks may also be implemented.

Security note:

The disk space for VDs is not zeroed when it is initially added to the free
space pool OR when a VD expires OR when a VD is created.  Therefore, if this is
not done manually it is possible for a domain to read a VD to determine what
was written by previous owners of its constituent extents.  If this is a
problem, users should manually clean VDs in some way either on allocation, or
just before deallocation (automated support for this may be added at a later
date).


Side note: The xvd* devices
---------------------------

The examples in this document make frequent use of the xvd* device nodes for
representing virtual block devices.  It is not a requirement to use these with
Xen, since VBDs can be mapped to any IDE or SCSI device node in the system.
Changing the the references to xvd* nodes in the examples below to refer to
some unused hd* or sd* node would also be valid.

They can be useful when accessing VBDs from dom0, since binding VBDs to xvd*
devices under will avoid clashes with real IDE or SCSI drives.

There is a shell script provided in tools/misc/xen-mkdevnodes to create these
nodes.  Specify on the command line the directory that the nodes should be
placed under (e.g. /dev):

> cd {root of Xen source tree}/tools/misc/
> ./xen-mkdevnodes /dev


Dynamically Registering VBDs
----------------------------

The domain control tool (xc_dom_control.py) includes the ability to add and
remove VBDs to / from running domains.  As usual, the command format is:

xc_dom_control.py [operation] [arguments]

The operations (and their arguments) are as follows:

vbd_add dom uname dev mode - Creates a VBD corresponding to either a physical
		             device or a virtual disk and adds it as a
		             specified device under the target domain, with
		             either read or write access.

vbd_remove dom dev	   - Removes the VBD associated with a specified device
			     node from the target domain.

These scripts are most useful when populating VDs.  VDs can't be populated
directly, since they don't correspond to real devices.  Using:

  xc_dom_control.py vbd_add 0 vd:your_vd_id /dev/whatever w

You can make a virtual disk available to DOM0.  Sensible devices to map VDs to
in DOM0 are the /dev/xvd* nodes, since that makes it obvious that they are Xen
virtual devices that don't correspond to real physical devices.

You can then format, mount and populate the VD through the nominated device
node.  When you've finished, use:

  xc_dom_control.py vbd_remove 0 /dev/whatever

To revoke DOM0's access to it.  It's then ready for use in a guest domain.



You can also use this functionality to grant access to a physical device to a
guest domain - you might use this to temporarily share a partition, or to add
access to a partition that wasn't granted at boot time.

When playing with VBDs, remember that in general, it is only safe for two
domains to have access to a file system if they both have read-only access.  You
shouldn't be trying to share anything which is writable, even if only by one
domain, unless you're really sure you know what you're doing!


Granting access to real disks and partitions
--------------------------------------------

During the boot process, Xen automatically creates a VBD for each physical disk
and gives Dom0 read / write access to it.  This makes it look like Dom0 has
normal access to the disks, just as if Xen wasn't being used - in reality, even
Dom0 talks to disks through Xen VBDs.

To give another domain access to a partition or whole disk then you need to
create a corresponding VBD for that partition, for use by that domain.  As for
virtual disks, you can grant access to a running domain, or specify that the
domain should have access when it is first booted.

To grant access to a physical partition or disk whilst a domain is running, use
the xc_dom_control.py script - the usage is very similar to the case of adding
access virtual disks to a running domain (described above).  Specify the device
as "phy:device", where device is the name of the device as seen from domain 0,
or from normal Linux without Xen.  For instance:

> xc_dom_control.py vbd_add 2 phy:hdc /dev/whatever r

Will grant domain 2 read-only access to the device /dev/hdc (as seen from Dom0
/ normal Linux running on the same machine - i.e. the master drive on the
secondary IDE chain), as /dev/whatever in the target domain.

Note that you can use this within domain 0 to map disks / partitions to other
device nodes within domain 0.  For instance, you could map /dev/hda to also be
accessible through /dev/xvda.  This is not generally recommended, since if you
(for instance) mount both device nodes read / write you could cause corruption
to the underlying filesystem.  It's also quite confusing ;-)

To grant a domain access to a partition or disk when it boots, the appropriate
VBD needs to be created before the domain is started.  This can be done very
easily using the tools provided.  To specify this to the xc_dom_create.py tool
(either in a startup script or on the command line) use triples of the format:

  phy:dev,target_dev,perms

Where dev is the device name as seen from Dom0, target_dev is the device you
want it to appear as in the target domain and perms is 'w' if you want to give
write privileges, or 'r' otherwise.

These may either be specified on the command line or in an initialisation
script.  For instance, to grant the same access rights as described by the
command example above, you would use the triple:

  phy:hdc,/dev/whatever,r

If you are using a config file, then you should add this triple into the
vbd_list variable, for instance using the line:

  vbd_list = [ ('phy:dev', 'hdc', 'r') ]

(Note that you need to use quotes here, since config files are really small
Python scripts.)

To specify the mapping on the command line, you'd use the -d switch and supply
the triple as the argument, e.g.:

> xc_dom_create.py [other arguments] -d phy:hdc,/dev/whatever,r

(You don't need to explicitly quote things in this case.)


Walk-through: Booting a domain from a VD
----------------------------------------

As an example, here is a sequence of commands you might use to create a virtual
disk, populate it with a root file system and boot a domain from it.  These
steps assume that you've installed the example scripts somewhere on your PATH -
if you haven't done that, you'll need to specify a fully qualified pathname in
the examples below.  It is also assumed that you know how to use the
xc_dom_create.py tool (apart from configuring virtual disks!)

[ This example is intended only for users of virtual disks (VDs).  You don't
need to follow this example if you'll be booting a domain from a dedicated
partition, since you can create that partition and populate it, directly from
Dom0, as normal. ]

First, if you haven't done so already, you'll initialise the free space pool by
adding a real partition to it.  The details are stored in the database, so
you'll only need to do it once.  You can also use this command to add further
partitions to the existing free space pool.

> xc_vd_tool.py format /dev/<real partition>

Now you'll want to allocate the space for your virtual disk.  Do so using the
following, specifying the size in megabytes.

> xc_vd_tool.py create <size in megabytes>

At this point, the program will tell you the virtual disk ID.  Note it down, as
it is how you will identify the virtual device in future.

If you don't want the VD to be bootable (i.e. you're booting a domain from some
other medium and just want it to be able to access this VD), you can simply add
it to the vbd_list used by xc_dom_create.py, either by putting it in a config
file or by specifying it on the command line.  Formatting / populating of the
VD could then done from that domain once it's started.

If you want to boot off your new VD as well then you need to populate it with a
standard Linux root filesystem.  You'll need to temporarily add the VD to DOM0
in order to do this.  To give DOM0 r/w access to the VD, use the following
command line, substituting the ID you got earlier.

> xc_dom_control.py vbd_add 0 vd:<id> /dev/xvda w

This attaches the VD to the device /dev/xvda in domain zero, with read / write
privileges - you can use other devices nodes if you choose too.

Now make a filesystem on this device, mount it and populate it with a root
filesystem.  These steps are exactly the same as under normal Linux.  When
you've finished, unmount the filesystem again.

You should now remove the VD from DOM0.  This will prevent you accidentally
changing it in DOM0, whilst the guest domain is using it (which could cause
filesystem corruption, and confuse Linux).

> xc_dom_control.py vbd_remove 0 /dev/xvda

It should now be possible to boot a guest domain from the VD.  To do this, you
should specify the the VD's details in some way so that xc_dom_create.py will
be able to set up the corresponding VBD for the domain to access.  If you're
using a config file, you should include:

  ('vd:<id>', '/dev/whatever', 'w')

In the vbd_list, substituting the appropriate virtual disk ID, device node and
read / write setting.

To specify access on the command line, as you start the domain, you would use
the -d switch (note that you don't need to use quote marks here):

> xc_dom_create.py [other arguments] -d vd:<id>,/dev/whatever,w

To tell Linux which device to boot from, you should either include:

  root=/dev/whatever

in your cmdline_root in the config file, or specify it on the command line,
using the -R option:

> xc_dom_create.py [other arguments] -R root=/dev/whatever

That should be it: sit back watch your domain boot off its virtual disk!


Getting help
------------

The main source of help using Xen is the developer's e-mail list:
<xen-devel@lists.sourceforge.net>.  The developers will help with problems,
listen to feature requests and do bug fixes.  It is, however, helpful if you
can look through the mailing list archives and HOWTOs provided to make sure
your question is not answered there.  If you post to the list, please provide
as much information as possible about your setup and your problem.

There is also a general Xen FAQ, kindly started by Jan van Rensburg, which (at
time of writing) is located at: <http://xen.epiuse.com/xen-faq.txt>.

Contributing
------------

Patches and extra documentation are also welcomed ;-) and should also be posted
to the xen-devel e-mail list.
