Clustering KVM with Ceph Storage

I have, for a long time, been fascinated and terrified by “Virtual SAN” solutions.

The idea of combining storage and compute seems on the surface very attractive. It allows us to scale out our storage and compute together or separately in relatively small and affordable units, helping avoid the sticker shock of the upfront cost of storage systems. And as somebody especially prone to capex-phobia, that really is a great solution.

However, on a technical level there are some major shortcomings of this type of infrastructure. For one, storage failures are by far the most feared and devastating that can happen to any individual or organization, and housing that on the relatively volatile virtual host layer seems like a very bad idea. Furthermore, many of the commercial solutions have very strict requirements on the type of hardware that can be used, and very vague documentation on how to recover the system from any sort of degraded state.

In particular, Microsoft’s Storage Spaces Direct (S2D) solution has a lack of meaningful documentation, most of it coming off more as a sales pitch than as a technical document for engineers and architects. This type of marketecture seems to be quite common in this space, with VMware’s VSAN suffering the same lack of useful information to a lesser (but still irritating) degree. And of course, there’s a slew of other systems that are more or less effective and documented.

Build it yourself

“You want a good truck, you’re gonna have to build it yourself," as my grandfather said referring to his Chevy/Ford/fiberglass/fabricated creation. Sometimes the right system is a mixture of different off-the-shelf standard parts with a few globs of glue between them. And that goes beyond old-school farmers building their equipment out of ‘junk.’ This type of engineering is tried and true. The reason those old tractors are so reliable is not because we forgot how to build good quality stuff – It’s because we forgot how to engineer simple stuff. The solution then is to strip away all the complexities and cruft and build a very simple cluster for hosting and managing virtual workloads.

After some research and testing, this is what I’ve come up with:

Hypervisor	Storage	Provisioning	Management
KVM	CephFS	cloud-config	`virsh` / virt-manager / kimchi web ui

Creating the cluster

This simple cluster needs only three nodes to start. It could be as small as two, but that may limit the ability to scale out later. More on that in a bit.

Here is the reference layout:

Roles	Node 1	Node 2	Node 3
OSD	osd1	osd2	osd3
MON	mon1	mon2	mon3
MDS	primary	standby	standby
KVM	Enabled	Enabled	Enabled

In terms of storage, each node should have a small SSD for the OS and system software, as well as an SSD to be dedicated to Ceph. Though Ceph can use a partition or LVM slice, it’s much better to give it raw access to a physical device. Not only will it improve performance by not having to layer inodes and filesystems (cough cough GlusterFS), it will make the system overall much more stable by avoiding additional layers of complexity. After all, that’s what this is all about.

Preparing the nodes

In my setup, I am using three Ubuntu server 18.04 servers, all running on VMware workstation. Use your favourite distro, but be aware that some of my documentation may not line up with your system.

Make sure that they’re all up to date, stable, and have as close to identical hardware as possible. Ceph by nature requires evenly matched servers to optimally place and replicate data. Also, ensure that NTP is synced on all nodes with as much precision as possible.

A reliable DNS resolver is also recommended, though modifying the host file is also possible. Either way, make sure it is working and that all the nodes can ping each other before proceeding.

The first stage of this process will be run on a management node, which can be a server, workstation, virtual machine or laptop. Ideally it should either be a permanent server installation (fourth node) or a virtual machine that can be backed up and archived once the process is complete.

First, install the ceph deployment tools on the manager node:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb https://download.ceph.com/debian-luminous/ bionic main'
sudo apt update
sudo apt install ceph-deploy

On each server, create a cephsvc user account:

sudo useradd -d /home/cephsvc -s /bin/bash -m cephsvc
sudo passwd cephsvc

This user also needs passwordless sudo on each system:

echo "cephsvc ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephsvc
sudo chmod 0440 /etc/sudoers.d/cephsvc

Install the prerequisite python on each node (osd1, osd2, osd3):

sudo apt-add-repository universe
sudo apt update && sudo apt install python-minimal -y

Generate a passwordless SSH key (required for ceph-deploy) for the admin workstation:

ssh-keygen

And copy the public key to each server:

ssh-copy-id cephsvc@osd1
ssh-copy-id cephsvc@osd2
ssh-copy-id cephsvc@osd3

Then, configure the SSH client to use this remote user and key:

~/.ssh/config

Host osd1
    Hostname osd1
    User osdsvc
Host osd2
    Hostname osd2
    User osdsvc
Host osd3
    Hostname osd3
    User osdsvc

Bootstrap the cluster

Create the data directory:

mkdir ~/ceph
cd ~/ceph

Specify inital monitor nodes for install:

ceph-deploy new osd1 osd2 osd3

In ~/ceph/ceph.conf specify the network of the ceph cluster. Though some documentation indicates this is not mandatory, it appears to fail during monitor deployment if this isn’t specified explicitly.

public network = 10.204.10.0/24

Install the ceph packages on the nodes:

ceph-deploy install osd1 osd2 osd3

Deploy monitors and gather keys:

ceph-deploy mon create-initial

Install the ceph keys and cluster configuration to each node:

ceph-deploy admin osd1 osd2 osd3

Install the manager node:

ceph-deploy mgr create osd1

Provision storage

Create three OSDs. These will claim and overwrite any contents of the specified disk. Be careful!

ceph-deploy osd create --data /dev/sdb osd1
ceph-deploy osd create --data /dev/sdb osd2
ceph-deploy osd create --data /dev/sdb osd3

Check the health of the cluster

ssh osd1 sudo ceph health

Metadata Service

At least one metadata node is required to use CephFS, which this cluster will depend on.

To make sure this cluster is fully redundant, all three nodes will be MDS. Only one will be active at a time.

ceph-deploy mds create osd1 osd2 osd3

Manager Nodes

At least one manager is required. It is recommended to have several in a cluster for high availability. In this case, add additional managers to the first in the cluster, osd1

ceph-deploy mgr create osd2 osd3

Storage Pools

A pool is the lowest level unit of data in Ceph. CephFS, RBD, and Swift are all ways to expose pools to different connectivity types.

Pool Type	Fault Tolerance	Storage Space
Replicated	High	Low
Erasure Coded	Low	High

When creating a pool, it’s important to pick an appropriate placement group identifier. Documentation on Placement Groups.

Example: Create a Replicated Pool

sudo ceph osd pool create reppool 50 50 replicated

Example: Create an Erasure Pool

The basic syntax replaces ‘replicated’ with ‘erasure’ to specify the pool type.

sudo ceph osd pool create ecpool 50 50 erasure

Pools can also be tuned balance redundancy and resiliency of the stored data. This is configured with the K and M values:

K = How many ‘chunks’ the original data will be divided into for storage. Generally, this is tied to the number of OSDs in the cluster.
M = Additional replica ‘chunks’ created to provide redundancy. The data is able to survive the failure up to M chunks.

For this very small cluster, we only need one replica chunk (M), and two primary chunks (K) to get the job done. This is done by creating a new profile (smallcluster), and then using that profile to provision a new storage pool.

sudo ceph osd erasure-code-profile set smallcluster \
    k=2 m=1 crush-failure-domain=host 

sudo ceph osd pool create ecpool2 50 50 erasure smallcluster

More information available from the official documentation.

Special thanks to Jake Grimmett for providing a correction for the original information here.

Create a CephFS pool

This will act as a cluster shared volume for the cluster running on the system.

Create a pair of pools to store metadata and data for the cephfs cluster:

 ceph osd pool create cephfs_data 50 50 replicated
 ceph osd pool create cephfs_meta 50 50 replicated

Note, I will be using replicated pools because of the substantially lower chances of data loss. It also allows for more resilient pools as the number of OSDs grows.

Do not mix erasure and replicated pools when building CephFS subpools.

Create a CephFS system from the two pools:

 ceph fs new cephfs cephfs_meta cephfs_data

If using erasure coded pools:

  ceph osd pool set my_ec_pool allow_ec_overwrites true

More information about CephFS available from the official documentation.

Mount the CephFS pool

Now that the pool is created, it can be mounted on each node.

Install the cephfs-fuse package:
```
 sudo apt install ceph-fuse
```
Create a mount point with the same name as the cephfs pool (not required but recommended)
```
 sudo mkdir -p /mnt/cephfs
```

Configure the /etc/fstab file for using FUSE:

 none    /mnt/cephfs  fuse.ceph ceph.id=admin,_netdev,defaults  0 0

The cephfs kernel driver can also be used, but it is generally recommended to use FUSE instead.

After each node is ready, issue the command sudo mount -a on each and check the output of the df command. If you did everything right, you’ll see that /mnt/cephfs points to your shiny new cluster! Try adding files on one node and check from another to see if they’re there… If they’re not, troubleshoot.

Installing KVM

The next stage is installing the hypervisor on each node. Since linux 2.6 practically every system has KVM installed already, it’s just a matter of making sure it’s enabled and configuring it.

First, check if the CPU instructions are available:

egrep -c '(vmx|svm)' /proc/cpuinfo

If they are, install cpu-checker and test each node for KVM compatibility. If kvm-ok does not output KVM acceleration can be used you may have issues.

sudo apt install cpu-checker    
kvm-ok

If your systems are good, install the qemu and libvirtd libraries:

sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils
sudo systemctl enable libvirtd
sudo systemctl start libvirtd

You will also have to add your user to the ‘libvirt’ group on each system.

sudo adduser <your-user> libvirt

You will need to log out and back in for this to take effect

Configure the bridge network

Each node will have to be reconfigured to use the bridge interface for networking. This allows the guest VMs to share the network connections with the host system, as well as adding support for multiple VLANs and even virtual switches. If your machines has multiple NICs, you can also bond them to add network redundancy.

For this system, I am using the new Netplan.io method. If your system uses another network system such as ifupdown, you’ll need to configure it differently.

/etc/netplan/20-kvm-config.yml

network:
  version: 2
  ethernets:
    ens33:
      dhcp4: no
      dhcp6: no

  bridges:
    br0:
    interfaces: [ens33]
      addresses:
        - 10.20.10.31/24
      gateway4: 10.20.10.2
      nameservers:
        search:
          - intranet.mycooldomain.com
        addresses:
          - 10.20.10.11
          - 10.20.10.12

Test and apply the configuration.

sudo netplan generate
sudo netplan apply

Check the running network configuration:

networkctl status -a

After each node is reconfigured, check that ceph is replicating and that all nodes are still reachable. While the cluster can sustain one transient node failure, multiple simultaneous failures could cause issues.

Running a cloud image

Cloud images are small simplified versions of full linux systems. I like to use them for lightweight VMs and for testing systems.

First, create a directory structure on the CephFS share:

sudo mkdir -p /mnt/cephfs/{templates,virtualmachines,config}

Download the cloud image:

wget https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
qemu-img info bionic-server-cloudimg-amd64.img

Clone the cloud image

Convert the image to a copy-on-write image stored on the CephFS volume

sudo qemu-img convert -f qcow2 bionic-server-cloudimg-amd64.img /mnt/cephfs/templates/bionic-server-cloudimg-amd64.img

Because it is a copy-on-write image, it’s very fast to clone it and create a VM.

qemu-img create -f qcow2 -b /mnt/cephfs/templates/bionic-server-cloudimg-amd64.img /mnt/cephfs/virtualmachines/virt-01.img

Set a root password

sudo apt install libguestfs-tools
sudo virt-customize -a /mnt/cephfs/virtualmachines/virt-01.img --root-password password:hunter2

Please don’t actually set your root password to hunter2

Generate a cloud-config

Using the cloud-config toolkit, we can create a basic desired state for this VM. Obviously, we’re scratching the surface here. Cloud-config can do a lot more than just set a hostname and import an SSH identity!

/mnt/cephfs/config/your-server-name.yml

#cloud-config
password: not-your-password
chpasswd: { expire: False }
ssh_pwauth: True
hostname: virt-01
ssh_authorized_keys: 
  - ssh-rsa AAAAA_My_SSH_Public_key_here

Next, the cloud config image is packaged into another virtual disk image. This allows us to attach the cloud config file to a VM at boot time for it to self-configure during the provisioning stage.

sudo apt install cloud-image-utils
sudo cloud-localds /mnt/ceph/config/virt-01_cloudconfig.img /mnt/cephfs/config/virt-01_cloudconfig.yml

Creating and Running a Virtual Machine

Finally, we can run the VM!

virt-install --name virt-01 --memory 512 --vcpus 1 \
 --disk /mnt/cephfs/vms/virt-01.img,device=disk,bus=virtio \
 --disk /mnt/cephfs/config/virt-01_cloudconfig.img,device=cdrom \
 --os-type linux --os-variant ubuntu18.04 \
 --virt-type kvm --graphics none \
 --network network=default,model=virtio --import

And in about 30 seconds the virtual machine is up and running. You can escape the VM by typing ctrl+] at the login tty.

Virtual Machine Live Migration

One of the most important parts of virtualization is the ability to keep workloads up by live migrating them between hosts. Luckily, this is very easy on KVM systems. All it requires is that tcp/22 is open for SSH, and that keys and passwords are configured correctly.

Assuming the nodes are configured correctly, all that has to be done is run the migration command to move the VM to another node in the cluster:

virsh migrate --live virt-01 qemu+ssh://my-user-name@remote/system

Check the output of virsh list --all on both the source and destination virtual hosts. It should now be listed under the destination machine with the status of “running”

This is the magic of CephFS + KVM! Using off the shelf tech like SSH and Qemu, we are able to quickly and easily migrate production workloads in a much simpler way than systems like Hyper-V and ESXi.

Web UI

Another common feature of a virtualization system is the web ui. There are a few to choose from, but I think these are the top choices for a simple cluster:

That being said, Proxmox oversteps the role of web UI and attempts be a full system admin suite.

Taking it even further

Of course, virtual machines are so 2010s. Containerization is all the rage these days.

And that’s great! Kubernetes loves using CephFS for storage, and installing lxc or docker on the cluster is a logical next step. After this system is built out fully, that’s exactly what I’ll do. And really, that is the beauty of sticking the pieces together. Since we’re not at the mercy of Dell/EMC or Microsoft’s Technical Vision(TM) there’s really no restrictions on what technologies this cluster will be able to support.

As always, let me know what you think. I’m always curious what others have to say about this sort of technical projects.