One major part of the new vCluster release is storage. As you may know, vCluster uses NFS extensively, so we want to make sure that it works reliably. Reliability means that the data stored is always available, meaning that the server should be able to handle the load and be stable, but also that the data is written correctly and consistently on disk. We also need some kind of replication for backups and maybe fail over.

There are lots of systems we could use to serve NFS. There is Linux of course, FreeBSD, OpenIndiana and a slew of commercial solutions from companies like EMC, NetAPP, HP, Dell etc. Looking around and testing various solutions we decided that Openindiana is the best solution for our needs. The main reason is ZFS.

Some background

ZFS is a filesystem originally developed by Sun Microsystems. It was released in late 2005 in the development builds of Solaris. When it was released it obliterated anything freely available at the time. It had features that where not even in the radar of other filesystems.

Some of the features are:

  • Block level checksumming
  • Real-time block level compression
  • Cheap snapshots
  • Copy-on-write
  • Atomic operations
  • Deduplication
  • Practically infinite capacity (can address pools of 256 billion terabytes in size)
  • Intelligent and aggressive caching, making use of dedicated accelerator block devices
  • Remote replication of filesystems and snapshots
  • Blatant layer violation (I will explain this further down)

One could look at this list and see features that other filesystems have. For example snapshots are generally available in other systems through different mechanisms e.g. LVM in Linux. Checksumming (only metadata) is available in XFS. No stable filesystem has all of them though.

In my view, ZFS has two main advantages that no other system has, which stem from the features above. The first one is that it never needs fsck (or scandisk for you windows users). The second is simplicity.

Fsck

Fsck is one of the most annoying things that a sysadmin can face. First it is slow. If a 10TB array has to check its filesystem a sysadmin (and of course clients!) has to wait needlessly hours and hours for it to finish. Downtime goes up, people get annoyed and in the end fsck might fail because of a million different reasons – not enough RAM, too many errors, a disk dies while fsck is running etc.
ZFS does not need any kind of fsck because of the atomic nature of its transactions and because every block is checksummed. This means that on disk data is *always* valid.

Simplicity

The second reason that sets ZFS apart from other filesystems is, in my view, simplicity. This comes from the dreaded “Blatant layer violation”. What this means is that ZFS ignores the separation of different subsystems that other operating systems use to manage their disks and filesystems. Let me give you an example to illustrate this better:

In Linux you have fdisk to partition the drives, then md to create a RAID array, then LVM to create the physical volumes that you will then chop up in logical volumes and finally mkfs to format the volumes with the desired filesystem. So if you have 3 drives and want to create a RAID5 LVM array you will have to do the following:

fdisk -l
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/hdb1 /dev/hda1 /dev/hdf1
pvcreate /dev/md0
vgcreate lvm-raid /dev/md0
lvcreate -l 57235 lvm-raid -n lvm0
mkfs.ext4 /dev/lvm-raid/lvm0
mount /dev/lvm-raid/lvm0 /mnt

In the case of ZFS the equivalent commands are:

format
zpool create tank raidz c2t0d0 c2t1d0 c2t2d0

End of commands. (format is a Solaris command used just to find the name of the drives e.g. c2t0d0, which could be in Linux /dev/sda)

Really. Nothing more is needed.
The reason is that ZFS combines a volume manager with a filesystem. This eases administration, which in turn leaves room for less errors and which leads to better optimization and of course easier and faster troubleshooting.
So where is the catch, you might ask. There has to be some caveat somewhere.

There is one actually.

All the features of ZFS need computing power. This means RAM and CPU. This means that in order for ZFS to have the same performance as other, may I say lesser filesystems, it needs to run on a significantly bigger box than the competition. These days CPU speed and RAM are cheap, so this argument is getting thinner every day.

In the following days I will continue with the comparison of the different systems that support ZFS and with some very interesting benchmarks.

Alex Bisogiannis Storage Engineer

Alex is a storage engineer specialising in clustered and high performance storage solutions for cloud hosting requirements.