In my previous post I gave a (very) high level overview of ZFS and why I think it is a solid foundation for vCluster. What I did not say though, was why we chose OpenIndiana over the other operating systems offering ZFS.
Just before Sun was bought by Oracle, Solaris came in 2 flavours. Solaris and Opensolaris. As the name suggests, OpenSolaris was open source and Solaris was the closed source variant. OpenSolaris was to become proper Solaris some day and development was to be made in the open. Because OpenSolaris was opensource, many different projects were born out of it. Nexetnta made a storage appliance. Belenix was a generic desktop with KDE. StormOS was a simple desktop with Xfce. FreeBSD which had lost it’s edge to Linux over the years, took the chance and ported ZFS with great success. Joyent based their cloud 100% on Opensolaris. They even offer their version, called SmartOS, free to download and use in production.
Alas, Sun was bought by Oracle, Oracle closed the Solaris source code and open development of OpenSolaris ended.
Or did it?
Just before Oracle closed the OpenSolaris source, a group of hardcore Solaris sysadmins and companies using OpenSolaris for their business decided to create a distro of OpenSolaris, called OpenIndiana. Their aim was to collaborate and move the platform forward and avoid Oracle and their heavy handed policies altogether. The result was Illumos, which is the kernel, and OpenIndiana which is the full generic server grade OS that OpenSolaris was.
So what choices did we have for ZFS storage?
One obvious one would be OpenIndiana. For all intents and purposes, it is the continuation of OpenSolaris. It is a stable platform, it has features beyond ZFS that are not found in any other system (i’ll talk about these later), but it has a flaw. Familiarity with the platform is very low. Solaris in general and by extension OpenIndiana where nowhere near as popular as Linux of FreeBSD. For one person familiar with Solaris you could find fifty familiar with Linux.
Another option was Nexentastor. Nexentastor is an Illumos distro that is commercially backed by a company called Nexenta. With Nexentastor one can make a storage appliance out of practically any PC less than 2 years old. Familiarity with the platform is not an issue because the user/admin interacts with the appliance through a web interface. It is a purpose built system just for creating storage appliances.
Another option was FreeNAS. FreeNAS is practically the same as NexentaStor, but using FreeBSD underneath instead of Illumos. Just like NexentaStor, FreeNAS is a purpose built operating system with a nice web interface on top.
The last option was FreeBSD. FreeBSD is the father of practically all UNIX and derivatives, going back to the late 70′s. Over the years it has gained a reputation of stability that any other platform would be envious of. Just like OpenIndiana, not many people are familiar with FreeBSD.
So why did we choose OpenIdiana in the end?
First lets compare the different solutions based on some general features:
|Nexenta||OpenIndiana||FreeNAS (FreeBSD)||FreeBSD 9|
|Vendor Support||YES||NO||NO(in U.K.)||NO|
|Web GUI||YES||YES (napp-it)||YES||NO|
|HA||Yes Commercial||Yes difficult||Yes Commercial||YES (HAST)|
|ZFS Send/Receive||Commercial only||YES||YES||YES|
As you can see NexentaStor and FreeNAS are almost identical in features, especially if one takes in account commercial support.
One of the things we will use extensively in vCluster is ZFS send/receive. This is a ZFS feature where a snapshot of a filesystem can be sent locally or over the network to another ZFS server and have an identical replica of the data remotely. Note that this is not the same as rsync, because rsync syncs files, whereas ZFS syncs blocks. This is significant because ZFS will send the difference in blocks, which means that syncs are significantly faster than rsync, plus they are checksummed.
This rules out NexentaStor, for now at least, because we are not prepared to pay a license for such a basic feature of ZFS.
Having that in mind, I started evaluating FreeNAS. The system I used has a Quad Core Xeon Processor @ 2.5 GHz, 32GB RAM, 22 2.5″ 7200rpm 750GB SATA disks and 2 OCZ 32GB SSDs used as ZFS zil accelerators (or slogs or logzillas as some ZFS engineers call them). It also has 3 Adaptec RAID 5805 controllers.
One thing to note is that ZFS hates RAID controllers with a passion. If you have to use a RAID controller configure it so as to present the disks as JBOD or at a minimum configure each disk as a RAID0 array with a single member. Remember that ZFS is a volume manager combined with a filesystem. It also handles RAID, having single parity, double parity, triple parity, mirror and stripe modes of operation.
So back to FreeNAS, I configured the system, I setup the network and I started benchmarking. We need the system to perform well as an NFS server for Linux clients. So I used a system in the lab that has 8GB RAM, 2 250GB SATA disks and a Quad Core Xeon @ 2.5 GHz, as an NFS client.
Initially I wanted to establish the speed of the FreeNAS server locally. So I run iozone, which is included with FreeNAS, to see just how fast the system I built was. After some fiddling around with the various iozone options I ended up running the following test: iozone -az -g 2G /mnt/tank/test -b /mnt/tank/iozone.xls
This command runs iozone in automatic mode, trying block sizes from 4K up to 16384K and writing and reading a file that starts from 4K in size up to 2GB in size.
The result was the following:
Hmmm. What is happening here? Sequential writing and random writing at 4GB/s?? How is it possible to write a file randomly at the same speed as writing it sequentially?!?
The answer is that ZFS is caching and using RAM so aggressively that if not told otherwise it will eat up *all* RAM minus 1GB by default. Yep, you read that right, all RAM minus 1GB. So the graphs in reality show the speed that the ZFS caching subsystem work. Also note the jump in performance when the benchmark reaches 128K block size in the first writer test. This is because this is the default block size when you create a zfs filesystem.
Ok then. How can we eliminate caching so we can see how the system really performs?
Simple. just add -o to iozone, so that it forces the system to actually write every single time to stable storage before it continues. So what do we get now?
Random sync writer
What is this? Again sequential write and random write go at the same speed?!?
If you remember I created the ZFS pool with 2 zil accelerators, slogs from now on. These are used by ZFS exactly when something wants to write to the system using sync. These SSDs can push 75000 IOPS, which translates to ~550MB reads and ~500MB writes per second. Not bad. and if you notice the graphs go up almost linearly with the block size.
Here are some reader stats that go off the scale. I used the SSDs as read caches now.
Note the speed when the file size is the same as the block size. 10GB/s!! This is straight from RAM.
This is all fine you say, but you want this to be an NFS server!
OK then here it is:
Linux NFS writer performance:
Linux NFS random writer performance:
Again this makes no sense!! 2.5GB/s from the network? This is impossible! The server uses one nic at 1Gbit/s. This translates to a theoretical maximum of 100MB/s, not 2.5GB!!!
Well you see, Linux caches too. To solve the conundrum look at the green line lower in the graphs. This is the 2GB file being transmitted through the network. You will notice that it is going constantly at 1/5th of 500MB/s, which is equal to 100MB/s! This is because the Linux computer does not cache the 2GB file, so it writes it directly to stable storage, in this case the NFS export from FreeNAS.
So then, we have established that the server is more than adequate to cope with the load expected from vCluster and we have established that it can saturate a 1GB network link. So why did we not use FreeNAS?
The answer is xattrs. You see, in vCluster we use SELinux extensively, which relies on xattrs so do its job. It turns out that FreeBSD and by extension FreeNAS do no tsupport xattrs at all!
In the next blog post I will continue with benchmarks and more on OpenIndiana.