Adventures with Lustre

By Mark Sutton Thursday, 26th March 2009

For the last few months we've been busy integrating, testing and tuning Lustre for use on our hosting platform. I thought I'd share some notes... Lustre is most widely used in HPC settings and there seem to be relatively few operations using it in conjunction with web servers and virtualisation. Reading through the wiki and mailling lists it soon becomes clear that Lustre has not been designed with small files and high metadata request rates in mind.

It's easy to upset Lustre

About the worst thing you can do on a lustre filesystem is 'ls -l'. This seemingly innocuous command initiates a barrage of Lustre activity as the directory is enumerated file by file. Each request corresponds to a network round trip and things soon start getting out of hand. Using the find command with any filtering based on file attributes amounts to the same thing as well.

What should you know about Lustre?

First it is very important to understand how Lustre stores it's data. Lustre splits file metadata and contents apart, storing them on seperate devices using a modified ext3 filesystem. Each device can be served by a seperate server, with metadata requests going to one place and data being sent to another. With enough servers, disks and system bandwidth, file data IO runs in parallel and can reach many hundreds of GB per second. But there is also a gotcha. While file data requests can be scaled to very high levels, file metadata requests cannot. For a Lustre filesystem there can only be one metadata server. This server can be protected by high availability tools such as Heartbeat and Redhat Cluster Suite, but if the workload is metadata-heavy then it will perform much worse on Lustre due to network round trips and concurrency on the server. One answer is to add more cores and memory (memory helps the metadata server to cache more) but this soon gets expensive due to another issue with metadata stored in Lustre. In Lustre, file metadata is stored (by default) in a 4KB inode. What that means is that each file on your Lustre filesystem will cost 4KB on top of the actual data stored in the file. If your files are <4KB then there is basically a 100% overhead to store each file. Compound this with your raid overhead and it's easy to see how storing a lot of small files is going to cost a lot of disk space on your metadata servers. This issue often trips up new Lustre users as well. When creating a Lustre filesystem it is very important to understand how the metadata service works and size the device accordingly. Once the metadata device is online it is not possible to resize it without taking the filesystem offline...

General comparison with other cluster filesystems

Other cluster filesystems are guilty of some or all of the following:
  • Heavyweight, manditory distributed locking slowing things down
  • Concurrent block device access can cause head thrashing since IO cannot be aggregated efficiently without an inline storage server
  • Susceptibility to whole cluster lockout due to loss of cluster quorum
  • Cannot easily be parallelised across spindles beyond what can be achieved using low level raid
Lustre appears to get around all of these problems through clever design, and fills a niche quite well. As a shared network filesystem it is in many ways more comparable to NFS and Samba (in that it presents a file system) than it is to the typical cluster file systems that share one or more block devices on a SAN. File locking can be disabled completely for applications which either don't need or take care of their own locking in another way. This can make a difference to stability as in our experience cluster-wide locking certainly appeared to cause some deadlocking for us even without file contention. My guess is that the Lustre locking code is less used and less tested, and certainly not with a workload like ours. In Lustre there is no quorum as such. If a server goes away then files relying on information from that server simply wait for a timeout to expire and return an error. If the application can deal with this gracefully then as soon as the server comes back online (or gets failed over to a warm standby which can take just a few seconds) things get back to normal very quickly. We use Redhat cluster and shared storage to take care of high availability, but in practise the Lustre servers have proved quite stable and we've only ever had manual failovers so far. Another advantage of Lustre is that filesystems can be expanded easily simply by adding more servers and storage. Whats even better is that adding more servers can also increase the filesystem throughput almost linearly with the number of servers added. We're still perfecting this process and plan to give it a real test in the next few days when we add more storage servers to the production cluster.

Using Lustre with Apache 2 Clusters

One of the ways we have been using Lustre is as a backing filesystem for DocumentRoot on Apache clusters. In practise the metadata overhead of storing lots of small-ish files has not been so bad when compared with the alternatives. I've already covered why we haven't used other more traditional cluster file systems, in particular those with shared block device. Another method used to synchronise files across a web cluster is replication, involving anything from deploy scripts to rsync to low-level nntp replication protocols (I must find out more about that last one from the person who told me about it). Since the replication method involves storing a whole copy of the file on every node, the space required for storing hundreds of web sites on all web nodes and risk of file system incoherency far outweighed the overhead of 4KB metadata inodes. But there are still issues surrounding use of Apache 2 on top of Lustre. The first issue was documented in my recent blog on .htaccess performance. While Dawid has been working on a solution to the .htaccess performance hit (in general, not just on Lustre), I have been investigating another issue that seems to be caused by the statahead feature in Lustre. It seems that under high load and with AllowOverrides enabled, statahead would end up itself going into override (!) and amplify the already large numbers of metadata requests in flight. Eventually the webserver would get bogged down and spawn right up to it's maximum number of children. At this point the upstream proxy/caching servers would also start to queue requests, and send more requests to other webservers in the pool until the whole system became completely jammed up. At this point internal timeouts on Lustre would cause eviction of the overloaded nodes from the filesystem resulting in page load errors. Shortly after, the webservers reconnect to Lustre and continue normal operation until the next statahead runaway. We disabled statahead on all Apache 2 clients using the following command, and the problem appears to have gone away.
echo 0 >/proc/fs/lustre/llite/fsname-ec80e800/statahead_max

Using Lustre with OpenVZ

We use OpenVZ to create lightweight containers for running application stack components such as MySQL, mod_php and Ruby on Rails. Initially we have been running a shared /vz filesystem on top of Lustre (using vzquota), and with some tuning have managed to get it running reasonably well. However, even with statahead working as it should, common, interactive operations like "ls" and "find" are badly affected by metadata performance. Another issue is that of caching. The Lustre client-side drivers seem to handle file caching outside of VFS. Without pulling it apart it seems to me that it would be great to use Linux's excellent VFS cache for container storage. One way that we have found to do this is to use Lustre simply as backing storage for ext3 file system images that can be mounted via the loop driver. According to my testing this eliminates virtually all metadata requests altogether beyond mount/umount operations and causes client-side caching of metadata. It also reactivates the vfs cache which seems to perform very nicely indeed. Another advantage of using ext3 images of top of Lustre is that we can disable file locking on the Lustre mount. File locking enables applications that rely on fnctl() to create locks against files. Without it, certain applications such as MySQL will fail to start at all. Using ext3 images and disabling lustre locking shortens the Lustre code path considerably, reduces exposure to (perhaps) less well-tested code, and reduces lock traffic under high load. I will provide more details and some benchmarks in a future blog.

One final fix for small file performance

Once everything is working well and stable there is one more thing that can help small file performance with Lustre. By default, debugging is enabled at a high level, and disabling debugging can really speed things up. This can be done with the following command:
sysctl -w 'lnet.debug=0'
All in all, despite a slightly shaky start with under-powered, under-featured switches and instability due to lack of tuning and suboptimal defaults for our workload, things are starting to take good shape. I guess we're quite lucky that Lustre is so readily tunable and Linux so flexible that we can work around the issues. I think Lustre will have a place on our clusters for some time to come.

Posted in Technology, vCluster | 3 Comments » twitter-follow facebook-follow rss-follow

  • Nice article, I’m interested to hear more about your Lustre exploits. For the past 6-9 months I’ve been researching & working on migrating our systems to a shared storage backend using Lustre, with Apache clients. So far as I’ve found, you’re the only person who’s blogged about it.

    It’s been a while since you’ve written about Lustre so I’m wondering how things developed for you, are you still testing? have you deployed it for production use? or have you scrapped the idea?

  • Nice article, I’m interested to hear more about your Lustre exploits. For the past 6-9 months I’ve been researching & working on migrating our systems to a shared storage backend using Lustre, with Apache clients. So far as I’ve found, you’re the only person who’s blogged about it.

    It’s been a while since you’ve written about Lustre so I’m wondering how things developed for you, are you still testing? have you deployed it for production use? or have you scrapped the idea?

Leave a Reply

Your email address will not be published. Required fields are marked *