Lustre is a complex cluster filesystem aimed at super computing clusters, offering scalability to many Gigabytes per second and Petabytes of raw storage. We’ve been testing it for some time as a base filesystem for virtualisation and I wanted to share some notes.
Build a solid, fast network
Lustre depends heavily on a smooth running network. That means dedicated network cards throughout and a dedicated vlan on your switched network. The faster the network card the better, and nic teaming can be useful if configured correctly too. At a minimum you’ll want end-to-end Gigabit, ideally 10Gigabit, and you’ll need to pay special attention to switch topology and size your inter-switch links correctly if you have a larger network. If there are any bottlenecks or flaky switches in your storage network then Lustre will find them as well as iSCSI or AoE.
Maximise raw storage throughput
The laws of physics cannot be changed. If you need good performance on your Lustre filesystem you’ll need to more or less match the bandwidth offered to clients to the raw storage as well. That means lots of spindles and lots of interconnect.
Direct-attached storage is probably going to be the fastest but won’t offer any high availability if a Lustre server breaks badly. San will make high availability possible but requires more effort and expense to size the storage fabric correctly and make it reliable.
Be careful with that locking…
In Lustre version 1.6.6, Lustre mounts filesystems with locking disabled by default. There are two locking options available if you need it – cluster locking and local locking.
Cluster locking can be enabled using the ‘flock’ option at mount time. Be careful with cluster locking though as this will not only hurt performance but increases the complexity of the work Lustre must do to grant access to files. We found cluster locking quite unstable leading to IO timeouts and client evictions and ended up dropping back to local locking instead.
Even local locking appears to increase the chances of lock timeouts and evictions (at least based on our testing) so carefully consider if you need locking at all, or whether this can be dealt with in your application.
Don’t expect great small file performance
There are two main drawbacks to Lustre small file performance. The first is that each file stored uses 4k on the metadata target (metadata) and 4k on the object storage target (data). This means a 4k or less file will actually cost 8k of disk space to store!
If your application is opening lots of small files for IO and using locking as well you will find small file performance the worst hit, as this can require several network round-trips to satisfy each file request.
Give yourself time to make it stable
Above all if you’re just starting out with Lustre then take your time to design the system and explore the features. Learn how to make it and break it and fix it again. If you are using shared storage and high availability then you’ll need to put in even more planning to configure the Lustre component filesystems correctly and tweak client timeouts to acceptable levels.
In all Lustre seems very promising and despite a few teething problems is beginning to take shape.
In future articles on Lustre I’ll go into more details on performance, configuration and how to configure it as a base filesystem for virtualisation.