OpenVZ forced umount of lustre mount problem

By Mark Sutton Tuesday, 31st March 2009

Recently we managed to find an answer to a quite worrying lustre problem that has been bugging us for some time. Every now and then on servers running OpenVZ containers that make use of lustre filesystem we would see a log entry in /var/log/messages saying:
kernel: Lustre: setting import lustre-server-MDT0000_UUID INACTIVE by administrator request
followed by a number of broken mounts/fs errors inside containers running on the server that the log entry appeared. In effect, all the containers making extensive use of the same lustre server would stop working properly (for example, apache serving sites from lustre mounts would start spawning processes all of which would be unsuccessfully trying to read data from the mounts). The only working fix to that problem was to umount all the lustre mounts and mount them again hoping that the processes that blocked while trying to retrieve data from those mounts would finally retrieve them and carry on working. In the worst case, when the processes inside containers carried on blocking, the restart of the containers was required to restore their functionality. The log suggested that the state of the lustre filesystem was explicitly set to inactive by an administrator. Except that none of us not only could not remember requesting any lustre filesystem state changes, but would never even want to remove a live filesystem used by multiple containers.

New trace

A couple of days ago we had the same issue, the same message appeared in the logs. Only that this time we managed to relate it to the last action we did on a server which was one of the basic ones - stopping a container. We tried restoring mounts, starting a container and then stopping it again to see if we can replicate the problem. We did, with every vzctl stop [container_ID] command we would get the same message in the logs and the lustre mount on the host server would be lost despite itsappearance on the mount list. Tries of accessing it would result in a message:
# ls -l /mnt/lustre-server
ls: /mnt/lustre-server: Cannot send after transport endpoint shutdown

Debugging

We started to trace the container stopping process to find out what it does exactly to trigger the change of lustre filesystem state to inactive. Going from a simple strace, through writing shared libraries replacing umount() function to reading vzctl sources we found that it was not vzctl code itself that caused the issue, but one of the shutdown procedures that activates after a change of runlevel is requested. That pointed us to have a look at the /etc/rc0.d/S01halt script responsible for halting a server on CentOS system when runlevel is changed to 0. Analysing the script we finally found the command responsible for setting the state to inactive which turned out to be a simple:
umount -f
We quickly verified that running umount -f /mnt/lustre-server immediately triggered:
kernel: Lustre: setting import lustre-server-MDT0000_UUID INACTIVE by administrator request
According to Lustre documentation umount -f instructs lustre filesystem to stop: To stop a server:
$ umount -f /mnt/test/ost0
The '-f' flag means "force"; force the server to stop WITHOUT RECOVERY. Without the '-f' flag, "failover" is
implied, meaning the next time the server is started it goes through the recovery procedure

What risk does it introduce?

It is a quite dangerous feature which in certain circumstances can allow for a Denial Of Service attack. Think of the following setup: An OpenVZ enabled host server running multiple containers and mounting the lustre filesystem:
10.0.0.1@tcp0:/lustre-sites on /mnt/sites type lustre (rw)
which stores website document roots for the apache servers running inside containers. In order for the containers to be able to read from the sites directory, a couple of bind mounts are required on the host server:
/mnt/sites/container111_website on /vz/root/111/var/www/container111_website type none (rw,bind)
/mnt/sites/container222_website on /vz/root/222/var/www/container222_website type none (rw,bind)
/mnt/sites/container333_website on /vz/root/333/var/www/container333_website type none (rw,bind)
The bind mounts map the site directories to container root filesystems so they can be read from inside containers. Here is how mounts look like inside container 333:
# mount
/dev/simfs on / type simfs (rw)
10.0.0.1@tcp:/lustre-sites on /var/www/container333_website type lustre (rw,noatime)
/proc on /proc type proc (rw)
none on /dev type tmpfs (rw
none on /dev/pts type devpts (rw)
Since OpenVZ kernel does not restrict the usage of umount syscall in any way, in such setup, anybody that has access to one of the containers can perform a DOS attack by simply running:
umount -f /var/www/container333_website
command inside a container. That will immediately set 10.0.0.1@tcp0:/lustre-sites filesystem to INACTIVE, as a result successfully stopping containers 111, 222 and 333 from accessing their sites directory. Because of the above consequences OpenVZ kernel should never allow to run a forced umount of a lustre mount if the umount request came from inside a container.

Solution for the problem - a kernel patch

Soon after we discovered this problem I decided to look at the OpenVZ kernel sources to try and fix the issue. I managed to come up with a patch for the kernel that makes the described attack impossible. In short, the patch alters umount() syscall so that before it actually unmounts a given filesystem it checks the current execution environment to see if the invocation comes from inside a container, if so, it checks the filesystem. When the filesystem happens to be a lustre filesystem it removes FORCE flag and performs a standard umount procedure instead of a forced one which will not do any harm. It will also print a message:
kernel: Forced umount of lustre fs is not allowed inside container (333). Overriding MNT_FORCE flag.
in the logs. The patch will be supplied to OpenVZ developers and hopefully it will get incorporated in one of the next kernel releases. For now, you can obtain the patch as well as installation instructions from the following link: http://code.fubra.com/wiki/OpenvzUmountLustrePatch

Posted in Technology, vCluster | 2 Comments » twitter-follow facebook-follow rss-follow

  • Hello blogger. I like your blog about OpenVZ forced umount of lustre mount problem CatN.

    I was wondering, i am planning to make a blog for myself. I want to use wordpress like you. Where did you get your template? If you post your answer below mine, i will read this in the next couple of day’s.

    Thanks Bedrijfsverzekering

    • Joe Gardiner says:

      Hi,

      Our template was developed entirely in house. We have a team of developers who can work on projects like these.

      Thanks,

      Joe

Leave a Reply

Your email address will not be published. Required fields are marked *