NFS Server Outages
By Mark Sutton Wednesday, 2nd November 2011
What happened?
At approximately 16:10UK on Friday (28/10/2011), our monitoring system picked up an outage on the vCluster platform and our engineering team began to investigate the problem.
A short while later the issue was narrowed down to a malfunctioning NFS layer causing the Edge web servers to lock up waiting for NFS IO. Further investigation showed that this behaviour was specific to one user – “apache-c1edge”. This user is critical to the operation of the Edge web servers, which in turn are critical for the fast delivery of static content and proxying to the PHP application servers.
Tracing the NFS lockup
Using the strace tool we were able to work out where the IO was hanging:
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=56458352, ...}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7d4c000
mmap2(NULL, 212992, PROT_READ, MAP_PRIVATE, 3, 0x1391) = 0xb7d18000
mmap2(NULL, 24576, PROT_READ, MAP_PRIVATE, 3, 0x13ca) = 0xb7d12000
mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0x13f4) = 0xb7d11000
close(3) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=48, ws_col=161, ws_xpixel=966, ws_ypixel=672}) = 0
stat64("/sites/client-example",
From this we could see that the Edge server was hanging on an NFS access() request. This problem would not be fixed until Sunday night (see later).
Filesystem consistency check
During the process of unmounting/remounting the storage, we found that the filesystem was overdue a consistency check.
This is because we have been concentrating on building our new vCluster platform with 2N redundancy rather than upgrading the existing system. Only having a single NFS server meant that we were unable to take the storage down to perform a file system check.
We decided to run a full filesystem check to rule out errors due to undetected corruption.
This process took around 3 hours to complete, during which time we continued to investigate the NFS IO lockup using a temporary storage mount. Upon completion the NFS hang remained. Several minor issues
were fixed by the filesystem check but there were no lost inodes (files).
Edge server upgrades
To rule out a kernel bug we upgraded the kernels on the affected Edge servers. As soon as these servers were upgraded, the system came back online.
Again, though we didn’t have knowledge of a specific bug fix in that kernel, it certainly seemed to fix the problem. With the cluster working, our engineers continued to monitor the system closely.
We were unable to find a specific bug fix though, so we remained skeptical the issue was resolved.
What happened on Sunday afternoon?
Sure enough, on Sunday afternoon the situation reappeared. Our monitoring system picked up the first alert at 15:13UK and shortly after our engineers confirmed the same problem was back.
At this point we considered several possibilities:
- A network/hardware issue
- An issue with LDAP lookups
- A bug in the NFS server kernel or NFS utils
One engineer was despatched to work on each query.
A network MTU issue
The first problem to be found was a network MTU issue between the NFS and LDAP servers. Our setup uses a mixture of bonding and vlans, and the ldap vlan on the nfs server had inherited an incorrect MTU of 9000 from the underlying device.
The MTU was set back to the normal 1500 bytes setting and this appeared to get the cluster working again. Unfortunately the cluster soon ground back to a halt.
NIC Errors
Further investigation found DMA errors on one of the NFS server network interfaces. Immediately we assessed our options – we could either replace the card or migrate to a new, known good server. Unfortunately the NIC was embedded on the server mainboard so to replace it would essentially mean pulling a board from our spares, which would be a lengthy process.
Fortunately we had a recently installed NFS server at hand on our development platform, running on our latest, most powerful hardware and Centos 6. This build was ultimately destined for the next platform release but it had been running long enough to show stability.
We migrated vCluster storage to this new machine.
rpc.mountd –managed-gids issues
Part of the CatN security model requires us to use the “–managed-gids” option to rpc.mountd. The reason for this is that the NFS protocol only allows for auxiliary membership of up to 16 groups, with the NFS client putting this data in the NFS ACCESS request.
Turning on this option enables the server to perform it’s own lookup to LDAP, fixing up the problem.
Unfortunately it turns out that there is a subtle bug in the upstream nfs-utils package in Centos 6 which prevents managed GIDs from working in scenarios with >100 group memberships involved. Our development system had not yet reached this level of test accounts and the issue had therefore not shown up.
Fortunately a patch was readily available so our engineers built a patched nfs-utils package based on the default 1.2.2 version. Here is the commit in the linux-nfs tree:
And here is the post that led to finding this fix:
http://marc.info/?l=linux-nfs&m=131537578720906&w=2
Final checks
After a brief period online we took the cluster back offline for final checks. During this time we performed a full set of tests to ensure that all layers of the vCluster platform were performing correctly.
Monday Morning
On Sunday night we patched NFS utils to version 1.2.2, and although it got the vCluster platform running we were still getting some warnings in the log. On Monday morning at 4am we went down again, we believe this was the same issue.
In response we upgraded NFS utils to version 1.2.3 plus the managed gids patch and this made all warnings go away in the log and NFS was immediately working again. This patch has since survived an intense backup load and has been stable for more than 24 hours, producing no warning messages in the log.
The follow up
- The response team will regroup after 48 hours of stability to review
what happened, and see how our response could be improved in future. - We will schedule another maintenance window to allow us some time to tidy up properly; for example our custom RPM is manually installed at the moment, but we need to put it into a proper yum repository and configure the system to track our custom package instead.
- We will continue to monitor the NFS very closely and be ready to react should any further issues crop up.
Posted in Service Announcements, vCluster |
3 Comments »
[...] single point of failure with our NFS server. For the techies amongst you full details are provided here by Senior System Architect at CatN, Mark Sutton. CatN has also apologised to its customers and [...]
As a techie and former ISP sysadmin I really appreciated the in-depth technical detail you put into this post. Service failures are never welcome, but by being so open and transparent it does give me confidence that you are doing everything you can to make improvements. I’m looking forward to vCluster 2.0 – keep up the good work.
Thanks Dominic. Your words of encouragement are much appreciated. We’re all getting very excited in the office, not only because Christmas is coming but also because vCluster 2.0 will be launched in the New Year. We will of course announce it and make lots of noise when it’s ready.
Joe