An apology for this weekends downtime

By Joe Gardiner Wednesday, 2nd November 2011

service-announcements

Following on from our Senior System Architect, Mark Sutton’s, post (NFS Outages) I would like to describe our plan moving forwards, and how we are going to react to the issues over the weekend.

On Friday our NFS server that powers vCluster malfunctioned causing all our web facing servers to lock up, waiting for disk IO. This caused downtime that began at approximately 16:10, Friday afternoon. This downtime went on intermittently throughout the weekend with services not reliably available until Monday lunchtime.

Of course this is unacceptable so let me begin by apologising to all our customers and vCluster users who were affected by this down time. We have worked tirelessly throughout the weekend to restore normal service but it is clear this is just the beginning.

To restore your faith in our service it is important to understand what went wrong from a technical perspective and I hope that Mark’s post will give those who are technically inclined some comfort that we have identified and fixed the root cause. We also want to assure you from a strategic perspective by showing what we have put in place to allay any concerns you might have in this area.

read more...

Posted in Service Announcements, vCluster | 3 Comments »

NFS Server Outages

By Mark Sutton Wednesday, 2nd November 2011

service-announcements

What happened?

At approximately 16:10UK on Friday (28/10/2011), our monitoring system picked up an outage on the vCluster platform and our engineering team began to investigate the problem.

A short while later the issue was narrowed down to a malfunctioning NFS layer causing the Edge web servers to lock up waiting for NFS IO. Further investigation showed that this behaviour was specific to one user - "apache-c1edge". This user is critical to the operation of the Edge web servers, which in turn are critical for the fast delivery of static content and proxying to the PHP application servers.

Tracing the NFS lockup

Using the strace tool we were able to work out where the IO was hanging:

read more...

Posted in Service Announcements, vCluster | 3 Comments »

Scheduled downtime

By Joe Gardiner Monday, 13th September 2010

service-announcements

In response to the database failure we experienced 2 weeks ago, we have planned scheduled down time to improve our infrastructure's redundancy.

From 10pm Monday 13th September the database servers we use to host the majority of our clients will be taken offline. This will affect you if you are a vCluster customer, and/or you use xserve1.dc.fubra.net for your database hosting.

Of course we will endeavour to restore database hosting to it's normal levels of service as quickly as possible, and estimate this will take no longer than 3 hours.

Let me apologise in advance for any inconvenience caused, but ensure you that this vital work will greatly improve your experience as a Fubra and CatN client.


Posted in Service Announcements | No Comments »

CatN unscheduled downtime

By Joe Gardiner Wednesday, 1st September 2010

service-announcements

CatN experienced unscheduled downtime for the majority of yesterday due to a major hardware failure.

Yesterday we suffered a major RAID card failure on a database server, unfortunately we didn't have the spares in place, resulting in a fix taking much more time than it should have and prolonged downtime. The server has been rebuilding the majority of the evening and is concluding now. The time it takes to repair an array of the size we host is measured in 24hour periods, not single hours, so waiting for this rebuild is not a viable restoration method.

If we experience RAID failures we need to have a plan for immediate recovery, but unfortunately our current graceful failure plan was not successful due to the catastrophic failure of the RAID controller, which left the entire RAID in an inconsistent state.

read more...

Posted in Service Announcements | No Comments »