CatN unscheduled downtime
By Joe Gardiner Wednesday, 1st September 2010
CatN experienced unscheduled downtime for the majority of yesterday due to a major hardware failure.
Yesterday we suffered a major RAID card failure on a database server, unfortunately we didn’t have the spares in place, resulting in a fix taking much more time than it should have and prolonged downtime. The server has been rebuilding the majority of the evening and is concluding now. The time it takes to repair an array of the size we host is measured in 24hour periods, not single hours, so waiting for this rebuild is not a viable restoration method.
If we experience RAID failures we need to have a plan for immediate recovery, but unfortunately our current graceful failure plan was not successful due to the catastrophic failure of the RAID controller, which left the entire RAID in an inconsistent state.
The recovery procedure
After the failure we looked at backups and restoring to a spare server as a possible solution. There were a couple of delays to a backup restoration. Currently all of our backups are remote, not a major delay but it did slow things down a little, another problem was not having a predetermined method of rapidly recovering backups.
Currently we have restored to our most recent backup in order to resume normal service, but unfortunately this backup was from Saturday morning (28th) between 06:30 a.m. and 07:12 a.m, 48 hours behind, clearly our planned nightly backup had failed.
As we alternate between running full and incremental backups, occasionally our system schedules a large number of full backups, resulting in a backlog of backups. The hardware failure we suffered coincided with a scheduled backup meaning we were half way through the process when the hardware failed, unfortunately the failed database server backup had not run at the point of failure.
Future Redundancy
In order to prevent a failure of similar magnitude in the future we are currently reviewing our recovery procedure and will be investing in five levels of redundancy for the future:
- RAID on all servers
- Master/Slave replication, allowing us to switch the Slave to Master in the case of a failure.
- Having an on site cold spare database server will provide a full set of spares in the case of a Master and/or Slave failure.
- On site daily backups (using a dedicated server in our Data Centre for rapid restores if Master/Slave replication fails).
- Off site daily backups using Amazon EC2/S2
Once the new hardware and backup solutions are in place we will have fast replications for the vast majority of potential issues. If both of our servers fail, we will have a spare server on site. If we do have to resort to archived backup, we will have more current and on site backups available instead of just off site as we currently do.
Finally, InnoDB backups did not work successfully, and we are currently reviewing this and will update here shortly.
Let me offer my sincere apologies to all our vCluster customers, and assure you that we take any major failures extremely seriously, and that we are working rapidly to prevent any failures of this magnitude occurring again. If you continue to experience problems please email support@catn.com, and remember to follow us on Twitter for further updates.
Posted in Service Announcements |
No Comments »