In a not to distant past
Russell Coker wrote about RAID Issues and referred in part to
a report containing data from 1,530,000 disks running at NetApp customer sites. (also available in
PDF or
Postscript)
Interesting reading, for sure - particularly if you run any large dataset and want to ensure it stays intact!
It prompted in the recesses of my memory a report compiled in February 2007 by Google. The report,
'Failure Trends in a Large Disk Drive Population' was a report presented at the USENIX (FAST '07) Conference.
The Google report looked at actual hardware failures of disk Google saw over several years. The numbers were crunched (Is there anything else Google does besides crunch large datasets?). Some interesting results popped out:
- The disks studied where either SATA or PATA consumer-grade disks that were either 5400RPM or 7200RPM ranging in size from 80G to 400G and been commissioned from any time after 2001. - Interesting this is the same disks that many of us will find in our own machines. No Enterprise disks, SCSI or SAS disks in the study.
- HDDs had a higher tendancy to fail at the start of their life or anything beyond >3 years of use.
- Low or Heavy Utilization of the HDD resulted in greater loss then 'Medium use'.
- Disks that had surface scan errors had a greater result of failure over the next 60 days.
- HDDs in operating in cool temperature (15-30oC) had much greater failure rate in the first 3 years. Whilst disks > 3 years had a greater failure rate with the higher the operating temperature. The ideal rate for running disks to minimise failure rate was 30-35oC.
- A disk that spends more than 50% of it's powered on time > 40oC is a good indication of a possible problem.
- SMART data analysis revealed that it is not a reliable way to determine if a disk is about to fail. 36% of all disk failure had no SMART errors. The disks that had SMART errors the majority where seek errors (~72%). So basically, expect to see seek errors, beyond that you appear to be running blind with SMART.
Looking at both the Google Report and the NetApp Storage Report some 'best practices' become apparent to ensure you minimise your data loss:
- HDDs are mechanical devices. Expect failure and plan for it.
- Attempt to operate disks in the 30 - 35oC temperature range to extend their life.
- Monitor disk temperatures. Extended periods where a disk temperature rises unexpectedly (ie: not under any additional load than normal), is often an early sign that failure is close at hand.
- Attempt to purchase disks not from the same batch. This will avoid a common manufacturing fault taking your disks out at the same time. (Many storage companies will do that for you as part of their service.)

- HDD failure follows the Bathtub curve. The 'right side' of the bath kicks in around 3 years. Getting life out of your HDDs beyond that is a bonus, treat it as such!
- HDDs are cheap these days. Don't be cheap -- implement RAID-6 over RAID-5 as a matter of course. Ensure you use Double-parity on your RAID-6 implementation. You'll find most recent versions of RAID-6 implement double-parity as 'standard'. (NB: If using NetApp - it's coined as RAID-DP.) Some vendors even allow you to upgrade the storage firmware online if using RAID-6 with Double Parity (NetApp for example has this feature).
- Hot-Spares in your RAID-6 array is a very good idea. For the cost of the array consider it an insurance policy against the dreaded multiple disk failure which could potentially toast your array.
- How important is your data? Can you put a cost on it? If the cost of replacement is extreme, consider redundancy options. This could include: Implementing RAID-60 (or RAID-6+0), archive/backup solutions, or even a total Disaster Recovery (DR) solution.
- Air flow around an array unit is critical. Don't cram your arrays in a fully populated rack, as minimal air-flow will ensure. This will add to the HDD temperatures and general storage enclosure. Remember you're aiming to keep your disks at 30-35oC.
- Keep a logbook of when each drive was added/replaced. You know that 18 month disk then is less likely to fail than that disk that has been whirring away for 5 years. Record size, manufacturer and model/run. You may see some 'patterns' emerge in your own failure rates that will help with additional purchase decisions (ie: particular makes/models to avoid!)
- Perform regular 'scans' or 'checks' of your HDD health, knowing the current state of disks allows you to plan for the inevitable failure.
Other tips I've picked up over the years looking after Enterprise systems attached to large storage arrays:
- There is a reason that the ASX demands that publicly listed companies with 'mission critical' services for the public (items such as water supply, electricity/gas, and telecommunications) MUST ensure their 'mission criticial' applications have full Disaster Recovery operation on hot-standy.
- Don't cut corners and not implement items like HotSpares and Double Parity. When you have a drive failure (not if), you'll be glade you spent that little bit extra on it.
- A well planned storage solution should have drive failures, it shouldn't have storage failures. (Don't tell Sun/Internode that!)

- Ensure you have clean-filtered power. Don't assume it. Power voltage fluctuations and disk writes don't play nicely.
- Don't put all your eggs in one basket. Bad things do happen (I won't mention Internode again). Spreading your data out over multiple storage arrays helps add another level of redundancy. This is a GoodThingTM
- It's never a bad idea to have spare disks in storage waiting for that inevitable failure, rather than relying on a vendor to have your disk capacity/model available. Every day that dead disk isn't replaced you are putting your array at risk.
- Look after your disks and arrays, and they generally will look after you (ie: ensuring you don't spend countless hours in the early mornings attempting to recover the unrecoverable!)
Above all, may your next
disk failure not be a
complete failure.
I'm interested to hear others stories, experiences and ideas that they have put in place to keep their data nice and cosy and their HDDs whirring in a constant and reliable state. Feel free to drop me a comment.
As a side note: I've been putting RAID-1 in place on my desktop machines as disk prices are so cheap now over the past 18 months. For the price of a few hundred dollars why bother the stress of a hard-disk failure? Having said that, it's no replacement for backing up your 'important' information. RAID-1 is still susceptible to the dreaded multiple disk failure issue.

Having said that, I've been lucky at that hasn't occurred. (Touch wood!)