Petter Reinholdtsen

Entries tagged "raid".

Some notes on fault tolerant storage systems
1st November 2017

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, raid, sysadmin.
How to figure out which RAID disk to replace when it fail
14th February 2012

Once in a while my home server have disk problems. Thanks to Linux Software RAID, I have not lost data yet (but I was close this summer :). But once a disk is starting to behave funny, a practical problem present itself. How to get from the Linux device name (like /dev/sdd) to something that can be used to identify the disk when the computer is turned off? In my case I have SATA disks with a unique ID printed on the label. All I need is a way to figure out how to query the disk to get the ID out.

After fumbling a bit, I found that hdparm -I will report the disk serial number, which is printed on the disk label. The following (almost) one-liner can be used to look up the ID of all the failed disks:

for d in $(cat /proc/mdstat |grep '(F)'|tr ' ' "\n"|grep '(F)'|cut -d\[ -f1|sort -u);
do
    printf "Failed disk $d: "
    hdparm -I /dev/$d |grep 'Serial Num'
done

Putting it here to make sure I do not have to search for it the next time, and in case other find it useful.

At the moment I have two failing disk. :(

Failed disk sdd1:       Serial Number:      WD-WCASJ1860823
Failed disk sdd2:       Serial Number:      WD-WCASJ1860823
Failed disk sde2:       Serial Number:      WD-WCASJ1840589

The last time I had failing disks, I added the serial number on labels I printed and stuck on the short sides of each disk, to be able to figure out which disk to take out of the box without having to remove each disk to look at the physical vendor label. The vendor label is at the top of the disk, which is hidden when the disks are mounted inside my box.

I really wish the check_linux_raid Nagios plugin for checking Linux Software RAID in the nagios-plugins-standard debian package would look up this value automatically, as it would make the plugin a lot more useful when my disks fail. At the moment it only report a failure when there are no more spares left (it really should warn as soon as a disk is failing), and it do not tell me which disk(s) is failing when the RAID is running short on disks.

Tags: english, raid.

RSS Feed

Created by Chronicle v4.6