Petter Reinholdtsen: Entries Tagged raid

RAID status from LSI Megaraid controllers in Debian

17th April 2024

I am happy to report that the megactl package, useful to fetch RAID status when using the LSI Megaraid controller, now is available in Debian. It passed NEW a few days ago, and is now available in unstable, and probably showing up in testing in a weeks time. The new version should provide Appstream hardware mapping and should integrate nicely with isenkram.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, isenkram, raid.

RAID status from LSI Megaraid controllers using free software

3rd March 2024

The last few days I have revisited RAID setup using the LSI Megaraid controller. These are a family of controllers called PERC by Dell, and is present in several old PowerEdge servers, and I recently got my hands on one of these. I had forgotten how to handle this RAID controller in Debian, so I had to take a peek in the Debian wiki page "Linux and Hardware RAID: an administrator's summary" to remember what kind of software is available to configure and monitor the disks and controller. I prefer Free Software alternatives to proprietary tools, as the later tend to fall into disarray once the manufacturer loose interest, and often do not work with newer Linux Distributions. Sadly there is no free software tool to configure the RAID setup, only to monitor it. RAID can provide improved reliability and resilience in a storage solution, but only if it is being regularly checked and any broken disks are being replaced in time. I thus want to ensure some automatic monitoring is available.

In the discovery process, I came across a old free software tool to monitor PERC2, PERC3, PERC4 and PERC5 controllers, which to my surprise is not present in debian. To help change that I created a request for packaging of the megactl package, and tried to track down a usable version. The original project site is on Sourceforge, but as far as I can tell that project has been dead for more than 15 years. I managed to find a more recent fork on github from user hmage, but it is unclear to me if this is still being maintained. It has not seen much improvements since 2016. A more up to date edition is a git fork from the original github fork by user namiltd, and this newer fork seem a lot more promising. The owner of this github repository has replied to change proposals within hours, and had already added some improvements and support for more hardware. Sadly he is reluctant to commit to maintaining the tool and stated in my first pull request that he think a new release should be made based on the git repository owned by hmage. I perfectly understand this reluctance, as I feel the same about maintaining yet another package in Debian when I barely have time to take care of the ones I already maintain, but do not really have high hopes that hmage will have time to spend on it and hope namiltd will change his mind.

In any case, I created a draft package based on the namiltd edition and put it under the debian group on salsa.debian.org. If you own a Dell PowerEdge server with one of the PERC controllers, or any other RAID controller using the megaraid or megaraid_sas Linux kernel modules, you might want to check it out. If enough people are interested, perhaps the package will make it into the Debian archive.

There are two tools provided, megactl for the megaraid Linux kernel module, and megasasctl for the megaraid_sas Linux kernel module. The simple output from the command on one of my machines look like this (yes, I know some of the disks have problems. :).

# megasasctl 
a0       PERC H730 Mini           encl:1 ldrv:2  batt:good
a0d0       558GiB RAID 1   1x2  optimal
a0d1      3067GiB RAID 0   1x11 optimal
a0e32s0     558GiB  a0d0  online   errs: media:0  other:19
a0e32s1     279GiB  a0d1  online  
a0e32s2     279GiB  a0d1  online  
a0e32s3     279GiB  a0d1  online  
a0e32s4     279GiB  a0d1  online  
a0e32s5     279GiB  a0d1  online  
a0e32s6     279GiB  a0d1  online  
a0e32s8     558GiB  a0d0  online   errs: media:0  other:17
a0e32s9     279GiB  a0d1  online  
a0e32s10    279GiB  a0d1  online  
a0e32s11    279GiB  a0d1  online  
a0e32s12    279GiB  a0d1  online  
a0e32s13    279GiB  a0d1  online  

#

In addition to displaying a simple status report, it can also test individual drives and print the various event logs. Perhaps you too find it useful?

In the packaging process I provided some patches upstream to improve installation and ensure a Appstream metainfo file is provided to list all supported HW, to allow isenkram to propose the package on all servers with a relevant PCI card.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, isenkram, raid.

Some notes on fault tolerant storage systems

1st November 2017

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

USENIX :login; Redundancy Does Not Imply Fault Tolerance. Analysis of Distributed Storage Reactions to Single Errors and Corruptions by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
ZDNet Why RAID 5 stops working in 2009 by Robin Harris
ZDNet Why RAID 6 stops working in 2019 by Robin Harris
USENIX FAST'07 Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso
USENIX ;login: Data Integrity. Finding Truth in a World of Guesses and Lies by Doug Hughes
USENIX FAST'08 An Analysis of Data Corruption in the Storage Stack by L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau
USENIX FAST'07 Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? by B. Schroeder and G. A. Gibson.
USENIX ;login: Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
SIGMETRICS 2007 An analysis of latent sector errors in disk drives by L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, raid, sysadmin.

How to figure out which RAID disk to replace when it fail

14th February 2012

Once in a while my home server have disk problems. Thanks to Linux Software RAID, I have not lost data yet (but I was close this summer :). But once a disk is starting to behave funny, a practical problem present itself. How to get from the Linux device name (like /dev/sdd) to something that can be used to identify the disk when the computer is turned off? In my case I have SATA disks with a unique ID printed on the label. All I need is a way to figure out how to query the disk to get the ID out.

After fumbling a bit, I found that hdparm -I will report the disk serial number, which is printed on the disk label. The following (almost) one-liner can be used to look up the ID of all the failed disks:

for d in $(cat /proc/mdstat |grep '(F)'|tr ' ' "\n"|grep '(F)'|cut -d\[ -f1|sort -u);
do
    printf "Failed disk $d: "
    hdparm -I /dev/$d |grep 'Serial Num'
done

Putting it here to make sure I do not have to search for it the next time, and in case other find it useful.

At the moment I have two failing disk. :(

Failed disk sdd1:       Serial Number:      WD-WCASJ1860823
Failed disk sdd2:       Serial Number:      WD-WCASJ1860823
Failed disk sde2:       Serial Number:      WD-WCASJ1840589

The last time I had failing disks, I added the serial number on labels I printed and stuck on the short sides of each disk, to be able to figure out which disk to take out of the box without having to remove each disk to look at the physical vendor label. The vendor label is at the top of the disk, which is hidden when the disks are mounted inside my box.

I really wish the check_linux_raid Nagios plugin for checking Linux Software RAID in the nagios-plugins-standard debian package would look up this value automatically, as it would make the plugin a lot more useful when my disks fail. At the moment it only report a failure when there are no more spares left (it really should warn as soon as a disk is failing), and it do not tell me which disk(s) is failing when the RAID is running short on disks.

Tags: english, raid.

Petter Reinholdtsen

Entries tagged "raid".

Archive

Tags