Friday, August 31, 2018

[Info] Another 2-drive failure on RAID5

Currently trying to fix another 2-drive failure on my mdadm raid5. The regular solution of --assemble --force is not working. But first some context.

The rig is a 7-drive mdadm RAID5 consisting of mis-matched branded 2TB drives. 5 of those drives are attached to SATA ports on the motherboard, while another 2 are in a 5-disk Rosewill SATA enclosure. This enclosure is attached via a Sil 3132 PCIe eSata card that supports port-multiplication.

03:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)

This and the enclosure came as part of Rosewil's RSV-5 system.

I had previous been usin gthe RSV-4 system, which worked fairly well. That system eventually gave me errors withing dmesg, that I originally attributed to a bad eSATA cable. Eventually that enclosure died because of an errant power surge.

Replacing it with the RSV-5 yielded hte same dmesg errors. No amount of replacement would alleviate the errors. Eventually I figured out, that the errors when away after I upgraded from Ubuntu 14.04 to 16.04. It has performed well since.

Recently I ran into similar ( but not exact) errors as before.

I have now tried my usual trick ( --assemble --force, and the drives in the correct order). But now the error comes back as:

sudo mdadm --verbose --assemble --force /dev/md127 /dev/sdf1 /dev/sdc1 /dev/sdb1 /dev/sdd1 /dev/sdg1 /dev/sde1 /dev/sdh1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdg1 is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 5.
mdadm: /dev/sdh1 is identified as a member of /dev/md127, slot 6.
mdadm: added /dev/sdc1 to /dev/md127 as 1
mdadm: added /dev/sdb1 to /dev/md127 as 2
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sdg1 to /dev/md127 as 4 (possibly out of date)
mdadm: added /dev/sde1 to /dev/md127 as 5
mdadm: added /dev/sdh1 to /dev/md127 as 6 (possibly out of date)
mdadm: added /dev/sdf1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 5 drives - not enough to start the array.


I thought it was because the Event counter was too far off:

         Events : 120796
         Events : 120796
         Events : 120796
         Events : 120796
         Events : 120796
         Events : 120788
         Events : 120788


For posterity here is a good example of a drive:

/dev/sdb1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : f21f5306:c8a07e60:fad3a920:52a40d5b
  Creation Time : Tue Dec 21 20:21:48 2010
     Raid Level : raid5
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
     Array Size : 11721071616 (11178.09 GiB 12002.38 GB)
   Raid Devices : 7
  Total Devices : 5
Preferred Minor : 127
    Update Time : Thu Aug  9 23:39:20 2018
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 2
  Spare Devices : 0
       Checksum : ce567320 - correct
         Events : 120796
         Layout : left-symmetric
     Chunk Size : 64K
      Number   Major   Minor   RaidDevice State
this     2       8       17        2      active sync   /dev/sdb1
   0     0       8       81        0      active sync   /dev/sdf1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       17        2      active sync   /dev/sdb1
   3     3       8       49        3      active sync   /dev/sdd1
   4     4       0        0        4      faulty removed
   5     5       8       65        5      active sync   /dev/sde1
   6     6       0        0        6      faulty removed

A drive that was kicked:

/dev/sdh1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : f21f5306:c8a07e60:fad3a920:52a40d5b
  Creation Time : Tue Dec 21 20:21:48 2010
     Raid Level : raid5
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
     Array Size : 11721071616 (11178.09 GiB 12002.38 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 127
    Update Time : Thu Aug  9 23:31:29 2018
          State : clean
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ce567281 - correct
         Events : 120788
         Layout : left-symmetric
     Chunk Size : 64K
      Number   Major   Minor   RaidDevice State
this     6       8      113        6      active sync   /dev/sdh1
   0     0       8       81        0      active sync   /dev/sdf1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       17        2      active sync   /dev/sdb1
   3     3       8       49        3      active sync   /dev/sdd1
   4     4       8       97        4      active sync   /dev/sdg1
   5     5       8       65        5      active sync   /dev/sde1
   6     6       8      113        6      active sync   /dev/sdh1

I opened up this thread on ubuntu forums:

https://ubuntuforums.org/showthread.php?t=2399971&p=13796868#post13796868

They suggested I try --run.

This yielded:

sudo mdadm --verbose --assemble --force --run /dev/md127 /dev/sdf1 /dev/sdc1 /dev/sdb1 /dev/sdd1 /dev/sdg1 /dev/sde1 /dev/sdh1
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdg1 is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 5.
mdadm: /dev/sdh1 is identified as a member of /dev/md127, slot 6.
mdadm: added /dev/sdc1 to /dev/md127 as 1
mdadm: added /dev/sdb1 to /dev/md127 as 2
mdadm: added /dev/sdd1 to /dev/md127 as 3
mdadm: added /dev/sdg1 to /dev/md127 as 4 (possibly out of date)
mdadm: added /dev/sde1 to /dev/md127 as 5
mdadm: added /dev/sdh1 to /dev/md127 as 6 (possibly out of date)
mdadm: added /dev/sdf1 to /dev/md127 as 0
mdadm: failed to RUN_ARRAY /dev/md127: Input/output error
mdadm: Not enough devices to start the array.

It was suggested that maybe the Input/Output error was because of one of the drives instead of the md127 not being able to be created.

Googling has not resulted in any conclusive direction.


I did find this possible sokution:

https://ubuntuforums.org/showthread.php?t=2276699&page=2&highlight=mdadm+event+counter
sudo mdadm --stop /dev/md2
sudo mdadm --zero-superblock /dev/sd[abcdhijk]
sudo mdadm --create --assume-clean /dev/md2 /dev/sd[abcdhijk]

Though it is cautioned as the "nuclear option" and thus I'm saiving it for when all other alternatives have been exhausted .

 I will keep this page updated as I try more things.