draid: add failure domains support

Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.

This patch allows to configure several children groups in the
same row in one draid vdev. In each such group, let's call it
failure group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10 such
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).

In order to preserve fast sequential resilvering in case of a
disk failure, the groups much share all disks between themselves,
and this is achieved by shuffling the disks between the groups.
But only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.

For example, here's how the pool status output looks like in
case of two `draid1:2d:4c` failure groups:

    NAME                        STATE     READ WRITE CKSUM
    pool1                       ONLINE       0     0     0
      draid1:2d:4c:8w:1s-0      ONLINE       0     0     0
        enc0d0                  ONLINE       0     0     0
        enc1d0                  ONLINE       0     0     0
        enc2d0                  ONLINE       0     0     0
        enc3d0                  ONLINE       0     0     0
        enc0d1                  ONLINE       0     0     0
        enc1d1                  ONLINE       0     0     0
        enc2d1                  ONLINE       0     0     0
        enc3d1                  ONLINE       0     0     0
    spares
      draid1-0-0                AVAIL

The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.

Spare disks are evenly distributed among failure groups, and they
are shared by all groups. However, to support domain failure, we
cannot have more than nparity - 1 failed disks in any group, even
if they are rebuilt to draid spares (the blocks of those spares
can be mapped to the disks from the failed domain, and we cannot
tolerate more than nparity failures in any failure group).

The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes #11969
Closes #18148
This commit is contained in:
Andriy Tkachuk
2026-04-08 18:09:47 +01:00
committed by GitHub
parent eb5c93fa8e
commit d1b0a69825
31 changed files with 1634 additions and 157 deletions
+3 -2
View File
@@ -426,6 +426,7 @@ tests = ['zpool_create_001_pos', 'zpool_create_002_pos',
'zpool_create_encrypted', 'zpool_create_edom_neg', 'zpool_create_crypt_combos',
'zpool_create_draid_001_pos', 'zpool_create_draid_002_pos',
'zpool_create_draid_003_pos', 'zpool_create_draid_004_pos',
'zpool_create_draid_005_pos',
'zpool_create_features_001_pos', 'zpool_create_features_002_pos',
'zpool_create_features_003_pos', 'zpool_create_features_004_neg',
'zpool_create_features_005_pos', 'zpool_create_features_006_pos',
@@ -916,10 +917,10 @@ timeout = 1200
[tests/functional/redundancy]
tests = ['redundancy_draid', 'redundancy_draid1', 'redundancy_draid2',
'redundancy_draid3', 'redundancy_draid_damaged1',
'redundancy_draid3', 'redundancy_draid_width', 'redundancy_draid_damaged1',
'redundancy_draid_damaged2', 'redundancy_draid_degraded1',
'redundancy_draid_spare1', 'redundancy_draid_spare2',
'redundancy_draid_spare3', 'redundancy_mirror',
'redundancy_draid_spare3', 'redundancy_draid_spare4', 'redundancy_mirror',
'redundancy_raidz', 'redundancy_raidz1', 'redundancy_raidz2',
'redundancy_raidz3', 'redundancy_stripe']
tags = ['functional', 'redundancy']