Commit Graph

235 Commits

Author SHA1 Message Date
Christos Longros 59e10e7b92 libzfs_pool: document export and initialize functions
Add brief docstrings to zpool_export(), zpool_export_force(),
zpool_initialize() and zpool_initialize_wait().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes #18514
2026-05-12 07:50:33 -07:00
Alexander Motin d65015938e Vdev allocation bias/class change
Normal, special and dedup vdevs differ only by space allocation
bias.  Normal and special vdevs might even legally store blocks
targeted to other classes.  Dedup vdevs don't normally do it, but
there is no real reason why they can't.  Considering this, it is
not impossible to change the allocation bias for those vdevs.

This change introduces a new top-level vdev property -- alloc_bias,
reporting current bias for the vdev, and allowing to change it.
This allows to easily change vdev role in a pool, especially if
vdev removal is impossible.  To not complicate the code, changes
take effect only on next pool import.

Changes to/from log vdev could also be theoretically possible, but
they are artificially blocked for now, partially due to additional
complications, and partially due to potential danger of placing
other blocks on log vdevs, that would otherwise be non-fatal.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18493
2026-05-07 09:16:39 -07:00
Tony Hutter f828a80cb6 CI/GCC: Add Fedora 44, fix build errors and threadsappend
- Add Fedora 44 to CI tests
- Fix build issues from the newer compiler. These are mostly 'char *'
  to 'const char *' conversions.
- Fix threadsappend.c test waiting for the same thread TID twice.
  This caused the test to hang on F44 (but strangely not other OSs?)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18478
2026-05-02 09:57:15 -07:00
Andriy Tkachuk d1b0a69825 draid: add failure domains support
Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.

This patch allows to configure several children groups in the
same row in one draid vdev. In each such group, let's call it
failure group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10 such
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).

In order to preserve fast sequential resilvering in case of a
disk failure, the groups much share all disks between themselves,
and this is achieved by shuffling the disks between the groups.
But only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.

For example, here's how the pool status output looks like in
case of two `draid1:2d:4c` failure groups:

    NAME                        STATE     READ WRITE CKSUM
    pool1                       ONLINE       0     0     0
      draid1:2d:4c:8w:1s-0      ONLINE       0     0     0
        enc0d0                  ONLINE       0     0     0
        enc1d0                  ONLINE       0     0     0
        enc2d0                  ONLINE       0     0     0
        enc3d0                  ONLINE       0     0     0
        enc0d1                  ONLINE       0     0     0
        enc1d1                  ONLINE       0     0     0
        enc2d1                  ONLINE       0     0     0
        enc3d1                  ONLINE       0     0     0
    spares
      draid1-0-0                AVAIL

The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.

Spare disks are evenly distributed among failure groups, and they
are shared by all groups. However, to support domain failure, we
cannot have more than nparity - 1 failed disks in any group, even
if they are rebuilt to draid spares (the blocks of those spares
can be mapped to the disks from the failed domain, and we cannot
tolerate more than nparity failures in any failure group).

The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes #11969
Closes #18148
2026-04-08 10:09:47 -07:00
Christos Longros 33ed68fc24 zpool create: report which device caused failure
When zpool create fails because a vdev is already in use, the
error message now identifies the problematic device and the pool
it belongs to, e.g.:

  cannot create 'tank': device '/dev/sdb1' is part of
  active pool 'rpool'

Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used
by zpool import:

  - Add spa_create_info to spa_t to capture error info during
    vdev_label_init(), before vdev_close() resets vdev state
  - When vdev_inuse() detects a conflict, read the on-disk
    label to extract the pool name and store it with the
    device path
  - Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO
    through the ioctl zc_nvlist_dst to userspace
  - In libzfs, zpool_create_info() unwraps the nvlist and
    formats the device-specific error message

Restructure zpool_create() error handling so all switch cases
use break instead of return, eliminating duplicated cleanup
code and using the single create_failed exit path.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes #18213
2026-04-03 14:18:24 -07:00
Tony Hutter b44a3ecf4a zpool: Change zpool offline spares policy
The zpool offline man page says that you cannot use 'zpool offline'
on spares.  However, testing found that you could in fact force fault
(zpool offline -f) spares.

Change the policy to:
1. You can never force-fault or offline dRAID spares.
2. You can only force-fault or offline traditional spares if they're
   active.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18282
2026-03-25 11:08:55 -07:00
siv0 7f65e04abd libzfs: scrub: only include start and end nv pairs if needed for scrub
This patch addresses running `zpool scrub <pool>` with ZFS 2.4 userspace
while the loaded kernel module is still 2.3, failing with:
```
cannot scrub <pool>: the loaded zfs module does not support an option
for this operation. A reboot may be required to enable this option.
```

Checking for the source of the message via `strace` showed the scrub
ioctl failing and setting errno to ZFS_ERR_IOC_ARG_UNAVAIL[0]. With
that and the comments in `module/zfs/zfs_ioctl.c`[1] commit: 894edd084
seemed like a likely cause for the backward incompatibility.

The corresponding kernelspace code in `module/zfs/zfs_ioctl.c` defaults
to a setting of 0 if either parameter is not set, so not providing the
nvpairs in case both are 0 should not make a semantic difference.

Tested by:
* loading zfs.ko in version 2.3.6
* running `zpool scrub testpool` with zpool from master (error occurs)
* running `zpool scrub testpool` with this patch applied (scrub is
  started)

This should help users who are still stuck on an older kernel module,
while their distribution ships newer ZFS userspace.

This was observed in the Proxmox community forum:
https://forum.proxmox.com/threads/.180467/

[0] https://github.com/openzfs/zfs/blob/d35951b18d6e12afeb0d5b0539ff2467ab4bfbcf/include/sys/fs/zfs.h#L1762
[1] https://github.com/openzfs/zfs/blob/d35951b18d6e12afeb0d5b0539ff2467ab4bfbcf/module/zfs/zfs_ioctl.c#L7799
Fixes: 894edd084 ("Add TXG timestamp database")

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Co-authored-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes #18314
2026-03-12 15:06:23 -07:00
Christos Longros d35951b18d zpool clear: remove undocumented rewind flags
Remove the -F, -n, and -X flags from zpool clear.  These flags were
inherited from OpenSolaris but are not applicable in this context.
Unlike zpool import, where the pool is not yet loaded and a specific
TXG can be selected, zpool clear operates on an already imported pool
whose in-memory state is ahead of what is on disk.  Rewinding
transactions would require force-exporting the pool first.

The rewind policy passed to zpool_clear() is now always
ZPOOL_NO_REWIND.

Tested on FreeBSD 16.0-CURRENT (amd64).  Verified that -F, -n, and
-X are properly rejected as invalid options and that the usage output
reflects the change.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes #13825
Closes #18300
2026-03-11 15:15:45 -07:00
Christos Longros 304de7f19b libzfs: handle EDOM error in zpool_create
When creating a pool with devices that have incompatible block sizes,
the kernel returns EDOM. However, zpool_create() did not handle this
errno, falling through to zpool_standard_error() which produced a
confusing message about invalid property values.

Add a case EDOM handler in zpool_create() to return EZFS_BADDEV with
a descriptive auxiliary message, consistent with the existing EDOM
handler in zpool_vdev_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes #18268
2026-03-08 12:59:10 -07:00
Ryan Moeller ac0fd40c8c Add zpool properties for allocation class space
The existing zpool properties accounting pool space (size, allocated,
fragmentation, expandsize, free, capacity) are based on the normal
metaslab class or are cumulative properties of several classes combined.

Add properties reporting the space accounting metrics for each metaslab
class individually.

Also introduce pool-wide AVAIL, USABLE, and USED properties reporting
values corresponding to FREE, SIZE, and ALLOC deflated for raidz.

Update ZTS to recognize the new properties and validate reported values.

While in zpool_get_parsable.cfg, add "fragmentation" to the list of
parsable properties.

Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Cloes #18238
2026-03-02 15:50:23 -08:00
Alexander Motin 991fc56fae Introduce dedupused/dedupsaved pool properties
Currently there is only a dedup ratio reported via pool properties.
If dedup is enabled only for some datasets, it is impossible to say
how much space the ratio actually covers.  Fix this by introducing
dedupused/dedupsaved pool properties, similar to earlier added
block cloning ones.  Combined with work to expose allocation classes
stats, it should give user-space enough visibility to correlate
`zpool list` and `zfs list` space numbers.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18245
2026-02-25 09:41:38 -05:00
Christos Longros 6a717f31e6 Improve misleading error messages for ZPOOL_STATUS_CORRUPT_POOL
When devices are missing or claimed by another subsystem (e.g.
mdadm, LVM), zpool import reports "The pool metadata is corrupted"
and suggests destroying the pool. This is misleading because the
metadata is not necessarily corrupted -- it may simply be incomplete
due to inaccessible devices.

Update the status, action, and recovery messages to acknowledge
that missing devices can trigger this status, and suggest checking
device availability before resorting to pool destruction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Longros <chris.longros@gmail.com>
Closes #18251
Closes #8236
2026-02-23 09:41:24 -08:00
Christos Longros 040ba7a7ca libzfs: improve error message for zpool create with ENXIO
When zpool create fails because a vdev cannot be opened (ENXIO),
the error falls through to zpool_standard_error() which reports
the generic 'one or more devices is currently unavailable'. This
is misleading when the real cause is a block size mismatch or
other device open failure.

Add an explicit ENXIO case in zpool_create()'s error handling to
provide a more descriptive message.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes #18184
Closes #11087
2026-02-10 13:19:44 -08:00
Brian Behlendorf 20176224ee mmp: claim sequence id before final import
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts.  This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect.  This extra
safety chceck operates as follows:

1. Repeats the following MMP check 10 times:
  a. Write out an MMP uberblock with the best txg and a random
     sequence id to all primary pool vdevs.
  b. Verify a minimum number of good writes such that even if
     the pool appears degraded on the remote host it will see
     at least one of the updated MMP uberblocks.
  c. Wait for the MMP interval this leaves a window for other
     racing hosts to make similar modifications which can be
     detected.
  d. Call vdev_uberblock_load() to determine the best uberblock
     to use, this should be the MMP uberblock just written.
  e. Verify the txg and random sequeunce number match the MMP
     uberblock written in 1a.

2. Restore the original MMP uberblocks.  This allows the check
   to be performed again if the pool fails to import for an
   unrelated reason.

This change also includes some refactoring and minor improvements.

- Never try loading earlier txgs during import when the import
  fails with EREMOTEIO or EINTER.  These errors don't indicate
  the txg is damaged but instead that its either in use on a
  remote host or the import was interactively cancelled.  No
  rewind is also performed for EBADD which can result from a
  stale trusted config when doing a verbatim import.

- Refactor the code for consistent logging of the multihost
  activity check using spa_load_note() and console messages
  indicating when the activity check was trigger and the result.

- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
  modification of the sequence number in an uberblock.

- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
  set to log to dump to stdout the spa_load_info nvlist returned
  during import.  This is used by the updated mmp test cases
  to determine if an activity check was run and its result.

- Standardize the mmp messages similarly to make it easier to
  find all the relevent mmp lines in the debug log.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:36:01 -08:00
Alexander Motin b4f073b5a6 Add BRT support to zpool prefetch command
Implement BRT (Block Reference Table) prefetch functionality similar
to existing DDT prefetch.  This allows preloading BRT metadata into
ARC to improve performance for block cloning operations and frees
of earlier cloned blocks.

Make -t parameter optional.  When omitted, prefetch all supported
metadata types (both DDT and BRT now).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17890
2025-11-10 16:16:22 -08:00
Paul Dagnelie d64711c202 Detect a slow raidz child during reads
A single slow responding disk can affect the overall read
performance of a raidz group.  When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.

Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.

The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter.  Setting it to
zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17227
2025-09-10 15:25:03 -07:00
Mariusz Zaborski 894edd084e Add TXG timestamp database
This feature enables tracking of when TXGs are committed to disk,
providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based
on specific date ranges, improving the granularity of data
management and recovery operations.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16853
2025-08-06 10:31:21 -07:00
Akash B b6e8db509d zpool/zfs: Add '-a|--all' option to scrub, trim, initialize
Add support for the '-a | --all' option to perform trim,
scrub, and initialize operations on all pools.
Previously, specifying a pool name was mandatory for
these operations. With this enhancement, users can now
execute these operations across all pools at once,
without needing to manually iterate over each pool
from the command line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #17524
2025-07-29 14:50:44 -07:00
Rob Norris bf38c15071 everywhere: misc unnecessary var init/update
These are all cases where we initialise or update a variable, and then
never use it. None of them particularly matter, as the compiler should
optimise them all away during dead store elimination, but some static
analysers complain about them and they are extra work for casual readers
to follow, so worth removing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:58 -07:00
Paul Dagnelie e845be28e7 Add no-upgrade featureflag
Adds a featureflag that is not enabled during upgrades unless listed
explicitly. This is useful for features that could cause issues unless
applied carefully; for example, a feature that could make a root pool
unbootable if bootloaders don't yet have support for it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17004
2025-07-09 14:01:59 -07:00
Alexander Motin 4e92aee233 Relax special_small_blocks restrictions
special_small_blocks is applied to blocks after compression, so it
makes no sense to demand its values to be power of 2.  At most
they could be multiple of 512, but that would still buy us nothing,
so lets allow them be any within SPA_MAXBLOCKSIZE.

Also special_small_blocks does not really need to depend on the
set recordsize, enabled pool features or presence of special vdev.
At worst in any of those cases it will just do nothing, so we
should not complicate users lives by artificial limitations.

While there, polish comments for recordsize and volblocksize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17497
2025-07-02 11:11:37 -07:00
Paul Dagnelie 086105f4c4 Cause zpool scan resume commands to get logged in history
Currently, commands that resume a scrub/errorscrub from a paused state
don't get logged in the pool history. This is because resumes actually
return ECANCELED, instead of 0. This causes the tsd code in the common
ioctl logic to not think the ioctl succeeded, which causes the
log_history ioctl to fail with EPERM. However, for resuming a scrub from
a paused state, ECANCELED is success.

There are two options for how to deal with this. The first is the one
that I implemented here; I can't find a good reason for dmu_scan to
return ECANCELED on resume instead of 0, so let's just not. The only
place we check for the ECANCELED value is in zpool_scan, where we just
convert it back to zero.  However, I am aware that this is changing an
ioctl interface, which I believe is a breaking change. I don't think
it's an important change, but maybe there is someone who relies on it.

The other option that could be implemented is to either allow ECANCELED
specifically from dsl_scan in the common ioctl code, or add a generic
facility to the common ioctl code that allows each command to specify
whether or not success happened, regardless of the return values. I am
open to feedback on which option people think would be better.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17301
2025-05-16 13:19:04 -04:00
Rob Norris 131df3bbf2 vdev_to_nvlist_iter: ignore draid parameters when matching names (#17228)
Various tools will display draid vdev names with parameters embedded in
them, but would not accept them as valid vdev names when looking them
up, making it difficult to build pipelines involving draid vdevs.

This commit makes it so that if a full draid name is offered for match,
it gets truncated at the first ':' character.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-14 17:10:48 -07:00
Richard Kojedzinszky 09fc7bb47e Fix memory leaks in pool properties handling
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Kojedzinszky <richard@kojedz.in>
Closes #17208
2025-04-05 19:40:55 -04:00
Rob Norris eb9098ed47 SPDX: license tags: CDDL-1.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:27 -07:00
Rob Norris 779c5a5deb zpool_get_vdev_prop_value: show missing vdev userprops
If a vdev userprop is not found, present it as value '-', default
source, so it matches the output from pool userprops.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16887
2024-12-29 11:11:40 -08:00
Umer Saleem 1c9a4c8cb4 Fix user properties output for zpool list
In zpool_get_user_prop, when called from zpool_expand_proplist and
collect_pool, we often have zpool_props present in zpool_handle_t equal
to NULL. This mostly happens when only one user property is requested
using zpool list -o <user_property>. Checking for this case and
correctly initializing the zpool_props field in zpool_handle_t fixes
this issue.

Interestingly, this issue does not occur if we query any other property
like name or guid along with a user property with -o flag because while
accessing properties like guid, zpool_prop_get_int is called which
checks for this case specifically and calls zpool_get_all_props.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16734
2024-11-11 09:46:45 -08:00
Brian Behlendorf 4319e71402 ztest: Fix scrub check in ztest_raidz_expand_check()
The scrub code may return EBUSY under several possible scenarios
causing ztest to incorrectly ASSERT when verifying the result of
a raidz expansion.  Update the test case to allow EBUSY since it
does not indicate pool damage.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16627
2024-10-08 20:41:17 -07:00
Don Brady d4d79451cb Add DDT prune command
Requires the new 'flat' physical data which has the start
time for a class entry.

The amount to prune can be based on a target percentage of
the unique entries or based on the age (i.e., every entry
older than N days).

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16277
2024-09-04 14:17:02 -07:00
Mateusz Piotrowski 6be8bf5552 zpool: Provide GUID to zpool-reguid(8) with -g (#16239)
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.

This change also adds some general tests for zpool-reguid(8).

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-08-26 09:27:24 -07:00
Ameer Hamza 963e6c9f3f Fix incorrect error report on vdev attach/replace
Report the correct error message in libzfs when attaching/replacing a
vdev with a higher ashift.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16449
2024-08-15 12:39:44 -07:00
Alan Somers 1f5bf91a85 Fix memory corruption during parallel zpool import with -o cachefile (#16419)
When importing multiple pools, the nvlist of properties given with "-o"
is shared amongst the several threads.  So no thread should modify it.
Previously, in the course of validating the cachefile property, the
zpool_valid_proplist function would temporarily modify the value, and
then change it back.  Now it will operate on a clone of the value.

Sponsored by:   Axcient
Fixes #16405
Signed-off-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2024-08-07 13:44:55 -07:00
Tony Hutter 02a9f7fed7 JSON: Fix class values for mirrored special vdevs
This fixes things so mirrored special vdevs report themselves as
"class=special" rather than "class=normal".

This happens due to the way the vdev nvlists are constructed:

mirrored special devices - The 'mirror' vdev has allocation bias as
"special" and it's leaf vdevs are "normal"

single or RAID0 special devices - Leaf vdevs have allocation bias as
"special".

This commit adds in code to check if a leaf's parent is a "special"
vdev to see if it should also report "special".

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16217
2024-08-06 12:47:58 -07:00
Umer Saleem 959e963c81 JSON output support for zpool status
This commit adds support for zpool status command to displpay status
of ZFS pools in JSON format using '-j' option. Status information is
collected in nvlist which is later dumped on stdout in JSON format.
Existing options for zpool status work with '-j' flag. man page for
zpool status is updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217
2024-08-06 12:47:10 -07:00
Allan Jude 62e7d3c89e ddt: add support for prefetching tables into the ARC
This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Closes #15890
2024-07-26 09:16:18 -07:00
Allan Jude c7ada64bb6 ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 09:47:36 -07:00
Don Brady fb6d8cf229 Add some missing vdev properties (#16346)
Sponsored-by: Klara, Inc.
Sponsored-By: Wasabi Technology, Inc.

Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-07-23 16:34:09 -07:00
Maxim Filimonov f07389d3ad Fix locale-specific time
In `zpool status -t`, scrub date/time is reported using the C locale,
while trim time is reported using the current one. This is inconsistent.
This patch fixes that.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Maxim Filimonov <che@bein.link>
Closes #15878
Closes #15879
2024-04-08 15:37:41 -07:00
George Wilson b1e46f869e Add ashift validation when adding devices to a pool
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.

This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:

--allow-in-use
--allow-ashift-mismatch
--allow-replicaton-mismatch

The force flag will disable all of these checks.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #15509
2024-03-29 13:15:56 -06:00
Don Brady cbe882298e Add slow disk diagnosis to ZED
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15469
2024-02-08 09:19:52 -08:00
Tony Hutter a9520e6e59 zpool: Add slot power control, print power status
Add `zpool` flags to control the slot power to drives.  This assumes
your SAS or NVMe enclosure supports slot power control via sysfs.

The new `--power` flag is added to `zpool offline|online|clear`:

    zpool offline --power <pool> <device>    Turn off device slot power
    zpool online --power <pool> <device>     Turn on device slot power
    zpool clear --power <pool> [device]      Turn on device slot power

If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power'
option is automatically implied for `zpool online` and `zpool clear`
and does not need to be passed.

zpool status also gets a --power option to print the slot power status.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mart Frauenlob <AllKind@fastest.cc>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15662
2023-12-21 10:53:16 -08:00
Don Brady 5caeef02fa RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 10:19:41 -08:00
Rob N 4647353c8b status: report pool suspension state under failmode=continue
When failmode=continue is set and the pool suspends, both 'zpool status'
and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree
state. There's no clear indicator that the pool is suspended. This is
unlike suspend in failmode=wait, or suspend due to MMP check failure,
which both report "SUSPENDED" explicitly.

This commit changes it so SUSPENDED is reported for failmode=continue
the same as for other modes.

Rationale:

The historical behaviour of failmode=continue is roughly, "press on as
though all is well". To this end, the fact that the pool had suspended
was not shown, to maintain the façade that all is well.

Its unclear why hiding this information was considered appropriate. One
possibility is that it was expected that a true pool fault would always
be reported as DEGRADED or FAULTED, and that the pool could not suspend
without these happening.

That is not necessarily true, as vdev health and suspend state are only
loosely connected, such that a pool in (apparent) good health can be
suspended for good reasons, and of course a degraded pool does not lead
to suspension. Even if that expectation were true, there's still a
difference in urgency - a degraded pool may not need to be attended to
for hours, while a suspended pool is most often unusable until an
operator intervenes.

An operator that has set failmode=continue has presumably done so
because their workload is one that can continue to operate in a useful
way when the pool suspends. In this case the operator still needs a
clear indicator that there is a problem that needs attending to.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15297
2023-09-20 16:56:45 -07:00
Serapheim Dimitropoulos 12373b0cc7 zpool_vdev_remove() should handle EALREADY error return
When the vdev properties features was merged an extra check
was added in `spa_vdev_remove_top_check()` which checked
whether the vdev that we want to remove is already being
removed and if so return an EALREADY error.

```
static int
spa_vdev_remove_top_check(vdev_t *vd)
{
	... <snip> ...
	/*
	 * This device is already being removed
	 */
	if (vd->vdev_removing)
		return (SET_ERROR(EALREADY));
```

Before that change we'd still fail with an error but it
was a more generic one - here is the check that failed
later in the same function:
```
	/*
	 * There can not be a removal in progress.
	 */
	if (spa->spa_removing_phys.sr_state == DSS_SCANNING)
		return (SET_ERROR(EBUSY));
```

Changing the error code returned from that function changed
the behavior of the removal's library interface exposed to
the userland - `spa_vdev_remove()` now returns `EZFS_UNKNOWN`
instead of `EZFS_EBUSY` that was returning before.

This patch adds logic to make `spa_vdev_remove()` mindful
of the new EALREADY code and propagating `EZFS_EBUSY`
reverting to the previously established semantics of that
function.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #15013
Closes #15129
2023-08-01 14:47:00 -07:00
George Amanakis 482eeef804 Teach zpool scrub to scrub only blocks in error log
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355
2023-05-18 11:59:42 -07:00
Brian Behlendorf e34e15ed6d Add the ability to uninitialize
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451 
Closes #14873
2023-05-18 10:02:20 -07:00
Allan Jude 8eae2d214c Add support for zpool user properties
Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
2023-04-21 10:20:36 -07:00
rob-wing 3e4ed4213d Create zap for root vdev
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
2023-04-20 10:07:56 -07:00
Richard Yao d1807f168e nvpair: Constify string functions
After addressing coverity complaints involving `nvpair_name()`, the
compiler started complaining about dropping const. This lead to a rabbit
hole where not only `nvpair_name()` needed to be constified, but also
`nvpair_value_string()`, `fnvpair_value_string()` and a few other static
functions, plus variable pointers throughout the code. The result became
a fairly big change, so it has been split out into its own patch.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14612
2023-03-14 15:25:50 -07:00
Richard Yao 47a7062772 zpool_valid_proplist() should not corrupt nvpair name string on error
The strings returned from parsing nvlists should be immutable, but to
simplify the code when we want a substring from it, we sometimes will
write a NULL into it and then restore the value afterward. Provided
there is no concurrent access, this is okay, unless we forget to restore
the value afterward. This was caught when constifying string functions
related to nvlists.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14612
2023-03-14 15:25:40 -07:00