The feature that added a failfast property to vdevs unfortunately did
not correctly set the default at creation time, so many vdevs do not
actually have the property set. In addition, when the property is
used, the failfast flag is not checked correctly, resulting in the
feature mostly not working as intended.
Set the failfast property to the default value at vdev allocation time.
The value will be read in from the ZAP as normal when the vdev metadata
is loaded. Allow the property to be set on any vdev and have it be
inherited from the root or top-level vdev.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#18410
Don't disable block cloning during dedup tests. Just don't use
cp to not trigger it. Add a new test, explicitly mixing dedup
and cloning on the same file, that should be handled by DDT.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@truenas.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18520
Before this change, for blocks marked with D flag but absent in DDT
(pruned from it), zio_ddt_free() fell back to ZIO_STAGE_DVA_FREE
without trying ZIO_STAGE_BRT_FREE first. Same time such blocks
might be present in BRT, and not handling that would result in
double/multiple free.
This change makes ZIO_DDT_FREE_PIPELINE include ZIO_FREE_PIPELINE,
just adding required ZIO_STAGE_ISSUE_ASYNC and ZIO_STAGE_DDT_FREE,
and moves DDT stages before BRT. This way, if the block is found
in DDT by zio_ddt_free(), the pipeline is short-circuited to
ZIO_INTERLOCK_PIPELINE, similar to what zio_brt_free() does. If
not, then BRT is checked, and if also no match, the block is freed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@truenas.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18520
Currently, zfs send generates a mix of nvlist encodings in DRR_BEGIN
records, some XDR and some in native byte order. The result is that
most streams currently can't be zfs received on opposite-endian systems.
zfs send generates the outer wrappers for compound streams in userspace,
and it explicitly requests NV_ENCODE_XDR format for those records. But
the BEGIN records for individual datasets are generated on the kernel
side, in dmu_send.c, where fnvlist_pack() is used for encoding. That
routine hard-wires NV_ENCODE_NATIVE format.
This PR replaces the fnvlist_pack() call with a direct call to
nvlist_pack() that specifies NV_ENCODE_XDR.
Tests are included to verify that native-encoded nvlists are not
generated by any kernel path that attaches nvlists to BEGIN records.
There's also a check for XDR encoding in the outer wrapper of
replication streams in case there is ever a regression there.
There are also two tests that have a chance of triggering (and
detecting) bug #18491. Non-triggering versions of those tests are
already included here, so when that bug is more fully characterized,
the tests can be moved to a more directly relevant category. (They
are the two tests with _with_write suffixes.)
This PR adds to zstream dump an output line that shows the exact
encoding of any nvlists in BEGIN records. This feature is used by
the tests to validate streams.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Garth Snyder <garth@garthsnyder.com>
Closes#18360Closes#18372
With one -v, the block type (parity or data) is printed (matching
the ASCII-art version); with two -v, the offset into the file is
also printed.
This also updates the man page, and adds some simple
test scripts.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Sean Fagan <sean.fagan@klarasystems.com>
Signed-off-by: Sean Fagan <sean.fagan@klarasystems.com>
Closes#18470
Normal, special and dedup vdevs differ only by space allocation
bias. Normal and special vdevs might even legally store blocks
targeted to other classes. Dedup vdevs don't normally do it, but
there is no real reason why they can't. Considering this, it is
not impossible to change the allocation bias for those vdevs.
This change introduces a new top-level vdev property -- alloc_bias,
reporting current bias for the vdev, and allowing to change it.
This allows to easily change vdev role in a pool, especially if
vdev removal is impossible. To not complicate the code, changes
take effect only on next pool import.
Changes to/from log vdev could also be theoretically possible, but
they are artificially blocked for now, partially due to additional
complications, and partially due to potential danger of placing
other blocks on log vdevs, that would otherwise be non-fatal.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18493
In send_reader_thread(), the PREVIOUSLY_REDACTED handler computed
file_max as MIN(dn->dn_maxblkid, range->end_blkid). dn_maxblkid is
an inclusive maximum block ID while range->end_blkid is exclusive (one
past the last block). The resulting file_max was then used as an
exclusive loop bound, causing the last block of any file (at index
dn_maxblkid) to be silently skipped when a PREVIOUSLY_REDACTED range
covered the end of the file.
The block was never written to the send stream so the receiver kept
zeros there. ZFS reported no error because the stream itself was
valid; the data was simply absent.
Fix: use dn_maxblkid + 1 so file_max is consistently exclusive.
Add a regression test (redacted_max_blkid.ksh) that modifies only the
last block of a file in one clone, creates a redaction bookmark from
it, then sends an unmodified clone incrementally from that bookmark.
The PREVIOUSLY_REDACTED path must fill in the last block; the test
verifies it is not zeros and matches the original.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Manoj Joseph <manoj.joseph@delphix.com>
Closes#18477
This is the repro test from #18464, and confirms that when disabled, the
libzfs_mnttab_cache is discarded and reloaded on every lookup.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Prakash Surya <prakash.surya@perforce.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18466Closes#18464
When the path argument to "zfs list -Ho name <path>" (or any caller of
zfs_path_to_zhandle()) is a symlink that crosses a mount boundary, the
wrong dataset is returned. Instead of returning the dataset that owns
the symlink's target, getextmntent() matches the dataset containing the
symlink itself.
For example, given two ZFS datasets "tank/ds1" and "tank/ds2", and a
symlink "/tank/ds1/link" pointing into "/tank/ds2":
$ sudo zfs list -Ho name /tank/ds1/link
tank/ds1
The expected (and previous) behavior is to return "tank/ds2", since the
symlink's target resides in that dataset.
The problem is in getextmntent(), in lib/libspl/os/linux/mnttab.c. That
function calls statx() on the caller-supplied path to obtain its mnt_id
(used to match against the mnt_id of each entry in /proc/self/mounts),
and it passes AT_SYMLINK_NOFOLLOW to that statx() call. As a result,
the mnt_id returned reflects the symlink's location rather than the
symlink target's mount, and the wrong /proc/self/mounts entry is
matched.
The same function also calls stat64() on the caller-supplied path
(used as a fallback when STATX_MNT_ID is not available, and to populate
the statbuf out-parameter). stat64() always follows symlinks, so the
statx() and stat64() calls were inconsistent: one resolved the symlink,
the other didn't. The AT_SYMLINK_NOFOLLOW behavior may be appropriate
when statx() is called on a mount entry from /proc/self/mounts (which
is always a real directory), but it is wrong for caller-supplied paths,
which may be symlinks.
This bug was introduced by 523d9d6007 ("Validate mountpoint on
path-based unmount using statx"), which added the STATX_MNT_ID code
path. However, the bug was latent: config/user-statx.m4 omitted
"#define _GNU_SOURCE" when checking for STATX_MNT_ID in <sys/stat.h>,
so HAVE_STATX_MNT_ID was never defined, and the buggy statx() path was
never compiled in. getextmntent() always fell back to the dev_t
comparison via stat64(), which correctly follows symlinks.
The fix to that autoconf check, in 2b930f63f8 ("config: fix
STATX_MNT_ID detection"), caused HAVE_STATX_MNT_ID to be properly
defined on kernels that support it, activating the broken
AT_SYMLINK_NOFOLLOW path for the first time and exposing the
regression.
The fix is to drop AT_SYMLINK_NOFOLLOW from the statx() call so that
symlinks are followed, matching the behavior of stat64() on the same
path.
Verified with a minimal reproducer: created two ZFS datasets, placed a
symlink inside the first pointing into the second, and confirmed that
"zfs list -Ho name <symlink>" returns the dataset containing the
symlink's target rather than the dataset containing the symlink.
Signed-off-by: Prakash Surya <prakash.surya@perforce.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
This PR adds a check in the mirror and raidz code for the case where
there are errors <= nparity. In that case, ZFS sets a new flag on
the zio that will be checked in zio_done. If that flag is set, when
the write IO completes, we issue a read IO for the same blkptr.
That will allow ZFS's auto-healing mechanisms and other errors
recovery tools to detect the effectively-corrupt data, and handle
it accordingly. Note that because draid raidz's IO done function,
it also benefits from this functionality.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#18387
When copy_file_range overwrites a recent truncation, subsequent reads
can incorrectly determine that it is read hole instead of reading the
cloned blocks.
This can happen when the following conditions are met:
- Truncate adds blkid to dn_free_ranges
- A new TXG is created
- copy_file_range calls dmu_brt_clone which override the block pointer
and set DB_NOFILL
- Subsequent read, given DB_NOFILL, hits dbuf_read_impl and
dbuf_read_hole
- dbuf_read_hole calls dnode_block_freed, which returns TRUE because the
truncated blkids are still in dn_free_ranges
This will not happen if the clone and truncate are in the same TXG,
because the block clone would update the current TXG's dn_free_ranges,
which is why this bug only triggers under high IO load (such as
compilation).
Fix this by skipping the dnode_block_freed call if the block is
overridden. The fix shouldn't cause an issue when the cloned block is
subsequently freed in later TXGs, as dbuf_undirty would remove the
override.
This requires a dedicated test program as it is much harder to trigger
with scripts (this needs to generate a lot of I/O in short period of
time for the bug to trigger reliably).
Assisted-by: Gemini:gemini-3.1-pro
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gary Guo <gary@kernel.org>
Closes#18412Closes#18421
Currently, when more than nparity disks get faulted during the
rebuild, only first nparity disks would go to faulted state, and
all the remaining disks would go to degraded state. When a hot
spare is attached to that degraded disk for rebuild creating the
spare mirror, only that hot spare is getting rebuilt, but not the
degraded device. So when later during scrub some other attached
draid spare happens to map to that spare, it will end up with
cksum error.
Moreover, if the user clears the degraded disk from errors, the
data won't be resilvered to it, hot spare will be detached almost
immediately and the data that was resilvered only to it will be
lost.
Solution: write to all mirrored devices during rebuild, similar
to traditional/healing resilvering, but only if we can verify
the integrity of the data, or when it's the draid spare we are
writing to, in which case we are writing to a reserved spare
space, and there is no danger to overwrite any good data.
The argument that writing only to rebuilding draid spare vdev is
faster than writing to normal device doesn't hold since, at a
specific offset being rebuilt, draid spare will be mapped to a
normal device anyway.
redundancy_draid_degraded2 automation test is added also to
cover the scenario.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18414
Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.
This patch allows to configure several children groups in the
same row in one draid vdev. In each such group, let's call it
failure group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10 such
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).
In order to preserve fast sequential resilvering in case of a
disk failure, the groups much share all disks between themselves,
and this is achieved by shuffling the disks between the groups.
But only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.
For example, here's how the pool status output looks like in
case of two `draid1:2d:4c` failure groups:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
draid1:2d:4c:8w:1s-0 ONLINE 0 0 0
enc0d0 ONLINE 0 0 0
enc1d0 ONLINE 0 0 0
enc2d0 ONLINE 0 0 0
enc3d0 ONLINE 0 0 0
enc0d1 ONLINE 0 0 0
enc1d1 ONLINE 0 0 0
enc2d1 ONLINE 0 0 0
enc3d1 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.
Spare disks are evenly distributed among failure groups, and they
are shared by all groups. However, to support domain failure, we
cannot have more than nparity - 1 failed disks in any group, even
if they are rebuilt to draid spares (the blocks of those spares
can be mapped to the disks from the failed domain, and we cannot
tolerate more than nparity failures in any failure group).
The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#11969Closes#18148
`zfs change-key` changes the key used to encrypt a ZFS dataset. When
used programmatically, it may be useful to track some external state
related to the key in a user property. E.g. a generation number,
expiration date, or application-specific source of the key.
This can be done today by running `zfs set user:prop=value` before or
after running `zfs change-key`. However, this introduces a race
condition where the property may not be set even though the key has
changed, or vice versa (depending on the order the commands are
executed).
This can be addressed by using a channel program (`zfs program`) which
calls both `zfs.sync.change_key()` and `zfs.sync.set_prop()`, changing
the property and key atomically. However, it is nontrivial to write such
a channel program to handle error cases, and provide the new key
securely (e.g. without logging it).
This issue proposes to enhance `zfs change-key` to be able to atomically
set user properties while changing the encryption key. Currently `zfs
change-key` accepts `-o property=value` arguments, but the only valid
properties are keylocation, keyformat, and pbkdf2iters. We will enhance
this to also allow user properties, e.g. `-o user:prop=value`. User
properties will also be allowed when using `zfs change-key -i` to
inherit the key from the parent dataset.
Original-patch-by: Matthew Ahrens <matt@mahrens.org>
External-issue: https://www.illumos.org/issues/17847
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18407
When sequentially resilvering allow a dRAID child to be read
as long as the DTLs indicate it should have a good copy of the
data and the leaf isn't being rebuilt. The previous check was
slightly too broad and would skip dRAID spare and replacing
vdevs if one of their children was being replaced. As long
as there exists enough additional redundancy this is fine, but
when there isn't this vdev must be read in order to correctly
reconstruct the missing data.
A new test case has been added which exhausts the available
redundancy, faults another device causing it to be degraded,
and then performs a sequential resilver for the degraded device.
In such a situation enough redundancy exists to perform the
replacement and a scrub should detect no checksum errors.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18405
For now make it only evict the specified data from the dbuf cache.
Even though dbuf cache is small, this may still reduce eviction of
more useful data from there, and slightly accelerate ARC evictions
by making the blocks there evictable a bit sooner.
On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the
kernel translates it into POSIX_FADV_DONTNEED after every read/write.
This is not as efficient as it could be for ZFS, but that is the only
way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399
When zpool create fails because a vdev is already in use, the
error message now identifies the problematic device and the pool
it belongs to, e.g.:
cannot create 'tank': device '/dev/sdb1' is part of
active pool 'rpool'
Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used
by zpool import:
- Add spa_create_info to spa_t to capture error info during
vdev_label_init(), before vdev_close() resets vdev state
- When vdev_inuse() detects a conflict, read the on-disk
label to extract the pool name and store it with the
device path
- Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO
through the ioctl zc_nvlist_dst to userspace
- In libzfs, zpool_create_info() unwraps the nvlist and
formats the device-specific error message
Restructure zpool_create() error handling so all switch cases
use break instead of return, eliminating duplicated cleanup
code and using the single create_failed exit path.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18213
This implements zoned_uid - a ZFS property that delegates dataset
visibility and administration to user namespaces owned by a specific
UID, enabling rootless Podman/Docker with native ZFS storage.
Usage: zfs set zoned_uid=1000 pool/dataset
Problem solved:
- zfs zone requires an existing namespace PID
- Podman creates a new namespace on each container start
- Solution: delegate to UID, any namespace owned by that UID is
authorized
Authorization model — three-layer additive (all must pass):
L0 (auth): Namespace owner UID matches zoned_uid property
L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool
delegation is ON — the default)
L2 (cap tier): Linux capability in the namespace determines the
operation class permitted
While CAP_SYS_ADMIN is a namespaced capability (the namespace owner
always holds it within their own user namespace), granting blanket
access based solely on its presence is contrary to the Principle of
Least Privilege. This change introduces tiered capability requirements
so that non-destructive operations (create, snapshot, set property)
require only CAP_FOWNER, while destructive operations (destroy, rename,
clone) continue to require CAP_SYS_ADMIN — both of which are namespaced
capabilities scoped to the user namespace, not the init namespace.
When pool delegation is OFF (non-default), all zoned_uid write
operations are denied — delegation OFF means the pool admin has
opted out of delegating access entirely.
Security model:
- Namespace owner UID must match zoned_uid value
- Delegation root cannot be destroyed or escaped via rename
- Namespace users cannot modify zoned_uid itself (only global
zone admin can manage delegation assignments)
- Namespace users cannot modify the 'zoned' property
- Namespace users cannot override filesystem_limit or
snapshot_limit set by the global admin on the delegation root
(but can impose tighter sub-limits on child datasets)
- Multi-UID isolation: sibling delegations with different UIDs
cannot access each other's subtrees
Kernel changes:
- zone_dataset_attach_uid()/detach_uid() in SPL
- zone_dataset_admin_check() for write authorization with tiered
capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN
for destructive)
- Callback registration for zoned_uid property lookup
- New zfs_secpolicy_zoned_uid_deleg() helper that calls
dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds()
which requires the `zoned` property that zoned_uid datasets lack
- Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid
datasets (not just zoned=on)
- Update all 9 secpolicy call sites to require dsl_deleg grants
instead of short-circuiting on ZONE_ADMIN_ALLOWED
- Security policy hooks in zfs_secpolicy_*() functions
- Fixed inglobalzone() to use current_user_ns()
- zfs_prop_set_special() handles attach/detach as property
side-effects, eliminating the need for dedicated ioctls
- spa_import_os() restores zoned_uid delegations kernel-side
on pool import via dmu_objset_find() walk
- spa_export_os() detaches zoned_uid delegations on pool
destroy/export, preventing stale kernel state on recreate
- zoned_uid registered as PROP_INHERIT so child datasets
inherit the delegation, enabling sub-dataset creation
- zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify
the true delegation root, correctly distinguishing inherited
values from locally-set ones for destroy/rename policy checks
- zone_dataset_check_list() accepts '@' and '#' separators in
addition to '/' so snapshots and bookmarks are visible from
delegated namespaces
- zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being
set within a delegated namespace, preventing self-revocation
- zfs_secpolicy_setprop() blocks filesystem_limit and
snapshot_limit changes on the delegation root from within a
namespace (uses dsl_prop_get setpoint to identify the root),
while allowing delegated users to set tighter sub-limits on
child datasets
- Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid
in destroy and rename cleanup paths, preventing stale tracking
entries when namespace users perform these operations
- Use cr parameter (not CRED()) in all secpolicy zoned_uid
delegation checks for correct credential propagation
Userspace changes:
- check_parents() defers to kernel when zoned_uid set
FreeBSD compatibility:
- include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs:
- zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY,
RENAME, SETPROP)
- zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED)
- zone_dataset_admin_check() — static inline, always returns
ZONE_ADMIN_NOT_APPLICABLE
- zone_dataset_attach_uid() — static inline, returns ENXIO
- zone_dataset_detach_uid() — static inline, returns ENXIO
- zone_get_zoned_uid_fn_t callback typedef
- zone_register_zoned_uid_callback() — static inline no-op
- zone_unregister_zoned_uid_callback() — static inline no-op
- On FreeBSD, every zone_dataset_admin_check() call returns
ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions
to fall through to existing jail-based permission checks
- Setting zoned_uid on FreeBSD returns ENXIO since user namespace
delegation requires Linux user namespaces
Test changes:
- Add grant_deleg() calls to tests 006-022 for operations that now
require explicit dsl_deleg grants
- Add tests 023-030 validating the capability tier model
- Add test 031 validating stale zone tracking cleanup after
namespace rename+destroy
- Fix capsh lookup in test helpers for ksh -p restricted PATH
(command -v + explicit /usr/sbin fallback)
- Add mountpoint=none to tests 023-026 to avoid mount-lock issues
in user namespaces
- Fix test 026 expectations to match kernel behavior (delegation
OFF denies all writes, allows read-only)
- run_in_userns helper resolves absolute zfs path to handle
environments where PATH does not include zfs (source builds)
- Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test
verifies inheritance and override behavior
- Test 013 uses within_percent with parseable byte output (-Hp)
for robust quota value comparison across environments
- Test 014: verifies grandchild dataset creation from user
namespace, confirming inherited zoned_uid delegation works
- Test 015: pool destroy/recreate with zoned_uid delegation
- Test 016: individual snapshot destroy from namespace
- Test 017: namespace user cannot modify zoned_uid property
- Test 018: clone operations from within delegated namespace
- Test 019: multi-UID isolation between sibling delegations
- Test 020: operations without zone_dataset_admin_check()
integration are denied via zfs_dozonecheck_impl()
- Test 021: 'zoned' property cannot be modified from namespace
- Test 022: delegation root limit overrides blocked from namespace
- Quoted shell variables across all test scripts for robustness
- Shellcheck SC2155 fixes across all test scripts
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org>
Closes#18167
While FreeBSD does not support relatime natively, it seems trivial
to implement it just as dataset property for consistency. To not
change the status quo, change its default to off on FreeBSD. Now,
if explicitly enabled, it should actually work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18385
The zpool offline man page says that you cannot use 'zpool offline'
on spares. However, testing found that you could in fact force fault
(zpool offline -f) spares.
Change the policy to:
1. You can never force-fault or offline dRAID spares.
2. You can only force-fault or offline traditional spares if they're
active.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18282
When using --exclude, filtering needs to take place in two places:
in zfs_main.c via the callback previously added to support the
options, and in libzfs_sendrecv.c because it generates the nvlist
during a first pass, and that results in it complaining if the
excluded dataset is not available for sending. (eg, excluding an
encrypted dataset so you don't have to use --raw wouldn't work,
because the first pass would look at the dataset and decide you
couldn't use it.) Add send --exclude tests, including one that tests
excluding an encrypted hierarchy.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sean Eric Fagan <sef@kithrup.ie>
Closes#18278
It can be used to drop extraneous records in a send stream caused by a
corrupt dataset, as in issue #18239.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored by: ConnectWise
Closes#18275
Teach `zfs {create,clone,rename}` to accept a doubled `-p` flag (`-pp`)
to create non-existing ancestor datasets with `canmount=off`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#17000
When creating a pool with devices that have incompatible block sizes,
the kernel returns EDOM. However, zpool_create() did not handle this
errno, falling through to zpool_standard_error() which produced a
confusing message about invalid property values.
Add a case EDOM handler in zpool_create() to return EZFS_BADDEV with
a descriptive auxiliary message, consistent with the existing EDOM
handler in zpool_vdev_add().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18268
When we clear the log, we should clear all the fields, not only
zh_log. Otherwise remaining ZIL_REPLAY_NEEDED will prevent the
vdev removal. Handle it also from the other side, when zh_log
is already cleared, while zh_flags is not.
spa_vdev_remove_log() asserts that allocated space on removed log
device is zero. While it should be so in perfect world, it might
be not if space leaked at any point.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18277
The existing zpool properties accounting pool space (size, allocated,
fragmentation, expandsize, free, capacity) are based on the normal
metaslab class or are cumulative properties of several classes combined.
Add properties reporting the space accounting metrics for each metaslab
class individually.
Also introduce pool-wide AVAIL, USABLE, and USED properties reporting
values corresponding to FREE, SIZE, and ALLOC deflated for raidz.
Update ZTS to recognize the new properties and validate reported values.
While in zpool_get_parsable.cfg, add "fragmentation" to the list of
parsable properties.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Cloes #18238
Added vdev property to disable the vdev scheduler.
The intention behind this property is to improve IOPS
performance when using o_direct.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes#17358
Rewrite of cloned and snapshotted blocks can allocate additional
space, that may be undesired. In some cases it may have sense
to still rewrite snapshotted blocks, expecting the snapshots to
rotate with time, freeing space. In other cases rewrite of cloned
blocks may be acceptable, despite persistent space usage increase.
For this reason add them as separate flags to `zfs rewrite`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18179
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18143Closes#18144
ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS
to indicate the presence of large blocks in the dataset. On the sending
side, this flag is included if the `-L` flag is passed to `zfs send`
and the feature is active in the dataset. On the receive side, the
stream is refused if the feature is active in the destination dataset
but the stream does not include the feature flag.
The problem is the feature is only activated when a large block is
born. If a large block has been born in the destination, but never
the source, the send can't work. This can arise when sending streams
back and forth between two datasets.
This commit fixes the problem by always activating the large blocks
feature when receiving a stream with the large block feature flag.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18105
In #17180, we fixed an interesting bug that i believe i hit in one of my
pools, but as far as i can tell, there was no test for it.
this patch adds a regression test for #17180, minimised from my attempts
to reproduce the bug in a way that resembled the history of my pool.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes#18109
Add snapshot_019_pos to verify parallel snapshot automount operations
don't cause AVL tree panic. Regression test for commit 4ce030e025.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18035
Implement BRT (Block Reference Table) prefetch functionality similar
to existing DDT prefetch. This allows preloading BRT metadata into
ARC to improve performance for block cloning operations and frees
of earlier cloned blocks.
Make -t parameter optional. When omitted, prefetch all supported
metadata types (both DDT and BRT now).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17890
When the default value of the xattr property was changed from 'dir' to
'sa', the code that displays the property's value was not affected. The
problem with this state of affairs is that 1) user tooling that
specifically looked for 'sa' before will be confused now that the code
displays 'on' instead. And 2) users may be confused when manually
running the commands about which specific type of xattr is in use unless
they are up to date on the latest zfs changes.
The fix here is to show the actual type always, rather than 'on' if we
happen to be using the default. This turns out to be easy to do, by
simply reordering the list of xattr values in the properties code. When
the property is displayed, we iterate down the table until we find a row
with a matching value, and use that row's name as the
display. Reordering the row fixes the display without affecting any
other code.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17801
When running zpool iostat in interval mode, it would not notice any new
pools created or imported, and would forget any destroyed or exported,
so would not notice if they came back. This leads to outputting "no
pools available" every interval until killed.
It looks like this was at least intended to work; the comment above
zpool_do_iostat() indicates that it is expected to "deal with pool
creation/destruction" and that pool_list_update() would detect new
pools. That call however was removed in 3e43edd2c5, though its unclear
if that broke this behaviour and it wasn't noticed, or if it never
worked, or if something later broke it. That said, the lack of
pool_list_update() is only part of the reason it doesn't work properly.
The fundamental problem is that the various things involved in
refreshing or updating the list of pools would aggressively ignore,
remove, skip or fail on pools that stop existing, or that already exist.
Mostly this meant that once a pool is removed from the list, it will
never be seen again. Restoring pool_list_update() to the
zpool_do_iostat() loop only partially fixes this - it would find "new"
pools again, but only in the "all pools" (no args) mode, and because its
iterator callback add_pool() would abort the iterator if it already has
a pool listed, it would only add pools if there weren't any already.
So, this commit reworks the structure somewhat. pool_list_update()
becomes pool_list_refresh(), and will ensure the state of all pools in
the list are updated. In the "all pools" mode, it will also add new
pools and remove pools that disappear, but when a fixed list of pools is
used, the list doesn't change, only the state of the pools within it.
The rest of the commit is adjusting things for this much simpler
structure. Regardless of the mode in use, pool_list_refresh() will
always do the right thing, so the driver code can just get on with the
display.
Now that pools can appear and disappear, I've made it so the header (if
enabled) is re-printed when the list changes, so that its easier to see
what's happening if the column widths change.
Since this is all rather complicated, I've included tests for the "all
pools" and "set of pools" modes.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17786
This change adds support for ZFS_KEYFORMAT_RAW to zdb_derive_key in
zdb.c. The implementation reads the raw key from the file specified
by the -K option which is consistent with how raw keys are handled in
the other parts of ZFS, along with a check to ensure that the keyfile
doesn't have too many bytes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Patrick Xia <patrickx@google.com>
Closes#17783
Three cases were discovered where 'zpool add' would fail to
warn when adding vdevs to a pool with a mismatched replication
level. These are:
1. When a pool contains mixed file and disk vdevs.
2. When a pool contains an active dRAID distributed spare
3. When a pool contains an active hot spare
The lack of warnings are caused by get_replication() assessing
the current pool configuration an inconsistent and disabling
the mismatched replication check for the new pool configuration
after 'zpool add'. This change updates get_replication() to
be slightly more tolerant in the non-fatal case.
The zpool_add_010_pos.ksh test case was split in to separate
tests: zpool_add_warn_create.ksh, pool_add_warn_degraded.ksh,
and zpool_add_warn_removal. These test were extended to
include coverage for dRAID pools and the three scenarios
described above.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17780
The time database update math assumed that the timestamps were in
nanoseconds, but at some point in the development or review process they
changed to seconds. This PR fixes the math to use seconds instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17735
A single slow responding disk can affect the overall read
performance of a raidz group. When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.
Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.
The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter. Setting it to
zero disables slow outlier detection.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17227
When attempting to debug performance problems on large systems, one of
the major factors that affect performance is free space
fragmentation. This heavily affects the allocation process, which is an
area of active development in ZFS. Unfortunately, fragmenting a large
pool for testing purposes is time consuming; it usually involves filling
the pool and then repeatedly overwriting data until the free space
becomes fragmented, which can take many hours. And even if the time is
available, artificial workloads rarely generate the same fragmentation
patterns as the natural workloads they're attempting to mimic.
This patch has two parts. First, in zdb, we add the ability to export
the full allocation map of the pool. It iterates over each vdev,
printing every allocated segment in the ms_allocatable range tree. This
can be done while the pool is online, though in that case the allocation
map may actually be from several different TXGs as new ones are loaded
on demand.
The second is a new subcommand for zhack, zhack metaslab leak (and its
supporting kernel changes). This is a zhack subcommand that imports a
pool and then modified the range trees of the metaslabs, allowing the
sync process to write them out normall. It does not currently store
those allocations anywhere to make them reversible, and there is no
corresponding free subcommand (which would be extremely dangerous); this
is an irreversible process, only intended for performance testing. The
only way to reclaim the space afterwards is to destroy the pool or roll
back to a checkpoint.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17576
These are all the same shape: set up the pool to suspend on first write,
then perform some write+sync operation. The pool should suspend, and the
sync operation should respond according to the failmode= property.
We test fsync(), msync() and two forms of write() (open with O_SYNC, and
async with sync=always), which all take slightly different paths to
zil_commit() and back.
A helper function is included to do the write+sync sequence with mmap()
and msync(), since I didn't find a convenient tool to do that.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files. It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc. Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.
Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite. It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17565