KekenoBSD/src

Author	SHA1	Message	Date
Paul Dagnelie	891e379d0f	Fix failfast default and usage The feature that added a failfast property to vdevs unfortunately did not correctly set the default at creation time, so many vdevs do not actually have the property set. In addition, when the property is used, the failfast flag is not checked correctly, resulting in the feature mostly not working as intended. Set the failfast property to the default value at vdev allocation time. The value will be read in from the ZAP as normal when the vdev metadata is loaded. Allow the property to be set on any vdev and have it be inherited from the root or top-level vdev. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #18410	2026-05-18 09:12:09 -07:00
Alexander Motin	f5733f6fa3	Integrate DDT and BRT tests Don't disable block cloning during dedup tests. Just don't use cp to not trigger it. Add a new test, explicitly mixing dedup and cloning on the same file, that should be handled by DDT. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@truenas.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18520	2026-05-13 07:48:14 -07:00
Alexander Motin	181e1b5227	Fix double free for blocks cloned after DDT prune Before this change, for blocks marked with D flag but absent in DDT (pruned from it), zio_ddt_free() fell back to ZIO_STAGE_DVA_FREE without trying ZIO_STAGE_BRT_FREE first. Same time such blocks might be present in BRT, and not handling that would result in double/multiple free. This change makes ZIO_DDT_FREE_PIPELINE include ZIO_FREE_PIPELINE, just adding required ZIO_STAGE_ISSUE_ASYNC and ZIO_STAGE_DDT_FREE, and moves DDT stages before BRT. This way, if the block is found in DDT by zio_ddt_free(), the pipeline is short-circuited to ZIO_INTERLOCK_PIPELINE, similar to what zio_brt_free() does. If not, then BRT is checked, and if also no match, the block is freed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@truenas.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18520	2026-05-13 07:47:34 -07:00
Garth Snyder	eaaea55b69	Consistently encode DRR_BEGIN packed nvlist payloads with NV_ENCODE_XDR Currently, zfs send generates a mix of nvlist encodings in DRR_BEGIN records, some XDR and some in native byte order. The result is that most streams currently can't be zfs received on opposite-endian systems. zfs send generates the outer wrappers for compound streams in userspace, and it explicitly requests NV_ENCODE_XDR format for those records. But the BEGIN records for individual datasets are generated on the kernel side, in dmu_send.c, where fnvlist_pack() is used for encoding. That routine hard-wires NV_ENCODE_NATIVE format. This PR replaces the fnvlist_pack() call with a direct call to nvlist_pack() that specifies NV_ENCODE_XDR. Tests are included to verify that native-encoded nvlists are not generated by any kernel path that attaches nvlists to BEGIN records. There's also a check for XDR encoding in the outer wrapper of replication streams in case there is ever a regression there. There are also two tests that have a chance of triggering (and detecting) bug #18491. Non-triggering versions of those tests are already included here, so when that bug is more fully characterized, the tests can be moved to a more directly relevant category. (They are the two tests with _with_write suffixes.) This PR adds to zstream dump an output line that shows the exact encoding of any nvlists in BEGIN records. This feature is used by the tests to validate streams. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Garth Snyder <garth@garthsnyder.com> Closes #18360 Closes #18372	2026-05-12 07:49:55 -07:00
Sean Eric Fagan	a2d053329c	Add some more file layout output, triggered by -v With one -v, the block type (parity or data) is printed (matching the ASCII-art version); with two -v, the offset into the file is also printed. This also updates the man page, and adds some simple test scripts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Sean Fagan <sean.fagan@klarasystems.com> Signed-off-by: Sean Fagan <sean.fagan@klarasystems.com> Closes #18470	2026-05-07 13:22:38 -07:00
Alexander Motin	d65015938e	Vdev allocation bias/class change Normal, special and dedup vdevs differ only by space allocation bias. Normal and special vdevs might even legally store blocks targeted to other classes. Dedup vdevs don't normally do it, but there is no real reason why they can't. Considering this, it is not impossible to change the allocation bias for those vdevs. This change introduces a new top-level vdev property -- alloc_bias, reporting current bias for the vdev, and allowing to change it. This allows to easily change vdev role in a pool, especially if vdev removal is impossible. To not complicate the code, changes take effect only on next pool import. Changes to/from log vdev could also be theoretically possible, but they are artificially blocked for now, partially due to additional complications, and partially due to potential danger of placing other blocks on log vdevs, that would otherwise be non-fatal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18493	2026-05-07 09:16:39 -07:00
Manoj Joseph	e78a51dd6f	Fix off-by-one in PREVIOUSLY_REDACTED handler that drops last block In send_reader_thread(), the PREVIOUSLY_REDACTED handler computed file_max as MIN(dn->dn_maxblkid, range->end_blkid). dn_maxblkid is an inclusive maximum block ID while range->end_blkid is exclusive (one past the last block). The resulting file_max was then used as an exclusive loop bound, causing the last block of any file (at index dn_maxblkid) to be silently skipped when a PREVIOUSLY_REDACTED range covered the end of the file. The block was never written to the send stream so the receiver kept zeros there. ZFS reported no error because the stream itself was valid; the data was simply absent. Fix: use dn_maxblkid + 1 so file_max is consistently exclusive. Add a regression test (redacted_max_blkid.ksh) that modifies only the last block of a file in one clone, creates a redaction bookmark from it, then sends an unmodified clone incrementally from that bookmark. The PREVIOUSLY_REDACTED path must fill in the last block; the test verifies it is not zeros and matches the original. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Manoj Joseph <manoj.joseph@delphix.com> Closes #18477	2026-05-01 12:03:29 -07:00
Rob Norris	6748e7e65e	ZTS: add libzfs_mnttab_cache test This is the repro test from #18464, and confirms that when disabled, the libzfs_mnttab_cache is discarded and reloaded on every lookup. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Prakash Surya <prakash.surya@perforce.com> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18466 Closes #18464	2026-05-01 11:47:56 -07:00
Prakash Surya	4acb62930b	libspl/mnttab: follow symlinks when resolving path via statx (#18469 ) When the path argument to "zfs list -Ho name <path>" (or any caller of zfs_path_to_zhandle()) is a symlink that crosses a mount boundary, the wrong dataset is returned. Instead of returning the dataset that owns the symlink's target, getextmntent() matches the dataset containing the symlink itself. For example, given two ZFS datasets "tank/ds1" and "tank/ds2", and a symlink "/tank/ds1/link" pointing into "/tank/ds2": $ sudo zfs list -Ho name /tank/ds1/link tank/ds1 The expected (and previous) behavior is to return "tank/ds2", since the symlink's target resides in that dataset. The problem is in getextmntent(), in lib/libspl/os/linux/mnttab.c. That function calls statx() on the caller-supplied path to obtain its mnt_id (used to match against the mnt_id of each entry in /proc/self/mounts), and it passes AT_SYMLINK_NOFOLLOW to that statx() call. As a result, the mnt_id returned reflects the symlink's location rather than the symlink target's mount, and the wrong /proc/self/mounts entry is matched. The same function also calls stat64() on the caller-supplied path (used as a fallback when STATX_MNT_ID is not available, and to populate the statbuf out-parameter). stat64() always follows symlinks, so the statx() and stat64() calls were inconsistent: one resolved the symlink, the other didn't. The AT_SYMLINK_NOFOLLOW behavior may be appropriate when statx() is called on a mount entry from /proc/self/mounts (which is always a real directory), but it is wrong for caller-supplied paths, which may be symlinks. This bug was introduced by `523d9d6007` ("Validate mountpoint on path-based unmount using statx"), which added the STATX_MNT_ID code path. However, the bug was latent: config/user-statx.m4 omitted "#define _GNU_SOURCE" when checking for STATX_MNT_ID in <sys/stat.h>, so HAVE_STATX_MNT_ID was never defined, and the buggy statx() path was never compiled in. getextmntent() always fell back to the dev_t comparison via stat64(), which correctly follows symlinks. The fix to that autoconf check, in `2b930f63f8` ("config: fix STATX_MNT_ID detection"), caused HAVE_STATX_MNT_ID to be properly defined on kernels that support it, activating the broken AT_SYMLINK_NOFOLLOW path for the first time and exposing the regression. The fix is to drop AT_SYMLINK_NOFOLLOW from the statx() call so that symlinks are followed, matching the behavior of stat64() on the same path. Verified with a minimal reproducer: created two ZFS datasets, placed a symlink inside the first pointing into the second, and confirmed that "zfs list -Ho name <symlink>" returns the dataset containing the symlink's target rather than the dataset containing the symlink. Signed-off-by: Prakash Surya <prakash.surya@perforce.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>	2026-04-28 09:24:24 -07:00
Paul Dagnelie	6562851406	Handle raidz errors <= nparity rather than ignoring This PR adds a check in the mirror and raidz code for the case where there are errors <= nparity. In that case, ZFS sets a new flag on the zio that will be checked in zio_done. If that flag is set, when the write IO completes, we issue a read IO for the same blkptr. That will allow ZFS's auto-healing mechanisms and other errors recovery tools to detect the effectively-corrupt data, and handle it accordingly. Note that because draid raidz's IO done function, it also benefits from this functionality. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #18387	2026-04-21 14:17:37 -07:00
Gary Guo	1644e2ffd2	Fix read corruption after block clone after truncate When copy_file_range overwrites a recent truncation, subsequent reads can incorrectly determine that it is read hole instead of reading the cloned blocks. This can happen when the following conditions are met: - Truncate adds blkid to dn_free_ranges - A new TXG is created - copy_file_range calls dmu_brt_clone which override the block pointer and set DB_NOFILL - Subsequent read, given DB_NOFILL, hits dbuf_read_impl and dbuf_read_hole - dbuf_read_hole calls dnode_block_freed, which returns TRUE because the truncated blkids are still in dn_free_ranges This will not happen if the clone and truncate are in the same TXG, because the block clone would update the current TXG's dn_free_ranges, which is why this bug only triggers under high IO load (such as compilation). Fix this by skipping the dnode_block_freed call if the block is overridden. The fix shouldn't cause an issue when the cloned block is subsequently freed in later TXGs, as dbuf_undirty would remove the override. This requires a dedicated test program as it is much harder to trigger with scripts (this needs to generate a lot of I/O in short period of time for the bug to trigger reliably). Assisted-by: Gemini:gemini-3.1-pro Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gary Guo <gary@kernel.org> Closes #18412 Closes #18421	2026-04-15 14:51:53 -07:00
Andriy Tkachuk	2abf469be5	draid: fix cksum errors after rebuild with degraded disks Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degraded state. When a hot spare is attached to that degraded disk for rebuild creating the spare mirror, only that hot spare is getting rebuilt, but not the degraded device. So when later during scrub some other attached draid spare happens to map to that spare, it will end up with cksum error. Moreover, if the user clears the degraded disk from errors, the data won't be resilvered to it, hot spare will be detached almost immediately and the data that was resilvered only to it will be lost. Solution: write to all mirrored devices during rebuild, similar to traditional/healing resilvering, but only if we can verify the integrity of the data, or when it's the draid spare we are writing to, in which case we are writing to a reserved spare space, and there is no danger to overwrite any good data. The argument that writing only to rebuilding draid spare vdev is faster than writing to normal device doesn't hold since, at a specific offset being rebuilt, draid spare will be mapped to a normal device anyway. redundancy_draid_degraded2 automation test is added also to cover the scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com> Closes #18414	2026-04-15 14:48:00 -07:00
Andriy Tkachuk	d1b0a69825	draid: add failure domains support Currently, the only way to tolerate the failure of the whole enclosure is to configure several draid vdevs in the pool, each vdev having disks from different enclosures. But this essentially degrades draid to raidz and defeats the purpose of having fast sequential resilvering on wide pools with draid. This patch allows to configure several children groups in the same row in one draid vdev. In each such group, let's call it failure group, the user can configure disks belonging to different enclosures - failure domains. For example, in case of 10 such enclosures with 10 disks each, the user can put 1st disk from each enclosure into 1st group, 2nd disk from each enclosure into 2nd group, and so on. If one enclosure fails, only one disk from each group would fail, which won't affect draid operation, and each group would have enough redundancy to recover the stored data. Of course, in case of draid2 - two enclosures can fail at a time, in case of draid3 - three enclosures (provided there are no other disk failures in each group). In order to preserve fast sequential resilvering in case of a disk failure, the groups much share all disks between themselves, and this is achieved by shuffling the disks between the groups. But only i-th disks in each group are shuffled between themselves, i.e. the disks from the same enclosures, after that they are shuffled within each group, like it is done today in an ordinary draid. Thus, no more than one disk from any enclosure can appear in any failure group as a result of this shuffling. For example, here's how the pool status output looks like in case of two `draid1:2d:4c` failure groups: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 draid1:2d:4c:8w:1s-0 ONLINE 0 0 0 enc0d0 ONLINE 0 0 0 enc1d0 ONLINE 0 0 0 enc2d0 ONLINE 0 0 0 enc3d0 ONLINE 0 0 0 enc0d1 ONLINE 0 0 0 enc1d1 ONLINE 0 0 0 enc2d1 ONLINE 0 0 0 enc3d1 ONLINE 0 0 0 spares draid1-0-0 AVAIL The number of failure groups is specified indirectly via the new width parameter in draid vdev configuration descriptor, which is the total number of disks and which is multiple of children in each group. This multiple is the number of groups (width / children). Doing it this way allows the user conveniently see how many disks draid has in an instant. Spare disks are evenly distributed among failure groups, and they are shared by all groups. However, to support domain failure, we cannot have more than nparity - 1 failed disks in any group, even if they are rebuilt to draid spares (the blocks of those spares can be mapped to the disks from the failed domain, and we cannot tolerate more than nparity failures in any failure group). The retire agent in zed is updated to not start resilvering when the domain failure happens. Otherwise, it might take a lot of computing and I/O bandwidth resources, only to be wasted when the failed domain component is replaced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #11969 Closes #18148	2026-04-08 10:09:47 -07:00
Rob Norris	e635d27ebc	Add ability to set user properties while changing encryption key `zfs change-key` changes the key used to encrypt a ZFS dataset. When used programmatically, it may be useful to track some external state related to the key in a user property. E.g. a generation number, expiration date, or application-specific source of the key. This can be done today by running `zfs set user:prop=value` before or after running `zfs change-key`. However, this introduces a race condition where the property may not be set even though the key has changed, or vice versa (depending on the order the commands are executed). This can be addressed by using a channel program (`zfs program`) which calls both `zfs.sync.change_key()` and `zfs.sync.set_prop()`, changing the property and key atomically. However, it is nontrivial to write such a channel program to handle error cases, and provide the new key securely (e.g. without logging it). This issue proposes to enhance `zfs change-key` to be able to atomically set user properties while changing the encryption key. Currently `zfs change-key` accepts `-o property=value` arguments, but the only valid properties are keylocation, keyformat, and pbkdf2iters. We will enhance this to also allow user properties, e.g. `-o user:prop=value`. User properties will also be allowed when using `zfs change-key -i` to inherit the key from the parent dataset. Original-patch-by: Matthew Ahrens <matt@mahrens.org> External-issue: https://www.illumos.org/issues/17847 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18407	2026-04-07 17:17:40 -07:00
Brian Behlendorf	0752cf0676	draid: allow seq resilver reads from degraded vdevs When sequentially resilvering allow a dRAID child to be read as long as the DTLs indicate it should have a good copy of the data and the leaf isn't being rebuilt. The previous check was slightly too broad and would skip dRAID spare and replacing vdevs if one of their children was being replaced. As long as there exists enough additional redundancy this is fine, but when there isn't this vdev must be read in order to correctly reconstruct the missing data. A new test case has been added which exhausts the available redundancy, faults another device causing it to be degraded, and then performs a sequential resilver for the degraded device. In such a situation enough redundancy exists to perform the replacement and a scrub should detect no checksum errors. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18405	2026-04-07 10:48:27 -07:00
Alexander Motin	7b1682a825	Add support for POSIX_FADV_DONTNEED For now make it only evict the specified data from the dbuf cache. Even though dbuf cache is small, this may still reduce eviction of more useful data from there, and slightly accelerate ARC evictions by making the blocks there evictable a bit sooner. On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the kernel translates it into POSIX_FADV_DONTNEED after every read/write. This is not as efficient as it could be for ZFS, but that is the only way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18399	2026-04-07 11:56:54 -04:00
Christos Longros	33ed68fc24	zpool create: report which device caused failure When zpool create fails because a vdev is already in use, the error message now identifies the problematic device and the pool it belongs to, e.g.: cannot create 'tank': device '/dev/sdb1' is part of active pool 'rpool' Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used by zpool import: - Add spa_create_info to spa_t to capture error info during vdev_label_init(), before vdev_close() resets vdev state - When vdev_inuse() detects a conflict, read the on-disk label to extract the pool name and store it with the device path - Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO through the ioctl zc_nvlist_dst to userspace - In libzfs, zpool_create_info() unwraps the nvlist and formats the device-specific error message Restructure zpool_create() error handling so all switch cases use break instead of return, eliminating duplicated cleanup code and using the single create_failed exit path. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18213	2026-04-03 14:18:24 -07:00
li-nk.social	f203fedde8	Add zoned_uid property with additive least privilege authorization This implements zoned_uid - a ZFS property that delegates dataset visibility and administration to user namespaces owned by a specific UID, enabling rootless Podman/Docker with native ZFS storage. Usage: zfs set zoned_uid=1000 pool/dataset Problem solved: - zfs zone requires an existing namespace PID - Podman creates a new namespace on each container start - Solution: delegate to UID, any namespace owned by that UID is authorized Authorization model — three-layer additive (all must pass): L0 (auth): Namespace owner UID matches zoned_uid property L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool delegation is ON — the default) L2 (cap tier): Linux capability in the namespace determines the operation class permitted While CAP_SYS_ADMIN is a namespaced capability (the namespace owner always holds it within their own user namespace), granting blanket access based solely on its presence is contrary to the Principle of Least Privilege. This change introduces tiered capability requirements so that non-destructive operations (create, snapshot, set property) require only CAP_FOWNER, while destructive operations (destroy, rename, clone) continue to require CAP_SYS_ADMIN — both of which are namespaced capabilities scoped to the user namespace, not the init namespace. When pool delegation is OFF (non-default), all zoned_uid write operations are denied — delegation OFF means the pool admin has opted out of delegating access entirely. Security model: - Namespace owner UID must match zoned_uid value - Delegation root cannot be destroyed or escaped via rename - Namespace users cannot modify zoned_uid itself (only global zone admin can manage delegation assignments) - Namespace users cannot modify the 'zoned' property - Namespace users cannot override filesystem_limit or snapshot_limit set by the global admin on the delegation root (but can impose tighter sub-limits on child datasets) - Multi-UID isolation: sibling delegations with different UIDs cannot access each other's subtrees Kernel changes: - zone_dataset_attach_uid()/detach_uid() in SPL - zone_dataset_admin_check() for write authorization with tiered capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN for destructive) - Callback registration for zoned_uid property lookup - New zfs_secpolicy_zoned_uid_deleg() helper that calls dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds() which requires the `zoned` property that zoned_uid datasets lack - Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid datasets (not just zoned=on) - Update all 9 secpolicy call sites to require dsl_deleg grants instead of short-circuiting on ZONE_ADMIN_ALLOWED - Security policy hooks in zfs_secpolicy_*() functions - Fixed inglobalzone() to use current_user_ns() - zfs_prop_set_special() handles attach/detach as property side-effects, eliminating the need for dedicated ioctls - spa_import_os() restores zoned_uid delegations kernel-side on pool import via dmu_objset_find() walk - spa_export_os() detaches zoned_uid delegations on pool destroy/export, preventing stale kernel state on recreate - zoned_uid registered as PROP_INHERIT so child datasets inherit the delegation, enabling sub-dataset creation - zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify the true delegation root, correctly distinguishing inherited values from locally-set ones for destroy/rename policy checks - zone_dataset_check_list() accepts '@' and '#' separators in addition to '/' so snapshots and bookmarks are visible from delegated namespaces - zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being set within a delegated namespace, preventing self-revocation - zfs_secpolicy_setprop() blocks filesystem_limit and snapshot_limit changes on the delegation root from within a namespace (uses dsl_prop_get setpoint to identify the root), while allowing delegated users to set tighter sub-limits on child datasets - Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid in destroy and rename cleanup paths, preventing stale tracking entries when namespace users perform these operations - Use cr parameter (not CRED()) in all secpolicy zoned_uid delegation checks for correct credential propagation Userspace changes: - check_parents() defers to kernel when zoned_uid set FreeBSD compatibility: - include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs: - zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY, RENAME, SETPROP) - zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED) - zone_dataset_admin_check() — static inline, always returns ZONE_ADMIN_NOT_APPLICABLE - zone_dataset_attach_uid() — static inline, returns ENXIO - zone_dataset_detach_uid() — static inline, returns ENXIO - zone_get_zoned_uid_fn_t callback typedef - zone_register_zoned_uid_callback() — static inline no-op - zone_unregister_zoned_uid_callback() — static inline no-op - On FreeBSD, every zone_dataset_admin_check() call returns ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions to fall through to existing jail-based permission checks - Setting zoned_uid on FreeBSD returns ENXIO since user namespace delegation requires Linux user namespaces Test changes: - Add grant_deleg() calls to tests 006-022 for operations that now require explicit dsl_deleg grants - Add tests 023-030 validating the capability tier model - Add test 031 validating stale zone tracking cleanup after namespace rename+destroy - Fix capsh lookup in test helpers for ksh -p restricted PATH (command -v + explicit /usr/sbin fallback) - Add mountpoint=none to tests 023-026 to avoid mount-lock issues in user namespaces - Fix test 026 expectations to match kernel behavior (delegation OFF denies all writes, allows read-only) - run_in_userns helper resolves absolute zfs path to handle environments where PATH does not include zfs (source builds) - Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test verifies inheritance and override behavior - Test 013 uses within_percent with parseable byte output (-Hp) for robust quota value comparison across environments - Test 014: verifies grandchild dataset creation from user namespace, confirming inherited zoned_uid delegation works - Test 015: pool destroy/recreate with zoned_uid delegation - Test 016: individual snapshot destroy from namespace - Test 017: namespace user cannot modify zoned_uid property - Test 018: clone operations from within delegated namespace - Test 019: multi-UID isolation between sibling delegations - Test 020: operations without zone_dataset_admin_check() integration are denied via zfs_dozonecheck_impl() - Test 021: 'zoned' property cannot be modified from namespace - Test 022: delegation root limit overrides blocked from namespace - Quoted shell variables across all test scripts for robustness - Shellcheck SC2155 fixes across all test scripts Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org> Closes #18167	2026-04-03 10:38:26 -07:00
Alexander Motin	16858492e6	FreeBSD: Implement relatime property While FreeBSD does not support relatime natively, it seems trivial to implement it just as dataset property for consistency. To not change the status quo, change its default to off on FreeBSD. Now, if explicitly enabled, it should actually work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18385	2026-04-01 13:48:50 -07:00
Tony Hutter	b44a3ecf4a	zpool: Change zpool offline spares policy The zpool offline man page says that you cannot use 'zpool offline' on spares. However, testing found that you could in fact force fault (zpool offline -f) spares. Change the policy to: 1. You can never force-fault or offline dRAID spares. 2. You can only force-fault or offline traditional spares if they're active. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18282	2026-03-25 11:08:55 -07:00
Sean Eric Fagan	06b0abfe62	Fix the send --exclude option to work with encryption When using --exclude, filtering needs to take place in two places: in zfs_main.c via the callback previously added to support the options, and in libzfs_sendrecv.c because it generates the nvlist during a first pass, and that results in it complaining if the excluded dataset is not available for sending. (eg, excluding an encrypted dataset so you don't have to use --raw wouldn't work, because the first pass would look at the dataset and decide you couldn't use it.) Add send --exclude tests, including one that tests excluding an encrypted hierarchy. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sean Eric Fagan <sef@kithrup.ie> Closes #18278	2026-03-12 15:10:28 -07:00
Alan Somers	753f1e1e21	zstream: add a drop_record subcommand It can be used to drop extraneous records in a send stream caused by a corrupt dataset, as in issue #18239. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #18275	2026-03-12 15:08:58 -07:00
Ivan Shapovalov	8531621aba	zfs_main: create, clone, rename: accept `-pp` for non-mountable parents Teach `zfs {create,clone,rename}` to accept a doubled `-p` flag (`-pp`) to create non-existing ancestor datasets with `canmount=off`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #17000	2026-03-09 14:50:18 -07:00
Christos Longros	304de7f19b	libzfs: handle EDOM error in zpool_create When creating a pool with devices that have incompatible block sizes, the kernel returns EDOM. However, zpool_create() did not handle this errno, falling through to zpool_standard_error() which produced a confusing message about invalid property values. Add a case EDOM handler in zpool_create() to return EZFS_BADDEV with a descriptive auxiliary message, consistent with the existing EDOM handler in zpool_vdev_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18268	2026-03-08 12:59:10 -07:00
Alexander Motin	1e1d64d665	Fix log vdev removal issues When we clear the log, we should clear all the fields, not only zh_log. Otherwise remaining ZIL_REPLAY_NEEDED will prevent the vdev removal. Handle it also from the other side, when zh_log is already cleared, while zh_flags is not. spa_vdev_remove_log() asserts that allocated space on removed log device is zero. While it should be so in perfect world, it might be not if space leaked at any point. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18277	2026-03-04 09:12:14 -05:00
Ryan Moeller	ac0fd40c8c	Add zpool properties for allocation class space The existing zpool properties accounting pool space (size, allocated, fragmentation, expandsize, free, capacity) are based on the normal metaslab class or are cumulative properties of several classes combined. Add properties reporting the space accounting metrics for each metaslab class individually. Also introduce pool-wide AVAIL, USABLE, and USED properties reporting values corresponding to FREE, SIZE, and ALLOC deflated for raidz. Update ZTS to recognize the new properties and validate reported values. While in zpool_get_parsable.cfg, add "fragmentation" to the list of parsable properties. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com> Cloes #18238	2026-03-02 15:50:23 -08:00
MigeljanImeri	4975430cf5	Add vdev property to disable vdev scheduler Added vdev property to disable the vdev scheduler. The intention behind this property is to improve IOPS performance when using o_direct. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com> Closes #17358	2026-02-23 09:34:33 -08:00
Alexander Motin	2646bd5585	Allow rewrite skip cloned and snapshotted blocks Rewrite of cloned and snapshotted blocks can allocate additional space, that may be undesired. In some cases it may have sense to still rewrite snapshotted blocks, expecting the snapshots to rotate with time, freeing space. In other cases rewrite of cloned blocks may be acceptable, despite persistent space usage increase. For this reason add them as separate flags to `zfs rewrite`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18179	2026-02-09 10:17:56 -08:00
Austin Wise	4f180e095a	Fix activating large_microzap on receive This ensures that the in-memory state of the feature is recorded and that `dsl_dataset_activate_feature` is not called when the feature is already active. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18143 Closes #18144	2026-02-05 15:48:03 -08:00
Ameer Hamza	13552d754f	ZTS: Add L2ARC DWPD and parallel writes tests Add four new functional tests to validate L2ARC DWPD rate limiting and parallel write features: - l2arc_dwpd_ratelimit_pos: Verifies DWPD rate limiting with different values (0, 100, 1000, 10000) and ordering - l2arc_dwpd_reimport_pos: Verifies DWPD rate limiting persists after pool export/import - l2arc_multidev_scaling_pos: Verifies parallel write scaling ratio (dual devices achieve ~2× single device throughput) - l2arc_multidev_throughput_pos: Verifies absolute parallel write throughput scales with device count (~32MB/s per device) Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:16 -08:00
Austin Wise	794f1587db	When receiving a stream with the large block flag, activate feature ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS to indicate the presence of large blocks in the dataset. On the sending side, this flag is included if the `-L` flag is passed to `zfs send` and the feature is active in the dataset. On the receive side, the stream is refused if the feature is active in the destination dataset but the stream does not include the feature flag. The problem is the feature is only activated when a large block is born. If a large block has been born in the destination, but never the source, the send can't work. This can arise when sending streams back and forth between two datasets. This commit fixes the problem by always activating the large blocks feature when receiving a stream with the large block feature flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18105	2026-01-07 16:47:12 -08:00
shuppy	6eef5cdc94	ZTS: add regression test for #17180 In #17180, we fixed an interesting bug that i believe i hit in one of my pools, but as far as i can tell, there was no test for it. this patch adds a regression test for #17180, minimised from my attempts to reproduce the bug in a way that resembled the history of my pool. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: delan azabani <dazabani@igalia.com> Closes #18109	2026-01-06 09:33:03 -08:00
Ivan Shapovalov	dbb3f247ed	cmd/zfs: clone: accept `-u` to not mount newly created datasets Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18080	2026-01-05 12:21:56 -05:00
Chunwei Chen	0c194352b5	Fix ddtprune causing space leak In zio_ddt_free, if a pruned dde is still in ddt, it would do nothing and cause space leak. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #17982 Closes #17983	2025-12-10 10:02:14 -08:00
Ameer Hamza	48842c0a41	ZTS: Add test for snapshot automount race Add snapshot_019_pos to verify parallel snapshot automount operations don't cause AVL tree panic. Regression test for commit `4ce030e025`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18035	2025-12-10 09:16:45 -08:00
Alexander Motin	b4f073b5a6	Add BRT support to zpool prefetch command Implement BRT (Block Reference Table) prefetch functionality similar to existing DDT prefetch. This allows preloading BRT metadata into ARC to improve performance for block cloning operations and frees of earlier cloned blocks. Make -t parameter optional. When omitted, prefetch all supported metadata types (both DDT and BRT now). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17890	2025-11-10 16:16:22 -08:00
Paul Dagnelie	fa4d4b1f80	Fix display of default xattr to show 'sa' When the default value of the xattr property was changed from 'dir' to 'sa', the code that displays the property's value was not affected. The problem with this state of affairs is that 1) user tooling that specifically looked for 'sa' before will be confused now that the code displays 'on' instead. And 2) users may be confused when manually running the commands about which specific type of xattr is in use unless they are up to date on the latest zfs changes. The fix here is to show the actual type always, rather than 'on' if we happen to be using the default. This turns out to be easy to do, by simply reordering the list of xattr values in the properties code. When the property is displayed, we iterate down the table until we find a row with a matching value, and use that row's name as the display. Reordering the row fixes the display without affecting any other code. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17801	2025-10-01 12:14:56 -07:00
Rob Norris	f0a95e8971	zpool iostat: refresh pool list every interval When running zpool iostat in interval mode, it would not notice any new pools created or imported, and would forget any destroyed or exported, so would not notice if they came back. This leads to outputting "no pools available" every interval until killed. It looks like this was at least intended to work; the comment above zpool_do_iostat() indicates that it is expected to "deal with pool creation/destruction" and that pool_list_update() would detect new pools. That call however was removed in `3e43edd2c5`, though its unclear if that broke this behaviour and it wasn't noticed, or if it never worked, or if something later broke it. That said, the lack of pool_list_update() is only part of the reason it doesn't work properly. The fundamental problem is that the various things involved in refreshing or updating the list of pools would aggressively ignore, remove, skip or fail on pools that stop existing, or that already exist. Mostly this meant that once a pool is removed from the list, it will never be seen again. Restoring pool_list_update() to the zpool_do_iostat() loop only partially fixes this - it would find "new" pools again, but only in the "all pools" (no args) mode, and because its iterator callback add_pool() would abort the iterator if it already has a pool listed, it would only add pools if there weren't any already. So, this commit reworks the structure somewhat. pool_list_update() becomes pool_list_refresh(), and will ensure the state of all pools in the list are updated. In the "all pools" mode, it will also add new pools and remove pools that disappear, but when a fixed list of pools is used, the list doesn't change, only the state of the pools within it. The rest of the commit is adjusting things for this much simpler structure. Regardless of the mode in use, pool_list_refresh() will always do the right thing, so the driver code can just get on with the display. Now that pools can appear and disappear, I've made it so the header (if enabled) is re-printed when the list changes, so that its easier to see what's happening if the column widths change. Since this is all rather complicated, I've included tests for the "all pools" and "set of pools" modes. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17786	2025-09-29 16:35:27 -07:00
patrickxia	5c38029f4b	zdb: add ZFS_KEYFORMAT_RAW support for -K option This change adds support for ZFS_KEYFORMAT_RAW to zdb_derive_key in zdb.c. The implementation reads the raw key from the file specified by the -K option which is consistent with how raw keys are handled in the other parts of ZFS, along with a check to ensure that the keyfile doesn't have too many bytes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Patrick Xia <patrickx@google.com> Closes #17783	2025-09-25 12:05:42 -07:00
Brian Behlendorf	0e1a53a8c0	Fix 'zpool add' safety check corner cases Three cases were discovered where 'zpool add' would fail to warn when adding vdevs to a pool with a mismatched replication level. These are: 1. When a pool contains mixed file and disk vdevs. 2. When a pool contains an active dRAID distributed spare 3. When a pool contains an active hot spare The lack of warnings are caused by get_replication() assessing the current pool configuration an inconsistent and disabling the mismatched replication check for the new pool configuration after 'zpool add'. This change updates get_replication() to be slightly more tolerant in the non-fatal case. The zpool_add_010_pos.ksh test case was split in to separate tests: zpool_add_warn_create.ksh, pool_add_warn_degraded.ksh, and zpool_add_warn_removal. These test were extended to include coverage for dRAID pools and the three scenarios described above. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17780	2025-09-25 09:32:59 -07:00
Brian Behlendorf	4b764fb01a	ZTS: Fix zfs_send_delegation_user test Correct the path in the common.run file. The zfs_send_delegation_user test is installed under cli_user not cli_root. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17740	2025-09-15 09:30:57 -07:00
Paul Dagnelie	9b772f328b	Fix time database update calculations The time database update math assumed that the timestamps were in nanoseconds, but at some point in the development or review process they changed to seconds. This PR fixes the math to use seconds instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17735	2025-09-12 16:33:36 -07:00
JT Pennington	955fbc5ade	Add send:encrypted test Create tests for the new send:encrypted permission Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: JT Pennington <jt.pennington@klarasystems.com> Closes #17543	2025-09-12 09:53:54 -07:00
Paul Dagnelie	d64711c202	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:25:03 -07:00
Paul Dagnelie	8f15d2e4d5	Add allocation profile export and zhack subcommand for import When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17576	2025-09-10 11:13:24 -07:00
Shengqi Chen	9ae20cf03d	cmd: rename arcstat to zarcstat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 10:45:21 -07:00
Shengqi Chen	a5571a0dd1	cmd: rename arc_summary to zarcsummary Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 10:45:13 -07:00
Rob Norris	6bb8fe5528	ZTS: stress test concurrent zvol create/destroy Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:05:34 -07:00
Rob Norris	1f8c39ddb2	ZTS: test response of various sync methods under different failmodes These are all the same shape: set up the pool to suspend on first write, then perform some write+sync operation. The pool should suspend, and the sync operation should respond according to the failmode= property. We test fsync(), msync() and two forms of write() (open with O_SYNC, and async with sync=always), which all take slightly different paths to zil_commit() and back. A helper function is included to do the write+sync sequence with mmap() and msync(), since I didn't find a convenient tool to do that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:42:35 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00

1 2 3 4 5 ...

251 Commits