KekenoBSD/src

Author	SHA1	Message	Date
Gality	37e3a260fd	dmu_direct: avoid UAF in dmu_write_direct_done() dmu_write_direct_done() passes dmu_sync_arg_t to dmu_sync_done(), which updates the override state and frees the completion context. The Direct I/O error path then still dereferences dsa->dsa_tx while rolling the dirty record back with dbuf_undirty(), resulting in a use-after-free. Save dsa->dsa_tx in a local variable before calling dmu_sync_done() and use that saved tx for the error rollback. This preserves the existing ownership model for dsa and does not change the Direct I/O write semantics. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: gality369 <gality369@example.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Closes #18440	2026-04-20 10:26:28 -07:00
Joel Low	ddf19dcbe1	initramfs: fix incorrect variable rename Fixes regression introduced by `61ab032ae0`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Joel Low <joel@joelsplace.sg> Closes #18442	2026-04-20 10:21:57 -07:00
Joel Low	c214a3ae9f	initramfs: fix use of renamed variables Fixes regression introduced by `33dd57e1b4`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Joel Low <joel@joelsplace.sg> Closes #18442	2026-04-20 10:21:34 -07:00
Christos Longros	1cebe8a38e	libzfs: report invalid permission name in zfs allow zfs allow with a typo (e.g. "snapshop") produced the misleading error "operation not applicable to datasets of this type". Report "invalid permission" instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18401 Closes #11903	2026-04-20 10:15:29 -07:00
shelvacu	4339b4eb2f	zpool-labelclear.8: Warn that it's destructive If I could go back in time, I would beg Sun engineers to pick a different name. For those of us who have not read the ZFS On-Disk Specification pdf, it is not at all obvious that clearing a "label" is such a bad thing. But changing the name would be a breaking change, so at least for now we can update the documentation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Shelvacu <git@shelvacu.com> Closes #18347	2026-04-20 10:05:31 -07:00
Christos Longros	7fdd2bf7d4	libzfs: report permission error from umount helper Non-root callers got "unmount failed" when ZFS_MOUNT_HELPER was set because /bin/umount's exit status doesn't preserve errno. Map a non-zero helper exit to EPERM when geteuid() != 0 so the user sees "permission denied". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #11740 Closes #18443	2026-04-20 10:02:26 -07:00
Brian Behlendorf	9be5431c5d	CI: tolerate missing artifacts When a VM fails to launch or is unreachable the qemu-7-prepare.sh script will fail to collect the artifacts due to the missing vm* directories. We want to collect as much diagnostic information as possible, when missing create the directory to allow the subsequent steps to proceed normally. Additionally, we don't want to fail if the /tmp/summary.txt file is missing. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18438	2026-04-17 17:21:13 -07:00
Tony Hutter	b3623d1123	CI: Do not set scheduler in qemu-1-setup.sh We've seen some qemu-1-setup failures while trying to change the runner's block device scheduler value to 'none': We have a single 150GB block device Setting up swapspace version 1, size = 16 GiB (17179865088 bytes) no label, UUID=7a790bfe-79e5-4e38-b208-9c63fe523294 tee: '/sys/block/s*/queue/scheduler': No such file or directory Luckily, we don't need to set the scheduler anymore on modern kernels: https://github.com/openzfs/zfs/issues/9778#issuecomment-569347505 This commit just removes the code that sets the scheduler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18437	2026-04-16 17:56:20 -07:00
Brian Behlendorf	d88d9c91dc	Linux 7.0 compat: META Update the META file to reflect compatibility with the 7.0 kernel. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18435	2026-04-16 16:45:20 -07:00
Brian Behlendorf	b32911b78f	ZTS: resilver_restart_001 improvements The resilver_restart_001 test case has not been entirely reliable when run under the CI. Address several small issues which may be responsible. - Configure the pool as raidz2 instead of raidz1 since the test offlines two devices. This ensures the second device is marked as OFFLINE instead of DEGRADED. - Start the zpool replace after setting SCAN_SUSPEND_PROGRESS to close any potential race where the replace finishs to quickly. - Wait for the offlines/onlined vdevs to fully transition to the expected state during the test. - Add the true flag to sync_pool to force a TXG sync to happen even if it might not otherwise be required. - During cleanup dump the zpool events history to aid debugging if the updated test case is still unreliable in the CI. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18434	2026-04-16 16:44:21 -07:00
Gary Guo	1644e2ffd2	Fix read corruption after block clone after truncate When copy_file_range overwrites a recent truncation, subsequent reads can incorrectly determine that it is read hole instead of reading the cloned blocks. This can happen when the following conditions are met: - Truncate adds blkid to dn_free_ranges - A new TXG is created - copy_file_range calls dmu_brt_clone which override the block pointer and set DB_NOFILL - Subsequent read, given DB_NOFILL, hits dbuf_read_impl and dbuf_read_hole - dbuf_read_hole calls dnode_block_freed, which returns TRUE because the truncated blkids are still in dn_free_ranges This will not happen if the clone and truncate are in the same TXG, because the block clone would update the current TXG's dn_free_ranges, which is why this bug only triggers under high IO load (such as compilation). Fix this by skipping the dnode_block_freed call if the block is overridden. The fix shouldn't cause an issue when the cloned block is subsequently freed in later TXGs, as dbuf_undirty would remove the override. This requires a dedicated test program as it is much harder to trigger with scripts (this needs to generate a lot of I/O in short period of time for the bug to trigger reliably). Assisted-by: Gemini:gemini-3.1-pro Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gary Guo <gary@kernel.org> Closes #18412 Closes #18421	2026-04-15 14:51:53 -07:00
Christos Longros	4b4ae48f9a	deb.am: propagate build errors in native-deb targets Replace semicolons with && so build failures are not masked by the subsequent lockfile cleanup. Use trap to ensure the lockfile is removed on both success and failure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18206 Closes #18424	2026-04-15 14:50:20 -07:00
Ameer Hamza	01e2a96839	Use AVL tree lookup in zfsctl_snapdir_vget for mounted snapshots zfsctl_snapdir_vget resolves NFS file handles for snapshot directory entries by calling zfsctl_snapshot_path_objset, which iterates all snapshots via dmu_snapshot_list_next to find the matching objsetid. With many snapshots this linear scan is expensive. For snapshots that have been previously mounted, the path is already cached in the in-memory AVL tree. Check the tree first with zfsctl_snapshot_find_by_objsetid and fall back to the on-disk scan only when the entry is not found. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18429	2026-04-15 14:49:22 -07:00
Andriy Tkachuk	2abf469be5	draid: fix cksum errors after rebuild with degraded disks Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degraded state. When a hot spare is attached to that degraded disk for rebuild creating the spare mirror, only that hot spare is getting rebuilt, but not the degraded device. So when later during scrub some other attached draid spare happens to map to that spare, it will end up with cksum error. Moreover, if the user clears the degraded disk from errors, the data won't be resilvered to it, hot spare will be detached almost immediately and the data that was resilvered only to it will be lost. Solution: write to all mirrored devices during rebuild, similar to traditional/healing resilvering, but only if we can verify the integrity of the data, or when it's the draid spare we are writing to, in which case we are writing to a reserved spare space, and there is no danger to overwrite any good data. The argument that writing only to rebuilding draid spare vdev is faster than writing to normal device doesn't hold since, at a specific offset being rebuilt, draid spare will be mapped to a normal device anyway. redundancy_draid_degraded2 automation test is added also to cover the scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com> Closes #18414	2026-04-15 14:48:00 -07:00
Tony Hutter	6692b6e28a	CI: Disable ZIP file artifacts, update versions The GH artifacts action now lets you disable auto-zipping your artifacts. Previously, GH would always automatically put your artifacts in a ZIP file. This is annoying when your artifacts are already in a tarball. Also update the following action versions checkout: v4 -> v6 upload-artifact: v4 -> v7 download-artifact: v4 -> v8 Lastly, fix a issue where zfs-qmeu-packages now needs to power cycle the VM. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18411	2026-04-14 13:20:46 -07:00
Ameer Hamza	9bf1a720f6	Fix snapshot automount deadlock during concurrent zfs recv zfsctl_snapshot_mount() holds z_teardown_lock(R) across call_usermodehelper(), which spawns a mount process that needs namespace_sem(W) via move_mount. Reading /proc/self/mountinfo holds namespace_sem(R) and needs z_teardown_lock(R) via zpl_show_devname. When zfs_suspend_fs (from zfs recv or zfs rollback) queues z_teardown_lock(W), the rrwlock blocks new readers, completing the deadlock cycle. Fix by releasing z_teardown_lock(R) after gathering the dataset name and mount path, before any blocking operation. Everything after the release operates on local string copies or uses its own synchronization. The parent zfsvfs pointer remains valid because the caller holds a path reference to the automount trigger dentry. Releasing the lock allows zfs_suspend_fs to proceed concurrently with the mount helper, so dmu_objset_hold in zpl_get_tree can transiently fail with ENOENT during the clone swap. The mount helper fails, EISDIR is returned, and the VFS falls back to the ctldir stub (empty directory) until the next access retries. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18415	2026-04-08 16:42:58 -07:00
Ameer Hamza	a616ba811c	Fix options memory leak in zfsctl_snapshot_mount Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18415	2026-04-08 16:42:23 -07:00
Mark Johnston	943a055284	zvol: Fix uses of uninitialized variables in zvol_rename_minors_impl() Reported-by: GitHub Copilot Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18191	2026-04-08 14:15:44 -07:00
Mark Johnston	d7b8eef9d2	zvol: Hold the zvol state writer lock when renaming Otherwise nothing serializes updates to the global zvol hash table. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18191	2026-04-08 14:15:44 -07:00
Mark Johnston	d736868672	Make zvol_set_common() block until the operation has completed This is motivated by a FreeBSD AIO test case which create a zvol with -o volmode=dev, then immediately tries to open the zvol device file. The open occasionally fails with ENOENT. When a zvol is created without the volmode setting, zvol_create_minors() blocks until the task is finished, at which point OS-dependent code will have created a device file. However, zvol_set_common() may cause the device file to be destroyed and re-created, at least on FreeBSD, if the voltype switches from GEOM to DEV. In this case, we do not block waiting for the operation to finish, causing the test failure. Fix the problem by making zvol_set_common() block until the operation has finished. In FreeBSD zvol code, use g_waitidle() to block until asynchronous GEOM operations are done. This fixes a secondary race where zvol_os_remove_minor() does not block until the zvol device file is removed, and the subsequent zvol_os_create_minor() fails because the (to-be-destroyed) device file already exists. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18191	2026-04-08 14:15:27 -07:00
Mark Johnston	6de1457a2d	FreeBSD: Fix zvol teardown races zvol_geom_open() may be called to taste an orphaned provider. The test for pp->private == NULL there is racy as no locks are synchronizing the test. Use the GEOM topology lock to interlock the pp->private == NULL test with the zvol state checks. This establishes a new lock order but I believe this is necessary. Set pp->private = NULL under the GEOM topology lock instead of the per-zvol state lock. Modify zvol_os_rename_minor() to drop the zvol state lock to avoid a lock order reversal with the topology lock. Also reverse the order of tests in zvol_geom_open() and zvol_cdev_open() as at least zvol_geom_open() may race with zvol_os_remove_minor(), which sets zv->zv_zso = NULL. Testing for ZVOL_REMOVING first avoids a race which can lead to a NULL pointer dereference. Add a new OS-specific flag to handle the case where zvol_geom_open() drops all locks in order to avoid a lock order reversal when acquiring the suspend lock as the open count transitions 0->1. I don't see anything preventing zvol_os_remove_minor() from racing there. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18191	2026-04-08 14:10:45 -07:00
Andriy Tkachuk	d1b0a69825	draid: add failure domains support Currently, the only way to tolerate the failure of the whole enclosure is to configure several draid vdevs in the pool, each vdev having disks from different enclosures. But this essentially degrades draid to raidz and defeats the purpose of having fast sequential resilvering on wide pools with draid. This patch allows to configure several children groups in the same row in one draid vdev. In each such group, let's call it failure group, the user can configure disks belonging to different enclosures - failure domains. For example, in case of 10 such enclosures with 10 disks each, the user can put 1st disk from each enclosure into 1st group, 2nd disk from each enclosure into 2nd group, and so on. If one enclosure fails, only one disk from each group would fail, which won't affect draid operation, and each group would have enough redundancy to recover the stored data. Of course, in case of draid2 - two enclosures can fail at a time, in case of draid3 - three enclosures (provided there are no other disk failures in each group). In order to preserve fast sequential resilvering in case of a disk failure, the groups much share all disks between themselves, and this is achieved by shuffling the disks between the groups. But only i-th disks in each group are shuffled between themselves, i.e. the disks from the same enclosures, after that they are shuffled within each group, like it is done today in an ordinary draid. Thus, no more than one disk from any enclosure can appear in any failure group as a result of this shuffling. For example, here's how the pool status output looks like in case of two `draid1:2d:4c` failure groups: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 draid1:2d:4c:8w:1s-0 ONLINE 0 0 0 enc0d0 ONLINE 0 0 0 enc1d0 ONLINE 0 0 0 enc2d0 ONLINE 0 0 0 enc3d0 ONLINE 0 0 0 enc0d1 ONLINE 0 0 0 enc1d1 ONLINE 0 0 0 enc2d1 ONLINE 0 0 0 enc3d1 ONLINE 0 0 0 spares draid1-0-0 AVAIL The number of failure groups is specified indirectly via the new width parameter in draid vdev configuration descriptor, which is the total number of disks and which is multiple of children in each group. This multiple is the number of groups (width / children). Doing it this way allows the user conveniently see how many disks draid has in an instant. Spare disks are evenly distributed among failure groups, and they are shared by all groups. However, to support domain failure, we cannot have more than nparity - 1 failed disks in any group, even if they are rebuilt to draid spares (the blocks of those spares can be mapped to the disks from the failed domain, and we cannot tolerate more than nparity failures in any failure group). The retire agent in zed is updated to not start resilvering when the domain failure happens. Otherwise, it might take a lot of computing and I/O bandwidth resources, only to be wasted when the failed domain component is replaced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #11969 Closes #18148	2026-04-08 10:09:47 -07:00
Brian Behlendorf	eb5c93fa8e	CI: set /etc/hostid in zloop runner ztest can enable and disable the multihost property when testing. This can result in a failure when attempting to import an existing pool when multihost=on but no /etc/hostid file exists. Update the workflow to use zgenhostid to create /etc/hostid when not present. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18413	2026-04-08 09:15:03 -07:00
Rob Norris	e635d27ebc	Add ability to set user properties while changing encryption key `zfs change-key` changes the key used to encrypt a ZFS dataset. When used programmatically, it may be useful to track some external state related to the key in a user property. E.g. a generation number, expiration date, or application-specific source of the key. This can be done today by running `zfs set user:prop=value` before or after running `zfs change-key`. However, this introduces a race condition where the property may not be set even though the key has changed, or vice versa (depending on the order the commands are executed). This can be addressed by using a channel program (`zfs program`) which calls both `zfs.sync.change_key()` and `zfs.sync.set_prop()`, changing the property and key atomically. However, it is nontrivial to write such a channel program to handle error cases, and provide the new key securely (e.g. without logging it). This issue proposes to enhance `zfs change-key` to be able to atomically set user properties while changing the encryption key. Currently `zfs change-key` accepts `-o property=value` arguments, but the only valid properties are keylocation, keyformat, and pbkdf2iters. We will enhance this to also allow user properties, e.g. `-o user:prop=value`. User properties will also be allowed when using `zfs change-key -i` to inherit the key from the parent dataset. Original-patch-by: Matthew Ahrens <matt@mahrens.org> External-issue: https://www.illumos.org/issues/17847 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18407	2026-04-07 17:17:40 -07:00
mischivus	4fec688987	Remove forced zfs_umount() from zfs_resume_fs() bail path When zfsvfs_init() fails during zfs_resume_fs(), the bail path called zfs_umount() directly. All three callers (zfs_ioc_rollback, zfs_ioc_recv_impl, and zfs_ioc_userspace_upgrade) hold an s_active reference via getzfsvfs() at entry. This creates two bugs: 1. Deadlock: zfs_umount() -> zfsvfs_teardown() -> txg_wait_synced() blocks in uninterruptible D state. The superblock cannot tear down because s_active is pinned by the calling thread itself. Survives SIGKILL. Blocks clean reboot. Requires hard power cycle. 2. Use-after-free: if txg_wait_synced() returns, zfs_umount() calls zfsvfs_free(). The caller then dereferences the freed zfsvfs via zfs_vfs_rele(). The explicit zfs_umount() is unnecessary. z_unmounted is already set to B_TRUE before the locks are released, so all new operations return EBUSY. When the caller releases its s_active reference via zfs_vfs_rele(), the standard VFS lifecycle (deactivate_super -> generic_shutdown_super -> zpl_kill_sb -> zfs_umount) handles teardown correctly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com> Closes #18309 Closes #18310	2026-04-07 14:48:38 -07:00
mischivus	fd067644a2	Fix s_active leak in zfsvfs_hold() when z_unmounted is true When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref()), but z_unmounted is subsequently found to be B_TRUE, zfsvfs_hold() returns EBUSY without calling zfs_vfs_rele(). This permanently leaks the VFS superblock s_active reference, preventing generic_shutdown_super() from ever firing, which blocks dmu_objset_disown() and makes the pool permanently unexportable (EBUSY). Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path where no VFS reference exists. This matches the existing cleanup pattern in zfsvfs_rele(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com> Closes #18309 Closes #18310	2026-04-07 14:48:33 -07:00
Brian Behlendorf	0752cf0676	draid: allow seq resilver reads from degraded vdevs When sequentially resilvering allow a dRAID child to be read as long as the DTLs indicate it should have a good copy of the data and the leaf isn't being rebuilt. The previous check was slightly too broad and would skip dRAID spare and replacing vdevs if one of their children was being replaced. As long as there exists enough additional redundancy this is fine, but when there isn't this vdev must be read in order to correctly reconstruct the missing data. A new test case has been added which exhausts the available redundancy, faults another device causing it to be degraded, and then performs a sequential resilver for the degraded device. In such a situation enough redundancy exists to perform the replacement and a scrub should detect no checksum errors. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #18405	2026-04-07 10:48:27 -07:00
Alexander Motin	7b1682a825	Add support for POSIX_FADV_DONTNEED For now make it only evict the specified data from the dbuf cache. Even though dbuf cache is small, this may still reduce eviction of more useful data from there, and slightly accelerate ARC evictions by making the blocks there evictable a bit sooner. On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the kernel translates it into POSIX_FADV_DONTNEED after every read/write. This is not as efficient as it could be for ZFS, but that is the only way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18399	2026-04-07 11:56:54 -04:00
Alek P	5cb95ad89a	fix memleak in spa_errlog.c Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alan Somers <asomers@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #18403	2026-04-06 15:59:30 -07:00
Alexander Motin	3599964dd2	Linux: Refactor zpl_fadvise() Similar to FreeBSD stop issuing prefetches on POSIX_FADV_SEQUENTIAL. It should not have this semantics, only hint speculative prefetcher, if access ever happen later. Instead after POSIX_FADV_WILLNEED handling call generic_fadvise(), if available, to do all the generic stuff, including setting f_mode in struct file, that we could later use to control prefetcher as part of read/write operations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18395	2026-04-06 15:57:18 -07:00
Christos Longros	1ff1f13a87	vdevprops: remove unused slow_io defaults, fix documentation Remove the unused DEFAULT_SLOW_IO_N and DEFAULT_SLOW_IO_T defines from zfs_diagnosis.c. Unlike the checksum and I/O thresholds, the slow_io_n and slow_io_t properties must be manually opted in and have no built-in defaults. The defines were misleading. Update the vdevprops man page to clarify that slow_io_n and slow_io_t must be manually set, and that the documented defaults (10 errors in 600 seconds) apply only to checksum and I/O events. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18359	2026-04-06 09:30:46 -07:00
Tony Hutter	74da51695e	CI: Free 35GB of unused files on the runner Free 35GB of unused files, mostly from unused development environments. This helps with the out of disk space problems we were seeing on FreeBSD runners. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18400	2026-04-04 13:32:39 -07:00
Christos Longros	33ed68fc24	zpool create: report which device caused failure When zpool create fails because a vdev is already in use, the error message now identifies the problematic device and the pool it belongs to, e.g.: cannot create 'tank': device '/dev/sdb1' is part of active pool 'rpool' Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used by zpool import: - Add spa_create_info to spa_t to capture error info during vdev_label_init(), before vdev_close() resets vdev state - When vdev_inuse() detects a conflict, read the on-disk label to extract the pool name and store it with the device path - Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO through the ioctl zc_nvlist_dst to userspace - In libzfs, zpool_create_info() unwraps the nvlist and formats the device-specific error message Restructure zpool_create() error handling so all switch cases use break instead of return, eliminating duplicated cleanup code and using the single create_failed exit path. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18213	2026-04-03 14:18:24 -07:00
Rob Norris	cfae167754	linux/vfsops: remove zfs_mnt_t, pass directly A cleanup of opportunity. Since we already are modifying the contents of zfs_mnt_t, we've broken any API guarantee, so we might as well go the rest of the way and get rid of it, and just pass the osname and/or the vfs_t directly. It seems like zfs_mnt_t was never really needed anyway; it was added in `1c2555ef92` (March 2017) to minimise the difference to illumos, but zfs_vfsops was made platform-specific anyway in `7b4e27232d`. We also remove setting SB_RDONLY on the caller's flags when failing a read-write remount on a read-only snapshot or pool. Since `0f608aa6ca` the caller's flags have been a pointer back to fc->sb_flags, which are discarded without further ceremony when the operation fails, so the change is unnecessary and we can simplify the call further. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:55 -07:00
Rob Norris	11f8f08106	linux/super: work around kernels that enforce "forbidden" mount options Before Linux 5.8 (include RHEL8), a fixed set of "forbidden" options would be rejected outright. For those, we work around it by providing our own option parser to avoid the codepath in the kernel that would trigger it. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:49 -07:00
Rob Norris	f5a9e3a622	linux/super: implement new mount params parser Adds zpl_parse_param and wires it up to the fs_context. This uses the kernel's standard mount option parsing infrastructure to keep the work we need to do to a minimum. We simply fill in the vfs_t we attached to the fs_context in the previous commit, ready to go for the mount/remount call. Here we also document all the options we need to support, and why. It's a lot of history but in the end the implementation is straightforward. Finally, if we get SB_RDONLY on the proposed superblock flags, we record that as the readonly mount option, because we haven't necessarily seen a "ro" param and we still need to know for remount, the `readonly` dataset property, etc. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:43 -07:00
Rob Norris	2782d2a1be	linux/super: match vfs_t lifetime to fs_context vfs_t is initially just parameters for the mount or remount operation, so match them to the lifetime of the fs_context that represents that operation. When we actually execute the operation (calling .get_tree or .reconfigure), transfer ownership of those options to the associated zfsvfs_t. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:37 -07:00
Rob Norris	1419bfc7d7	linux/super: remove zpl_parse_monolithic Final bit of cleanup of the old method. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:32 -07:00
Rob Norris	5b4e8f8697	linux/vfsops: remove old options parser We're working to replace this, and its easier to drop it outright while we get set up. To keep things compiling, the calls to zfsvfs_parse_options() are replaced with zfsvfs_vfs_alloc(), though without any option parsing at all nothing will work. That's ok, next commits are working towards it. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:24 -07:00
Rob Norris	040fb2994f	linux/vfsops: add vfs_t allocator, make public In a few commits, we're going to need to allocate and free vfs_t from zpl_super.c as well, so lets keep them uniform. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-03 12:15:09 -07:00
Christos Longros	a7157221db	AUTHORS: add Christos Longros Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18396	2026-04-03 11:17:08 -07:00
li-nk.social	f203fedde8	Add zoned_uid property with additive least privilege authorization This implements zoned_uid - a ZFS property that delegates dataset visibility and administration to user namespaces owned by a specific UID, enabling rootless Podman/Docker with native ZFS storage. Usage: zfs set zoned_uid=1000 pool/dataset Problem solved: - zfs zone requires an existing namespace PID - Podman creates a new namespace on each container start - Solution: delegate to UID, any namespace owned by that UID is authorized Authorization model — three-layer additive (all must pass): L0 (auth): Namespace owner UID matches zoned_uid property L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool delegation is ON — the default) L2 (cap tier): Linux capability in the namespace determines the operation class permitted While CAP_SYS_ADMIN is a namespaced capability (the namespace owner always holds it within their own user namespace), granting blanket access based solely on its presence is contrary to the Principle of Least Privilege. This change introduces tiered capability requirements so that non-destructive operations (create, snapshot, set property) require only CAP_FOWNER, while destructive operations (destroy, rename, clone) continue to require CAP_SYS_ADMIN — both of which are namespaced capabilities scoped to the user namespace, not the init namespace. When pool delegation is OFF (non-default), all zoned_uid write operations are denied — delegation OFF means the pool admin has opted out of delegating access entirely. Security model: - Namespace owner UID must match zoned_uid value - Delegation root cannot be destroyed or escaped via rename - Namespace users cannot modify zoned_uid itself (only global zone admin can manage delegation assignments) - Namespace users cannot modify the 'zoned' property - Namespace users cannot override filesystem_limit or snapshot_limit set by the global admin on the delegation root (but can impose tighter sub-limits on child datasets) - Multi-UID isolation: sibling delegations with different UIDs cannot access each other's subtrees Kernel changes: - zone_dataset_attach_uid()/detach_uid() in SPL - zone_dataset_admin_check() for write authorization with tiered capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN for destructive) - Callback registration for zoned_uid property lookup - New zfs_secpolicy_zoned_uid_deleg() helper that calls dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds() which requires the `zoned` property that zoned_uid datasets lack - Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid datasets (not just zoned=on) - Update all 9 secpolicy call sites to require dsl_deleg grants instead of short-circuiting on ZONE_ADMIN_ALLOWED - Security policy hooks in zfs_secpolicy_*() functions - Fixed inglobalzone() to use current_user_ns() - zfs_prop_set_special() handles attach/detach as property side-effects, eliminating the need for dedicated ioctls - spa_import_os() restores zoned_uid delegations kernel-side on pool import via dmu_objset_find() walk - spa_export_os() detaches zoned_uid delegations on pool destroy/export, preventing stale kernel state on recreate - zoned_uid registered as PROP_INHERIT so child datasets inherit the delegation, enabling sub-dataset creation - zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify the true delegation root, correctly distinguishing inherited values from locally-set ones for destroy/rename policy checks - zone_dataset_check_list() accepts '@' and '#' separators in addition to '/' so snapshots and bookmarks are visible from delegated namespaces - zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being set within a delegated namespace, preventing self-revocation - zfs_secpolicy_setprop() blocks filesystem_limit and snapshot_limit changes on the delegation root from within a namespace (uses dsl_prop_get setpoint to identify the root), while allowing delegated users to set tighter sub-limits on child datasets - Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid in destroy and rename cleanup paths, preventing stale tracking entries when namespace users perform these operations - Use cr parameter (not CRED()) in all secpolicy zoned_uid delegation checks for correct credential propagation Userspace changes: - check_parents() defers to kernel when zoned_uid set FreeBSD compatibility: - include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs: - zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY, RENAME, SETPROP) - zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED) - zone_dataset_admin_check() — static inline, always returns ZONE_ADMIN_NOT_APPLICABLE - zone_dataset_attach_uid() — static inline, returns ENXIO - zone_dataset_detach_uid() — static inline, returns ENXIO - zone_get_zoned_uid_fn_t callback typedef - zone_register_zoned_uid_callback() — static inline no-op - zone_unregister_zoned_uid_callback() — static inline no-op - On FreeBSD, every zone_dataset_admin_check() call returns ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions to fall through to existing jail-based permission checks - Setting zoned_uid on FreeBSD returns ENXIO since user namespace delegation requires Linux user namespaces Test changes: - Add grant_deleg() calls to tests 006-022 for operations that now require explicit dsl_deleg grants - Add tests 023-030 validating the capability tier model - Add test 031 validating stale zone tracking cleanup after namespace rename+destroy - Fix capsh lookup in test helpers for ksh -p restricted PATH (command -v + explicit /usr/sbin fallback) - Add mountpoint=none to tests 023-026 to avoid mount-lock issues in user namespaces - Fix test 026 expectations to match kernel behavior (delegation OFF denies all writes, allows read-only) - run_in_userns helper resolves absolute zfs path to handle environments where PATH does not include zfs (source builds) - Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test verifies inheritance and override behavior - Test 013 uses within_percent with parseable byte output (-Hp) for robust quota value comparison across environments - Test 014: verifies grandchild dataset creation from user namespace, confirming inherited zoned_uid delegation works - Test 015: pool destroy/recreate with zoned_uid delegation - Test 016: individual snapshot destroy from namespace - Test 017: namespace user cannot modify zoned_uid property - Test 018: clone operations from within delegated namespace - Test 019: multi-UID isolation between sibling delegations - Test 020: operations without zone_dataset_admin_check() integration are denied via zfs_dozonecheck_impl() - Test 021: 'zoned' property cannot be modified from namespace - Test 022: delegation root limit overrides blocked from namespace - Quoted shell variables across all test scripts for robustness - Shellcheck SC2155 fixes across all test scripts Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org> Closes #18167	2026-04-03 10:38:26 -07:00
Christos Longros	74504cf7fd	zinject: add numeric suffix support for -r range Parse range values with zfs_nicestrtonum() instead of strtoull() so that -r accepts human-readable suffixes (K, M, G, T, P, E). For example: zinject -r 1G,2G /pool/file Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18374	2026-04-01 13:53:56 -07:00
Alexander Motin	16858492e6	FreeBSD: Implement relatime property While FreeBSD does not support relatime natively, it seems trivial to implement it just as dataset property for consistency. To not change the status quo, change its default to off on FreeBSD. Now, if explicitly enabled, it should actually work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18385	2026-04-01 13:48:50 -07:00
Christos Longros	0f86f244ce	ZTS: re-enable send_raw_ashift on FreeBSD The test was skipped on FreeBSD since 2023 (#14961) due to exceeding the 10-minute CI timeout on FreeBSD 14. CI runs on the fork now show the test completes well within limits: FreeBSD 14.3-RELEASE: 10 seconds FreeBSD 15.0-STABLE: 11 seconds FreeBSD 16.0-CURRENT: 14 seconds Remove the FreeBSD skip and the corresponding known skip entry. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18389	2026-04-01 13:13:56 -07:00
Alexander Motin	a22b3f6700	abd: Fix stats asymmetry in case of Direct I/O abd_alloc_from_pages() does not call abd_update_scatter_stats(), since memory is not really allocated there. But abd_free_scatter() called by abd_free() does. It causes negative overflow of some ABD and possibly ARC counters. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@truenas.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18390	2026-04-01 10:28:18 -04:00
Lalufu	869d805997	Support bin-sbin merge on Fedora Starting with Fedora 42, Fedora has been working on merging /bin and /sbin directories. See https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin To support this, make sure we do not put files into /sbin directories on these distributions by respecting the distribution set value of %{_sbindir}. In addition, explicitly set `mounthelperdir`, which affects the placement of `mount.zfs`, which does not respect %{_sbindir} by default. Making it point to %{_sbindir} will allow it to work correctly on Fedora, while keeping its previous directory of `/sbin` on all other distributions. Note that files that used to reside in /sbin directories on Fedora will stay accessible under these paths, as the distribution maintains symlinks. No changes are needed to external scripts invoking these binaries. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ralf Ertzinger <ralf@skytale.net> Closes #18373	2026-03-31 16:48:33 -07:00
shelvacu	8553675dba	man: Fix checking manpages without a full build Some of the man pages (ex: `man/man8/zed.8`) are generated from `.in` files, and `zed.8.in` was a dependency of `zed.8`, but `zed.8` was not a dependency of `mancheck-...zed.8`. This usually worked anyways because a full build had already been run, now it works regardless. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shelvacu on fw <git@shelvacu.com> Closes #18346	2026-03-31 15:46:37 -07:00
Christos Longros	67ec68fd0d	zfs.4: document the zfs_arc_free_target parameter Document the FreeBSD-specific zfs_arc_free_target tunable which controls the number of free pages below which the ARC triggers reclaim. Note its initialization from vm.v_free_target and its distinction from the Linux-specific zfs_arc_sys_free. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Christos Longros <chris.longros@gmail.com> Closes #18350	2026-03-31 15:42:00 -07:00
Andriy Tkachuk	fc659bd6de	draid: fix import failure after disks replacements Currently, it's possible that draid vdev asize would decrease after disks replacements when the disk size is a little less than all other disks in the pool. In such situations, import would fail on this check in vdev_open(): /* * Make sure the allocatable size hasn't shrunk too much. */ if (asize < vd->vdev_min_asize) { vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, VDEV_AUX_BAD_LABEL); return (SET_ERROR(EINVAL)); } Solution: fix vdev_draid_min_asize() so that it would round up the required minimal disk capacity to the VDEV_DRAID_ROWHEIGHT. This would refuse replacements with the disks whose size is less than minimally required to avoid draid asize decrement. Note: we also use VDEV_DRAID_ROWHEIGHT in vdev_draid_open() when calculating asize, and thats why we need to round up min_size at vdev_draid_min_asize() to avoid asize drops. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18380	2026-03-31 15:41:03 -07:00

1 2 3 4 5 ...

10725 Commits