dmu_write_direct_done() passes dmu_sync_arg_t to
dmu_sync_done(), which updates the override state and
frees the completion context. The Direct I/O error path
then still dereferences dsa->dsa_tx while rolling the
dirty record back with dbuf_undirty(), resulting in a
use-after-free.
Save dsa->dsa_tx in a local variable before calling
dmu_sync_done() and use that saved tx for the error
rollback. This preserves the existing ownership model
for dsa and does not change the Direct I/O write
semantics.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: gality369 <gality369@example.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Closes#18440
zfs allow with a typo (e.g. "snapshop") produced the misleading
error "operation not applicable to datasets of this type". Report
"invalid permission" instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18401Closes#11903
If I could go back in time, I would beg Sun engineers to pick a
different name. For those of us who have not read the ZFS On-Disk
Specification pdf, it is not at all obvious that clearing a "label" is
such a bad thing.
But changing the name would be a breaking change, so at least for now
we can update the documentation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Shelvacu <git@shelvacu.com>
Closes#18347
Non-root callers got "unmount failed" when ZFS_MOUNT_HELPER was set
because /bin/umount's exit status doesn't preserve errno. Map a
non-zero helper exit to EPERM when geteuid() != 0 so the user sees
"permission denied".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#11740Closes#18443
When a VM fails to launch or is unreachable the qemu-7-prepare.sh
script will fail to collect the artifacts due to the missing vm*
directories. We want to collect as much diagnostic information as
possible, when missing create the directory to allow the subsequent
steps to proceed normally. Additionally, we don't want to fail
if the /tmp/summary.txt file is missing.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18438
We've seen some qemu-1-setup failures while trying to change the
runner's block device scheduler value to 'none':
We have a single 150GB block device
Setting up swapspace version 1, size = 16 GiB (17179865088 bytes)
no label, UUID=7a790bfe-79e5-4e38-b208-9c63fe523294
tee: '/sys/block/s*/queue/scheduler': No such file or directory
Luckily, we don't need to set the scheduler anymore on modern kernels:
https://github.com/openzfs/zfs/issues/9778#issuecomment-569347505
This commit just removes the code that sets the scheduler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18437
Update the META file to reflect compatibility with the 7.0
kernel.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18435
The resilver_restart_001 test case has not been entirely reliable
when run under the CI. Address several small issues which may be
responsible.
- Configure the pool as raidz2 instead of raidz1 since the test
offlines two devices. This ensures the second device is marked
as OFFLINE instead of DEGRADED.
- Start the zpool replace after setting SCAN_SUSPEND_PROGRESS to
close any potential race where the replace finishs to quickly.
- Wait for the offlines/onlined vdevs to fully transition to the
expected state during the test.
- Add the true flag to sync_pool to force a TXG sync to happen
even if it might not otherwise be required.
- During cleanup dump the zpool events history to aid debugging
if the updated test case is still unreliable in the CI.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18434
When copy_file_range overwrites a recent truncation, subsequent reads
can incorrectly determine that it is read hole instead of reading the
cloned blocks.
This can happen when the following conditions are met:
- Truncate adds blkid to dn_free_ranges
- A new TXG is created
- copy_file_range calls dmu_brt_clone which override the block pointer
and set DB_NOFILL
- Subsequent read, given DB_NOFILL, hits dbuf_read_impl and
dbuf_read_hole
- dbuf_read_hole calls dnode_block_freed, which returns TRUE because the
truncated blkids are still in dn_free_ranges
This will not happen if the clone and truncate are in the same TXG,
because the block clone would update the current TXG's dn_free_ranges,
which is why this bug only triggers under high IO load (such as
compilation).
Fix this by skipping the dnode_block_freed call if the block is
overridden. The fix shouldn't cause an issue when the cloned block is
subsequently freed in later TXGs, as dbuf_undirty would remove the
override.
This requires a dedicated test program as it is much harder to trigger
with scripts (this needs to generate a lot of I/O in short period of
time for the bug to trigger reliably).
Assisted-by: Gemini:gemini-3.1-pro
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gary Guo <gary@kernel.org>
Closes#18412Closes#18421
Replace semicolons with && so build failures are not masked by the
subsequent lockfile cleanup. Use trap to ensure the lockfile is
removed on both success and failure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18206Closes#18424
zfsctl_snapdir_vget resolves NFS file handles for snapshot directory
entries by calling zfsctl_snapshot_path_objset, which iterates all
snapshots via dmu_snapshot_list_next to find the matching objsetid.
With many snapshots this linear scan is expensive.
For snapshots that have been previously mounted, the path is already
cached in the in-memory AVL tree. Check the tree first with
zfsctl_snapshot_find_by_objsetid and fall back to the on-disk scan
only when the entry is not found.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18429
Currently, when more than nparity disks get faulted during the
rebuild, only first nparity disks would go to faulted state, and
all the remaining disks would go to degraded state. When a hot
spare is attached to that degraded disk for rebuild creating the
spare mirror, only that hot spare is getting rebuilt, but not the
degraded device. So when later during scrub some other attached
draid spare happens to map to that spare, it will end up with
cksum error.
Moreover, if the user clears the degraded disk from errors, the
data won't be resilvered to it, hot spare will be detached almost
immediately and the data that was resilvered only to it will be
lost.
Solution: write to all mirrored devices during rebuild, similar
to traditional/healing resilvering, but only if we can verify
the integrity of the data, or when it's the draid spare we are
writing to, in which case we are writing to a reserved spare
space, and there is no danger to overwrite any good data.
The argument that writing only to rebuilding draid spare vdev is
faster than writing to normal device doesn't hold since, at a
specific offset being rebuilt, draid spare will be mapped to a
normal device anyway.
redundancy_draid_degraded2 automation test is added also to
cover the scenario.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18414
The GH artifacts action now lets you disable auto-zipping your
artifacts. Previously, GH would always automatically put your
artifacts in a ZIP file. This is annoying when your artifacts
are already in a tarball.
Also update the following action versions
checkout: v4 -> v6
upload-artifact: v4 -> v7
download-artifact: v4 -> v8
Lastly, fix a issue where zfs-qmeu-packages now needs to power
cycle the VM.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18411
zfsctl_snapshot_mount() holds z_teardown_lock(R) across
call_usermodehelper(), which spawns a mount process that needs
namespace_sem(W) via move_mount. Reading /proc/self/mountinfo holds
namespace_sem(R) and needs z_teardown_lock(R) via zpl_show_devname.
When zfs_suspend_fs (from zfs recv or zfs rollback) queues
z_teardown_lock(W), the rrwlock blocks new readers, completing the
deadlock cycle.
Fix by releasing z_teardown_lock(R) after gathering the dataset name
and mount path, before any blocking operation. Everything after the
release operates on local string copies or uses its own
synchronization. The parent zfsvfs pointer remains valid because the
caller holds a path reference to the automount trigger dentry.
Releasing the lock allows zfs_suspend_fs to proceed concurrently
with the mount helper, so dmu_objset_hold in zpl_get_tree can
transiently fail with ENOENT during the clone swap. The mount
helper fails, EISDIR is returned, and the VFS falls back to the
ctldir stub (empty directory) until the next access retries.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18415
This is motivated by a FreeBSD AIO test case which create a zvol with -o
volmode=dev, then immediately tries to open the zvol device file. The
open occasionally fails with ENOENT.
When a zvol is created without the volmode setting, zvol_create_minors()
blocks until the task is finished, at which point OS-dependent code will
have created a device file. However, zvol_set_common() may cause the
device file to be destroyed and re-created, at least on FreeBSD, if the
voltype switches from GEOM to DEV. In this case, we do not block
waiting for the operation to finish, causing the test failure.
Fix the problem by making zvol_set_common() block until the operation
has finished. In FreeBSD zvol code, use g_waitidle() to block until
asynchronous GEOM operations are done. This fixes a secondary race
where zvol_os_remove_minor() does not block until the zvol device file
is removed, and the subsequent zvol_os_create_minor() fails because the
(to-be-destroyed) device file already exists.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18191
zvol_geom_open() may be called to taste an orphaned provider. The test
for pp->private == NULL there is racy as no locks are synchronizing the
test.
Use the GEOM topology lock to interlock the pp->private == NULL test
with the zvol state checks. This establishes a new lock order but I
believe this is necessary. Set pp->private = NULL under the GEOM
topology lock instead of the per-zvol state lock. Modify
zvol_os_rename_minor() to drop the zvol state lock to avoid a lock order
reversal with the topology lock.
Also reverse the order of tests in zvol_geom_open() and zvol_cdev_open()
as at least zvol_geom_open() may race with zvol_os_remove_minor(), which
sets zv->zv_zso = NULL. Testing for ZVOL_REMOVING first avoids a race
which can lead to a NULL pointer dereference.
Add a new OS-specific flag to handle the case where zvol_geom_open()
drops all locks in order to avoid a lock order reversal when acquiring
the suspend lock as the open count transitions 0->1. I don't see
anything preventing zvol_os_remove_minor() from racing there.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18191
Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.
This patch allows to configure several children groups in the
same row in one draid vdev. In each such group, let's call it
failure group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10 such
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).
In order to preserve fast sequential resilvering in case of a
disk failure, the groups much share all disks between themselves,
and this is achieved by shuffling the disks between the groups.
But only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.
For example, here's how the pool status output looks like in
case of two `draid1:2d:4c` failure groups:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
draid1:2d:4c:8w:1s-0 ONLINE 0 0 0
enc0d0 ONLINE 0 0 0
enc1d0 ONLINE 0 0 0
enc2d0 ONLINE 0 0 0
enc3d0 ONLINE 0 0 0
enc0d1 ONLINE 0 0 0
enc1d1 ONLINE 0 0 0
enc2d1 ONLINE 0 0 0
enc3d1 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.
Spare disks are evenly distributed among failure groups, and they
are shared by all groups. However, to support domain failure, we
cannot have more than nparity - 1 failed disks in any group, even
if they are rebuilt to draid spares (the blocks of those spares
can be mapped to the disks from the failed domain, and we cannot
tolerate more than nparity failures in any failure group).
The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#11969Closes#18148
ztest can enable and disable the multihost property when testing.
This can result in a failure when attempting to import an existing
pool when multihost=on but no /etc/hostid file exists. Update the
workflow to use zgenhostid to create /etc/hostid when not present.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18413
`zfs change-key` changes the key used to encrypt a ZFS dataset. When
used programmatically, it may be useful to track some external state
related to the key in a user property. E.g. a generation number,
expiration date, or application-specific source of the key.
This can be done today by running `zfs set user:prop=value` before or
after running `zfs change-key`. However, this introduces a race
condition where the property may not be set even though the key has
changed, or vice versa (depending on the order the commands are
executed).
This can be addressed by using a channel program (`zfs program`) which
calls both `zfs.sync.change_key()` and `zfs.sync.set_prop()`, changing
the property and key atomically. However, it is nontrivial to write such
a channel program to handle error cases, and provide the new key
securely (e.g. without logging it).
This issue proposes to enhance `zfs change-key` to be able to atomically
set user properties while changing the encryption key. Currently `zfs
change-key` accepts `-o property=value` arguments, but the only valid
properties are keylocation, keyformat, and pbkdf2iters. We will enhance
this to also allow user properties, e.g. `-o user:prop=value`. User
properties will also be allowed when using `zfs change-key -i` to
inherit the key from the parent dataset.
Original-patch-by: Matthew Ahrens <matt@mahrens.org>
External-issue: https://www.illumos.org/issues/17847
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18407
When zfsvfs_init() fails during zfs_resume_fs(), the bail
path called zfs_umount() directly. All three callers
(zfs_ioc_rollback, zfs_ioc_recv_impl, and
zfs_ioc_userspace_upgrade) hold an s_active reference
via getzfsvfs() at entry.
This creates two bugs:
1. Deadlock: zfs_umount() -> zfsvfs_teardown() ->
txg_wait_synced() blocks in uninterruptible D state.
The superblock cannot tear down because s_active is
pinned by the calling thread itself. Survives SIGKILL.
Blocks clean reboot. Requires hard power cycle.
2. Use-after-free: if txg_wait_synced() returns,
zfs_umount() calls zfsvfs_free(). The caller then
dereferences the freed zfsvfs via zfs_vfs_rele().
The explicit zfs_umount() is unnecessary. z_unmounted is
already set to B_TRUE before the locks are released, so
all new operations return EBUSY. When the caller releases
its s_active reference via zfs_vfs_rele(), the standard
VFS lifecycle (deactivate_super -> generic_shutdown_super
-> zpl_kill_sb -> zfs_umount) handles teardown correctly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Closes#18309Closes#18310
When getzfsvfs() succeeds (incrementing s_active via
zfs_vfs_ref()), but z_unmounted is subsequently found to
be B_TRUE, zfsvfs_hold() returns EBUSY without calling
zfs_vfs_rele(). This permanently leaks the VFS superblock
s_active reference, preventing generic_shutdown_super()
from ever firing, which blocks dmu_objset_disown() and
makes the pool permanently unexportable (EBUSY).
Add the missing zfs_vfs_rele() call, guarded by
zfs_vfs_held() to handle the zfsvfs_create() fallback
path where no VFS reference exists. This matches the
existing cleanup pattern in zfsvfs_rele().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Closes#18309Closes#18310
When sequentially resilvering allow a dRAID child to be read
as long as the DTLs indicate it should have a good copy of the
data and the leaf isn't being rebuilt. The previous check was
slightly too broad and would skip dRAID spare and replacing
vdevs if one of their children was being replaced. As long
as there exists enough additional redundancy this is fine, but
when there isn't this vdev must be read in order to correctly
reconstruct the missing data.
A new test case has been added which exhausts the available
redundancy, faults another device causing it to be degraded,
and then performs a sequential resilver for the degraded device.
In such a situation enough redundancy exists to perform the
replacement and a scrub should detect no checksum errors.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18405
For now make it only evict the specified data from the dbuf cache.
Even though dbuf cache is small, this may still reduce eviction of
more useful data from there, and slightly accelerate ARC evictions
by making the blocks there evictable a bit sooner.
On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the
kernel translates it into POSIX_FADV_DONTNEED after every read/write.
This is not as efficient as it could be for ZFS, but that is the only
way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399
Similar to FreeBSD stop issuing prefetches on POSIX_FADV_SEQUENTIAL.
It should not have this semantics, only hint speculative prefetcher,
if access ever happen later. Instead after POSIX_FADV_WILLNEED
handling call generic_fadvise(), if available, to do all the generic
stuff, including setting f_mode in struct file, that we could later
use to control prefetcher as part of read/write operations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18395
Remove the unused DEFAULT_SLOW_IO_N and DEFAULT_SLOW_IO_T defines
from zfs_diagnosis.c. Unlike the checksum and I/O thresholds, the
slow_io_n and slow_io_t properties must be manually opted in and
have no built-in defaults. The defines were misleading.
Update the vdevprops man page to clarify that slow_io_n and
slow_io_t must be manually set, and that the documented defaults
(10 errors in 600 seconds) apply only to checksum and I/O events.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18359
Free 35GB of unused files, mostly from unused development environments.
This helps with the out of disk space problems we were seeing on
FreeBSD runners.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18400
When zpool create fails because a vdev is already in use, the
error message now identifies the problematic device and the pool
it belongs to, e.g.:
cannot create 'tank': device '/dev/sdb1' is part of
active pool 'rpool'
Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used
by zpool import:
- Add spa_create_info to spa_t to capture error info during
vdev_label_init(), before vdev_close() resets vdev state
- When vdev_inuse() detects a conflict, read the on-disk
label to extract the pool name and store it with the
device path
- Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO
through the ioctl zc_nvlist_dst to userspace
- In libzfs, zpool_create_info() unwraps the nvlist and
formats the device-specific error message
Restructure zpool_create() error handling so all switch cases
use break instead of return, eliminating duplicated cleanup
code and using the single create_failed exit path.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18213
A cleanup of opportunity. Since we already are modifying the contents of
zfs_mnt_t, we've broken any API guarantee, so we might as well go the
rest of the way and get rid of it, and just pass the osname and/or the
vfs_t directly.
It seems like zfs_mnt_t was never really needed anyway; it was added in
1c2555ef92 (March 2017) to minimise the difference to illumos, but
zfs_vfsops was made platform-specific anyway in 7b4e27232d.
We also remove setting SB_RDONLY on the caller's flags when failing a
read-write remount on a read-only snapshot or pool. Since 0f608aa6ca
the caller's flags have been a pointer back to fc->sb_flags, which are
discarded without further ceremony when the operation fails, so the
change is unnecessary and we can simplify the call further.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Before Linux 5.8 (include RHEL8), a fixed set of "forbidden" options
would be rejected outright. For those, we work around it by providing
our own option parser to avoid the codepath in the kernel that would
trigger it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Adds zpl_parse_param and wires it up to the fs_context. This uses the
kernel's standard mount option parsing infrastructure to keep the work
we need to do to a minimum. We simply fill in the vfs_t we attached to
the fs_context in the previous commit, ready to go for the mount/remount
call.
Here we also document all the options we need to support, and why. It's
a lot of history but in the end the implementation is straightforward.
Finally, if we get SB_RDONLY on the proposed superblock flags, we record
that as the readonly mount option, because we haven't necessarily seen a
"ro" param and we still need to know for remount, the `readonly` dataset
property, etc.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
vfs_t is initially just parameters for the mount or remount operation,
so match them to the lifetime of the fs_context that represents that
operation.
When we actually execute the operation (calling .get_tree or .reconfigure),
transfer ownership of those options to the associated zfsvfs_t.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
We're working to replace this, and its easier to drop it outright while
we get set up.
To keep things compiling, the calls to zfsvfs_parse_options() are
replaced with zfsvfs_vfs_alloc(), though without any option parsing at
all nothing will work. That's ok, next commits are working towards it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
In a few commits, we're going to need to allocate and free vfs_t from
zpl_super.c as well, so lets keep them uniform.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
This implements zoned_uid - a ZFS property that delegates dataset
visibility and administration to user namespaces owned by a specific
UID, enabling rootless Podman/Docker with native ZFS storage.
Usage: zfs set zoned_uid=1000 pool/dataset
Problem solved:
- zfs zone requires an existing namespace PID
- Podman creates a new namespace on each container start
- Solution: delegate to UID, any namespace owned by that UID is
authorized
Authorization model — three-layer additive (all must pass):
L0 (auth): Namespace owner UID matches zoned_uid property
L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool
delegation is ON — the default)
L2 (cap tier): Linux capability in the namespace determines the
operation class permitted
While CAP_SYS_ADMIN is a namespaced capability (the namespace owner
always holds it within their own user namespace), granting blanket
access based solely on its presence is contrary to the Principle of
Least Privilege. This change introduces tiered capability requirements
so that non-destructive operations (create, snapshot, set property)
require only CAP_FOWNER, while destructive operations (destroy, rename,
clone) continue to require CAP_SYS_ADMIN — both of which are namespaced
capabilities scoped to the user namespace, not the init namespace.
When pool delegation is OFF (non-default), all zoned_uid write
operations are denied — delegation OFF means the pool admin has
opted out of delegating access entirely.
Security model:
- Namespace owner UID must match zoned_uid value
- Delegation root cannot be destroyed or escaped via rename
- Namespace users cannot modify zoned_uid itself (only global
zone admin can manage delegation assignments)
- Namespace users cannot modify the 'zoned' property
- Namespace users cannot override filesystem_limit or
snapshot_limit set by the global admin on the delegation root
(but can impose tighter sub-limits on child datasets)
- Multi-UID isolation: sibling delegations with different UIDs
cannot access each other's subtrees
Kernel changes:
- zone_dataset_attach_uid()/detach_uid() in SPL
- zone_dataset_admin_check() for write authorization with tiered
capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN
for destructive)
- Callback registration for zoned_uid property lookup
- New zfs_secpolicy_zoned_uid_deleg() helper that calls
dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds()
which requires the `zoned` property that zoned_uid datasets lack
- Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid
datasets (not just zoned=on)
- Update all 9 secpolicy call sites to require dsl_deleg grants
instead of short-circuiting on ZONE_ADMIN_ALLOWED
- Security policy hooks in zfs_secpolicy_*() functions
- Fixed inglobalzone() to use current_user_ns()
- zfs_prop_set_special() handles attach/detach as property
side-effects, eliminating the need for dedicated ioctls
- spa_import_os() restores zoned_uid delegations kernel-side
on pool import via dmu_objset_find() walk
- spa_export_os() detaches zoned_uid delegations on pool
destroy/export, preventing stale kernel state on recreate
- zoned_uid registered as PROP_INHERIT so child datasets
inherit the delegation, enabling sub-dataset creation
- zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify
the true delegation root, correctly distinguishing inherited
values from locally-set ones for destroy/rename policy checks
- zone_dataset_check_list() accepts '@' and '#' separators in
addition to '/' so snapshots and bookmarks are visible from
delegated namespaces
- zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being
set within a delegated namespace, preventing self-revocation
- zfs_secpolicy_setprop() blocks filesystem_limit and
snapshot_limit changes on the delegation root from within a
namespace (uses dsl_prop_get setpoint to identify the root),
while allowing delegated users to set tighter sub-limits on
child datasets
- Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid
in destroy and rename cleanup paths, preventing stale tracking
entries when namespace users perform these operations
- Use cr parameter (not CRED()) in all secpolicy zoned_uid
delegation checks for correct credential propagation
Userspace changes:
- check_parents() defers to kernel when zoned_uid set
FreeBSD compatibility:
- include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs:
- zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY,
RENAME, SETPROP)
- zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED)
- zone_dataset_admin_check() — static inline, always returns
ZONE_ADMIN_NOT_APPLICABLE
- zone_dataset_attach_uid() — static inline, returns ENXIO
- zone_dataset_detach_uid() — static inline, returns ENXIO
- zone_get_zoned_uid_fn_t callback typedef
- zone_register_zoned_uid_callback() — static inline no-op
- zone_unregister_zoned_uid_callback() — static inline no-op
- On FreeBSD, every zone_dataset_admin_check() call returns
ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions
to fall through to existing jail-based permission checks
- Setting zoned_uid on FreeBSD returns ENXIO since user namespace
delegation requires Linux user namespaces
Test changes:
- Add grant_deleg() calls to tests 006-022 for operations that now
require explicit dsl_deleg grants
- Add tests 023-030 validating the capability tier model
- Add test 031 validating stale zone tracking cleanup after
namespace rename+destroy
- Fix capsh lookup in test helpers for ksh -p restricted PATH
(command -v + explicit /usr/sbin fallback)
- Add mountpoint=none to tests 023-026 to avoid mount-lock issues
in user namespaces
- Fix test 026 expectations to match kernel behavior (delegation
OFF denies all writes, allows read-only)
- run_in_userns helper resolves absolute zfs path to handle
environments where PATH does not include zfs (source builds)
- Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test
verifies inheritance and override behavior
- Test 013 uses within_percent with parseable byte output (-Hp)
for robust quota value comparison across environments
- Test 014: verifies grandchild dataset creation from user
namespace, confirming inherited zoned_uid delegation works
- Test 015: pool destroy/recreate with zoned_uid delegation
- Test 016: individual snapshot destroy from namespace
- Test 017: namespace user cannot modify zoned_uid property
- Test 018: clone operations from within delegated namespace
- Test 019: multi-UID isolation between sibling delegations
- Test 020: operations without zone_dataset_admin_check()
integration are denied via zfs_dozonecheck_impl()
- Test 021: 'zoned' property cannot be modified from namespace
- Test 022: delegation root limit overrides blocked from namespace
- Quoted shell variables across all test scripts for robustness
- Shellcheck SC2155 fixes across all test scripts
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org>
Closes#18167
Parse range values with zfs_nicestrtonum() instead of strtoull()
so that -r accepts human-readable suffixes (K, M, G, T, P, E).
For example: zinject -r 1G,2G /pool/file
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18374
While FreeBSD does not support relatime natively, it seems trivial
to implement it just as dataset property for consistency. To not
change the status quo, change its default to off on FreeBSD. Now,
if explicitly enabled, it should actually work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18385
The test was skipped on FreeBSD since 2023 (#14961) due to exceeding
the 10-minute CI timeout on FreeBSD 14. CI runs on the fork now show
the test completes well within limits:
FreeBSD 14.3-RELEASE: 10 seconds
FreeBSD 15.0-STABLE: 11 seconds
FreeBSD 16.0-CURRENT: 14 seconds
Remove the FreeBSD skip and the corresponding known skip entry.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18389
abd_alloc_from_pages() does not call abd_update_scatter_stats(),
since memory is not really allocated there. But abd_free_scatter()
called by abd_free() does. It causes negative overflow of some
ABD and possibly ARC counters.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@truenas.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18390
Starting with Fedora 42, Fedora has been working on merging /bin and
/sbin directories. See
https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin
To support this, make sure we do not put files into */sbin directories
on these distributions by respecting the distribution set value of
%{_sbindir}.
In addition, explicitly set `mounthelperdir`, which affects the
placement of `mount.zfs`, which does not respect %{_sbindir} by
default. Making it point to %{_sbindir} will allow it to work correctly
on Fedora, while keeping its previous directory of `/sbin` on all other
distributions.
Note that files that used to reside in */sbin directories on Fedora will
stay accessible under these paths, as the distribution maintains
symlinks. No changes are needed to external scripts invoking these
binaries.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ralf Ertzinger <ralf@skytale.net>
Closes#18373
Some of the man pages (ex: `man/man8/zed.8`) are generated from `.in`
files, and `zed.8.in` *was* a dependency of `zed.8`, but `zed.8` was
not a dependency of `mancheck-...zed.8`. This usually worked anyways
because a full build had already been run, now it works regardless.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shelvacu on fw <git@shelvacu.com>
Closes#18346
Document the FreeBSD-specific zfs_arc_free_target tunable which
controls the number of free pages below which the ARC triggers
reclaim. Note its initialization from vm.v_free_target and its
distinction from the Linux-specific zfs_arc_sys_free.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18350
Currently, it's possible that draid vdev asize would decrease
after disks replacements when the disk size is a little less than
all other disks in the pool. In such situations, import would
fail on this check in vdev_open():
/*
* Make sure the allocatable size hasn't shrunk too much.
*/
if (asize < vd->vdev_min_asize) {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_BAD_LABEL);
return (SET_ERROR(EINVAL));
}
Solution: fix vdev_draid_min_asize() so that it would round up
the required minimal disk capacity to the VDEV_DRAID_ROWHEIGHT.
This would refuse replacements with the disks whose size is less
than minimally required to avoid draid asize decrement.
Note: we also use VDEV_DRAID_ROWHEIGHT in vdev_draid_open() when
calculating asize, and thats why we need to round up min_size at
vdev_draid_min_asize() to avoid asize drops.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18380