KekenoBSD/src

Author	SHA1	Message	Date
Alexander Motin	472ddca116	zed: Prefer spares with matching rotational and size Before this change zed tried to activate spares just in order they are stored in configuration, which is quite arbitrary. To make the result more optimal, sort the spares by their rotational status and size, so that the most fitting ones have better chances. To make it more visible, export the rotational status as a vdev property. While at it, minimally fix vdev properties reading for spare and L2ARC vdevs, having no ZAPs. To keep the rotational status for spare activation purposes when failed device is already gone, save it into the vdev config. The same is for spare vdevs asize. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18597	2026-05-28 15:14:26 -07:00
Alexander Motin	d65015938e	Vdev allocation bias/class change Normal, special and dedup vdevs differ only by space allocation bias. Normal and special vdevs might even legally store blocks targeted to other classes. Dedup vdevs don't normally do it, but there is no real reason why they can't. Considering this, it is not impossible to change the allocation bias for those vdevs. This change introduces a new top-level vdev property -- alloc_bias, reporting current bias for the vdev, and allowing to change it. This allows to easily change vdev role in a pool, especially if vdev removal is impossible. To not complicate the code, changes take effect only on next pool import. Changes to/from log vdev could also be theoretically possible, but they are artificially blocked for now, partially due to additional complications, and partially due to potential danger of placing other blocks on log vdevs, that would otherwise be non-fatal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18493	2026-05-07 09:16:39 -07:00
Pranav P	2eee4ac1ea	Fix: draid autopkgtests fail on s390x architecture (Endianness Issue) The ioctl call to create the pool was returning -1 with errno EINVAL. Inside the module code, inside vdev_draid.c, verify_perms is calling fletcher_4_native_varsize. This in turn calls fletcher_4_scalar_native. So, implemented a fletcher_4_byteswap_varsize which makes use of the fletcher_4_scalar_byteswap in Big endian machines. Reviewed-by: Andriy Tkachuk <andriy.tkachuk@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pranav P <pranavsdreams@gmail.com> Closes #16261 Closes #18445	2026-04-22 09:53:48 -07:00
Andriy Tkachuk	d1b0a69825	draid: add failure domains support Currently, the only way to tolerate the failure of the whole enclosure is to configure several draid vdevs in the pool, each vdev having disks from different enclosures. But this essentially degrades draid to raidz and defeats the purpose of having fast sequential resilvering on wide pools with draid. This patch allows to configure several children groups in the same row in one draid vdev. In each such group, let's call it failure group, the user can configure disks belonging to different enclosures - failure domains. For example, in case of 10 such enclosures with 10 disks each, the user can put 1st disk from each enclosure into 1st group, 2nd disk from each enclosure into 2nd group, and so on. If one enclosure fails, only one disk from each group would fail, which won't affect draid operation, and each group would have enough redundancy to recover the stored data. Of course, in case of draid2 - two enclosures can fail at a time, in case of draid3 - three enclosures (provided there are no other disk failures in each group). In order to preserve fast sequential resilvering in case of a disk failure, the groups much share all disks between themselves, and this is achieved by shuffling the disks between the groups. But only i-th disks in each group are shuffled between themselves, i.e. the disks from the same enclosures, after that they are shuffled within each group, like it is done today in an ordinary draid. Thus, no more than one disk from any enclosure can appear in any failure group as a result of this shuffling. For example, here's how the pool status output looks like in case of two `draid1:2d:4c` failure groups: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 draid1:2d:4c:8w:1s-0 ONLINE 0 0 0 enc0d0 ONLINE 0 0 0 enc1d0 ONLINE 0 0 0 enc2d0 ONLINE 0 0 0 enc3d0 ONLINE 0 0 0 enc0d1 ONLINE 0 0 0 enc1d1 ONLINE 0 0 0 enc2d1 ONLINE 0 0 0 enc3d1 ONLINE 0 0 0 spares draid1-0-0 AVAIL The number of failure groups is specified indirectly via the new width parameter in draid vdev configuration descriptor, which is the total number of disks and which is multiple of children in each group. This multiple is the number of groups (width / children). Doing it this way allows the user conveniently see how many disks draid has in an instant. Spare disks are evenly distributed among failure groups, and they are shared by all groups. However, to support domain failure, we cannot have more than nparity - 1 failed disks in any group, even if they are rebuilt to draid spares (the blocks of those spares can be mapped to the disks from the failed domain, and we cannot tolerate more than nparity failures in any failure group). The retire agent in zed is updated to not start resilvering when the domain failure happens. Otherwise, it might take a lot of computing and I/O bandwidth resources, only to be wasted when the failed domain component is replaced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #11969 Closes #18148	2026-04-08 10:09:47 -07:00
li-nk.social	f203fedde8	Add zoned_uid property with additive least privilege authorization This implements zoned_uid - a ZFS property that delegates dataset visibility and administration to user namespaces owned by a specific UID, enabling rootless Podman/Docker with native ZFS storage. Usage: zfs set zoned_uid=1000 pool/dataset Problem solved: - zfs zone requires an existing namespace PID - Podman creates a new namespace on each container start - Solution: delegate to UID, any namespace owned by that UID is authorized Authorization model — three-layer additive (all must pass): L0 (auth): Namespace owner UID matches zoned_uid property L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool delegation is ON — the default) L2 (cap tier): Linux capability in the namespace determines the operation class permitted While CAP_SYS_ADMIN is a namespaced capability (the namespace owner always holds it within their own user namespace), granting blanket access based solely on its presence is contrary to the Principle of Least Privilege. This change introduces tiered capability requirements so that non-destructive operations (create, snapshot, set property) require only CAP_FOWNER, while destructive operations (destroy, rename, clone) continue to require CAP_SYS_ADMIN — both of which are namespaced capabilities scoped to the user namespace, not the init namespace. When pool delegation is OFF (non-default), all zoned_uid write operations are denied — delegation OFF means the pool admin has opted out of delegating access entirely. Security model: - Namespace owner UID must match zoned_uid value - Delegation root cannot be destroyed or escaped via rename - Namespace users cannot modify zoned_uid itself (only global zone admin can manage delegation assignments) - Namespace users cannot modify the 'zoned' property - Namespace users cannot override filesystem_limit or snapshot_limit set by the global admin on the delegation root (but can impose tighter sub-limits on child datasets) - Multi-UID isolation: sibling delegations with different UIDs cannot access each other's subtrees Kernel changes: - zone_dataset_attach_uid()/detach_uid() in SPL - zone_dataset_admin_check() for write authorization with tiered capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN for destructive) - Callback registration for zoned_uid property lookup - New zfs_secpolicy_zoned_uid_deleg() helper that calls dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds() which requires the `zoned` property that zoned_uid datasets lack - Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid datasets (not just zoned=on) - Update all 9 secpolicy call sites to require dsl_deleg grants instead of short-circuiting on ZONE_ADMIN_ALLOWED - Security policy hooks in zfs_secpolicy_*() functions - Fixed inglobalzone() to use current_user_ns() - zfs_prop_set_special() handles attach/detach as property side-effects, eliminating the need for dedicated ioctls - spa_import_os() restores zoned_uid delegations kernel-side on pool import via dmu_objset_find() walk - spa_export_os() detaches zoned_uid delegations on pool destroy/export, preventing stale kernel state on recreate - zoned_uid registered as PROP_INHERIT so child datasets inherit the delegation, enabling sub-dataset creation - zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify the true delegation root, correctly distinguishing inherited values from locally-set ones for destroy/rename policy checks - zone_dataset_check_list() accepts '@' and '#' separators in addition to '/' so snapshots and bookmarks are visible from delegated namespaces - zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being set within a delegated namespace, preventing self-revocation - zfs_secpolicy_setprop() blocks filesystem_limit and snapshot_limit changes on the delegation root from within a namespace (uses dsl_prop_get setpoint to identify the root), while allowing delegated users to set tighter sub-limits on child datasets - Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid in destroy and rename cleanup paths, preventing stale tracking entries when namespace users perform these operations - Use cr parameter (not CRED()) in all secpolicy zoned_uid delegation checks for correct credential propagation Userspace changes: - check_parents() defers to kernel when zoned_uid set FreeBSD compatibility: - include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs: - zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY, RENAME, SETPROP) - zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED) - zone_dataset_admin_check() — static inline, always returns ZONE_ADMIN_NOT_APPLICABLE - zone_dataset_attach_uid() — static inline, returns ENXIO - zone_dataset_detach_uid() — static inline, returns ENXIO - zone_get_zoned_uid_fn_t callback typedef - zone_register_zoned_uid_callback() — static inline no-op - zone_unregister_zoned_uid_callback() — static inline no-op - On FreeBSD, every zone_dataset_admin_check() call returns ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions to fall through to existing jail-based permission checks - Setting zoned_uid on FreeBSD returns ENXIO since user namespace delegation requires Linux user namespaces Test changes: - Add grant_deleg() calls to tests 006-022 for operations that now require explicit dsl_deleg grants - Add tests 023-030 validating the capability tier model - Add test 031 validating stale zone tracking cleanup after namespace rename+destroy - Fix capsh lookup in test helpers for ksh -p restricted PATH (command -v + explicit /usr/sbin fallback) - Add mountpoint=none to tests 023-026 to avoid mount-lock issues in user namespaces - Fix test 026 expectations to match kernel behavior (delegation OFF denies all writes, allows read-only) - run_in_userns helper resolves absolute zfs path to handle environments where PATH does not include zfs (source builds) - Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test verifies inheritance and override behavior - Test 013 uses within_percent with parseable byte output (-Hp) for robust quota value comparison across environments - Test 014: verifies grandchild dataset creation from user namespace, confirming inherited zoned_uid delegation works - Test 015: pool destroy/recreate with zoned_uid delegation - Test 016: individual snapshot destroy from namespace - Test 017: namespace user cannot modify zoned_uid property - Test 018: clone operations from within delegated namespace - Test 019: multi-UID isolation between sibling delegations - Test 020: operations without zone_dataset_admin_check() integration are denied via zfs_dozonecheck_impl() - Test 021: 'zoned' property cannot be modified from namespace - Test 022: delegation root limit overrides blocked from namespace - Quoted shell variables across all test scripts for robustness - Shellcheck SC2155 fixes across all test scripts Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org> Closes #18167	2026-04-03 10:38:26 -07:00
Aditya Gollamudi	b481a8bbbf	Make zpool status dedup table support raw bytes -p output Check if -p flag is enabled, and if so print dedup table with raw bytes. Restructure the logic in zutil_pool to check if -p flag is enabled before printing either the bytes or raw numbers. Calls to print the data for DDT now all use zfs_nicenum_format(). Increased DDT histogram column buffers to 32 bytes to prevent truncation when -p is enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com> Closes #11626 Closes #17926	2026-03-13 09:53:56 -07:00
Rob Norris	62fa8bcb3c	abi: updates for mnttab cleanup Sponsored-by: TrueNAS Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18296	2026-03-10 13:07:07 -07:00
Ivan Shapovalov	2f3f1ab1ba	libzfs: teach zfs_create_ancestors() to accept properties This will be used to support creating non-mountable ancestors in zfs(8). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #17000	2026-03-09 14:49:52 -07:00
Ryan Moeller	ac0fd40c8c	Add zpool properties for allocation class space The existing zpool properties accounting pool space (size, allocated, fragmentation, expandsize, free, capacity) are based on the normal metaslab class or are cumulative properties of several classes combined. Add properties reporting the space accounting metrics for each metaslab class individually. Also introduce pool-wide AVAIL, USABLE, and USED properties reporting values corresponding to FREE, SIZE, and ALLOC deflated for raidz. Update ZTS to recognize the new properties and validate reported values. While in zpool_get_parsable.cfg, add "fragmentation" to the list of parsable properties. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com> Cloes #18238	2026-03-02 15:50:23 -08:00
Alexander Motin	991fc56fae	Introduce dedupused/dedupsaved pool properties Currently there is only a dedup ratio reported via pool properties. If dedup is enabled only for some datasets, it is impossible to say how much space the ratio actually covers. Fix this by introducing dedupused/dedupsaved pool properties, similar to earlier added block cloning ones. Combined with work to expose allocation classes stats, it should give user-space enough visibility to correlate `zpool list` and `zfs list` space numbers. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18245	2026-02-25 09:41:38 -05:00
MigeljanImeri	4975430cf5	Add vdev property to disable vdev scheduler Added vdev property to disable the vdev scheduler. The intention behind this property is to improve IOPS performance when using o_direct. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com> Closes #17358	2026-02-23 09:34:33 -08:00
Wolfgang Hoschek	c77f17b750	Add snapshots_changed_nsecs dataset property Add a read-only dataset property, snapshots_changed_nsecs, which exposes the nanosecond resolution version of snapshots_changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Wolfgang Hoschek <wolfgang.hoschek@mac.com> Closes #17998 Closes #18031	2026-01-06 09:36:20 -08:00
Rob Norris	0d44b58d7f	libshare: fold into libzfs and reorg headers a little libzfs is the only user of libshare, and only internally, so there's no particular reason to build it separately, nor to export its symbols. So, pull it into libzfs proper, remove its "public" header, and hide its symbols. The bare minimum "public" API is just to count and enumerate the supported share types. These are moved to libzfs.h with the other share API. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18072	2025-12-19 19:52:33 -08:00
Ameer Hamza	88d012a1d6	Fix snapshot automount expiry cancellation deadlock A deadlock occurs when snapshot expiry tasks are cancelled while holding locks. The snapshot expiry task (snapentry_expire) spawns an umount process and waits for it to complete. Concurrently, ARC memory pressure triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the expiry task while holding locks. The umount process spawned by the expiry task blocks trying to acquire locks held by arc_prune, which is blocked waiting for the expiry task to complete. This creates a circular dependency: expiry task waits for umount, umount waits for arc_prune, arc_prune waits for expiry task. Fix by adding non-blocking cancellation support to taskq_cancel_id(). The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to reschedule the unmount, which needs to cancel any existing expiry task. It now uses non-blocking cancellation to avoid waiting while holding locks, breaking the deadlock by returning immediately when the task is already running. The per-entry se_taskqid_lock has been removed, with all taskqid operations now protected by the global zfs_snapshot_lock held as WRITER. Additionally, an se_in_umount flag prevents recursive waits when zfsctl_destroy() is called during unmount. The taskqid is now only cleared by the caller on successful cancellation; running tasks clear their own taskqid upon completion. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17941	2025-12-01 14:43:42 -08:00
Rob Norris	71609a9264	zfs: replace tpool with taskq They're basically the same thing; lets just carry one. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17948	2025-11-19 08:16:51 -08:00
Brian Behlendorf	6015edb374	lib: update ABI meta following libspl changes In theory they should not have resulted in a change. In practice, the way visibility is set up currently means that many of our convenience libraries will "leak through" into the available symbols in our public libraries. In this commit, we're seeing all the new symbols in libspl through libuutil, libzfs and libzfs_core. Importantly, none have been removed, so consumers of these libraries will not notice. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17861	2025-11-12 10:25:14 -08:00
Brian Behlendorf	cb6b249f8c	Update all ABI files Refresh all ABI files using the CI generated files to reflect the library interfaces to be published for the 2.4 release. Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17911	2025-11-12 09:39:00 -08:00
Mariusz Zaborski	02fdd26e51	Add knob to disable slow io notifications Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that allows users to disable notifications for slow devices. This prevents ZED and/or ZFSD from degrading the pool due to slow I/O. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes 17477	2025-11-11 10:42:17 -08:00
Rob Norris	5605a6d79b	pool_iter_refresh: don't refresh pools twice In "all pools" mode, pool_iter_refresh() will call zpool_iter(), which will call zpool_refresh_stats() before calling add_pool(). If we already have the pool, this is a different handle, so we just release it and return. Back in pool_iter_refresh(), we then call zpool_stats_refresh() again for our handle on the same pool. All together, this means we're doing two ZFS_IOC_POOL_STATS calls into the kernel for every pool in the system. This isn't wrong, but it does double the pressure on global locks. Instead, we add a new function zpool_refresh_stats_from_handle() that simply copies the pool config and state from one handle to another, and use it to update our handle before we release it in add_pool(), so we only have one call per pool per interval. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17807	2025-10-03 14:39:09 -07:00
Paul Dagnelie	d64711c202	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:25:03 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00
Mariusz Zaborski	894edd084e	Add TXG timestamp database This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG. With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16853	2025-08-06 10:31:21 -07:00
Alexander Motin	f70c85086b	BRT: Fix ZAP entry endianness During original block cloning implementation a mistake was made, making BRT ZAP entries an array of 8 1-byte entries instead of 1 entry of 8 bytes. This makes the pools non-endian-safe. This commit introduces a new read-compatible pool feature "com.truenas:block_cloning_endian", fixing the endianness issue for new pools while maintaining compatibility with existing ones. The feature is automatically activated when creating the first BRT ZAP (ensuring we don't activate it on pools that already have BRT entries in the old format). When active, BRT entries are stored as single 8-byte values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17572	2025-07-30 09:42:47 -07:00
Akash B	b6e8db509d	zpool/zfs: Add '-a\|--all' option to scrub, trim, initialize Add support for the '-a \| --all' option to perform trim, scrub, and initialize operations on all pools. Previously, specifying a pool name was mandatory for these operations. With this enhancement, users can now execute these operations across all pools at once, without needing to manually iterate over each pool from the command line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Akash B <akash-b@hpe.com> Closes #17524	2025-07-29 14:50:44 -07:00
Rob Norris	cb9742e532	libspl: add API for manipulating tunables Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:46:58 -07:00
Paul Dagnelie	a981cb69e4	Implement dynamic gang header sizes ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:02:53 -07:00
Rob Norris	44e3266894	events: include zio type in IO error reports Usually the IO type can be inferred from the other fields (in particular, priority and flags) sometimes it's not easy to see. This is just another little debug helper. May 27 2025 00:54:54.024110493 ereport.fs.zfs.data class = "ereport.fs.zfs.data" ena = 0x1f5ecfae600801 ... zio_delta = 0x0 zio_type = 0x2 [WRITE] zio_priority = 0x3 [ASYNC_WRITE] zio_objset = 0x0 Document zio_type and zio_priority. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17381	2025-05-30 10:29:29 -04:00
Rob Norris	06fa8f3f69	zfs_cmd: reorganise zfs_cmd_t to match original size `2aa3fbe761` extended zinject_record_t, and in doing so inadvertently extended zfs_cmd_t, which broke compatibility with userspace tools without the change. This fixes that by using some of the unused space in zfs_cmd_t for the extra fields. We also add an assert to trigger a compile error if the size ever changes. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17367	2025-05-27 20:01:06 -04:00
Ameer Hamza	2a91d577b1	Expose dataset encryption status via fast stat path In truenas_pylibzfs, we query list of encrypted datasets several times, which is expensive. This commit exposes a public API zfs_is_encrypted() to get encryption status from fast stat path without having to refresh the properties. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17368	2025-05-26 22:11:03 -04:00
Tony Hutter	8d1489735b	nvlist: Add nvlist_snprintf() and zfs_dbgmsg_nvlist() Add nvlist_snprintf() to print a nvlist to a buffer. This is basically the snprintf() version of dump_nvlist(). Along with that, add a zfs_dbgmsg_nvlist() to print out an nvlist to dbgmsg. This will aid in debugging. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17215	2025-04-18 09:22:16 -04:00
Ameer Hamza	6f6c504700	Show default quotas in zfs userspace tools Update zfs userspace, groupspace, and projectspace to display the default quotas when no per-ID specific quota is configured. This ensures tool outputs align with enforced limits. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-03 10:36:45 -07:00
Ameer Hamza	2a8d9d9607	Add default user/group/project quota properties This adds default userquota, groupquota, and projectquota properties to MASTER_NODE_OBJ to make them accessible during zfsvfs_init() (regular DSL properties require dsl_config_lock, which cannot be safely acquired in this context). The zfs_fill_zplprops_impl() logic is updated to read these default properties directly from MASTER_NODE_OBJ. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-03 10:35:22 -07:00
Mariusz Zaborski	4b4e346b9f	Add ability to scrub from last scrubbed txg Some users might want to scrub only new data because they would like to know if the new write wasn't corrupted. This PR adds possibility scrub only newly written data. This introduces new `last_scrubbed_txg` property, indicating the transaction group (TXG) up to which the most recent scrub operation has checked and repaired the dataset, so users can run scrub only from the last saved point. We use a scn_max_txg and scn_min_txg which are already built into scrub, to accomplish that. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Closes #16301	2024-12-04 14:21:45 -05:00
Shengqi Chen	e8f0aa143e	Bump SONAME of libzfs and libzpool The ABI of libzfs and libzpool have breaking changes since last SONAME bump in commit `fe6babc`: * libzfs: `zpool_print_unsup_feat` removed (used by zpool cmd). * libzpool: multiple `ddt_*` symbols removed (used by zdb cmd). Bump them to avoid ABI breakage. See: https://github.com/openzfs/zfs/pull/11817 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16609	2024-10-06 14:49:33 -07:00
Rob Norris	224393a321	feature: large_microzap In `a4b21eadec` we added the zap_micro_max_size tuneable to raise the size at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block) ZAPs. Before this, a microZAP was limited to 128KiB, which was the old largest block size. The side effect of raising the max size past 128KiB is that it be stored in a large block, requiring the large_blocks feature. Unfortunately, this means that a backup stream created without the --large-block (-L) flag to zfs send would split the microZAP block into smaller blocks and send those, as is normal behaviour for large blocks. This would be received correctly, but since microZAPs are limited to the first block in the object by definition, the entries in the later blocks would be inaccessible. For directory ZAPs, this gives the appearance of files being lost. This commit adds a feature flag, large_microzap, that must be enabled for microZAPs to grow beyond 128KiB, and which will be activated the first time that occurs. This feature is later checked when generating the stream and if active, the send operation will abort unless --large-block has also been requested. Changing the limit still requires zap_micro_max_size to be changed. The state of this flag effectively sets the upper value for this tuneable, that is, if the feature is disabled, the tuneable will be clamped to 128KiB. A stream flag is also added to ensure that the receiver also activates its own feature flag upon receiving the stream. This is not strictly necessary to _use_ the received microZAP, since it doesn't care how large its block is, but it is required to send the microZAP object on, otherwise the original problem occurs again. Because it's difficult to reliably distinguish a microZAP from a fatZAP from outside the ZAP code, and because it seems unlikely that most users are affected (a fairly niche tuneable combined with what should be an uncommon use of send), and for the sake of expediency, this change activates the feature the first time a microZAP grows to use a large block, and is never deactivated after that. This can be improved in the future. This commit changes nothing for existing pools that already have large microZAPs. The feature will not be retroactively applied, but will be activated the next time a microZAP grows past the limit. Don't use large_blocks feature for enable/disable tests. The large_microzap depends on large_blocks, so it gets enabled as a dependency, breaking the test. Instead use feature "longname", which has the exact same feature characteristics. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16593	2024-10-02 20:47:11 -07:00
Brian Behlendorf	e8cbb5952d	Update all ABI files Refresh all ABI files using the CI generated files as of commit `0cf14bf4b5`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #16592	2024-10-01 17:10:23 -07:00
Sanjeev Bagewadi	20232ecfaa	Support for longnames for files/directories (Linux part) This patch adds the ability for zfs to support file/dir name up to 1023 bytes. This number is chosen so we can support up to 255 4-byte characters. This new feature is represented by the new feature flag feature@longname. A new dataset property "longname" is also introduced to toggle longname support for each dataset individually. This property can be disabled, even if it contains longname files. In such case, new file cannot be created with longname but existing longname files can still be looked up. Note that, to my knowledge native Linux filesystems don't support name longer than 255 bytes. So there might be programs not able to work with longname. Note that NFS server may needs to use exportfs_get_name to reconnect dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So NFS may not work when longname is enabled. Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even though we add code to support it here, it won't actually work. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15921	2024-10-01 13:40:27 -07:00
Shengqi Chen	0ae4460c61	zcommon: add specialized versions of cityhash4 Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16131 Closes #16483	2024-09-19 15:18:59 -07:00
Brian Atkinson	a10e552b99	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov> Closes #10018	2024-09-14 13:47:59 -07:00
Rob Norris	82ff9aafd6	value strings: pretty printers for flags and enums This adds zfs_valstr, a collection of pretty printers for bitfields and enums. These are useful in debugging, logging and other display contexts where raw values are difficult for the untrained (or even trained!) eye to decipher. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-09-05 13:40:05 -07:00
Don Brady	d4d79451cb	Add DDT prune command Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16277	2024-09-04 14:17:02 -07:00
Mateusz Piotrowski	6be8bf5552	zpool: Provide GUID to zpool-reguid(8) with -g (#16239 ) This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-26 09:27:24 -07:00
Rob Norris	db2b1fdb79	ddt: add FDT feature and support for legacy and new on-disk formats This is the supporting infrastructure for the upcoming dedup features. Traditionally, dedup objects live directly in the MOS root. While their details vary (checksum, type and class), they are all the same "kind" of thing - a store of dedup entries. The new features are more varied than that, and are better thought of as a set of related stores for the overall state of a dedup table. This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will cause new DDTs to be created as a ZAP in the MOS root, named DDT-<checksum>. The is used as the root object for the normal type/class store objects, but will also be a place for any storage required by new features. This commit adds two new fields to ddt_t, for version and flags. These are intended to describe the structure and features of the overall dedup table, and are stored as-is in the DDT root. In this commit, flags are always zero, but the intent is that they can be used to hang optional logic or state onto for new dedup features. Version is always 1. For a "legacy" dedup table, where no DDT root directory exists, the version will be 0. ddt_configure() is expected to determine the version and flags features currently in operation based on whether or not the fast_dedup feature is enabled, and from what's available on disk. In this way, its possible to support both old and new tables. This also provides a migration path. A legacy setup can be upgraded to FDT by creating the DDT root ZAP, moving the existing objects into it, and setting version and flags appropriately. There's no support for that here, but it would be straightforward to add later and allows the possibility that newer features could be applied to existing dedup tables. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892	2024-08-16 11:58:59 -07:00
Tony Hutter	02a9f7fed7	JSON: Fix class values for mirrored special vdevs This fixes things so mirrored special vdevs report themselves as "class=special" rather than "class=normal". This happens due to the way the vdev nvlists are constructed: mirrored special devices - The 'mirror' vdev has allocation bias as "special" and it's leaf vdevs are "normal" single or RAID0 special devices - Leaf vdevs have allocation bias as "special". This commit adds in code to check if a leaf's parent is a "special" vdev to see if it should also report "special". Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16217	2024-08-06 12:47:58 -07:00
Umer Saleem	959e963c81	JSON output support for zpool status This commit adds support for zpool status command to displpay status of ZFS pools in JSON format using '-j' option. Status information is collected in nvlist which is later dumped on stdout in JSON format. Existing options for zpool status work with '-j' flag. man page for zpool status is updated accordingly. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #16217	2024-08-06 12:47:10 -07:00
Allan Jude	62e7d3c89e	ddt: add support for prefetching tables into the ARC This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Will Andrews <will.andrews@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Closes #15890	2024-07-26 09:16:18 -07:00
Allan Jude	c7ada64bb6	ddt: dedup table quota enforcement This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com> Closes #15889	2024-07-25 09:47:36 -07:00
Don Brady	fb6d8cf229	Add some missing vdev properties (#16346 ) Sponsored-by: Klara, Inc. Sponsored-By: Wasabi Technology, Inc. Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-23 16:34:09 -07:00
Rob Norris	3974ef045e	libspl: lift backtrace into a separate file If it's going to be used directly by zdb/ztest, then it sort of doesn't make sense to carry it with the assert code. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16181	2024-05-14 09:48:45 -07:00
Alan Somers	b64afa41d5	Better control the thread pool size when mounting datasets Ever since `a10d50f999`, ZFS has mounted file systems in parallel when importing a pool. It uses a fixed size of 512 for the thread pool. But since `c183d164aa`, it has also imported pools in parallel. So the total number of threads at one time is 513 * npools + 1. That can easily exceed the system's limit on the number of threads per process, which will cause one or more pools to be unable to allocate any worker threads, forcing them to fallback to slow serial mounting . To forestall that, manage the threadpool size in /sbin/zpool, not libzfs. Use the same size (512), but divided by the number of pools. This is a backwards-incompatible change to the libzfs abi. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org> Closes #16178	2024-05-14 09:36:21 -07:00

1 2 3

112 Commits