From 5a03e358fc6a8cdcbac8fdf606ff6a8cd3837b60 Mon Sep 17 00:00:00 2001 From: Brian Behlendorf Date: Mon, 20 Oct 2025 06:26:51 -0700 Subject: [PATCH 01/16] Update device removal documentation Make a minor update to the 'zpool remove' man page to clarify both raidz and draid pools do not support removal, and change sector to ashift which is what we actually care about. Update the big theory comment in vdev_removal.c to accurately reflect which types of vdevs can be removed. Furthermore, I've added some discussion for the casual reader to briefly explain the top-level vdev removal restrictions. This has been a common area of confusion and it's not intuitive where they come from without understanding the implementation details. Signed-off-by: Brian Behlendorf Reviewed-by: George Melikov Reviewed-by: Alexander Motin Closes #17847 --- man/man8/zpool-remove.8 | 4 +- module/zfs/vdev_removal.c | 80 ++++++++++++++++++++++++++++----------- 2 files changed, 60 insertions(+), 24 deletions(-) diff --git a/man/man8/zpool-remove.8 b/man/man8/zpool-remove.8 index 4d5fc431d33..483d885f10f 100644 --- a/man/man8/zpool-remove.8 +++ b/man/man8/zpool-remove.8 @@ -58,8 +58,8 @@ This command supports removing hot spare, cache, log, and both mirrored and non-redundant primary top-level vdevs, including dedup and special vdevs. .Pp Top-level vdevs can only be removed if the primary pool storage does not contain -a top-level raidz vdev, all top-level vdevs have the same sector size, and the -keys for all encrypted datasets are loaded. +a top-level raidz or draid vdev, all top-level vdevs have the same ashift size, +and the keys for all encrypted datasets are loaded. .Pp Removing a top-level vdev reduces the total amount of space in the storage pool. The specified device will be evacuated by copying all allocated space from it to diff --git a/module/zfs/vdev_removal.c b/module/zfs/vdev_removal.c index 2ce0121324a..8ea53e735a9 100644 --- a/module/zfs/vdev_removal.c +++ b/module/zfs/vdev_removal.c @@ -51,34 +51,70 @@ #include /* - * This file contains the necessary logic to remove vdevs from a - * storage pool. Currently, the only devices that can be removed - * are log, cache, and spare devices; and top level vdevs from a pool - * w/o raidz or mirrors. (Note that members of a mirror can be removed - * by the detach operation.) + * This file contains the necessary logic to remove vdevs from a storage + * pool. Note that members of a mirror can be removed by the detach + * operation. Currently, the only devices that can be removed are: * - * Log vdevs are removed by evacuating them and then turning the vdev - * into a hole vdev while holding spa config locks. + * 1) Traditional hot spare and cache vdevs. Note that draid distributed + * spares are fixed at creation time and cannot be removed. * - * Top level vdevs are removed and converted into an indirect vdev via - * a multi-step process: + * 2) Log vdevs are removed by evacuating them and then turning the vdev + * into a hole vdev while holding spa config locks. * - * - Disable allocations from this device (spa_vdev_remove_top). + * 3) Top-level singleton and mirror vdevs, including dedup and special + * vdevs, are removed and converted into an indirect vdev via a + * multi-step process: * - * - From a new thread (spa_vdev_remove_thread), copy data from - * the removing vdev to a different vdev. The copy happens in open - * context (spa_vdev_copy_impl) and issues a sync task - * (vdev_mapping_sync) so the sync thread can update the partial - * indirect mappings in core and on disk. + * - Disable allocations from this device (spa_vdev_remove_top). * - * - If a free happens during a removal, it is freed from the - * removing vdev, and if it has already been copied, from the new - * location as well (free_from_removing_vdev). + * - From a new thread (spa_vdev_remove_thread), copy data from the + * removing vdev to a different vdev. The copy happens in open context + * (spa_vdev_copy_impl) and issues a sync task (vdev_mapping_sync) so + * the sync thread can update the partial indirect mappings in core + * and on disk. * - * - After the removal is completed, the copy thread converts the vdev - * into an indirect vdev (vdev_remove_complete) before instructing - * the sync thread to destroy the space maps and finish the removal - * (spa_finish_removal). + * - If a free happens during a removal, it is freed from the removing + * vdev, and if it has already been copied, from the new location as + * well (free_from_removing_vdev). + * + * - After the removal is completed, the copy thread converts the vdev + * into an indirect vdev (vdev_remove_complete) before instructing + * the sync thread to destroy the space maps and finish the removal + * (spa_finish_removal). + * + * The following constraints currently apply primary device removal: + * + * - All vdevs must be online, healthy, and not be missing any data + * according to the DTLs. + * + * - When removing a singleton or mirror vdev, regardless of it's a + * special, dedup, or primary device, it must have the same ashift + * as the devices in the normal allocation class. Furthermore, all + * vdevs in the normal allocation class must have the same ashift to + * ensure the new allocations never includes additional padding. + * + * - The normal allocation class cannot contain any raidz or draid + * top-level vdevs since segments are copied without regard for block + * boundaries. This makes it impossible to calculate the required + * parity columns when using these vdev types as the destination. + * + * - The encryption keys must be loaded so the ZIL logs can be reset + * in order to prevent writing to the device being removed. + * + * N.B. ashift and raidz/draid constraints for primary top-level device + * removal could be slightly relaxed if it were possible to request that + * DVAs from a mirror or singleton in the specified allocation class be + * used (metaslab_alloc_dva). + * + * This flexibility would be particularly useful for raidz/draid pools which + * often include a mirrored special device. If a mistakenly added top-level + * singleton were added it could then still be removed at the cost of some + * special device capacity. This may be a worthwhile tradeoff depending on + * the pool capacity and expense (cost, complexity, time) of creating a new + * pool and copying all of the data to correct the configuration. + * + * Furthermore, while not currently supported it should be possible to allow + * vdevs of any type to be removed as long as they've never been written to. */ typedef struct vdev_copy_arg { From 783a02b5d3d866e04d25ec6c35a68143e77a95c3 Mon Sep 17 00:00:00 2001 From: Andrew Walker Date: Mon, 20 Oct 2025 08:28:57 -0500 Subject: [PATCH 02/16] Fix ZFS_READONLY implementation on Linux MS-FSCC 2.6 is the governing document for DOS attribute behavior. It specifies the following: For a file, applications can read the file but cannot write to it or delete it. For a directory, applications cannot delete it, but applications can create and delete files from the directory. Signed-off-by: Andrew Walker Reviewed-by: Alexander Motin Reviewed-by: Ameer Hamza Reviewed-by: Allan Jude Reviewed-by: Rob Norris Closes #17837 --- module/os/linux/zfs/zfs_acl.c | 2 +- module/os/linux/zfs/zfs_vnops_os.c | 5 +---- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/module/os/linux/zfs/zfs_acl.c b/module/os/linux/zfs/zfs_acl.c index daa4b577683..934d74a112f 100644 --- a/module/os/linux/zfs/zfs_acl.c +++ b/module/os/linux/zfs/zfs_acl.c @@ -2524,7 +2524,7 @@ zfs_zaccess_common(znode_t *zp, uint32_t v4_mode, uint32_t *working_mode, * Also note: DOS R/O is ignored for directories. */ if ((v4_mode & WRITE_MASK_DATA) && - S_ISDIR(ZTOI(zp)->i_mode) && + !S_ISDIR(ZTOI(zp)->i_mode) && (zp->z_pflags & ZFS_READONLY)) { return (SET_ERROR(EPERM)); } diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c index 6106726651a..e845ad69ad7 100644 --- a/module/os/linux/zfs/zfs_vnops_os.c +++ b/module/os/linux/zfs/zfs_vnops_os.c @@ -2033,10 +2033,7 @@ zfs_setattr(znode_t *zp, vattr_t *vap, int flags, cred_t *cr, zidmap_t *mnt_ns) goto out3; } - if ((mask & ATTR_SIZE) && (zp->z_pflags & ZFS_READONLY)) { - err = SET_ERROR(EPERM); - goto out3; - } + /* ZFS_READONLY will be handled in zfs_zaccess() */ /* * Verify timestamps doesn't overflow 32 bits. From 3ea8ca8c0f955a99ac87eb01a7201114a2d5f15f Mon Sep 17 00:00:00 2001 From: Shreshth3 <66148173+Shreshth3@users.noreply.github.com> Date: Mon, 20 Oct 2025 06:30:20 -0700 Subject: [PATCH 03/16] zdb: fix bug with -A flag Fixes #10544. According to the manpage, zdb -A should ignore all assertions. But it currently does not do that. This commit fixes this bug. Signed-off-by: Shreshth Srivastava Reviewed-by: Rob Norris Reviewed-by: Alexander Motin Closes #17825 --- cmd/zdb/zdb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/cmd/zdb/zdb.c b/cmd/zdb/zdb.c index 1ffb8ebc70f..2560ad045db 100644 --- a/cmd/zdb/zdb.c +++ b/cmd/zdb/zdb.c @@ -9747,6 +9747,9 @@ main(int argc, char **argv) */ spa_mode_readable_spacemaps = B_TRUE; + libspl_set_assert_ok((dump_opt['A'] == 1) || (dump_opt['A'] > 2)); + zfs_recover = (dump_opt['A'] > 1); + if (dump_all) verbose = MAX(verbose, 1); @@ -9757,9 +9760,6 @@ main(int argc, char **argv) dump_opt[c] += verbose; } - libspl_set_assert_ok((dump_opt['A'] == 1) || (dump_opt['A'] > 2)); - zfs_recover = (dump_opt['A'] > 1); - argc -= optind; argv += optind; if (argc < 2 && dump_opt['R']) From adacf020ce3407b55b824519d56410ae08eec8de Mon Sep 17 00:00:00 2001 From: Andrew Walker Date: Mon, 20 Oct 2025 17:21:40 -0500 Subject: [PATCH 04/16] Fix return value for setting zvol threading We must return -1 instead of ENOENT if the special zvol threading property set function can't locate the dataset (this would typically happen with an encypted and unmounted zvol) so that the operation gets inserted properly into the nvlist for operations to set. This is because we want the property to be set once the zvol is decrypted again. Reviewed-by: Allan Jude Reviewed-by: Rob Norris Reviewed-by: Ameer Hamza Reviewed-by: Brian Behlendorf Reviewed-by: Alexander Motin Signed-off-by: Andrew Walker Closes #17836 --- module/zfs/zvol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c index faced0db7e9..00f98168d3d 100644 --- a/module/zfs/zvol.c +++ b/module/zfs/zvol.c @@ -410,7 +410,7 @@ zvol_set_volthreading(const char *name, boolean_t value) { zvol_state_t *zv = zvol_find_by_name(name, RW_NONE); if (zv == NULL) - return (SET_ERROR(ENOENT)); + return (-1); zv->zv_threading = value; mutex_exit(&zv->zv_state_lock); return (0); From 9d50ee59dc13dbb376ec738808da9d95226b44fe Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Fri, 12 Sep 2025 09:57:53 +1000 Subject: [PATCH 05/16] Linux 6.18: replace nth_page() Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- module/os/linux/zfs/abd_os.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/module/os/linux/zfs/abd_os.c b/module/os/linux/zfs/abd_os.c index 8a8316f63c4..18f2426fbbf 100644 --- a/module/os/linux/zfs/abd_os.c +++ b/module/os/linux/zfs/abd_os.c @@ -23,6 +23,7 @@ * Copyright (c) 2014 by Chunwei Chen. All rights reserved. * Copyright (c) 2019 by Delphix. All rights reserved. * Copyright (c) 2023, 2024, Klara Inc. + * Copyright (c) 2025, Rob Norris */ /* @@ -1109,6 +1110,14 @@ abd_return_buf_copy(abd_t *abd, void *buf, size_t n) #define ABD_ITER_PAGE_SIZE(page) (PAGESIZE) #endif +#ifndef nth_page +/* + * Since 6.18 nth_page() no longer exists, and is no longer required to iterate + * within a single SG entry, so we replace it with a simple addition. + */ +#define nth_page(p, n) ((p)+(n)) +#endif + void abd_iter_page(struct abd_iter *aiter) { From 5de4a297e7ab0f8062b550acc4e76308c675f653 Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Fri, 12 Sep 2025 10:03:07 +1000 Subject: [PATCH 06/16] Linux 6.18: convert ida_simple_* calls ida_simple_get() and ida_simple_remove() are removed in 6.18. However, since 4.19 they have been simple wrappers around ida_alloc() and ida_free(), so we can just use those directly. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- module/os/linux/zfs/zvol_os.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/module/os/linux/zfs/zvol_os.c b/module/os/linux/zfs/zvol_os.c index 4e66bee7744..3e458e14c92 100644 --- a/module/os/linux/zfs/zvol_os.c +++ b/module/os/linux/zfs/zvol_os.c @@ -21,7 +21,7 @@ */ /* * Copyright (c) 2012, 2020 by Delphix. All rights reserved. - * Copyright (c) 2024, Rob Norris + * Copyright (c) 2024, 2025, Rob Norris * Copyright (c) 2024, 2025, Klara, Inc. */ @@ -1500,7 +1500,7 @@ zvol_os_remove_minor(zvol_state_t *zv) if (zso->use_blk_mq) blk_mq_free_tag_set(&zso->tag_set); - ida_simple_remove(&zvol_ida, MINOR(zso->zvo_dev) >> ZVOL_MINOR_BITS); + ida_free(&zvol_ida, MINOR(zso->zvo_dev) >> ZVOL_MINOR_BITS); kmem_free(zso, sizeof (struct zvol_state_os)); @@ -1655,7 +1655,7 @@ zvol_os_create_minor(const char *name) if (zvol_inhibit_dev) return (0); - idx = ida_simple_get(&zvol_ida, 0, 0, kmem_flags_convert(KM_SLEEP)); + idx = ida_alloc(&zvol_ida, kmem_flags_convert(KM_SLEEP)); if (idx < 0) return (SET_ERROR(-idx)); minor = idx << ZVOL_MINOR_BITS; @@ -1663,7 +1663,7 @@ zvol_os_create_minor(const char *name) /* too many partitions can cause an overflow */ zfs_dbgmsg("zvol: create minor overflow: %s, minor %u/%u", name, minor, MINOR(minor)); - ida_simple_remove(&zvol_ida, idx); + ida_free(&zvol_ida, idx); return (SET_ERROR(EINVAL)); } @@ -1671,7 +1671,7 @@ zvol_os_create_minor(const char *name) if (zv) { ASSERT(MUTEX_HELD(&zv->zv_state_lock)); mutex_exit(&zv->zv_state_lock); - ida_simple_remove(&zvol_ida, idx); + ida_free(&zvol_ida, idx); return (SET_ERROR(EEXIST)); } @@ -1771,7 +1771,7 @@ zvol_os_create_minor(const char *name) rw_exit(&zvol_state_lock); error = zvol_os_add_disk(zv->zv_zso->zvo_disk); } else { - ida_simple_remove(&zvol_ida, idx); + ida_free(&zvol_ida, idx); } return (error); From 39db4bda8078eb83776ad7ac90ecb6cdcbd083eb Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Fri, 12 Sep 2025 10:23:28 +1000 Subject: [PATCH 07/16] Linux 6.18: block_device_operations->getgeo takes struct gendisk* Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- config/kernel-block-device-operations.m4 | 34 ++++++++++++++++++++++++ module/os/linux/zfs/zvol_os.c | 20 +++++++++++--- 2 files changed, 51 insertions(+), 3 deletions(-) diff --git a/config/kernel-block-device-operations.m4 b/config/kernel-block-device-operations.m4 index 4ff20b9c413..1905340a9c7 100644 --- a/config/kernel-block-device-operations.m4 +++ b/config/kernel-block-device-operations.m4 @@ -119,15 +119,49 @@ AC_DEFUN([ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_REVALIDATE_DISK], [ ]) ]) +dnl # +dnl # 6.18 API change +dnl # block_device_operation->getgeo takes struct gendisk* as first arg +dnl # +AC_DEFUN([ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK], [ + ZFS_LINUX_TEST_SRC([block_device_operations_getgeo_gendisk], [ + #include + + static int blk_getgeo(struct gendisk *disk, struct hd_geometry *geo) + { + (void) disk, (void) geo; + return (0); + } + + static const struct block_device_operations + bops __attribute__ ((unused)) = { + .getgeo = blk_getgeo, + }; + ], [], []) +]) + +AC_DEFUN([ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK], [ + AC_MSG_CHECKING([whether bops->getgeo() takes gendisk as first arg]) + ZFS_LINUX_TEST_RESULT([block_device_operations_getgeo_gendisk], [ + AC_MSG_RESULT(yes) + AC_DEFINE([HAVE_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK], [1], + [Define if getgeo() in block_device_operations takes struct gendisk * as its first arg]) + ],[ + AC_MSG_RESULT(no) + ]) +]) + AC_DEFUN([ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS], [ ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_CHECK_EVENTS ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_RELEASE_VOID ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_RELEASE_1ARG ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_REVALIDATE_DISK + ZFS_AC_KERNEL_SRC_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK ]) AC_DEFUN([ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS], [ ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_CHECK_EVENTS ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_RELEASE_VOID ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_REVALIDATE_DISK + ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK ]) diff --git a/module/os/linux/zfs/zvol_os.c b/module/os/linux/zfs/zvol_os.c index 3e458e14c92..fe939150b64 100644 --- a/module/os/linux/zfs/zvol_os.c +++ b/module/os/linux/zfs/zvol_os.c @@ -1032,12 +1032,12 @@ zvol_os_update_volsize(zvol_state_t *zv, uint64_t volsize) * tiny devices. For devices over 1 Mib a standard head and sector count * is used to keep the cylinders count reasonable. */ -static int -zvol_getgeo(struct block_device *bdev, struct hd_geometry *geo) +static inline int +zvol_getgeo_impl(struct gendisk *disk, struct hd_geometry *geo) { + zvol_state_t *zv = atomic_load_ptr(&disk->private_data); sector_t sectors; - zvol_state_t *zv = atomic_load_ptr(&bdev->bd_disk->private_data); ASSERT3P(zv, !=, NULL); ASSERT3U(zv->zv_open_count, >, 0); @@ -1057,6 +1057,20 @@ zvol_getgeo(struct block_device *bdev, struct hd_geometry *geo) return (0); } +#ifdef HAVE_BLOCK_DEVICE_OPERATIONS_GETGEO_GENDISK +static int +zvol_getgeo(struct gendisk *disk, struct hd_geometry *geo) +{ + return (zvol_getgeo_impl(disk, geo)); +} +#else +static int +zvol_getgeo(struct block_device *bdev, struct hd_geometry *geo) +{ + return (zvol_getgeo_impl(bdev->bd_disk, geo)); +} +#endif + /* * Why have two separate block_device_operations structs? * From 76c238f1ba9fbd0123cf4f93028e70ad19a0bcd2 Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Fri, 12 Sep 2025 09:31:35 +1000 Subject: [PATCH 08/16] Linux 6.18: replace write_cache_pages() Linux 6.18 removed write_cache_pages() without a usable replacement. Here we implement a minimal zpl_write_cache_pages() that find the dirty pages within the mapping, gets them into the expected state and hands them off to zfs_putpage(), which handles the rest. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- config/kernel-writeback.m4 | 58 ++++++++++++++++++++++++++ config/kernel-writepage_t.m4 | 26 ------------ config/kernel.m4 | 4 +- module/os/linux/zfs/zpl_file.c | 74 ++++++++++++++++++++++++++++++++++ 4 files changed, 134 insertions(+), 28 deletions(-) create mode 100644 config/kernel-writeback.m4 delete mode 100644 config/kernel-writepage_t.m4 diff --git a/config/kernel-writeback.m4 b/config/kernel-writeback.m4 new file mode 100644 index 00000000000..334d65ef84b --- /dev/null +++ b/config/kernel-writeback.m4 @@ -0,0 +1,58 @@ +AC_DEFUN([ZFS_AC_KERNEL_SRC_WRITEPAGE_T], [ + dnl # + dnl # 6.3 API change + dnl # The writepage_t function type now has its first argument as + dnl # struct folio* instead of struct page* + dnl # + ZFS_LINUX_TEST_SRC([writepage_t_folio], [ + #include + static int putpage(struct folio *folio, + struct writeback_control *wbc, void *data) + { return 0; } + writepage_t func = putpage; + ],[]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_WRITEPAGE_T], [ + AC_MSG_CHECKING([whether int (*writepage_t)() takes struct folio*]) + ZFS_LINUX_TEST_RESULT([writepage_t_folio], [ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_WRITEPAGE_T_FOLIO, 1, + [int (*writepage_t)() takes struct folio*]) + ],[ + AC_MSG_RESULT(no) + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_SRC_WRITE_CACHE_PAGES], [ + dnl # + dnl # 6.18 API change + dnl # write_cache_pages() has been removed. + dnl # + ZFS_LINUX_TEST_SRC([write_cache_pages], [ + #include + ], [ + (void) write_cache_pages(NULL, NULL, NULL, NULL); + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_WRITE_CACHE_PAGES], [ + AC_MSG_CHECKING([whether write_cache_pages() is available]) + ZFS_LINUX_TEST_RESULT([write_cache_pages], [ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_WRITE_CACHE_PAGES, 1, + [write_cache_pages() is available]) + ],[ + AC_MSG_RESULT(no) + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_SRC_WRITEBACK], [ + ZFS_AC_KERNEL_SRC_WRITEPAGE_T + ZFS_AC_KERNEL_SRC_WRITE_CACHE_PAGES +]) + +AC_DEFUN([ZFS_AC_KERNEL_WRITEBACK], [ + ZFS_AC_KERNEL_WRITEPAGE_T + ZFS_AC_KERNEL_WRITE_CACHE_PAGES +]) diff --git a/config/kernel-writepage_t.m4 b/config/kernel-writepage_t.m4 deleted file mode 100644 index a82cf370c9d..00000000000 --- a/config/kernel-writepage_t.m4 +++ /dev/null @@ -1,26 +0,0 @@ -AC_DEFUN([ZFS_AC_KERNEL_SRC_WRITEPAGE_T], [ - dnl # - dnl # 6.3 API change - dnl # The writepage_t function type now has its first argument as - dnl # struct folio* instead of struct page* - dnl # - ZFS_LINUX_TEST_SRC([writepage_t_folio], [ - #include - static int putpage(struct folio *folio, - struct writeback_control *wbc, void *data) - { return 0; } - writepage_t func = putpage; - ],[]) -]) - -AC_DEFUN([ZFS_AC_KERNEL_WRITEPAGE_T], [ - AC_MSG_CHECKING([whether int (*writepage_t)() takes struct folio*]) - ZFS_LINUX_TEST_RESULT([writepage_t_folio], [ - AC_MSG_RESULT(yes) - AC_DEFINE(HAVE_WRITEPAGE_T_FOLIO, 1, - [int (*writepage_t)() takes struct folio*]) - ],[ - AC_MSG_RESULT(no) - ]) -]) - diff --git a/config/kernel.m4 b/config/kernel.m4 index 35819e4d68c..27fe7661611 100644 --- a/config/kernel.m4 +++ b/config/kernel.m4 @@ -121,7 +121,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_SRC], [ ZFS_AC_KERNEL_SRC_IDMAP_MNT_API ZFS_AC_KERNEL_SRC_IDMAP_NO_USERNS ZFS_AC_KERNEL_SRC_IATTR_VFSID - ZFS_AC_KERNEL_SRC_WRITEPAGE_T + ZFS_AC_KERNEL_SRC_WRITEBACK ZFS_AC_KERNEL_SRC_RECLAIMED ZFS_AC_KERNEL_SRC_REGISTER_SYSCTL_TABLE ZFS_AC_KERNEL_SRC_REGISTER_SYSCTL_SZ @@ -240,7 +240,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_RESULT], [ ZFS_AC_KERNEL_IDMAP_MNT_API ZFS_AC_KERNEL_IDMAP_NO_USERNS ZFS_AC_KERNEL_IATTR_VFSID - ZFS_AC_KERNEL_WRITEPAGE_T + ZFS_AC_KERNEL_WRITEBACK ZFS_AC_KERNEL_RECLAIMED ZFS_AC_KERNEL_REGISTER_SYSCTL_TABLE ZFS_AC_KERNEL_REGISTER_SYSCTL_SZ diff --git a/module/os/linux/zfs/zpl_file.c b/module/os/linux/zfs/zpl_file.c index d07317b0d91..02965ac8cbe 100644 --- a/module/os/linux/zfs/zpl_file.c +++ b/module/os/linux/zfs/zpl_file.c @@ -23,6 +23,7 @@ * Copyright (c) 2011, Lawrence Livermore National Security, LLC. * Copyright (c) 2015 by Chunwei Chen. All rights reserved. * Copyright (c) 2025, Klara, Inc. + * Copyright (c) 2025, Rob Norris */ @@ -478,6 +479,7 @@ zpl_putpage(struct page *pp, struct writeback_control *wbc, void *data) return (ret); } +#ifdef HAVE_WRITE_CACHE_PAGES #ifdef HAVE_WRITEPAGE_T_FOLIO static int zpl_putfolio(struct folio *pp, struct writeback_control *wbc, void *data) @@ -499,6 +501,78 @@ zpl_write_cache_pages(struct address_space *mapping, #endif return (result); } +#else +static inline int +zpl_write_cache_pages(struct address_space *mapping, + struct writeback_control *wbc, void *data) +{ + pgoff_t start = wbc->range_start >> PAGE_SHIFT; + pgoff_t end = wbc->range_end >> PAGE_SHIFT; + + struct folio_batch fbatch; + folio_batch_init(&fbatch); + + /* + * This atomically (-ish) tags all DIRTY pages in the range with + * TOWRITE, allowing users to continue dirtying or undirtying pages + * while we get on with writeback, without us treading on each other. + */ + tag_pages_for_writeback(mapping, start, end); + + int err = 0; + unsigned int npages; + + /* + * Grab references to the TOWRITE pages just flagged. This may not get + * all of them, so we do it in a loop until there are none left. + */ + while ((npages = filemap_get_folios_tag(mapping, &start, end, + PAGECACHE_TAG_TOWRITE, &fbatch)) != 0) { + + /* Loop over each page and write it out. */ + struct folio *folio; + while ((folio = folio_batch_next(&fbatch)) != NULL) { + folio_lock(folio); + + /* + * If the folio has been remapped, or is no longer + * dirty, then there's nothing to do. + */ + if (folio->mapping != mapping || + !folio_test_dirty(folio)) { + folio_unlock(folio); + continue; + } + + /* + * If writeback is already in progress, wait for it to + * finish. We continue after this even if the page + * ends up clean; zfs_putpage() will skip it if no + * further work is required. + */ + while (folio_test_writeback(folio)) + folio_wait_bit(folio, PG_writeback); + + /* + * Write it out and collect any error. zfs_putpage() + * will clear the TOWRITE and DIRTY flags, and return + * with the page unlocked. + */ + int ferr = zpl_putpage(&folio->page, wbc, data); + if (err == 0 && ferr != 0) + err = ferr; + + /* Housekeeping for the caller. */ + wbc->nr_to_write -= folio_nr_pages(folio); + } + + /* Release any remaining references on the batch. */ + folio_batch_release(&fbatch); + } + + return (err); +} +#endif static int zpl_writepages(struct address_space *mapping, struct writeback_control *wbc) From 8911360a416fb3a9fe055768017e003b2fc0d3bf Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Mon, 29 Sep 2025 09:16:36 +1000 Subject: [PATCH 09/16] Linux 6.18: namespace type moved to ns_common The namespace type has moved from the namespace ops struct to the "common" base namespace struct. Detect this and define a macro that does the right thing for both versions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- config/kernel-namespace.m4 | 31 +++++++++++ config/kernel-userns-capabilities.m4 | 79 ---------------------------- config/kernel.m4 | 2 + module/os/linux/spl/spl-zone.c | 19 ++++++- 4 files changed, 51 insertions(+), 80 deletions(-) create mode 100644 config/kernel-namespace.m4 delete mode 100644 config/kernel-userns-capabilities.m4 diff --git a/config/kernel-namespace.m4 b/config/kernel-namespace.m4 new file mode 100644 index 00000000000..9b0b12e4eab --- /dev/null +++ b/config/kernel-namespace.m4 @@ -0,0 +1,31 @@ +dnl # +dnl # 6.18 API change +dnl # ns->ops->type was moved to ns->ns.ns_type (struct ns_common) +dnl # +AC_DEFUN([ZFS_AC_KERNEL_SRC_NS_COMMON_TYPE], [ + ZFS_LINUX_TEST_SRC([ns_common_type], [ + #include + ],[ + struct user_namespace ns; + ns.ns.ns_type = 0; + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_NS_COMMON_TYPE], [ + AC_MSG_CHECKING([whether ns_type is accessible through ns_common]) + ZFS_LINUX_TEST_RESULT([ns_common_type], [ + AC_MSG_RESULT(yes) + AC_DEFINE([HAVE_NS_COMMON_TYPE], 1, + [Define if ns_type is accessible through ns_common]) + ],[ + AC_MSG_RESULT(no) + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_SRC_NAMESPACE], [ + ZFS_AC_KERNEL_SRC_NS_COMMON_TYPE +]) + +AC_DEFUN([ZFS_AC_KERNEL_NAMESPACE], [ + ZFS_AC_KERNEL_NS_COMMON_TYPE +]) diff --git a/config/kernel-userns-capabilities.m4 b/config/kernel-userns-capabilities.m4 deleted file mode 100644 index f4e24fb1606..00000000000 --- a/config/kernel-userns-capabilities.m4 +++ /dev/null @@ -1,79 +0,0 @@ -dnl # -dnl # 2.6.38 API change -dnl # ns_capable() was introduced -dnl # -AC_DEFUN([ZFS_AC_KERNEL_SRC_NS_CAPABLE], [ - ZFS_LINUX_TEST_SRC([ns_capable], [ - #include - ],[ - ns_capable((struct user_namespace *)NULL, CAP_SYS_ADMIN); - ]) -]) - -AC_DEFUN([ZFS_AC_KERNEL_NS_CAPABLE], [ - AC_MSG_CHECKING([whether ns_capable exists]) - ZFS_LINUX_TEST_RESULT([ns_capable], [ - AC_MSG_RESULT(yes) - ],[ - ZFS_LINUX_TEST_ERROR([ns_capable()]) - ]) -]) - -dnl # -dnl # 2.6.39 API change -dnl # struct user_namespace was added to struct cred_t as cred->user_ns member -dnl # -AC_DEFUN([ZFS_AC_KERNEL_SRC_CRED_USER_NS], [ - ZFS_LINUX_TEST_SRC([cred_user_ns], [ - #include - ],[ - struct cred cr; - cr.user_ns = (struct user_namespace *)NULL; - ]) -]) - -AC_DEFUN([ZFS_AC_KERNEL_CRED_USER_NS], [ - AC_MSG_CHECKING([whether cred_t->user_ns exists]) - ZFS_LINUX_TEST_RESULT([cred_user_ns], [ - AC_MSG_RESULT(yes) - ],[ - ZFS_LINUX_TEST_ERROR([cred_t->user_ns()]) - ]) -]) - -dnl # -dnl # 3.4 API change -dnl # kuid_has_mapping() and kgid_has_mapping() were added to distinguish -dnl # between internal kernel uids/gids and user namespace uids/gids. -dnl # -AC_DEFUN([ZFS_AC_KERNEL_SRC_KUID_HAS_MAPPING], [ - ZFS_LINUX_TEST_SRC([kuid_has_mapping], [ - #include - ],[ - kuid_has_mapping((struct user_namespace *)NULL, KUIDT_INIT(0)); - kgid_has_mapping((struct user_namespace *)NULL, KGIDT_INIT(0)); - ]) -]) - -AC_DEFUN([ZFS_AC_KERNEL_KUID_HAS_MAPPING], [ - AC_MSG_CHECKING([whether kuid_has_mapping/kgid_has_mapping exist]) - ZFS_LINUX_TEST_RESULT([kuid_has_mapping], [ - AC_MSG_RESULT(yes) - ],[ - ZFS_LINUX_TEST_ERROR([kuid_has_mapping()]) - ]) -]) - -AC_DEFUN([ZFS_AC_KERNEL_SRC_USERNS_CAPABILITIES], [ - ZFS_AC_KERNEL_SRC_NS_CAPABLE - ZFS_AC_KERNEL_SRC_HAS_CAPABILITY - ZFS_AC_KERNEL_SRC_CRED_USER_NS - ZFS_AC_KERNEL_SRC_KUID_HAS_MAPPING -]) - -AC_DEFUN([ZFS_AC_KERNEL_USERNS_CAPABILITIES], [ - ZFS_AC_KERNEL_NS_CAPABLE - ZFS_AC_KERNEL_HAS_CAPABILITY - ZFS_AC_KERNEL_CRED_USER_NS - ZFS_AC_KERNEL_KUID_HAS_MAPPING -]) diff --git a/config/kernel.m4 b/config/kernel.m4 index 27fe7661611..8484bcfb161 100644 --- a/config/kernel.m4 +++ b/config/kernel.m4 @@ -136,6 +136,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_SRC], [ ZFS_AC_KERNEL_SRC_TIMER ZFS_AC_KERNEL_SRC_SUPER_BLOCK_S_WB_ERR ZFS_AC_KERNEL_SRC_SOPS_FREE_INODE + ZFS_AC_KERNEL_SRC_NAMESPACE case "$host_cpu" in powerpc*) ZFS_AC_KERNEL_SRC_CPU_HAS_FEATURE @@ -256,6 +257,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_RESULT], [ ZFS_AC_KERNEL_TIMER ZFS_AC_KERNEL_SUPER_BLOCK_S_WB_ERR ZFS_AC_KERNEL_SOPS_FREE_INODE + ZFS_AC_KERNEL_NAMESPACE case "$host_cpu" in powerpc*) ZFS_AC_KERNEL_CPU_HAS_FEATURE diff --git a/module/os/linux/spl/spl-zone.c b/module/os/linux/spl/spl-zone.c index 45c2999a4bb..b2eae5d00b1 100644 --- a/module/os/linux/spl/spl-zone.c +++ b/module/os/linux/spl/spl-zone.c @@ -25,6 +25,10 @@ * SUCH DAMAGE. */ +/* + * Copyright (c) 2025, Rob Norris + */ + #include #include #include @@ -56,6 +60,19 @@ typedef struct zone_dataset { } zone_dataset_t; #ifdef CONFIG_USER_NS + +/* + * Linux 6.18 moved the generic namespace type away from ns->ops->type onto + * ns_common itself. + */ +#ifdef HAVE_NS_COMMON_TYPE +#define ns_is_newuser(ns) \ + ((ns)->ns_type == CLONE_NEWUSER) +#else +#define ns_is_newuser(ns) \ + ((ns)->ops != NULL && (ns)->ops->type == CLONE_NEWUSER) +#endif + /* * Returns: * - 0 on success @@ -84,7 +101,7 @@ user_ns_get(int fd, struct user_namespace **userns) goto done; } ns = get_proc_ns(file_inode(nsfile)); - if (ns->ops->type != CLONE_NEWUSER) { + if (!ns_is_newuser(ns)) { error = ENOTTY; goto done; } From 3651888182ec381f95d90efbd564a207e5e17670 Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Mon, 29 Sep 2025 09:32:50 +1000 Subject: [PATCH 10/16] sha256_generic: make internal functions a little more private Linux 6.18 has conflicting prototypes for various sha256_* and sha512_* functions, which we get through a very long include chain. That's tough to fix right now; easier is just to rename our internal functions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- module/icp/algs/sha2/sha2_generic.c | 41 +++++++++++++++++------------ 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/module/icp/algs/sha2/sha2_generic.c b/module/icp/algs/sha2/sha2_generic.c index d0fcca798fa..ad707341eec 100644 --- a/module/icp/algs/sha2/sha2_generic.c +++ b/module/icp/algs/sha2/sha2_generic.c @@ -77,7 +77,8 @@ static const uint32_t SHA256_K[64] = { h = g, g = f, f = e, e = d + T1; \ d = c, c = b, b = a, a = T1 + T2; -static void sha256_generic(uint32_t state[8], const void *data, size_t num_blks) +static void +icp_sha256_generic(uint32_t state[8], const void *data, size_t num_blks) { uint64_t blk; @@ -173,7 +174,8 @@ static const uint64_t SHA512_K[80] = { 0x5fcb6fab3ad6faec, 0x6c44198c4a475817 }; -static void sha512_generic(uint64_t state[8], const void *data, size_t num_blks) +static void +icp_sha512_generic(uint64_t state[8], const void *data, size_t num_blks) { uint64_t blk; @@ -226,7 +228,8 @@ static void sha512_generic(uint64_t state[8], const void *data, size_t num_blks) } } -static void sha256_update(sha256_ctx *ctx, const uint8_t *data, size_t len) +static void +icp_sha256_update(sha256_ctx *ctx, const uint8_t *data, size_t len) { uint64_t pos = ctx->count[0]; uint64_t total = ctx->count[1]; @@ -258,7 +261,8 @@ static void sha256_update(sha256_ctx *ctx, const uint8_t *data, size_t len) ctx->count[1] = total; } -static void sha512_update(sha512_ctx *ctx, const uint8_t *data, size_t len) +static void +icp_sha512_update(sha512_ctx *ctx, const uint8_t *data, size_t len) { uint64_t pos = ctx->count[0]; uint64_t total = ctx->count[1]; @@ -290,7 +294,8 @@ static void sha512_update(sha512_ctx *ctx, const uint8_t *data, size_t len) ctx->count[1] = total; } -static void sha256_final(sha256_ctx *ctx, uint8_t *result, int bits) +static void +icp_sha256_final(sha256_ctx *ctx, uint8_t *result, int bits) { uint64_t mlen, pos = ctx->count[0]; uint8_t *m = ctx->wbuf; @@ -334,7 +339,8 @@ static void sha256_final(sha256_ctx *ctx, uint8_t *result, int bits) memset(ctx, 0, sizeof (*ctx)); } -static void sha512_final(sha512_ctx *ctx, uint8_t *result, int bits) +static void +icp_sha512_final(sha512_ctx *ctx, uint8_t *result, int bits) { uint64_t mlen, pos = ctx->count[0]; uint8_t *m = ctx->wbuf, *r; @@ -461,14 +467,14 @@ SHA2Update(SHA2_CTX *ctx, const void *data, size_t len) switch (ctx->algotype) { case SHA256: - sha256_update(&ctx->sha256, data, len); + icp_sha256_update(&ctx->sha256, data, len); break; case SHA512: case SHA512_HMAC_MECH_INFO_TYPE: - sha512_update(&ctx->sha512, data, len); + icp_sha512_update(&ctx->sha512, data, len); break; case SHA512_256: - sha512_update(&ctx->sha512, data, len); + icp_sha512_update(&ctx->sha512, data, len); break; } } @@ -479,32 +485,33 @@ SHA2Final(void *digest, SHA2_CTX *ctx) { switch (ctx->algotype) { case SHA256: - sha256_final(&ctx->sha256, digest, 256); + icp_sha256_final(&ctx->sha256, digest, 256); break; case SHA512: case SHA512_HMAC_MECH_INFO_TYPE: - sha512_final(&ctx->sha512, digest, 512); + icp_sha512_final(&ctx->sha512, digest, 512); break; case SHA512_256: - sha512_final(&ctx->sha512, digest, 256); + icp_sha512_final(&ctx->sha512, digest, 256); break; } } /* the generic implementation is always okay */ -static boolean_t sha2_is_supported(void) +static boolean_t +icp_sha2_is_supported(void) { return (B_TRUE); } const sha256_ops_t sha256_generic_impl = { .name = "generic", - .transform = sha256_generic, - .is_supported = sha2_is_supported + .transform = icp_sha256_generic, + .is_supported = icp_sha2_is_supported }; const sha512_ops_t sha512_generic_impl = { .name = "generic", - .transform = sha512_generic, - .is_supported = sha2_is_supported + .transform = icp_sha512_generic, + .is_supported = icp_sha2_is_supported }; From fe8b50f09fe69d3ae672d75593ec11b6d2b3f73f Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Mon, 29 Sep 2025 09:51:06 +1000 Subject: [PATCH 11/16] Linux 6.18: generic_drop_inode() and generic_delete_inode() renamed Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris --- config/kernel-drop-inode.m4 | 24 ++++++++++++++++++++++ config/kernel.m4 | 2 ++ include/os/linux/kernel/linux/vfs_compat.h | 7 +++++++ module/os/linux/zfs/zpl_super.c | 4 +++- 4 files changed, 36 insertions(+), 1 deletion(-) create mode 100644 config/kernel-drop-inode.m4 diff --git a/config/kernel-drop-inode.m4 b/config/kernel-drop-inode.m4 new file mode 100644 index 00000000000..6f2b12cadc0 --- /dev/null +++ b/config/kernel-drop-inode.m4 @@ -0,0 +1,24 @@ +dnl # +dnl # 6.18 API change +dnl # - generic_drop_inode() renamed to inode_generic_drop() +dnl # - generic_delete_inode() renamed to inode_just_drop() +dnl # +AC_DEFUN([ZFS_AC_KERNEL_SRC_INODE_GENERIC_DROP], [ + ZFS_LINUX_TEST_SRC([inode_generic_drop], [ + #include + ],[ + struct inode *ip = NULL; + inode_generic_drop(ip); + ]) +]) + +AC_DEFUN([ZFS_AC_KERNEL_INODE_GENERIC_DROP], [ + AC_MSG_CHECKING([whether inode_generic_drop() exists]) + ZFS_LINUX_TEST_RESULT([inode_generic_drop], [ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_INODE_GENERIC_DROP, 1, + [inode_generic_drop() exists]) + ],[ + AC_MSG_RESULT(no) + ]) +]) diff --git a/config/kernel.m4 b/config/kernel.m4 index 8484bcfb161..40b7de73988 100644 --- a/config/kernel.m4 +++ b/config/kernel.m4 @@ -137,6 +137,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_SRC], [ ZFS_AC_KERNEL_SRC_SUPER_BLOCK_S_WB_ERR ZFS_AC_KERNEL_SRC_SOPS_FREE_INODE ZFS_AC_KERNEL_SRC_NAMESPACE + ZFS_AC_KERNEL_SRC_INODE_GENERIC_DROP case "$host_cpu" in powerpc*) ZFS_AC_KERNEL_SRC_CPU_HAS_FEATURE @@ -258,6 +259,7 @@ AC_DEFUN([ZFS_AC_KERNEL_TEST_RESULT], [ ZFS_AC_KERNEL_SUPER_BLOCK_S_WB_ERR ZFS_AC_KERNEL_SOPS_FREE_INODE ZFS_AC_KERNEL_NAMESPACE + ZFS_AC_KERNEL_INODE_GENERIC_DROP case "$host_cpu" in powerpc*) ZFS_AC_KERNEL_CPU_HAS_FEATURE diff --git a/include/os/linux/kernel/linux/vfs_compat.h b/include/os/linux/kernel/linux/vfs_compat.h index cbf14e28371..d9dc904bc32 100644 --- a/include/os/linux/kernel/linux/vfs_compat.h +++ b/include/os/linux/kernel/linux/vfs_compat.h @@ -23,6 +23,7 @@ /* * Copyright (C) 2011 Lawrence Livermore National Security, LLC. * Copyright (C) 2015 Jörg Thalheim. + * Copyright (c) 2025, Rob Norris */ #ifndef _ZFS_VFS_H @@ -262,4 +263,10 @@ zpl_is_32bit_api(void) #define zpl_generic_fillattr(user_ns, ip, sp) generic_fillattr(ip, sp) #endif +#ifdef HAVE_INODE_GENERIC_DROP +/* 6.18 API change. These were renamed, alias the old names to the new. */ +#define generic_delete_inode(ip) inode_just_drop(ip) +#define generic_drop_inode(ip) inode_generic_drop(ip) +#endif + #endif /* _ZFS_VFS_H */ diff --git a/module/os/linux/zfs/zpl_super.c b/module/os/linux/zfs/zpl_super.c index 444948d03cb..347b352506e 100644 --- a/module/os/linux/zfs/zpl_super.c +++ b/module/os/linux/zfs/zpl_super.c @@ -23,6 +23,7 @@ * Copyright (c) 2011, Lawrence Livermore National Security, LLC. * Copyright (c) 2023, Datto Inc. All rights reserved. * Copyright (c) 2025, Klara, Inc. + * Copyright (c) 2025, Rob Norris */ @@ -33,6 +34,7 @@ #include #include #include +#include /* * What to do when the last reference to an inode is released. If 0, the kernel @@ -104,7 +106,7 @@ zpl_dirty_inode(struct inode *ip, int flags) * reporting memory pressure and requests OpenZFS release some memory (see * zfs_prune()). * - * When set to 1, we call generic_delete_node(), which always returns "destroy + * When set to 1, we call generic_delete_inode(), which always returns "destroy * immediately", resulting in inodes being destroyed immediately, releasing * their associated dnodes and dbufs to the dbuf cached and the ARC to be * evicted as normal. From 3a55e76b84c35a26fbfba098bdadc541614ae71d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jean-S=C3=A9bastien=20P=C3=A9dron?= Date: Tue, 21 Oct 2025 02:04:21 +0200 Subject: [PATCH 12/16] FreeBSD: zfs_getpages: Don't zero freshly allocated pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Reviewed-by: Alexander Motin Reviewed-by: Mark Johnston Signed-off-by: Jean-Sébastien Pédron Closes #17851 --- module/os/freebsd/zfs/zfs_vnops_os.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/module/os/freebsd/zfs/zfs_vnops_os.c b/module/os/freebsd/zfs/zfs_vnops_os.c index 8a5f76351dc..25fbb77acba 100644 --- a/module/os/freebsd/zfs/zfs_vnops_os.c +++ b/module/os/freebsd/zfs/zfs_vnops_os.c @@ -4225,8 +4225,20 @@ zfs_getpages(struct vnode *vp, vm_page_t *ma, int count, int *rbehind, zfs_vmobject_wlock(object); (void) vm_page_grab_pages(object, OFF_TO_IDX(start), - VM_ALLOC_NORMAL | VM_ALLOC_WAITOK | VM_ALLOC_ZERO, + VM_ALLOC_NORMAL | VM_ALLOC_WAITOK, ma, count); + if (!vm_page_all_valid(ma[count - 1])) { + /* + * Later in this function, we copy DMU data to + * invalid pages only. The last page may not be + * entirely filled though, if the file does not + * end on a page boundary. Therefore, we zero + * that last page here to make sure it does not + * contain garbage after the end of file. + */ + ASSERT(vm_page_none_valid(ma[count - 1])); + vm_page_zero_invalid(ma[count - 1], FALSE); + } zfs_vmobject_wunlock(object); } if (blksz == zp->z_blksz) From 72f41454a60b27c25e84fa73e6f39da314f09407 Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Wed, 22 Oct 2025 08:34:39 +1100 Subject: [PATCH 13/16] ZTS: fail test run if test runner crashes unexpectedly zfs-tests.sh executes test-runner.py to do the actual test work. Any exit code < 4 is interpreted as success, with the actual value describing the outcome of the tests inside. If a Python program crashes in some way (eg an uncaught exception), the process exit code is 1. Taken together, this means that test-runner.py can crash during setup, but return a "success" error code to zfs-tests.sh, which will report and exit 0. This in turn causes the CI runner to believe the test run completed successfully. This commit addresses this by making zfs-tests.sh interpret an exit code of 255 as a failure in the runner itself. Then, in test-runner.py, the "fail()" function defaults to a 255 return, and the main function gets wrapped in a generic exception handler, which prints it and calls fail(). All together, this should mean that any unexpected failure in the test runner itself will be propagated out of zfs-tests.sh for CI or any other calling program to deal with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf Reviewed-by: Tony Hutter Signed-off-by: Rob Norris Closes #17858 --- scripts/zfs-tests.sh | 4 +++ tests/test-runner/bin/test-runner.py.in | 33 ++++++++++++++----------- 2 files changed, 23 insertions(+), 14 deletions(-) diff --git a/scripts/zfs-tests.sh b/scripts/zfs-tests.sh index 5a0a1a60944..09a15bafc27 100755 --- a/scripts/zfs-tests.sh +++ b/scripts/zfs-tests.sh @@ -797,6 +797,10 @@ msg "${TEST_RUNNER}" \ 2>&1; echo $? >"$REPORT_FILE"; } | tee "$RESULTS_FILE" read -r RUNRESULT <"$REPORT_FILE" +if [[ "$RUNRESULT" -eq "255" ]] ; then + fail "$TEST_RUNNER failed, test aborted." +fi + # # Analyze the results. # diff --git a/tests/test-runner/bin/test-runner.py.in b/tests/test-runner/bin/test-runner.py.in index d2c1185e4a9..6688b6c4beb 100755 --- a/tests/test-runner/bin/test-runner.py.in +++ b/tests/test-runner/bin/test-runner.py.in @@ -25,6 +25,7 @@ import sys import ctypes import re import configparser +import traceback from datetime import datetime from optparse import OptionParser @@ -1138,7 +1139,7 @@ def filter_tests(testrun, options): testrun.filter(failed) -def fail(retstr, ret=1): +def fail(retstr, ret=255): print('%s: %s' % (sys.argv[0], retstr)) exit(ret) @@ -1247,23 +1248,27 @@ def parse_args(): def main(): options = parse_args() - testrun = TestRun(options) + try: + testrun = TestRun(options) - if options.runfiles: - testrun.read(options) - else: - find_tests(testrun, options) + if options.runfiles: + testrun.read(options) + else: + find_tests(testrun, options) - if options.logfile: - filter_tests(testrun, options) + if options.logfile: + filter_tests(testrun, options) - if options.template: - testrun.write(options) - exit(0) + if options.template: + testrun.write(options) + exit(0) - testrun.complete_outputdirs() - testrun.run(options) - exit(testrun.summary()) + testrun.complete_outputdirs() + testrun.run(options) + exit(testrun.summary()) + + except Exception: + fail("Uncaught exception in test runner:\n" + traceback.format_exc()) if __name__ == '__main__': From 44704616b47c2c9a3334a684612535ec2457eb8c Mon Sep 17 00:00:00 2001 From: Shreshth3 <66148173+Shreshth3@users.noreply.github.com> Date: Tue, 21 Oct 2025 15:10:52 -0700 Subject: [PATCH 14/16] zpool: fix conflict with -v and -o options Right now, the -v and -o options for `zpool list` work independently, but when paired, the -v "wins out" and the -o effect is lost. This commit fixes that problem. Reviewed-by: Alexander Motin Reviewed-by: Brian Behlendorf Reviewed-by: Rob Norris Signed-off-by: Shreshth Srivastava Closes #11040 Closes #17839 --- cmd/zpool/zpool_main.c | 133 ++++++++++++++++++++++++++++------------- 1 file changed, 91 insertions(+), 42 deletions(-) diff --git a/cmd/zpool/zpool_main.c b/cmd/zpool/zpool_main.c index 1feec55c0e8..a6658a9c280 100644 --- a/cmd/zpool/zpool_main.c +++ b/cmd/zpool/zpool_main.c @@ -6975,7 +6975,6 @@ collect_vdev_prop(zpool_prop_t prop, uint64_t value, const char *str, /* * print static default line per vdev - * not compatible with '-o' option */ static void collect_list_stats(zpool_handle_t *zhp, const char *name, nvlist_t *nv, @@ -7031,48 +7030,98 @@ collect_list_stats(zpool_handle_t *zhp, const char *name, nvlist_t *nv, * 'toplevel' boolean value is passed to the print_one_column() * to indicate that the value is valid. */ - if (VDEV_STAT_VALID(vs_pspace, c) && vs->vs_pspace) { - collect_vdev_prop(ZPOOL_PROP_SIZE, vs->vs_pspace, NULL, - scripted, B_TRUE, format, cb->cb_json, props, - cb->cb_json_as_int); - } else { - collect_vdev_prop(ZPOOL_PROP_SIZE, vs->vs_space, NULL, - scripted, toplevel, format, cb->cb_json, props, - cb->cb_json_as_int); + for (zprop_list_t *pl = cb->cb_proplist; pl != NULL; + pl = pl->pl_next) { + switch (pl->pl_prop) { + case ZPOOL_PROP_SIZE: + if (VDEV_STAT_VALID(vs_pspace, c) && + vs->vs_pspace) { + collect_vdev_prop( + ZPOOL_PROP_SIZE, vs->vs_pspace, + NULL, scripted, B_TRUE, format, + cb->cb_json, props, + cb->cb_json_as_int); + } else { + collect_vdev_prop( + ZPOOL_PROP_SIZE, vs->vs_space, NULL, + scripted, toplevel, format, + cb->cb_json, props, + cb->cb_json_as_int); + } + break; + case ZPOOL_PROP_ALLOCATED: + collect_vdev_prop(ZPOOL_PROP_ALLOCATED, + vs->vs_alloc, NULL, scripted, toplevel, + format, cb->cb_json, props, + cb->cb_json_as_int); + break; + + case ZPOOL_PROP_FREE: + collect_vdev_prop(ZPOOL_PROP_FREE, + vs->vs_space - vs->vs_alloc, NULL, scripted, + toplevel, format, cb->cb_json, props, + cb->cb_json_as_int); + break; + + case ZPOOL_PROP_CHECKPOINT: + collect_vdev_prop(ZPOOL_PROP_CHECKPOINT, + vs->vs_checkpoint_space, NULL, scripted, + toplevel, format, cb->cb_json, props, + cb->cb_json_as_int); + break; + + case ZPOOL_PROP_EXPANDSZ: + collect_vdev_prop(ZPOOL_PROP_EXPANDSZ, + vs->vs_esize, NULL, scripted, B_TRUE, + format, cb->cb_json, props, + cb->cb_json_as_int); + break; + + case ZPOOL_PROP_FRAGMENTATION: + collect_vdev_prop( + ZPOOL_PROP_FRAGMENTATION, + vs->vs_fragmentation, NULL, scripted, + (vs->vs_fragmentation != ZFS_FRAG_INVALID && + toplevel), + format, cb->cb_json, props, + cb->cb_json_as_int); + break; + + case ZPOOL_PROP_CAPACITY: + cap = (vs->vs_space == 0) ? + 0 : (vs->vs_alloc * 10000 / vs->vs_space); + collect_vdev_prop(ZPOOL_PROP_CAPACITY, cap, + NULL, scripted, toplevel, format, + cb->cb_json, props, cb->cb_json_as_int); + break; + + case ZPOOL_PROP_HEALTH: + state = zpool_state_to_name(vs->vs_state, + vs->vs_aux); + if (isspare) { + if (vs->vs_aux == VDEV_AUX_SPARED) + state = "INUSE"; + else if (vs->vs_state == + VDEV_STATE_HEALTHY) + state = "AVAIL"; + } + collect_vdev_prop(ZPOOL_PROP_HEALTH, 0, state, + scripted, B_TRUE, format, cb->cb_json, + props, cb->cb_json_as_int); + break; + + case ZPOOL_PROP_NAME: + break; + + default: + collect_vdev_prop(pl->pl_prop, 0, + NULL, scripted, B_FALSE, format, + cb->cb_json, props, cb->cb_json_as_int); + + } + + } - collect_vdev_prop(ZPOOL_PROP_ALLOCATED, vs->vs_alloc, NULL, - scripted, toplevel, format, cb->cb_json, props, - cb->cb_json_as_int); - collect_vdev_prop(ZPOOL_PROP_FREE, vs->vs_space - vs->vs_alloc, - NULL, scripted, toplevel, format, cb->cb_json, props, - cb->cb_json_as_int); - collect_vdev_prop(ZPOOL_PROP_CHECKPOINT, - vs->vs_checkpoint_space, NULL, scripted, toplevel, format, - cb->cb_json, props, cb->cb_json_as_int); - collect_vdev_prop(ZPOOL_PROP_EXPANDSZ, vs->vs_esize, NULL, - scripted, B_TRUE, format, cb->cb_json, props, - cb->cb_json_as_int); - collect_vdev_prop(ZPOOL_PROP_FRAGMENTATION, - vs->vs_fragmentation, NULL, scripted, - (vs->vs_fragmentation != ZFS_FRAG_INVALID && toplevel), - format, cb->cb_json, props, cb->cb_json_as_int); - cap = (vs->vs_space == 0) ? 0 : - (vs->vs_alloc * 10000 / vs->vs_space); - collect_vdev_prop(ZPOOL_PROP_CAPACITY, cap, NULL, - scripted, toplevel, format, cb->cb_json, props, - cb->cb_json_as_int); - collect_vdev_prop(ZPOOL_PROP_DEDUPRATIO, 0, NULL, - scripted, toplevel, format, cb->cb_json, props, - cb->cb_json_as_int); - state = zpool_state_to_name(vs->vs_state, vs->vs_aux); - if (isspare) { - if (vs->vs_aux == VDEV_AUX_SPARED) - state = "INUSE"; - else if (vs->vs_state == VDEV_STATE_HEALTHY) - state = "AVAIL"; - } - collect_vdev_prop(ZPOOL_PROP_HEALTH, 0, state, scripted, - B_TRUE, format, cb->cb_json, props, cb->cb_json_as_int); if (cb->cb_json) { fnvlist_add_nvlist(ent, "properties", props); From fc519b2c1108b52aaad63e356645f63c2c167f82 Mon Sep 17 00:00:00 2001 From: Rob Norris Date: Thu, 23 Oct 2025 03:27:54 +1100 Subject: [PATCH 15/16] mailmap/AUTHORS: update with recent new contributors MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We’re not always on the same page, but at least we’re in the same book. Reviewed-by: Brian Behlendorf Reviewed-by: George Melikov Reviewed-by: Alexander Motin Signed-off-by: Rob Norris Closes #17860 --- .mailmap | 8 ++++++++ AUTHORS | 14 ++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/.mailmap b/.mailmap index e6f09c6c9d4..3397fbc3745 100644 --- a/.mailmap +++ b/.mailmap @@ -53,6 +53,7 @@ Jason Harmening Jeremy Faulkner Jinshan Xiong John Poduska +Jo Zzsi Justin Scholz Ka Ho Ng Kash Pande @@ -67,6 +68,7 @@ Michael Gmelin Olivier Mazouffre Piotr Kubaj Quentin Zdanis +Roberto Ricci Roberto Ricci Rob Norris Rob Norris @@ -83,7 +85,10 @@ Youzhong Yang # Signed-off-by: overriding Author: Alexander Ziaee Felix Schmidt +Jean-Sébastien Pédron +Konstantin Belousov Olivier Certner +Patrick Xia Phil Sutter poscat Qiuhao Chen @@ -125,6 +130,7 @@ buzzingwires <131118055+buzzingwires@users.noreply.gi Cedric Maunoury <38213715+cedricmaunoury@users.noreply.github.com> Charles Suh Chris Peredun <126915832+chrisperedun@users.noreply.github.com> +classabbyamp <5366828+classabbyamp@users.noreply.github.com> Dacian Reece-Stremtan <35844628+dacianstremtan@users.noreply.github.com> Damian Szuberski <30863496+szubersk@users.noreply.github.com> Daniel Hiepler <32984777+heeplr@users.noreply.github.com> @@ -185,6 +191,7 @@ Michael Niewöhner Michael Zhivich <33133421+mzhivich@users.noreply.github.com> MigeljanImeri <78048439+MigeljanImeri@users.noreply.github.com> Mo Zhou <5723047+cdluminate@users.noreply.github.com> +nav1s <42621369+nav1s@users.noreply.github.com> Nick Mattis omni <79493359+omnivagant@users.noreply.github.com> Pablo Correa Gómez <32678034+pablofsf@users.noreply.github.com> @@ -206,6 +213,7 @@ Samuel Wycliffe <50765275+npc203@users.noreply.github Savyasachee Jha Scott Colby Sean Eric Fagan +Shreshth Srivastava <66148173+Shreshth3@users.noreply.github.com> Spencer Kinny <30333052+Spencer-Kinny@users.noreply.github.com> Srikanth N S <75025422+nssrikanth@users.noreply.github.com> Stefan Lendl <1321542+stfl@users.noreply.github.com> diff --git a/AUTHORS b/AUTHORS index 6c34c07f39e..e496c0e8a80 100644 --- a/AUTHORS +++ b/AUTHORS @@ -154,6 +154,7 @@ CONTRIBUTORS: Chris Zubrzycki Chuck Tuffli Chunwei Chen + classabbyamp Clemens Fruhwirth Clemens Lang Clint Armstrong @@ -161,6 +162,7 @@ CONTRIBUTORS: Colin Ian King Colin Percival Colm Buckley + Cong Zhang Crag Wang Craig Loomis Craig Sanders @@ -217,6 +219,7 @@ CONTRIBUTORS: Eitan Adler Eli Rosenthal Eli Schwartz + Eric A. Borisch Eric Desrochers Eric Dillmann Eric Schrock @@ -288,6 +291,7 @@ CONTRIBUTORS: Henrik Riomar Herb Wartens Hiếu Lê + hoshinomori Huang Liu Håkan Johansson Igor K @@ -300,6 +304,7 @@ CONTRIBUTORS: ilovezfs InsanePrawn Isaac Huang + Ivan Shapovalov Ivan Volosyuk Jacek Fefliński Jacob Adams @@ -322,6 +327,7 @@ CONTRIBUTORS: Javen Wu Jaydeep Kshirsagar Jean-Baptiste Lallement + Jean-Sébastien Pédron Jeff Dike Jeremy Faulkner Jeremy Gill @@ -355,7 +361,9 @@ CONTRIBUTORS: Josh Soref Joshua M. Clulow José Luis Salvador Rufo + Jo Zzsi João Carlos Mendes Luís + JT Pennington Julian Brunner Julian Heuking jumbi77 @@ -388,6 +396,7 @@ CONTRIBUTORS: Kleber Tarcísio Kody A Kantor Kohsuke Kawaguchi + Konstantin Belousov Konstantin Khorenko KORN Andras kotauskas @@ -416,6 +425,7 @@ CONTRIBUTORS: luozhengzheng Luís Henriques Madhav Suresh + Maksym Shkolnyi manfromafar Manoj Joseph Manuel Amador (Rudd-O) @@ -482,6 +492,7 @@ CONTRIBUTORS: Nathaniel Clark Nathaniel Wesley Filardo Nathan Lewis + nav1s Nav Ravindranath Neal Gompa (ニール・ゴンパ) Ned Bass @@ -506,6 +517,7 @@ CONTRIBUTORS: Palash Gandhi Patrick Fasano Patrick Mooney + Patrick Xia Patrik Greco Paul B. Henson Paul Dagnelie @@ -605,6 +617,7 @@ CONTRIBUTORS: Shengqi Chen SHENGYI HONG Shen Yan + Shreshth Srivastava Sietse Simon Guest Simon Howard @@ -665,6 +678,7 @@ CONTRIBUTORS: Toyam Cox Trevor Bautista Trey Dockendorf + trick2011 Troels Nørgaard tstabrawa Tulsi Jain From 0455150f1160dd7089ab31fefcfc58bfa548ab81 Mon Sep 17 00:00:00 2001 From: Ryan Libby Date: Thu, 23 Oct 2025 18:23:25 -0700 Subject: [PATCH 16/16] FreeBSD zio_crypt.c: initialize uio variables before access In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was not initialized before the calls to zfs_uio_init, leading to uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc warnings. Similar issue as fixed in 2bf152021 ("Fix gcc uninitialized warning in FreeBSD zio_crypt.c") Signed-off-by: Ryan Libby Reviewed-by: Alexander Motin Reviewed-by: Brian Behlendorf Closes #17863 --- module/os/freebsd/zfs/zio_crypt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/module/os/freebsd/zfs/zio_crypt.c b/module/os/freebsd/zfs/zio_crypt.c index 91cf38016e0..8562c42b322 100644 --- a/module/os/freebsd/zfs/zio_crypt.c +++ b/module/os/freebsd/zfs/zio_crypt.c @@ -437,6 +437,7 @@ zio_crypt_key_wrap(crypto_key_t *cwkey, zio_crypt_key_t *key, uint8_t *iv, ASSERT3U(crypt, <, ZIO_CRYPT_FUNCTIONS); + memset(&cuio_s, 0, sizeof (cuio_s)); zfs_uio_init(&cuio, &cuio_s); keydata_len = zio_crypt_table[crypt].ci_keylen; @@ -519,6 +520,7 @@ zio_crypt_key_unwrap(crypto_key_t *cwkey, uint64_t crypt, uint64_t version, keydata_len = zio_crypt_table[crypt].ci_keylen; rw_init(&key->zk_salt_lock, NULL, RW_DEFAULT, NULL); + memset(&cuio_s, 0, sizeof (cuio_s)); zfs_uio_init(&cuio, &cuio_s); /*