Add zoned_uid property with additive least privilege authorization

This implements zoned_uid - a ZFS property that delegates dataset
visibility and administration to user namespaces owned by a specific
UID, enabling rootless Podman/Docker with native ZFS storage.

Usage: zfs set zoned_uid=1000 pool/dataset

Problem solved:
- zfs zone requires an existing namespace PID
- Podman creates a new namespace on each container start
- Solution: delegate to UID, any namespace owned by that UID is
  authorized

Authorization model — three-layer additive (all must pass):

  L0 (auth):     Namespace owner UID matches zoned_uid property
  L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool
                   delegation is ON — the default)
  L2 (cap tier):  Linux capability in the namespace determines the
                   operation class permitted

While CAP_SYS_ADMIN is a namespaced capability (the namespace owner
always holds it within their own user namespace), granting blanket
access based solely on its presence is contrary to the Principle of
Least Privilege. This change introduces tiered capability requirements
so that non-destructive operations (create, snapshot, set property)
require only CAP_FOWNER, while destructive operations (destroy, rename,
clone) continue to require CAP_SYS_ADMIN — both of which are namespaced
capabilities scoped to the user namespace, not the init namespace.

When pool delegation is OFF (non-default), all zoned_uid write
operations are denied — delegation OFF means the pool admin has
opted out of delegating access entirely.

Security model:
- Namespace owner UID must match zoned_uid value
- Delegation root cannot be destroyed or escaped via rename
- Namespace users cannot modify zoned_uid itself (only global
  zone admin can manage delegation assignments)
- Namespace users cannot modify the 'zoned' property
- Namespace users cannot override filesystem_limit or
  snapshot_limit set by the global admin on the delegation root
  (but can impose tighter sub-limits on child datasets)
- Multi-UID isolation: sibling delegations with different UIDs
  cannot access each other's subtrees

Kernel changes:
- zone_dataset_attach_uid()/detach_uid() in SPL
- zone_dataset_admin_check() for write authorization with tiered
  capabilities (CAP_FOWNER for non-destructive, CAP_SYS_ADMIN
  for destructive)
- Callback registration for zoned_uid property lookup
- New zfs_secpolicy_zoned_uid_deleg() helper that calls
  dsl_deleg_access_impl() directly, bypassing zfs_dozonecheck_ds()
  which requires the `zoned` property that zoned_uid datasets lack
- Fix dsl_deleg_access_impl() hierarchy walk to accept zoned_uid
  datasets (not just zoned=on)
- Update all 9 secpolicy call sites to require dsl_deleg grants
  instead of short-circuiting on ZONE_ADMIN_ALLOWED
- Security policy hooks in zfs_secpolicy_*() functions
- Fixed inglobalzone() to use current_user_ns()
- zfs_prop_set_special() handles attach/detach as property
  side-effects, eliminating the need for dedicated ioctls
- spa_import_os() restores zoned_uid delegations kernel-side
  on pool import via dmu_objset_find() walk
- spa_export_os() detaches zoned_uid delegations on pool
  destroy/export, preventing stale kernel state on recreate
- zoned_uid registered as PROP_INHERIT so child datasets
  inherit the delegation, enabling sub-dataset creation
- zfs_get_zoned_uid() uses dsl_prop_get setpoint to identify
  the true delegation root, correctly distinguishing inherited
  values from locally-set ones for destroy/rename policy checks
- zone_dataset_check_list() accepts '@' and '#' separators in
  addition to '/' so snapshots and bookmarks are visible from
  delegated namespaces
- zfs_secpolicy_setprop() blocks ZFS_PROP_ZONED_UID from being
  set within a delegated namespace, preventing self-revocation
- zfs_secpolicy_setprop() blocks filesystem_limit and
  snapshot_limit changes on the delegation root from within a
  namespace (uses dsl_prop_get setpoint to identify the root),
  while allowing delegated users to set tighter sub-limits on
  child datasets
- Use kcred (not CRED()) for zone_dataset_detach_uid/attach_uid
  in destroy and rename cleanup paths, preventing stale tracking
  entries when namespace users perform these operations
- Use cr parameter (not CRED()) in all secpolicy zoned_uid
  delegation checks for correct credential propagation

Userspace changes:
- check_parents() defers to kernel when zoned_uid set

FreeBSD compatibility:
- include/os/freebsd/spl/sys/zone.h — Added FreeBSD stubs:
  - zone_uid_op_t enum (ZONE_OP_CREATE, SNAPSHOT, CLONE, DESTROY,
    RENAME, SETPROP)
  - zone_admin_result_t enum (NOT_APPLICABLE, ALLOWED, DENIED)
  - zone_dataset_admin_check() — static inline, always returns
    ZONE_ADMIN_NOT_APPLICABLE
  - zone_dataset_attach_uid() — static inline, returns ENXIO
  - zone_dataset_detach_uid() — static inline, returns ENXIO
  - zone_get_zoned_uid_fn_t callback typedef
  - zone_register_zoned_uid_callback() — static inline no-op
  - zone_unregister_zoned_uid_callback() — static inline no-op
- On FreeBSD, every zone_dataset_admin_check() call returns
  ZONE_ADMIN_NOT_APPLICABLE, causing all security policy functions
  to fall through to existing jail-based permission checks
- Setting zoned_uid on FreeBSD returns ENXIO since user namespace
  delegation requires Linux user namespaces

Test changes:
- Add grant_deleg() calls to tests 006-022 for operations that now
  require explicit dsl_deleg grants
- Add tests 023-030 validating the capability tier model
- Add test 031 validating stale zone tracking cleanup after
  namespace rename+destroy
- Fix capsh lookup in test helpers for ksh -p restricted PATH
  (command -v + explicit /usr/sbin fallback)
- Add mountpoint=none to tests 023-026 to avoid mount-lock issues
  in user namespaces
- Fix test 026 expectations to match kernel behavior (delegation
  OFF denies all writes, allows read-only)
- run_in_userns helper resolves absolute zfs path to handle
  environments where PATH does not include zfs (source builds)
- Test 004 updated: zoned_uid now inherits (PROP_INHERIT), test
  verifies inheritance and override behavior
- Test 013 uses within_percent with parseable byte output (-Hp)
  for robust quota value comparison across environments
- Test 014: verifies grandchild dataset creation from user
  namespace, confirming inherited zoned_uid delegation works
- Test 015: pool destroy/recreate with zoned_uid delegation
- Test 016: individual snapshot destroy from namespace
- Test 017: namespace user cannot modify zoned_uid property
- Test 018: clone operations from within delegated namespace
- Test 019: multi-UID isolation between sibling delegations
- Test 020: operations without zone_dataset_admin_check()
  integration are denied via zfs_dozonecheck_impl()
- Test 021: 'zoned' property cannot be modified from namespace
- Test 022: delegation root limit overrides blocked from namespace
- Quoted shell variables across all test scripts for robustness
- Shellcheck SC2155 fixes across all test scripts

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colin K. Williams / LINK ORG LLC / li-nk.social <colin@li-nk.org>
Closes #18167
This commit is contained in:
li-nk.social
2026-04-03 10:38:26 -07:00
committed by GitHub
parent 74504cf7fd
commit f203fedde8
55 changed files with 5238 additions and 45 deletions
+2
View File
@@ -204,6 +204,7 @@ typedef enum {
ZFS_PROP_DEFAULTGROUPOBJQUOTA,
ZFS_PROP_DEFAULTPROJECTOBJQUOTA,
ZFS_PROP_SNAPSHOTS_CHANGED_NSECS,
ZFS_PROP_ZONED_UID,
ZFS_NUM_PROPS
} zfs_prop_t;
@@ -1782,6 +1783,7 @@ typedef enum {
ZFS_ERR_ASHIFT_MISMATCH,
ZFS_ERR_STREAM_LARGE_MICROZAP,
ZFS_ERR_TOO_MANY_SITOUTS,
ZFS_ERR_NO_USER_NS_SUPPORT,
} zfs_errno_t;
/*