Skip to content

Refactor mount operations to use procfd and new mount API#2094

Open
giuseppe wants to merge 22 commits into
containers:mainfrom
giuseppe:procfd-refactor
Open

Refactor mount operations to use procfd and new mount API#2094
giuseppe wants to merge 22 commits into
containers:mainfrom
giuseppe:procfd-refactor

Conversation

@giuseppe

Copy link
Copy Markdown
Member
  • Open a /proc fd early and thread it through all mount helpers, replacing hardcoded /proc/self/fd string paths with fd-relative operations. This removes the dependency on /proc being mounted during container setup.
  • Migrate mount operations to the new mount API (fsopen/fsmount/move_mount/mount_setattr).
  • Move pivot_root before container mounts and defer mount namespace creation (CLONE_NEWNS) to libcrun_do_pivot_root(), consolidating it in a single location.
  • Add OPEN_TREE_NAMESPACE support (kernel 7.0+) to replace the unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence entirely.

Closes #2086

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the container mount logic to utilize the new Linux mount API, enabling detached mounts and the pre-opening of file descriptors before pivot_root to maintain access to host paths. It introduces setup_mount_namespace, adds support for MS_SYNCHRONOUS and MS_DIRSYNC flags, and enhances rootless device node handling. The review feedback identifies several bugs related to empty path strings when processing the root directory and suggests using AT_FDCWD for consistent directory descriptor resolution.

Comment thread src/libcrun/linux.c Outdated
Comment on lines +3438 to +3440
if (get_private_data (container)->mount_fds)
{
struct libcrun_fd_map *mount_fds = get_private_data (container)->mount_fds;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Directly checking get_private_data (container)->mount_fds might skip pre-opening bind mounts if no non-bind mounts have been processed yet (as mount_fds is lazily initialized). Use get_fd_map (container) to ensure the map is initialized before use.

          struct libcrun_fd_map *mount_fds = get_fd_map (container);
          if (mount_fds)
            {

Comment thread src/libcrun/linux.c Outdated
if (old_root_fd >= 0 && source && source[0] == '/')
{
crun_error_release (&tmp_err);
bindfd = get_bind_mount (old_root_fd, source + 1,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If source is exactly /, source + 1 will be an empty string. get_bind_mount calls syscall_open_tree, which requires the AT_EMPTY_PATH flag to handle an empty path. Since get_bind_mount does not currently pass this flag, the call will fail. Using . instead of an empty string avoids this issue.

                      bindfd = get_bind_mount (old_root_fd, (source[1] == '\0') ? "." : source + 1,

Comment thread src/libcrun/linux.c Outdated
if (UNLIKELY (ret < 0))
return crun_make_error (err, errno, "bind mount `/sys` from the host");
crun_error_release (err);
mountfd = get_bind_mount (-1, "/sys", true, false, false, MS_PRIVATE, err);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

                      mountfd = get_bind_mount (AT_FDCWD, "/sys", true, false, false, MS_PRIVATE, err);

Comment thread src/libcrun/linux.c Outdated
/* Do not resolve the symlink only when src-nofollow and copy-symlink are used. */
ret = get_file_type (&src_mode, (extra_flags & (OPTION_SRC_NOFOLLOW | OPTION_COPY_SYMLINK)) ? true : false, path);
if (old_root_fd >= 0)
ret = get_file_type_at (old_root_fd, &src_mode, nofollow, path + 1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If path is exactly /, path + 1 will be an empty string. get_file_type_at only applies the AT_EMPTY_PATH flag if the path argument is NULL. Passing an empty string will result in an ENOENT error. Suggest passing NULL when the path is the root directory.

            ret = get_file_type_at (old_root_fd, &src_mode, nofollow, (path[1] == '\0') ? NULL : path + 1);

Comment thread src/libcrun/linux.c Outdated
if (old_root_fd >= 0 && mount->source && mount->source[0] == '/')
{
bind_dirfd = old_root_fd;
bind_source = mount->source + 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If mount->source is exactly /, mount->source + 1 will be an empty string, which causes get_bind_mount to fail as it lacks the AT_EMPTY_PATH flag for its internal open_tree call. Using . ensures the directory itself is opened correctly.

              bind_source = (mount->source[1] == '\0') ? "." : mount->source + 1;

Comment thread src/libcrun/linux.c Outdated
{
libcrun_error_t tmp_err = NULL;

mount_fds->fds[i] = get_bind_mount (-1, def->mounts[i]->source,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

                  mount_fds->fds[i] = get_bind_mount (AT_FDCWD, def->mounts[i]->source,

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@giuseppe

Copy link
Copy Markdown
Member Author

CI fuzzing fix in #2096

@giuseppe giuseppe force-pushed the procfd-refactor branch 11 times, most recently from 683e89e to 7440a1c Compare May 21, 2026 07:58
@packit-as-a-service

Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@giuseppe giuseppe force-pushed the procfd-refactor branch 2 times, most recently from d0a63b0 to f5302f5 Compare May 21, 2026 11:24
@packit-as-a-service

Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@packit-as-a-service

Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@giuseppe

Copy link
Copy Markdown
Member Author

@jnovy @kolyshkin I realize this is a big refactoring, so PTAL when you've some time :-) The CI seems happy

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors crun’s mount and rootfs setup flow to reduce reliance on /proc path lookups during early container setup, and to adopt the Linux “new mount API” (fsopen/fsmount/move_mount/mount_setattr). It also reorganizes when mount namespaces and pivot_root occur, with optional support for OPEN_TREE_NAMESPACE (Linux 7.0+) to replace the traditional unshare(CLONE_NEWNS)+pivot_root sequence.

Changes:

  • Introduces a cached proc mount FD (procfd) and threads FD-relative operations through mount helpers, reducing dependence on /proc being mounted in the container filesystem during setup.
  • Migrates more mount operations to the new mount API and uses mount_setattr() where possible (with fallbacks to mount(2) paths).
  • Reorders pivot/mount-namespace setup and adds OPEN_TREE_NAMESPACE support; expands test coverage for these behaviors (tmpfs options, overlay, no-/proc specs, masked paths).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_paths.py Adds masked-path regression tests, including a case intended to ensure “crun” doesn’t leak into /run output.
tests/test_mounts.py Adds new mount tests for fsopen-based tmpfs options, overlay premount behavior, no-/proc specs, and rbind option handling; updates tmpcopyup-on-/ expectation.
tests/init.c Adds nlink and isdir helper commands used by new tests.
src/libcrun/utils.h Adds get_self_fd_path() helper for procfd-relative self/fd/N paths.
src/libcrun/utils.c Adjusts safe_openat() behavior for the rootfs="/" && path=="" case to avoid needing /proc readlink checks.
src/libcrun/syscalls.h Adds missing MOUNT_ATTR_* fallback defines for portability across libc/kernel header versions.
src/libcrun/linux.h Updates libcrun_do_pivot_root() signature to allow updating the rootfs string in-place.
src/libcrun/linux.c Core refactor: procfd/old-root handling, mount_setattr translation, fsopen_mount data parsing, mount namespace setup changes, OPEN_TREE_NAMESPACE integration, notify socket path changes, and device/mount FD plumbing.
src/libcrun/container.c Moves pivot_root earlier in init flow and aligns calls with new pivot API.
src/libcrun/cgroup.h Refactors /proc/self/cgroup path composition to enable *_at() reads.
src/libcrun/cgroup-utils.h Adds libcrun_get_cgroup_process_at() API.
src/libcrun/cgroup-utils.c Factors parsing and adds an *_at() implementation for procfd-relative cgroup reads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/libcrun/linux.c
@giuseppe

Copy link
Copy Markdown
Member Author

@jnovy @kolyshkin I realize this is a big refactoring, so PTAL when you've some time :-) The CI seems happy

had a chance to look at this PR?

@jnovy jnovy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this major refactoring. The architecture is sound -- procfd threading, new mount API with systematic fallback, and OPEN_TREE_NAMESPACE gating all look correct. A few suggestions below, none blocking.

Comment thread src/libcrun/linux.c Outdated
static uint64_t
ms_flags_to_mount_attr (uint64_t ms_flags)
{
uint64_t attr = ms_flags & (MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | MOUNT_ATTR_NOEXEC);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This works because the kernel intentionally made these MOUNT_ATTR_* values equal to their MS_* counterparts (1, 2, 4, 8), but it's non-obvious. Consider adding a comment or _Static_assert to document this assumption, similar to how the atime flags get explicit conversion below.

Comment thread src/libcrun/linux.c Outdated
/* Older kernels (seen on 4.18) fail with EINVAL if data is set when
setting MS_RDONLY. */
if (flags & (MS_REMOUNT | MS_RDONLY))
data = NULL;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This nullifies data before the mount_setattr path (line 846). mount_setattr doesn't use data, so it's harmless there, but if mount_setattr fails and falls through to the legacy mount() call (line 857), the data is already lost. Consider moving data = NULL after the mount_setattr fallthrough, so the legacy path also benefits from the workaround but doesn't lose data prematurely.

Comment thread src/libcrun/utils.c
directory itself (rootfs="/", path=""). open("/") can
only return the root, and after setns the /proc-based
readlink may not be reachable by path yet. */
if (rootfs[0] != '/' || rootfs[1] != '\0')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The existing comment explains why the check is skipped (after setns, /proc may not be reachable). It would be helpful to also document the security invariant: this is safe because after pivot_root, "/" can only refer to the container root, so the readlink verification is redundant.

Comment thread src/libcrun/linux.c
if (UNLIKELY (fsfd < 0))
return fsfd;

(void) syscall_fsconfig (fsfd, FSCONFIG_SET_STRING, "source", source_name ? source_name : type, 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is silently ignoring fsconfig("source") failure intentional? For filesystems that require a source device (e.g., ext4), this would let the mount proceed without one, likely failing later at FSCONFIG_CMD_CREATE with a less clear error. Would a debug-level log be appropriate here?

Comment thread src/libcrun/linux.c Outdated
return dev_fds;
}

#define NUM_NEEDED_DEVS 6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This must be kept in sync with the needed_devs[] array (line 1927). Since needed_devs is file-scope, you could replace this with:

#define NUM_NEEDED_DEVS (sizeof (needed_devs) / sizeof (needed_devs[0]) - 1)

(minus 1 for the sentinel entry). Alternatively, a _Static_assert(NUM_NEEDED_DEVS == ...) would catch drift.

Comment thread src/libcrun/linux.c
if (recursive)
*recursive = false;

if (src_nofollow != NULL)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This fixes a pre-existing bug -- the old code had if (src_nofollow == NULL) which would dereference NULL and skip initialization when non-NULL. Nice catch. Consider splitting this into its own commit for backport clarity, since it's a standalone bug fix independent of the refactoring.

giuseppe and others added 22 commits June 24, 2026 11:41
The check was inverted as it dereferenced src_nofollow when the
pointer was NULL instead of when it was non-NULL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace setxattr via /proc/self/fd path with openat(procfd) +
fsetxattr.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat
using the trusted procfd directory fd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use procfd-relative paths for get_bind_mount and fstatfs instead of
/proc/self/fd string paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use get_bind_mount with the cached procfd and relative self/fd path
instead of formatting an absolute /proc/self/fd path and going through
do_mount.  Apply the extra mount flags from fstatfs via
do_mount_setattr on the detached mount before moving it into place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace libcrun_get_cgroup_process() with direct read via procfd to
avoid dependency on /proc being mounted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the
mount() syscall.  mount_setattr() is fd-based and does not need
/proc/self/fd paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the
mount() syscall.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the slash-counting loop bound with fstat-based root
detection.  The old code counted '/' characters in the path string
to cap iterations; for relative paths like "rootfs" (zero slashes)
the loop ran only once, which could be insufficient.

Compare inode/device of each directory with its parent instead:
when ".." resolves to the same inode we have reached the filesystem
root.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Try fsopen_mount()+fs_move_mount_to() for filesystem mounts and
get_bind_mount()+fs_move_mount_to() for bind mounts before falling
back to mount().  For propagation changes, try mount_setattr()
first.

This removes the dependency on /proc/self/fd paths for the initial
mount and propagation steps when the new mount API is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In user namespace containers, devices under /dev are created via bind
mounts from the host.  Pre-open the 6 standard devices (/dev/null,
/dev/zero, /dev/full, /dev/tty, /dev/random, /dev/urandom) using
open_tree(OPEN_TREE_CLONE) in the parent process after clone(), and
send them to the child via the existing sync socket fd-passing
mechanism.

This reuses the same send_mounts/receive_mounts pattern already used
for custom devices (dev_fds) and bind mount sources (mount_fds).

This is a preparatory change for OPEN_TREE_NAMESPACE support, where
host paths are not accessible after setns() into the new mount
namespace.

When open_tree() is not available or fails (e.g. rootless without
CAP_SYS_ADMIN), the fds remain -1 and the child falls back to the
existing bind mount path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the openat(procfd) + fchmod/fchown approach with direct
fchmodat/fchownat calls.  These operate on the directory entry
without opening the device file, avoiding ENXIO errors on character
devices like /dev/tty that fail to open when no controlling terminal
exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
and rename it

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The masked-dir-nlink test checked that st_nlink >= 2 to verify a
masked path is a directory.  This fails on btrfs where directories
have st_nlink=1 (btrfs does not count . and .. in nlink).

Add an 'isdir' command to the test init binary and use S_ISDIR to
verify the masked path is a directory, which works regardless of
filesystem type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Verify that containers can run without /proc, /sys, or cgroup
mounts in the OCI spec.  This exercises the procfd-based code
paths that no longer depend on /proc/self/fd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use the new mount API (fsopen/fsconfig/fsmount) to create detached
mounts for each non-bind filesystem type before pivot_root.  After
the pivot, place them via move_mount so the kernel mnt_already_visible
check is satisfied.  set_mounts then mounts fresh instances on top
with the correct OCI flags.

This fixes mounting proc/sysfs/cgroup in containers that use a user
namespace, where the kernel denies these mounts unless a mount of
the same type is already visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use open_tree(OPEN_TREE_NAMESPACE) + setns(CLONE_NEWNS) to replace the
traditional unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root
sequence.

OPEN_TREE_NAMESPACE creates a new mount namespace with the rootfs as
the root mount.  setns() enters that namespace directly, so no bind
mount or pivot_root is needed.  The kernel automatically sets the
process root and cwd to the new namespace's root when the old root is
not reachable.

On older kernels (< 7.0) or when OPEN_TREE_NAMESPACE is not
supported, the code falls back to the traditional path.

Closes: containers#2086

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
When fsopen() is not available (returns ENOSYS), the host-side device
pre-creation in prepare_and_send_dev_mounts() cannot proceed since it
relies on the new mount API to create a detached tmpfs.

Instead of failing, send empty fds to the container process so that
libcrun_create_dev() falls back to its existing legacy path (bind
mounts or mknod) inside the container's mount namespace.

Closes: containers#2104

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use OPEN_TREE_NAMESPACE

4 participants