Refactor mount operations to use procfd and new mount API#2094
Refactor mount operations to use procfd and new mount API#2094giuseppe wants to merge 22 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the container mount logic to utilize the new Linux mount API, enabling detached mounts and the pre-opening of file descriptors before pivot_root to maintain access to host paths. It introduces setup_mount_namespace, adds support for MS_SYNCHRONOUS and MS_DIRSYNC flags, and enhances rootless device node handling. The review feedback identifies several bugs related to empty path strings when processing the root directory and suggests using AT_FDCWD for consistent directory descriptor resolution.
| if (get_private_data (container)->mount_fds) | ||
| { | ||
| struct libcrun_fd_map *mount_fds = get_private_data (container)->mount_fds; |
There was a problem hiding this comment.
Directly checking get_private_data (container)->mount_fds might skip pre-opening bind mounts if no non-bind mounts have been processed yet (as mount_fds is lazily initialized). Use get_fd_map (container) to ensure the map is initialized before use.
struct libcrun_fd_map *mount_fds = get_fd_map (container);
if (mount_fds)
{| if (old_root_fd >= 0 && source && source[0] == '/') | ||
| { | ||
| crun_error_release (&tmp_err); | ||
| bindfd = get_bind_mount (old_root_fd, source + 1, |
There was a problem hiding this comment.
If source is exactly /, source + 1 will be an empty string. get_bind_mount calls syscall_open_tree, which requires the AT_EMPTY_PATH flag to handle an empty path. Since get_bind_mount does not currently pass this flag, the call will fail. Using . instead of an empty string avoids this issue.
bindfd = get_bind_mount (old_root_fd, (source[1] == '\0') ? "." : source + 1,| if (UNLIKELY (ret < 0)) | ||
| return crun_make_error (err, errno, "bind mount `/sys` from the host"); | ||
| crun_error_release (err); | ||
| mountfd = get_bind_mount (-1, "/sys", true, false, false, MS_PRIVATE, err); |
| /* Do not resolve the symlink only when src-nofollow and copy-symlink are used. */ | ||
| ret = get_file_type (&src_mode, (extra_flags & (OPTION_SRC_NOFOLLOW | OPTION_COPY_SYMLINK)) ? true : false, path); | ||
| if (old_root_fd >= 0) | ||
| ret = get_file_type_at (old_root_fd, &src_mode, nofollow, path + 1); |
There was a problem hiding this comment.
If path is exactly /, path + 1 will be an empty string. get_file_type_at only applies the AT_EMPTY_PATH flag if the path argument is NULL. Passing an empty string will result in an ENOENT error. Suggest passing NULL when the path is the root directory.
ret = get_file_type_at (old_root_fd, &src_mode, nofollow, (path[1] == '\0') ? NULL : path + 1);| if (old_root_fd >= 0 && mount->source && mount->source[0] == '/') | ||
| { | ||
| bind_dirfd = old_root_fd; | ||
| bind_source = mount->source + 1; |
There was a problem hiding this comment.
If mount->source is exactly /, mount->source + 1 will be an empty string, which causes get_bind_mount to fail as it lacks the AT_EMPTY_PATH flag for its internal open_tree call. Using . ensures the directory itself is opened correctly.
bind_source = (mount->source[1] == '\0') ? "." : mount->source + 1;| { | ||
| libcrun_error_t tmp_err = NULL; | ||
|
|
||
| mount_fds->fds[i] = get_bind_mount (-1, def->mounts[i]->source, |
|
TMT tests failed. @containers/packit-build please check. |
|
CI fuzzing fix in #2096 |
683e89e to
7440a1c
Compare
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
TMT tests failed. @containers/packit-build please check. |
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
TMT tests failed. @containers/packit-build please check. |
d0a63b0 to
f5302f5
Compare
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
TMT tests failed. @containers/packit-build please check. |
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
TMT tests failed. @containers/packit-build please check. |
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
TMT tests failed. @containers/packit-build please check. |
|
@jnovy @kolyshkin I realize this is a big refactoring, so PTAL when you've some time :-) The CI seems happy |
There was a problem hiding this comment.
Pull request overview
This PR refactors crun’s mount and rootfs setup flow to reduce reliance on /proc path lookups during early container setup, and to adopt the Linux “new mount API” (fsopen/fsmount/move_mount/mount_setattr). It also reorganizes when mount namespaces and pivot_root occur, with optional support for OPEN_TREE_NAMESPACE (Linux 7.0+) to replace the traditional unshare(CLONE_NEWNS)+pivot_root sequence.
Changes:
- Introduces a cached proc mount FD (
procfd) and threads FD-relative operations through mount helpers, reducing dependence on/procbeing mounted in the container filesystem during setup. - Migrates more mount operations to the new mount API and uses
mount_setattr()where possible (with fallbacks tomount(2)paths). - Reorders pivot/mount-namespace setup and adds
OPEN_TREE_NAMESPACEsupport; expands test coverage for these behaviors (tmpfs options, overlay, no-/proc specs, masked paths).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_paths.py | Adds masked-path regression tests, including a case intended to ensure “crun” doesn’t leak into /run output. |
| tests/test_mounts.py | Adds new mount tests for fsopen-based tmpfs options, overlay premount behavior, no-/proc specs, and rbind option handling; updates tmpcopyup-on-/ expectation. |
| tests/init.c | Adds nlink and isdir helper commands used by new tests. |
| src/libcrun/utils.h | Adds get_self_fd_path() helper for procfd-relative self/fd/N paths. |
| src/libcrun/utils.c | Adjusts safe_openat() behavior for the rootfs="/" && path=="" case to avoid needing /proc readlink checks. |
| src/libcrun/syscalls.h | Adds missing MOUNT_ATTR_* fallback defines for portability across libc/kernel header versions. |
| src/libcrun/linux.h | Updates libcrun_do_pivot_root() signature to allow updating the rootfs string in-place. |
| src/libcrun/linux.c | Core refactor: procfd/old-root handling, mount_setattr translation, fsopen_mount data parsing, mount namespace setup changes, OPEN_TREE_NAMESPACE integration, notify socket path changes, and device/mount FD plumbing. |
| src/libcrun/container.c | Moves pivot_root earlier in init flow and aligns calls with new pivot API. |
| src/libcrun/cgroup.h | Refactors /proc/self/cgroup path composition to enable *_at() reads. |
| src/libcrun/cgroup-utils.h | Adds libcrun_get_cgroup_process_at() API. |
| src/libcrun/cgroup-utils.c | Factors parsing and adds an *_at() implementation for procfd-relative cgroup reads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
74306ee to
0a1ee76
Compare
had a chance to look at this PR? |
jnovy
left a comment
There was a problem hiding this comment.
Nice work on this major refactoring. The architecture is sound -- procfd threading, new mount API with systematic fallback, and OPEN_TREE_NAMESPACE gating all look correct. A few suggestions below, none blocking.
| static uint64_t | ||
| ms_flags_to_mount_attr (uint64_t ms_flags) | ||
| { | ||
| uint64_t attr = ms_flags & (MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | MOUNT_ATTR_NOEXEC); |
There was a problem hiding this comment.
suggestion: This works because the kernel intentionally made these MOUNT_ATTR_* values equal to their MS_* counterparts (1, 2, 4, 8), but it's non-obvious. Consider adding a comment or _Static_assert to document this assumption, similar to how the atime flags get explicit conversion below.
| /* Older kernels (seen on 4.18) fail with EINVAL if data is set when | ||
| setting MS_RDONLY. */ | ||
| if (flags & (MS_REMOUNT | MS_RDONLY)) | ||
| data = NULL; |
There was a problem hiding this comment.
suggestion: This nullifies data before the mount_setattr path (line 846). mount_setattr doesn't use data, so it's harmless there, but if mount_setattr fails and falls through to the legacy mount() call (line 857), the data is already lost. Consider moving data = NULL after the mount_setattr fallthrough, so the legacy path also benefits from the workaround but doesn't lose data prematurely.
| directory itself (rootfs="/", path=""). open("/") can | ||
| only return the root, and after setns the /proc-based | ||
| readlink may not be reachable by path yet. */ | ||
| if (rootfs[0] != '/' || rootfs[1] != '\0') |
There was a problem hiding this comment.
nit: The existing comment explains why the check is skipped (after setns, /proc may not be reachable). It would be helpful to also document the security invariant: this is safe because after pivot_root, "/" can only refer to the container root, so the readlink verification is redundant.
| if (UNLIKELY (fsfd < 0)) | ||
| return fsfd; | ||
|
|
||
| (void) syscall_fsconfig (fsfd, FSCONFIG_SET_STRING, "source", source_name ? source_name : type, 0); |
There was a problem hiding this comment.
question: Is silently ignoring fsconfig("source") failure intentional? For filesystems that require a source device (e.g., ext4), this would let the mount proceed without one, likely failing later at FSCONFIG_CMD_CREATE with a less clear error. Would a debug-level log be appropriate here?
| return dev_fds; | ||
| } | ||
|
|
||
| #define NUM_NEEDED_DEVS 6 |
There was a problem hiding this comment.
suggestion: This must be kept in sync with the needed_devs[] array (line 1927). Since needed_devs is file-scope, you could replace this with:
#define NUM_NEEDED_DEVS (sizeof (needed_devs) / sizeof (needed_devs[0]) - 1)(minus 1 for the sentinel entry). Alternatively, a _Static_assert(NUM_NEEDED_DEVS == ...) would catch drift.
| if (recursive) | ||
| *recursive = false; | ||
|
|
||
| if (src_nofollow != NULL) |
There was a problem hiding this comment.
note: This fixes a pre-existing bug -- the old code had if (src_nofollow == NULL) which would dereference NULL and skip initialization when non-NULL. Nice catch. Consider splitting this into its own commit for backport clarity, since it's a standalone bug fix independent of the refactoring.
The check was inverted as it dereferenced src_nofollow when the pointer was NULL instead of when it was non-NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace setxattr via /proc/self/fd path with openat(procfd) + fsetxattr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat using the trusted procfd directory fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use procfd-relative paths for get_bind_mount and fstatfs instead of /proc/self/fd string paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use get_bind_mount with the cached procfd and relative self/fd path instead of formatting an absolute /proc/self/fd path and going through do_mount. Apply the extra mount flags from fstatfs via do_mount_setattr on the detached mount before moving it into place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace libcrun_get_cgroup_process() with direct read via procfd to avoid dependency on /proc being mounted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the mount() syscall. mount_setattr() is fd-based and does not need /proc/self/fd paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the mount() syscall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the slash-counting loop bound with fstat-based root detection. The old code counted '/' characters in the path string to cap iterations; for relative paths like "rootfs" (zero slashes) the loop ran only once, which could be insufficient. Compare inode/device of each directory with its parent instead: when ".." resolves to the same inode we have reached the filesystem root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Try fsopen_mount()+fs_move_mount_to() for filesystem mounts and get_bind_mount()+fs_move_mount_to() for bind mounts before falling back to mount(). For propagation changes, try mount_setattr() first. This removes the dependency on /proc/self/fd paths for the initial mount and propagation steps when the new mount API is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In user namespace containers, devices under /dev are created via bind mounts from the host. Pre-open the 6 standard devices (/dev/null, /dev/zero, /dev/full, /dev/tty, /dev/random, /dev/urandom) using open_tree(OPEN_TREE_CLONE) in the parent process after clone(), and send them to the child via the existing sync socket fd-passing mechanism. This reuses the same send_mounts/receive_mounts pattern already used for custom devices (dev_fds) and bind mount sources (mount_fds). This is a preparatory change for OPEN_TREE_NAMESPACE support, where host paths are not accessible after setns() into the new mount namespace. When open_tree() is not available or fails (e.g. rootless without CAP_SYS_ADMIN), the fds remain -1 and the child falls back to the existing bind mount path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the openat(procfd) + fchmod/fchown approach with direct fchmodat/fchownat calls. These operate on the directory entry without opening the device file, avoiding ENXIO errors on character devices like /dev/tty that fail to open when no controlling terminal exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
and rename it Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The masked-dir-nlink test checked that st_nlink >= 2 to verify a masked path is a directory. This fails on btrfs where directories have st_nlink=1 (btrfs does not count . and .. in nlink). Add an 'isdir' command to the test init binary and use S_ISDIR to verify the masked path is a directory, which works regardless of filesystem type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Verify that containers can run without /proc, /sys, or cgroup mounts in the OCI spec. This exercises the procfd-based code paths that no longer depend on /proc/self/fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use the new mount API (fsopen/fsconfig/fsmount) to create detached mounts for each non-bind filesystem type before pivot_root. After the pivot, place them via move_mount so the kernel mnt_already_visible check is satisfied. set_mounts then mounts fresh instances on top with the correct OCI flags. This fixes mounting proc/sysfs/cgroup in containers that use a user namespace, where the kernel denies these mounts unless a mount of the same type is already visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use open_tree(OPEN_TREE_NAMESPACE) + setns(CLONE_NEWNS) to replace the traditional unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence. OPEN_TREE_NAMESPACE creates a new mount namespace with the rootfs as the root mount. setns() enters that namespace directly, so no bind mount or pivot_root is needed. The kernel automatically sets the process root and cwd to the new namespace's root when the old root is not reachable. On older kernels (< 7.0) or when OPEN_TREE_NAMESPACE is not supported, the code falls back to the traditional path. Closes: containers#2086 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
When fsopen() is not available (returns ENOSYS), the host-side device pre-creation in prepare_and_send_dev_mounts() cannot proceed since it relies on the new mount API to create a detached tmpfs. Instead of failing, send empty fds to the container process so that libcrun_create_dev() falls back to its existing legacy path (bind mounts or mknod) inside the container's mount namespace. Closes: containers#2104 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
0a1ee76 to
8f7d971
Compare
Closes #2086