chore: Harden kernel + searcher container against AF_ALG/algif_* bug class by MoeMahhouk · Pull Request #142 · flashbots/flashbots-images

MoeMahhouk · 2026-05-04T10:28:13Z

Summary

Defense-in-depth follow-up to #138 (the kernel bump that fixes CVE-2026-31431 / copy.fail). The kernel-level vulnerability is already patched; this PR removes the AF_ALG userspace surface that the copy.fail exploit chain runs through, so any future bug in the same family doesn't have a ready entry point on this image.

Three commits, ordered by increasing scope of removal, easy to drop the last one (or two) if pre-merge testing surfaces an unexpected AF_ALG consumer.

Changes

Block AF_ALG in searcher container seccomp profile

Container-only. The existing socket() rule already blocks AF_VSOCK (family 40); extend the same rule to also deny AF_ALG (family 38). One rule, two AND-ed constraints per the OCI seccomp spec. No build/image impact, only runtime behaviour inside the searcher container.

Drop `CONFIG_CRYPTO_USER_API_*`

Removes the AF_ALG family from the kernel entirely (host-wide, not just inside the searcher container). Verified by source-reading that no userspace process on the image consumes AF_ALG:

tdx-init: Go stdlib crypto/hmac + crypto/sha256 (pure userspace), then shells out to cryptsetup.
cryptsetup: Debian build uses libgcrypt + libargon2 in userspace; dm-crypt uses the in-kernel skcipher API directly, not via the AF_ALG userspace surface.
Lighthouse and the rest of the image userspace: ring / aes-gcm.

The previous # For tdx-init annotation in 10-bob was inaccurate — these flags weren't actually being used. Replaced with a comment explaining the rationale.

Pin `CONFIG_CRYPTO_AUTHENCESN=n`

authencesn is the AEAD template at the heart of the copy.fail bug. Its only intended in-tree consumer is the kernel's IPsec stack when a tunnel is configured with the Extended Sequence Number option, and IPsec is fully disabled on this image (CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP all "not set" in 01-sane-defaults). Pinning it off explicitly removes the algorithm even if Debian's cloud config inherits =y.

Pre-merge verification

Boot the rebuilt image; journalctl -b | grep -i 'AF_ALG\|algif' is empty (no warnings about a missing crypto API).
tdx-init set-passphrase end-to-end works (LUKS format → token import → header restore → MAC write → open → ext4 mount).
lighthouse restarts cleanly via systemctl restart lighthouse.
Searcher container init path completes (init-container.sh runs, container reaches running state, sshd listens on the published port).
Optional: strace -e trace=socket podman exec searcher-container <typical-cmd> 2>&1 | grep AF_ALG returns nothing.

If anything regresses, individual commits can be git revert'd cleanly without disturbing the others.

Not in scope

This is not the fix for CVE-2026-31431 that landed in #138 via the kernel bump. This PR removes the userspace entry point so the same bug class can't be reached again on this image without both a kernel CVE and a config regression.

References

CVE-2026-31431 (copy.fail): https://security-tracker.debian.org/tracker/CVE-2026-31431
Upstream fix: https://git.kernel.org/linus/a664bf3d603dc3bdcf9ae47cc21e0daec706d7a5
Initial kernel bump: Bump kernel to 6.19 + Debian snapshot 20260430 to fix CVE-2026-31431 #138

Defense in depth against the AF_ALG/algif_aead syscall surface that copy.fail (CVE-2026-31431) abuses. The existing socket() rule already blocks AF_VSOCK (family 40); extend the same rule to also block AF_ALG (family 38). Multiple args in a single seccomp rule are AND-ed per the OCI spec, so the rule now allows socket() only when arg[0] is neither 40 nor 38.

The AF_ALG userspace crypto API (algif_hash / algif_skcipher / algif_rng / algif_aead) was enabled with a '# For tdx-init' annotation, but tdx-init itself uses Go's stdlib crypto/hmac + crypto/sha256 (pure userspace) and shells out to cryptsetup, which on Debian uses libgcrypt + libargon2 for PBKDF and dm-crypt for actual block encryption -- dm-crypt talks to the in-kernel skcipher API directly, not via the AF_ALG userspace surface. Lighthouse and the rest of the image userspace use ring / aes-gcm. Removing the surface eliminates the entry point for CVE-2026-31431 (copy.fail) at the kernel level and shrinks the surface for any future algif_* CVE. Pre-merge: boot the rebuilt image and confirm 'journalctl -b' has no AF_ALG/algif_* warnings, and that 'tdx-init set-passphrase' / lighthouse restart / searcher container init paths all work end-to-end.

authencesn is an AEAD template whose only intended in-tree consumer is the IPsec/XFRM stack when an SA has the Extended Sequence Number flag set. IPsec is disabled on this image (CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP all 'not set' in 01-sane-defaults), so authencesn has no in-tree user here. Pinning it off explicitly removes the algorithm even if Debian's cloud config inherits it as =y, and removes the specific code path that the copy.fail bug rearranges -- belt-and-suspenders alongside the AF_ALG removal in the previous commit.

…er seccomp profile Defense in depth against the RxRPC and PF_KEY/XFRM kernel codepaths. The existing socket() rule already blocks AF_VSOCK (40) and AF_ALG (38); extend the same rule to also block AF_RXRPC (45) and AF_KEY (15). Numeric values verified against include/linux/socket.h (PF_RXRPC = 45, PF_KEY = 15) -- same lesson learned from copy.fail, where the rule intended to block AF_ALG was blocking AF_VSOCK because the constant was off by two. Multiple args in a single seccomp rule are AND-ed per the OCI spec, so the rule now allows socket() only when arg[0] is none of {15, 38, 40, 45}. The host kernel does not currently compile any of these families in (MODULES=n + CONFIG_AF_RXRPC=m / CONFIG_NET_KEY=m in the Debian base both resolve to 'not set' after olddefconfig), so socket() with these families already returns EAFNOSUPPORT. This change makes the rejection explicit at the seccomp layer, which keeps the path closed even if a future kernel-config edit re-enables one of these families. No legitimate searcher workload uses AF_RXRPC (kernel AFS client) or AF_KEY (legacy IPsec keying interface). The container's egress firewall in init-container.sh already blocks the relevant network paths.

Three more kernel codepaths with no in-tree user on any flashbots image, joining the existing # CONFIG_INET_AH/ESP/INET6_AH/INET6_ESP/NET_KEY disables in this file: - AF_RXRPC + RXKAD: kernel RxRPC session sockets and Kerberos security, used only by the in-kernel AFS filesystem client. No image runs an AFS client, no userspace opens AF_RXRPC sockets. - XFRM_USER: netlink control interface for XFRM transforms (`ip xfrm`, strongSwan, libreswan). The image firewall is iptables; no IPsec daemon runs anywhere. With INET_AH/ESP/INET6_AH/INET6_ESP/NET_KEY already off, XFRM has no transforms to configure -- the netlink control interface is dead surface. Debian's cloud-amd64 base config has CONFIG_AF_RXRPC=m, CONFIG_RXKAD=y, CONFIG_XFRM_USER=m. CONFIG_MODULES is unset on this image (00-no-modules), so olddefconfig already resolves AF_RXRPC and XFRM_USER to 'not set', and RXKAD follows because it sits inside `if AF_RXRPC` in net/rxrpc/Kconfig. RXKAD is the one to watch -- a straight `=y` in Debian, not auto-disabled by MODULES=n alone, so an explicit pin is the only thing that keeps it off if the surrounding config drifts. Pinning the three explicitly removes the inference step and keeps the kernel attack surface small if a future Debian config or kconfig snippet edit changes a default. Mirrors the same belt-and-suspenders pattern used for AUTHENCESN and the AF_ALG family elsewhere in this branch.

Followup to the previous commit that pinned AF_RXRPC/RXKAD/XFRM_USER off. XFRM_USER is the netlink config interface; this commit pins the rest of the XFRM machinery so no XFRM code is compiled into the kernel at all. In net/xfrm/Kconfig: - CONFIG_XFRM (bool, no default) is selected only by transforms (INET_ESP/AH/IPCOMP, INET6_ESP/AH/IPCOMP, NET_KEY, XFRM_USER, XFRM_INTERFACE). All are 'not set' on this image (NET_KEY, INET[6]_AH/ESP/IPCOMP earlier in this file; XFRM_USER in the previous commit; XFRM_INTERFACE depends on IPV6 which is off). - CONFIG_XFRM_ALGO (tristate, no default) is selected by the same transform protocols, all off. - CONFIG_XFRM_ESPINTCP (bool) is the ESP-in-TCP encap glue, only meaningful with ESP, which is off. So all three resolve to 'not set' via olddefconfig already; the explicit pin removes the inference step and stays correct if a future kconfig snippet edit selects something that pulls XFRM back in. Functional impact: none. Verified that NET_IP_TUNNEL/NET_UDP_TUNNEL, TLS, KVM, HYPERV, VIRTIO, container runtime, dropbear, and the flashbox firewall do not depend on XFRM. NETFILTER_XT_MATCH_POLICY depends on XFRM and is the only iptables match that does -- flashbox firewall scripts do not use \`-m policy\` (grepped 0 hits in init-firewall.sh, toggle, and the per-image firewall-config files), so its absence is invisible. Removes the kernel-side primitive used by the ESP-in-UDP MSG_SPLICE_PAGES no-COW page-cache writes (Copy_Fail2 / Dirty Frag's ESP path) at the strongest layer: the ESP code is not even compiled in.

…DEVMEM) Three kernel features that auto-enable on this image despite having no consumer: init/Kconfig: config IO_URING bool "..." if EXPERT, default y io_uring/Kconfig: config IO_URING_ZCRX def_bool y, depends on IO_URING + PAGE_POOL + INET + NET_RX_BUSY_POLL net/Kconfig: config NET_DEVMEM def_bool y, depends on DMA_SHARED_BUFFER + PAGE_POOL All three are kernel zero-copy-IO paths -- io_uring is the broader async I/O subsystem; IO_URING_ZCRX (6.15+) is its receive-into- registered-memory variant; NET_DEVMEM (6.11+) is the socket-level receive-into-device-memory variant ("devmem TCP"). They share the underlying net_iov / page-pool memory-provider machinery. Grepping the image confirms no consumer: - lighthouse + rbuilder use tokio with the default mio/epoll reactor (no tokio-uring), confirmed by greps for liburing / io_uring / tokio_uring across the in-tree Rust sources. - tdx-init is Go; Go runtime poller uses epoll on Linux. - systemd / podman / runc / dropbear / chrony / iptables / conntrack do not use io_uring or devmem. - No GPU / DRM / media drivers in the kernel snippets, so DMA_SHARED_BUFFER's selectors are not present either. What disabling these closes: 1. An OOB heap write in io_uring/zcrx.c:io_zcrx_return_niov_freelist() (freelist[] free_count not bounds-checked; 4-byte OOB into adjacent slab). Disclosed 2026-05-06; hardening commit 770594e is in mainline 2026-04-21 but not in linux-source-6.19_6.19.13-1~bpo13+1 which we ship. 2. io_uring as a whole, which has been a recurring CVE factory (CVE-2023-21400, CVE-2024-1086, CVE-2024-26581, CVE-2024-50266, ...). KSPP and several distros (ChromeOS, parts of AWS Bottlerocket) disable it by default on production servers. 3. NET_DEVMEM, which currently resolves to n via olddefconfig because DMA_SHARED_BUFFER has no selector on this image, but pinning it explicit keeps that property stable if a future driver pull-in selects DMA_SHARED_BUFFER -- same inference gap that prompted pinning XFRM core/algo/espintcp in the previous commit. EXPERT is already y in the Debian base config, so the `if EXPERT` prompt gate on IO_URING is non-binding -- olddefconfig respects the explicit "is not set" line. IO_URING_ZCRX would follow automatically (`depends on IO_URING`), but pinning it explicit makes the disable visible at the source. Container-side note: the searcher container's seccomp profile (defaultAction: SCMP_ACT_ERRNO, allow-listed syscalls only) does not include any io_uring_* in its allow list, so io_uring was already blocked there by default-deny. This commit removes the code from the kernel binary entirely; an explicit io_uring_* deny rule is added in a separate commit for belt-and-suspenders.

Debian's cryptsetup 2.8.1 is built with KERNEL_CAPI (visible in `cryptsetup --version` flags), and libcryptsetup in this binary has no openssl/gcrypt userspace backend compiled in. It hard-fails at startup with "Cannot initialize crypto backend" if AF_ALG is unavailable. tdx-init shells out to cryptsetup for LUKS2 format/open/resize/token operations, so without AF_ALG the persistent disk can never be initialized and the image cannot boot far enough to mount /persistent. The prior disable of all CRYPTO_USER_API_* was based on a code-path audit that under-counted what libcryptsetup actually uses at runtime. Verified on a dev image with strace + cryptsetup --debug on a loopback: # Running pbkdf2(sha256) benchmark. <- algif_hash # Running argon2id() benchmark. <- userspace libargon2 # Updating keyslot area [0x8000]. <- algif_skcipher Re-enable the minimum needed for that flow: the AF_ALG umbrella, HASH (PBKDF2 + MAC), and SKCIPHER (AES-XTS keyslot encryption). Keep _AEAD and _RNG explicitly off as kernel attack-surface hardening: - _AEAD: not used by the LUKS2 default flow (aes-xts-plain64). It is the most exposed AF_ALG subfamily; keeping it off removes that interface at the syscall layer. - _RNG: cryptsetup reads /dev/urandom directly for random data; it does not open algif_rng. Searcher container exposure is unchanged: the seccomp profile in modules/flashbox/common/mkosi.extra/etc/containers/seccomp.json blocks socket() for AF_ALG (family 38), so re-enabling _HASH and _SKCIPHER on the host kernel does not widen the container's syscall surface.

alexhulbert · 2026-05-12T03:48:46Z

don't we need these for making sure cryptsetup is secure? or was that vulnerability patched?

MoeMahhouk · 2026-05-12T15:40:25Z

don't we need these for making sure cryptsetup is secure? or was that vulnerability patched?

which ones needs to be there and what vulnerability are you referring to?
Could you link it here please so I can investigate?

Replaces the previous seccomp extension attempt (commits 8f3342c and b459f19) with two changes: 1. Sync the bundled profile to moby/profiles main (currently tagged seccomp/v0.2.1): https://github.com/moby/profiles/blob/main/seccomp/default.json The relevant upstream change is dec315c (2026-04-30, "seccomp: Block AF_ALG in default socket policy"), which restructures the socket rule from a single `SCMP_CMP_NE 40` into three range-based rules: ALLOW arg0 < 38 ALLOW arg0 == 39 ALLOW arg0 > 40 Net effect: AF_ALG (38) and AF_VSOCK (40) both block at the seccomp layer. AF_VSOCK preservation comes along for free; AF_ALG blocking is the headline copy.fail (CVE-2026-31431) mitigation. This subsumes 8f3342c and the AF_VSOCK preservation half of b459f19. 2. Append three SCMP_ACT_ERRNO rules with errnoRet=97 (EAFNOSUPPORT) for AF_KEY (15), AF_RXRPC (33), and AF_MCTP (45). These families are also pinned off at the kernel layer (CONFIG_NET_KEY=n, CONFIG_AF_RXRPC=n, CONFIG_MCTP=n -- the last one in the next commit). The seccomp rules are tripwires in case the kernel config ever drifts; errnoRet=97 keeps the seccomp block visually indistinguishable from the kernel's own "family not registered" response. Why not just keep b459f19's socket rule? It packed four SCMP_CMP_NE conditions on arg0 into one seccomp_rule_add call, which libseccomp documents as supporting only one comparison per arg per rule (upstream issue #118, manpage clarification PR #225). On libseccomp 2.6.0 it silently produces a BPF tree where AF_KEY/AF_ALG/AF_MCTP fall through to ALLOW and the preserved AF_VSOCK block regresses. The LT/EQ/GT upstream pattern sidesteps that case because its three rules occupy different libseccomp op-priority levels; the layered EQ rules added here sit in priority-3 alongside upstream's `EQ 39` with disjoint datums and emit reachable BPF. Verified end-to-end on podman 5.8.2 + crun: in a fresh container, AF_KEY/AF_RXRPC/AF_ALG/AF_VSOCK/AF_MCTP all block, and every other family produces identical output to `seccomp=unconfined`.

AF_MCTP (Management Component Transport Protocol) has no consumer on this image. Same rationale as the AF_RXRPC and RXKAD pins in e5c78b3: reduce kernel attack surface by not registering the family at all. Pairs with the SCMP_CMP_EQ 45 rule in the previous commit, which acts as a seccomp-layer tripwire if this kernel pin ever drifts back to =y.

alexhulbert · 2026-05-13T00:51:10Z

@MoeMahhouk This one: https://gitlab.com/cryptsetup/cryptsetup/-/issues/954

Double checked and its not a problem 👍

MoeMahhouk added 3 commits May 4, 2026 10:12

MoeMahhouk requested a review from niccoloraspa May 4, 2026 10:28

MoeMahhouk added 5 commits May 8, 2026 09:10

shashial added 2 commits May 12, 2026 21:37

MoeMahhouk marked this pull request as ready for review May 13, 2026 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Harden kernel + searcher container against AF_ALG/algif_* bug class#142

chore: Harden kernel + searcher container against AF_ALG/algif_* bug class#142
MoeMahhouk wants to merge 10 commits into
mainfrom
moe/copy-fail-defense-in-depth

MoeMahhouk commented May 4, 2026 •

edited

Loading

Uh oh!

alexhulbert commented May 12, 2026

Uh oh!

MoeMahhouk commented May 12, 2026

Uh oh!

alexhulbert commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MoeMahhouk commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Block AF_ALG in searcher container seccomp profile

Drop CONFIG_CRYPTO_USER_API_*

Pin CONFIG_CRYPTO_AUTHENCESN=n

Pre-merge verification

Not in scope

References

Uh oh!

alexhulbert commented May 12, 2026

Uh oh!

MoeMahhouk commented May 12, 2026

Uh oh!

alexhulbert commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MoeMahhouk commented May 4, 2026 •

edited

Loading

Drop `CONFIG_CRYPTO_USER_API_*`

Pin `CONFIG_CRYPTO_AUTHENCESN=n`