Skip to content

chacha20: widen NEON bulk path to 8 parallel blocks#564

Merged
tarcieri merged 1 commit intoRustCrypto:masterfrom
waamm:chacha20/neon-parblocks-u8
May 5, 2026
Merged

chacha20: widen NEON bulk path to 8 parallel blocks#564
tarcieri merged 1 commit intoRustCrypto:masterfrom
waamm:chacha20/neon-parblocks-u8

Conversation

@waamm
Copy link
Copy Markdown
Contributor

@waamm waamm commented Apr 29, 2026

OpenSSL’s AArch64 ChaCha assembly uses an 8-block outer loop on its large-buffer path (see ChaCha20_512_neon in chacha-armv8.pl), whereas this crate previously used four parallel blocks.

On an Apple M4 Max, this gives 30% ~45% higher throughput on bulk apply_keystream:
Screenshot 2026-04-29 at 17 30 06

Copy link
Copy Markdown
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed these speedups on a M1 Max. Mine weren't quite as good, more in the realm of a 30% improvement, but still it seems worth it.

Edit: funny, because @waamm says "On an Apple M4 Max, this gives ~30% higher throughput" but the image shows 40%+?

@tarcieri tarcieri merged commit 51bb585 into RustCrypto:master May 5, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants