chacha20: widen NEON bulk path to 8 parallel blocks by waamm · Pull Request #564 · RustCrypto/stream-ciphers

waamm · 2026-04-29T16:04:56Z

OpenSSL’s AArch64 ChaCha assembly uses an 8-block outer loop on its large-buffer path (see ChaCha20_512_neon in chacha-armv8.pl), whereas this crate previously used four parallel blocks.

On an Apple M4 Max, this gives ~~30%~~ ~45% higher throughput on bulk apply_keystream:

tarcieri

Confirmed these speedups on a M1 Max. Mine weren't quite as good, more in the realm of a 30% improvement, but still it seems worth it.

Edit: funny, because @waamm says "On an Apple M4 Max, this gives ~30% higher throughput" but the image shows 40%+?

chacha20: widen NEON bulk path to 8 parallel blocks

7780df5

tarcieri approved these changes May 5, 2026

View reviewed changes

tarcieri merged commit 51bb585 into RustCrypto:master May 5, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20: widen NEON bulk path to 8 parallel blocks#564

chacha20: widen NEON bulk path to 8 parallel blocks#564
tarcieri merged 1 commit intoRustCrypto:masterfrom
waamm:chacha20/neon-parblocks-u8

waamm commented Apr 29, 2026 •

edited

Loading

Uh oh!

tarcieri left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waamm commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waamm commented Apr 29, 2026 •

edited

Loading

tarcieri left a comment •

edited

Loading