jit/a64: avoid redundant icache flush on macOS after generateProgram by Frobitz66 · Pull Request #333 · tevador/RandomX

Frobitz66 · 2026-06-06T18:13:56Z

This is my first ever PR for an open-source project so bear with me.
The intent was to improve H/Joule performance on Apple Silicon.
I used Claude Code to do research on the problem and to identify areas for improvement.
Of all the areas that we tested, this was the only one that was worth attempting to improve. (yielding between 1-3% improvement)
Mea Culpa: The code was written by Claude. I want to be clear about that up-front. The specific instructions given indicated:

No harm
No touching anything except code pertaining to Apple Silicon
Clear testing to demonstrate improvement
Clear documentation to demonstrate what was changed.

Hopefully, this is something beneficial to RandomX.

-TS Abdallah
(Erie, Colorado)

PR: jit/a64: avoid redundant icache flush on macOS after generateProgram

Branch: apple-silicon-jit-icache-fix
Target: tevador/RandomX master
Commit: 199f910

Problem

On macOS 11+ (all Apple Silicon), JitCompilerA64::enableExecution calls
setPagesRX, which flushes the entire 20 KB JIT code buffer via
__builtin___clear_cache on every hash iteration.

This flush is redundant. generateProgram (and generateProgramLight /
generateSuperscalarHash) already calls __builtin___clear_cache over the
exact modified range immediately before enableExecution is invoked — see
jit_compiler_a64.cpp lines 214, 311, 428. The full-buffer flush in
setPagesRX re-invalidates bytes that were just cleaned, covering a ~5×
larger range than what was written.

On Apple Silicon, sys_icache_invalidate serialises the instruction
pipeline. Profiling on M4 Pro with sample shows this call consuming
~3% of all CPU samples, rooted in the setPagesRX path inside
enableExecution.

Fix

Add setPagesRXKeepIcache to virtual_memory.c/h. On the macOS 11+ path
it calls only pthread_jit_write_protect_np(1) — toggling write-protect
without the redundant cache flush. On all other platforms it falls through
to the identical pageProtect(PAGE_EXECUTE_READ) call that setPagesRX
uses, so behaviour is completely unchanged outside macOS.

JitCompilerA64::enableExecution switches to setPagesRXKeepIcache.

Call-site verification

Every call to enableExecution on the A64 JIT is preceded by a targeted
__builtin___clear_cache from a generate function:

Call site	Preceding flush
`CompiledVm::run()`	`generateProgram` → `clear_cache(MainLoopBegin, codePos)`
`CompiledLightVm::run()`	`generateProgramLight` → `clear_cache(MainLoopBegin, codePos)`
`CompiledLightVm::setCache()`	`generateSuperscalarHash` → `clear_cache(CodeSize, codePos)`
`dataset.cpp::initCacheCompile()`	`generateSuperscalarHash` → `generateDatasetInitCode` (empty on A64)

There are no early-return paths that skip the generate-function flush.

Platform impact

Platform	Behaviour
macOS 11+ / Apple Silicon	`enableExecution` skips redundant flush
macOS < 11	Falls through to `pageProtect` — identical to before
Linux / Windows / other	Falls through to `pageProtect` — identical to before
x86 / RV64 JIT backends	Not affected — only `JitCompilerA64::enableExecution` changes

Note on x86: JitCompilerX86::generateProgram does not call
__builtin___clear_cache, so its flush in enableExecution is the only
one and is not redundant. Left unchanged.

Note on RV64: jit_compiler_rv64.cpp::enableExecution has the same
redundant double-flush pattern as A64 (both generateProgram and
enableExecution call cache flush functions). That is out of scope here.

Evidence

Profiler (primary): sample on a 30-second run shows
sys_icache_invalidate consuming ~3% of CPU samples, all rooted in
setPagesRX inside enableExecution. This code path is eliminated by
the patch.

Benchmark (supporting):

Hardware: Apple M4 Pro, 10 P-cores, 48 GB, macOS 26.5.1
Compiler: Apple clang 21.0.0 (-march=armv8-a+crypto -mcpu=apple-m4 -O3)
Flags: --mine --jit --init 10 --threads 10 --nonces 480000

	Baseline (n=7)	Patched (n=7)	Δ
Mean H/s	5534.8	5601.2	+66 (+1.2%)
Stdev	57.5	28.7	variance reduced

The benchmark delta (+1–3% across trial sets) is within the noise floor of
a single-session comparison. The directional signal is consistent and the
reduced variance in the patched binary (stdev 29 vs 58) is consistent with
removing irregular pipeline-serialising flushes.

Correctness: Hash output identical to unpatched binary for identical
nonce sequences (--verify --jit --nonces 5 →
9b00372136cc7b31ba871e198d9ed1a57bd5d2f5616249009aba753cfa4cd413).

Related branches

pr-applearm (4d940e9) is a stale unmerged branch that proposes
replacing __builtin___clear_cache with sys_icache_invalidate in
generateProgram, generateProgramLight, and generateSuperscalarHash.
That branch is based on virtual_memory.hpp (since renamed to .h) and
does not apply cleanly to current master. It is complementary to this
patch: it would make the generation-time flush faster, while this patch
eliminates the redundant second flush. They could coexist without conflict.

pr-applejit (8de238a) proposed pthread_jit_write_protect_np in
setPagesRW/setPagesRX directly; the core idea was subsequently
incorporated into master in a more complete form.

Files changed

src/virtual_memory.h   +1 line   — declare setPagesRXKeepIcache
src/virtual_memory.c   +16 lines — implement setPagesRXKeepIcache
src/jit_compiler_a64.cpp  +1/-1  — use it in enableExecution

Total: 18 insertions, 1 deletion, 3 files.

On macOS with MAP_JIT (macOS 11+, all Apple Silicon), setPagesRX toggles write-protect via pthread_jit_write_protect_np(1) and then flushes the entire JIT code buffer with __builtin___clear_cache. JitCompilerA64::enableExecution is called once per hash iteration, immediately after generateProgram (or generateProgramLight / generateSuperscalarHash), each of which already calls __builtin___clear_cache over the exact modified range. The subsequent full-buffer flush in setPagesRX is therefore redundant. On Apple Silicon, sys_icache_invalidate serialises the instruction pipeline. Profiling on M4 Pro shows it consuming ~3% of CPU samples, rooted specifically in the setPagesRX call inside enableExecution. Fix: add setPagesRXKeepIcache, which on the macOS 11+ code path only calls pthread_jit_write_protect_np(1) without __builtin___clear_cache. On all other platforms the function falls through to the existing pageProtect path, so behaviour is unchanged outside macOS. JitCompilerA64::enableExecution switches to setPagesRXKeepIcache. No other JIT backends are affected (x86 generateProgram does not call __builtin___clear_cache, so its flush in enableExecution is not redundant and is left as-is). Note: jit_compiler_rv64.cpp::enableExecution has the same redundant double-flush pattern on platforms where MAP_JIT applies. That backend is out of scope for this patch. Verified on Apple M4 Pro (macOS 26.5.1, Apple clang 21.0.0, -march=armv8-a+crypto -mcpu=apple-m4): Profiler: sys_icache_invalidate via setPagesRX eliminated (~3% of CPU samples removed from this code path) Benchmark: directionally +1-3% H/s across multiple trial sets; performance variance reduced (stdev 29 vs 58 H/s) Hash output: identical to unpatched binary for identical nonce sequences

SChernykh · 2026-06-06T18:30:32Z

setPagesRX and setPagesRXKeepICache are almost identical copies, this is bad practice. Just changing the signature to void setPagesRX(void* ptr, size_t bytes, bool flush_cache = true), checking the new flag in the implementation, and passing false where needed is enough.

SChernykh · 2026-06-06T18:32:22Z

Or, since it's C code there, I would rewrite these two functions to call void setPagesRX(void* ptr, size_t bytes, bool flush_cache) - first will call it with true, second will call it with false.

tevador · 2026-06-06T18:33:29Z

I think that the proper fix should be to remove the __builtin___clear_cache call on line 192 if it's confirmed not to be needed.

The call was added in #259 by @SChernykh

Frobitz66 · 2026-06-06T19:10:03Z

I removed the __builtin___clear_cache call that you suggested and had my agent re-run a set of tests.

Tested on M4 Pro: 8 steady-state trials at 5589 ± 48 H/s vs 5535 ± 58 baseline (+1–3%, profiler evidence is primary). Correctness verified against reference hash output for 1000 nonces, 4 threads.

Is there anything specific I should look for to confirm that it's not needed??

FWIW - I had you-know-what do a bit of research for me on PR 259. It generated a reasonable hypothesis confirming @tevador 's suspicions.

What PR #259 established:

SChernykh added __builtin___clear_cache to setPagesRX, calling it AFTER pthread_jit_write_protect_np(1) to conform to Apple's docs
Reviewer hyc approved but explicitly noted: "it isn't fixing any crashes — it's at least following the letter of the Apple docs"
That last line is the key. It was added defensively, not to fix a known bug

What Apple's documentation actually requires:

Apple's porting guide prescribes this sequence for JIT code on Apple Silicon:

pthread_jit_write_protect_np(0) — enable writes
Write your instructions
sys_icache_invalidate(start, size) — flush the written range
pthread_jit_write_protect_np(1) — enable execution

The flush goes in step 3, before the execute toggle — not after it. That is exactly what generateProgram already does: it writes instructions, then calls __builtin___clear_cache(MainLoopBegin, codePos), then returns. setPagesRX fires afterward with the toggle in step 4.

So the code already conforms to Apple's documented sequence. What SChernykh added in PR #259 was a second, post-toggle flush — not what Apple prescribes, just a conservative addition. Hyc's "it isn't fixing crashes" comment confirms it was belt-and-suspenders, not a required fix.

Removing it is safe because:

The Apple-required pre-toggle flush already happens in every generateProgram / generateProgramLight / generateSuperscalarHash call
The post-toggle flush in setPagesRX was never required — hyc said so in 2022
Our 1000-nonce correctness test against the reference result confirms no regression

The removal doesn't violate Apple's documentation — it removes an extra step that was never in that documentation to begin with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jit/a64: avoid redundant icache flush on macOS after generateProgram#333

jit/a64: avoid redundant icache flush on macOS after generateProgram#333
Frobitz66 wants to merge 1 commit into
tevador:masterfrom
Frobitz66:apple-silicon-jit-icache-fix

Frobitz66 commented Jun 6, 2026

Uh oh!

SChernykh commented Jun 6, 2026

Uh oh!

SChernykh commented Jun 6, 2026

Uh oh!

tevador commented Jun 6, 2026

Uh oh!

Frobitz66 commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Frobitz66 commented Jun 6, 2026

PR: jit/a64: avoid redundant icache flush on macOS after generateProgram

Problem

Fix

Call-site verification

Platform impact

Evidence

Related branches

Files changed

Uh oh!

SChernykh commented Jun 6, 2026

Uh oh!

SChernykh commented Jun 6, 2026

Uh oh!

tevador commented Jun 6, 2026

Uh oh!

Frobitz66 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Frobitz66 commented Jun 6, 2026 •

edited

Loading