Skip to content

jit/a64: avoid redundant icache flush on macOS after generateProgram#333

Open
Frobitz66 wants to merge 1 commit into
tevador:masterfrom
Frobitz66:apple-silicon-jit-icache-fix
Open

jit/a64: avoid redundant icache flush on macOS after generateProgram#333
Frobitz66 wants to merge 1 commit into
tevador:masterfrom
Frobitz66:apple-silicon-jit-icache-fix

Conversation

@Frobitz66

Copy link
Copy Markdown

This is my first ever PR for an open-source project so bear with me.
The intent was to improve H/Joule performance on Apple Silicon.
I used Claude Code to do research on the problem and to identify areas for improvement.
Of all the areas that we tested, this was the only one that was worth attempting to improve. (yielding between 1-3% improvement)
Mea Culpa: The code was written by Claude. I want to be clear about that up-front. The specific instructions given indicated:

  1. No harm
  2. No touching anything except code pertaining to Apple Silicon
  3. Clear testing to demonstrate improvement
  4. Clear documentation to demonstrate what was changed.

Hopefully, this is something beneficial to RandomX.

-TS Abdallah
(Erie, Colorado)

PR: jit/a64: avoid redundant icache flush on macOS after generateProgram

Branch: apple-silicon-jit-icache-fix
Target: tevador/RandomX master
Commit: 199f910


Problem

On macOS 11+ (all Apple Silicon), JitCompilerA64::enableExecution calls
setPagesRX, which flushes the entire 20 KB JIT code buffer via
__builtin___clear_cache on every hash iteration.

This flush is redundant. generateProgram (and generateProgramLight /
generateSuperscalarHash) already calls __builtin___clear_cache over the
exact modified range immediately before enableExecution is invoked — see
jit_compiler_a64.cpp lines 214, 311, 428. The full-buffer flush in
setPagesRX re-invalidates bytes that were just cleaned, covering a ~5×
larger range than what was written.

On Apple Silicon, sys_icache_invalidate serialises the instruction
pipeline. Profiling on M4 Pro with sample shows this call consuming
~3% of all CPU samples, rooted in the setPagesRX path inside
enableExecution.


Fix

Add setPagesRXKeepIcache to virtual_memory.c/h. On the macOS 11+ path
it calls only pthread_jit_write_protect_np(1) — toggling write-protect
without the redundant cache flush. On all other platforms it falls through
to the identical pageProtect(PAGE_EXECUTE_READ) call that setPagesRX
uses, so behaviour is completely unchanged outside macOS.

JitCompilerA64::enableExecution switches to setPagesRXKeepIcache.


Call-site verification

Every call to enableExecution on the A64 JIT is preceded by a targeted
__builtin___clear_cache from a generate function:

Call site Preceding flush
CompiledVm::run() generateProgramclear_cache(MainLoopBegin, codePos)
CompiledLightVm::run() generateProgramLightclear_cache(MainLoopBegin, codePos)
CompiledLightVm::setCache() generateSuperscalarHashclear_cache(CodeSize, codePos)
dataset.cpp::initCacheCompile() generateSuperscalarHashgenerateDatasetInitCode (empty on A64)

There are no early-return paths that skip the generate-function flush.


Platform impact

Platform Behaviour
macOS 11+ / Apple Silicon enableExecution skips redundant flush
macOS < 11 Falls through to pageProtect — identical to before
Linux / Windows / other Falls through to pageProtect — identical to before
x86 / RV64 JIT backends Not affected — only JitCompilerA64::enableExecution changes

Note on x86: JitCompilerX86::generateProgram does not call
__builtin___clear_cache, so its flush in enableExecution is the only
one and is not redundant. Left unchanged.

Note on RV64: jit_compiler_rv64.cpp::enableExecution has the same
redundant double-flush pattern as A64 (both generateProgram and
enableExecution call cache flush functions). That is out of scope here.


Evidence

Profiler (primary): sample on a 30-second run shows
sys_icache_invalidate consuming ~3% of CPU samples, all rooted in
setPagesRX inside enableExecution. This code path is eliminated by
the patch.

Benchmark (supporting):

  • Hardware: Apple M4 Pro, 10 P-cores, 48 GB, macOS 26.5.1
  • Compiler: Apple clang 21.0.0 (-march=armv8-a+crypto -mcpu=apple-m4 -O3)
  • Flags: --mine --jit --init 10 --threads 10 --nonces 480000
Baseline (n=7) Patched (n=7) Δ
Mean H/s 5534.8 5601.2 +66 (+1.2%)
Stdev 57.5 28.7 variance reduced

The benchmark delta (+1–3% across trial sets) is within the noise floor of
a single-session comparison. The directional signal is consistent and the
reduced variance in the patched binary (stdev 29 vs 58) is consistent with
removing irregular pipeline-serialising flushes.

Correctness: Hash output identical to unpatched binary for identical
nonce sequences (--verify --jit --nonces 5
9b00372136cc7b31ba871e198d9ed1a57bd5d2f5616249009aba753cfa4cd413).


Related branches

pr-applearm (4d940e9) is a stale unmerged branch that proposes
replacing __builtin___clear_cache with sys_icache_invalidate in
generateProgram, generateProgramLight, and generateSuperscalarHash.
That branch is based on virtual_memory.hpp (since renamed to .h) and
does not apply cleanly to current master. It is complementary to this
patch: it would make the generation-time flush faster, while this patch
eliminates the redundant second flush. They could coexist without conflict.

pr-applejit (8de238a) proposed pthread_jit_write_protect_np in
setPagesRW/setPagesRX directly; the core idea was subsequently
incorporated into master in a more complete form.


Files changed

src/virtual_memory.h   +1 line   — declare setPagesRXKeepIcache
src/virtual_memory.c   +16 lines — implement setPagesRXKeepIcache
src/jit_compiler_a64.cpp  +1/-1  — use it in enableExecution

Total: 18 insertions, 1 deletion, 3 files.

On macOS with MAP_JIT (macOS 11+, all Apple Silicon), setPagesRX
toggles write-protect via pthread_jit_write_protect_np(1) and then
flushes the entire JIT code buffer with __builtin___clear_cache.

JitCompilerA64::enableExecution is called once per hash iteration,
immediately after generateProgram (or generateProgramLight /
generateSuperscalarHash), each of which already calls
__builtin___clear_cache over the exact modified range. The subsequent
full-buffer flush in setPagesRX is therefore redundant.

On Apple Silicon, sys_icache_invalidate serialises the instruction
pipeline. Profiling on M4 Pro shows it consuming ~3% of CPU samples,
rooted specifically in the setPagesRX call inside enableExecution.

Fix: add setPagesRXKeepIcache, which on the macOS 11+ code path only
calls pthread_jit_write_protect_np(1) without __builtin___clear_cache.
On all other platforms the function falls through to the existing
pageProtect path, so behaviour is unchanged outside macOS.

JitCompilerA64::enableExecution switches to setPagesRXKeepIcache. No
other JIT backends are affected (x86 generateProgram does not call
__builtin___clear_cache, so its flush in enableExecution is not
redundant and is left as-is).

Note: jit_compiler_rv64.cpp::enableExecution has the same redundant
double-flush pattern on platforms where MAP_JIT applies. That backend
is out of scope for this patch.

Verified on Apple M4 Pro (macOS 26.5.1, Apple clang 21.0.0,
-march=armv8-a+crypto -mcpu=apple-m4):
  Profiler: sys_icache_invalidate via setPagesRX eliminated (~3% of
            CPU samples removed from this code path)
  Benchmark: directionally +1-3% H/s across multiple trial sets;
             performance variance reduced (stdev 29 vs 58 H/s)
  Hash output: identical to unpatched binary for identical nonce
               sequences
@SChernykh

Copy link
Copy Markdown
Collaborator

setPagesRX and setPagesRXKeepICache are almost identical copies, this is bad practice. Just changing the signature to void setPagesRX(void* ptr, size_t bytes, bool flush_cache = true), checking the new flag in the implementation, and passing false where needed is enough.

@SChernykh

Copy link
Copy Markdown
Collaborator

Or, since it's C code there, I would rewrite these two functions to call void setPagesRX(void* ptr, size_t bytes, bool flush_cache) - first will call it with true, second will call it with false.

@tevador

tevador commented Jun 6, 2026

Copy link
Copy Markdown
Owner

I think that the proper fix should be to remove the __builtin___clear_cache call on line 192 if it's confirmed not to be needed.

The call was added in #259 by @SChernykh

@Frobitz66

Frobitz66 commented Jun 6, 2026

Copy link
Copy Markdown
Author

I removed the __builtin___clear_cache call that you suggested and had my agent re-run a set of tests.

Tested on M4 Pro: 8 steady-state trials at 5589 ± 48 H/s vs 5535 ± 58 baseline (+1–3%, profiler evidence is primary). Correctness verified against reference hash output for 1000 nonces, 4 threads.

Is there anything specific I should look for to confirm that it's not needed??

FWIW - I had you-know-what do a bit of research for me on PR 259. It generated a reasonable hypothesis confirming @tevador 's suspicions.

What PR #259 established:

  • SChernykh added __builtin___clear_cache to setPagesRX, calling it AFTER pthread_jit_write_protect_np(1) to conform to Apple's docs
  • Reviewer hyc approved but explicitly noted: "it isn't fixing any crashes — it's at least following the letter of the Apple docs"
  • That last line is the key. It was added defensively, not to fix a known bug

What Apple's documentation actually requires:

Apple's porting guide prescribes this sequence for JIT code on Apple Silicon:

  1. pthread_jit_write_protect_np(0) — enable writes
  2. Write your instructions
  3. sys_icache_invalidate(start, size) — flush the written range
  4. pthread_jit_write_protect_np(1) — enable execution

The flush goes in step 3, before the execute toggle — not after it. That is exactly what generateProgram already does: it writes instructions, then calls __builtin___clear_cache(MainLoopBegin, codePos), then returns. setPagesRX fires afterward with the toggle in step 4.

So the code already conforms to Apple's documented sequence. What SChernykh added in PR #259 was a second, post-toggle flush — not what Apple prescribes, just a conservative addition. Hyc's "it isn't fixing crashes" comment confirms it was belt-and-suspenders, not a required fix.

Removing it is safe because:

  1. The Apple-required pre-toggle flush already happens in every generateProgram / generateProgramLight / generateSuperscalarHash call
  2. The post-toggle flush in setPagesRX was never required — hyc said so in 2022
  3. Our 1000-nonce correctness test against the reference result confirms no regression

The removal doesn't violate Apple's documentation — it removes an extra step that was never in that documentation to begin with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants