jit/a64: avoid redundant icache flush on macOS after generateProgram#333
jit/a64: avoid redundant icache flush on macOS after generateProgram#333Frobitz66 wants to merge 1 commit into
Conversation
On macOS with MAP_JIT (macOS 11+, all Apple Silicon), setPagesRX
toggles write-protect via pthread_jit_write_protect_np(1) and then
flushes the entire JIT code buffer with __builtin___clear_cache.
JitCompilerA64::enableExecution is called once per hash iteration,
immediately after generateProgram (or generateProgramLight /
generateSuperscalarHash), each of which already calls
__builtin___clear_cache over the exact modified range. The subsequent
full-buffer flush in setPagesRX is therefore redundant.
On Apple Silicon, sys_icache_invalidate serialises the instruction
pipeline. Profiling on M4 Pro shows it consuming ~3% of CPU samples,
rooted specifically in the setPagesRX call inside enableExecution.
Fix: add setPagesRXKeepIcache, which on the macOS 11+ code path only
calls pthread_jit_write_protect_np(1) without __builtin___clear_cache.
On all other platforms the function falls through to the existing
pageProtect path, so behaviour is unchanged outside macOS.
JitCompilerA64::enableExecution switches to setPagesRXKeepIcache. No
other JIT backends are affected (x86 generateProgram does not call
__builtin___clear_cache, so its flush in enableExecution is not
redundant and is left as-is).
Note: jit_compiler_rv64.cpp::enableExecution has the same redundant
double-flush pattern on platforms where MAP_JIT applies. That backend
is out of scope for this patch.
Verified on Apple M4 Pro (macOS 26.5.1, Apple clang 21.0.0,
-march=armv8-a+crypto -mcpu=apple-m4):
Profiler: sys_icache_invalidate via setPagesRX eliminated (~3% of
CPU samples removed from this code path)
Benchmark: directionally +1-3% H/s across multiple trial sets;
performance variance reduced (stdev 29 vs 58 H/s)
Hash output: identical to unpatched binary for identical nonce
sequences
|
|
|
Or, since it's C code there, I would rewrite these two functions to call |
|
I think that the proper fix should be to remove the The call was added in #259 by @SChernykh |
|
I removed the __builtin___clear_cache call that you suggested and had my agent re-run a set of tests. Tested on M4 Pro: 8 steady-state trials at 5589 ± 48 H/s vs 5535 ± 58 baseline (+1–3%, profiler evidence is primary). Correctness verified against reference hash output for 1000 nonces, 4 threads. Is there anything specific I should look for to confirm that it's not needed?? FWIW - I had you-know-what do a bit of research for me on PR 259. It generated a reasonable hypothesis confirming @tevador 's suspicions. What PR #259 established:
What Apple's documentation actually requires: Apple's porting guide prescribes this sequence for JIT code on Apple Silicon:
The flush goes in step 3, before the execute toggle — not after it. That is exactly what generateProgram already does: it writes instructions, then calls __builtin___clear_cache(MainLoopBegin, codePos), then returns. setPagesRX fires afterward with the toggle in step 4. So the code already conforms to Apple's documented sequence. What SChernykh added in PR #259 was a second, post-toggle flush — not what Apple prescribes, just a conservative addition. Hyc's "it isn't fixing crashes" comment confirms it was belt-and-suspenders, not a required fix. Removing it is safe because:
The removal doesn't violate Apple's documentation — it removes an extra step that was never in that documentation to begin with. |
This is my first ever PR for an open-source project so bear with me.
The intent was to improve H/Joule performance on Apple Silicon.
I used Claude Code to do research on the problem and to identify areas for improvement.
Of all the areas that we tested, this was the only one that was worth attempting to improve. (yielding between 1-3% improvement)
Mea Culpa: The code was written by Claude. I want to be clear about that up-front. The specific instructions given indicated:
Hopefully, this is something beneficial to RandomX.
-TS Abdallah
(Erie, Colorado)
PR: jit/a64: avoid redundant icache flush on macOS after generateProgram
Branch:
apple-silicon-jit-icache-fixTarget:
tevador/RandomXmasterCommit: 199f910
Problem
On macOS 11+ (all Apple Silicon),
JitCompilerA64::enableExecutioncallssetPagesRX, which flushes the entire 20 KB JIT code buffer via__builtin___clear_cacheon every hash iteration.This flush is redundant.
generateProgram(andgenerateProgramLight/generateSuperscalarHash) already calls__builtin___clear_cacheover theexact modified range immediately before
enableExecutionis invoked — seejit_compiler_a64.cpplines 214, 311, 428. The full-buffer flush insetPagesRXre-invalidates bytes that were just cleaned, covering a ~5×larger range than what was written.
On Apple Silicon,
sys_icache_invalidateserialises the instructionpipeline. Profiling on M4 Pro with
sampleshows this call consuming~3% of all CPU samples, rooted in the
setPagesRXpath insideenableExecution.Fix
Add
setPagesRXKeepIcachetovirtual_memory.c/h. On the macOS 11+ pathit calls only
pthread_jit_write_protect_np(1)— toggling write-protectwithout the redundant cache flush. On all other platforms it falls through
to the identical
pageProtect(PAGE_EXECUTE_READ)call thatsetPagesRXuses, so behaviour is completely unchanged outside macOS.
JitCompilerA64::enableExecutionswitches tosetPagesRXKeepIcache.Call-site verification
Every call to
enableExecutionon the A64 JIT is preceded by a targeted__builtin___clear_cachefrom a generate function:CompiledVm::run()generateProgram→clear_cache(MainLoopBegin, codePos)CompiledLightVm::run()generateProgramLight→clear_cache(MainLoopBegin, codePos)CompiledLightVm::setCache()generateSuperscalarHash→clear_cache(CodeSize, codePos)dataset.cpp::initCacheCompile()generateSuperscalarHash→generateDatasetInitCode(empty on A64)There are no early-return paths that skip the generate-function flush.
Platform impact
enableExecutionskips redundant flushpageProtect— identical to beforepageProtect— identical to beforeJitCompilerA64::enableExecutionchangesNote on x86:
JitCompilerX86::generateProgramdoes not call__builtin___clear_cache, so its flush inenableExecutionis the onlyone and is not redundant. Left unchanged.
Note on RV64:
jit_compiler_rv64.cpp::enableExecutionhas the sameredundant double-flush pattern as A64 (both
generateProgramandenableExecutioncall cache flush functions). That is out of scope here.Evidence
Profiler (primary):
sampleon a 30-second run showssys_icache_invalidateconsuming ~3% of CPU samples, all rooted insetPagesRXinsideenableExecution. This code path is eliminated bythe patch.
Benchmark (supporting):
-march=armv8-a+crypto -mcpu=apple-m4 -O3)--mine --jit --init 10 --threads 10 --nonces 480000The benchmark delta (+1–3% across trial sets) is within the noise floor of
a single-session comparison. The directional signal is consistent and the
reduced variance in the patched binary (stdev 29 vs 58) is consistent with
removing irregular pipeline-serialising flushes.
Correctness: Hash output identical to unpatched binary for identical
nonce sequences (
--verify --jit --nonces 5→9b00372136cc7b31ba871e198d9ed1a57bd5d2f5616249009aba753cfa4cd413).Related branches
pr-applearm(4d940e9) is a stale unmerged branch that proposesreplacing
__builtin___clear_cachewithsys_icache_invalidateingenerateProgram,generateProgramLight, andgenerateSuperscalarHash.That branch is based on
virtual_memory.hpp(since renamed to.h) anddoes not apply cleanly to current master. It is complementary to this
patch: it would make the generation-time flush faster, while this patch
eliminates the redundant second flush. They could coexist without conflict.
pr-applejit(8de238a) proposedpthread_jit_write_protect_npinsetPagesRW/setPagesRXdirectly; the core idea was subsequentlyincorporated into master in a more complete form.
Files changed
Total: 18 insertions, 1 deletion, 3 files.