Fix the random screenshot-suite crashes behind the mac-native / iOS Metal CI flake#5226
Conversation
The mac-native / iOS screenshot jobs flake with 'N of 128 screenshots
not produced': the suite stops emitting mid-run and the runner times out
waiting for CN1SS:SUITE:FINISHED. Artifact forensics show the app is not
idle when this happens -- ParparVM's SignalHandler converts SIGSEGV into
a Java NPE and returns, so a thread that faulted outside a Java try
frame re-executes the faulting instruction forever ('We had a signal 11'
spam in device-runner.log, observed from the UIKit main thread).
On suite timeout both runners now:
- 'sample' the live app process into app-hang-sample.txt. Because the
crashed thread keeps re-faulting at the same PC, the sample contains
the exact faulting stack; for genuine deadlocks it captures every
thread's wait state.
- log when the signal-handler loop signature is present in the app log.
- collect crash reports written to ~/Library/Logs/DiagnosticReports
during the run (covers the process-died-outright mode).
Diagnostics only; no behavior change on the success path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ashes A sample of a wedged Mac Catalyst screenshot run caught a dying Java thread aborting inside libmalloc (POINTER_BEING_FREED_WAS_NOT_ALLOCATED) under markDeadThread -> collectThreadResources -> placeObjectInHeapCollection, with the GC, the EDT and thread-spawn all piled up behind the critical section it still held. Two bugs in placeObjectInHeapCollection's rarely-taken grow path: 1. Unsynchronized concurrent callers. The GC mark migration waits for a thread's threadActive to drop before migrating its pendingHeapAllocations -- but a thread that finishes runImpl drops threadActive through markDeadThread, which migrates the same buffer concurrently under the critical section the GC never takes. Both sides double-place the same objects and race the grow-and-free of allObjectsInHeap: concurrent grows double-free the old array (the captured abort), and a stale read of the freed array is a use-after-free. The GC migration now takes the critical section and re-checks the thread slot; if the thread died meanwhile, markDeadThread already migrated everything under the same lock. 2. The grow branch left pos at -1, so the placed object's __heapPosition was never recorded. A later reference-counted free could not null its slot (removeObjectFromHeapCollection returns JAVA_FALSE) yet the object was freed anyway, leaving a dangling pointer in allObjectsInHeap for the next sweep to dereference. Also defer freeing the replaced array by one growth cycle since the sweep and the refcount removal path read allObjectsInHeap without the critical section. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Phase 3 v2 mutable-image pipeline tags queued ExecutableOps with the GLUIImage they should render into, but the target ivar was __unsafe_unretained. The main-thread drawFrame drain dereferences it after the EDT queued the op, so a mutable image deallocated in between (Java-side GC finalizing the Image) left a dangling pointer. Caught locally as -[DrawTextureAlphaMask mtlMutableTexture]: unrecognized selector ... CN1MetalBeginMutableImageDraw / drawFrame: when the freed GLUIImage's memory had been reused by another op; with less lucky reuse it is a straight SIGSEGV mid-frame, matching the random mid-suite crashes in the mac-native and iOS Metal screenshot CI jobs. setTarget now retains (released in dealloc), exactly like the ops' image ivars (DrawImage.img et al). The final release can now happen on the main thread during the drain, which is the safe place for a UIKit-backed object. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
✅ ByteCodeTranslator Quality ReportTest & Coverage
Benchmark Results
Static Analysis
Generated automatically by the PR CI workflow. |
Cloudflare Preview
|
✅ Continuous Quality ReportTest & Coverage
Static Analysis
Generated automatically by the PR CI workflow. |
|
Compared 125 screenshots: 125 matched. Benchmark ResultsDetailed Performance Metrics
|
|
Compared 125 screenshots: 125 matched. Benchmark ResultsDetailed Performance Metrics
|
|
Compared 128 screenshots: 128 matched. Benchmark Results
Detailed Performance Metrics
|
|
Compared 125 screenshots: 125 matched. Benchmark ResultsDetailed Performance Metrics
|
|
Compared 121 screenshots: 121 matched. |
|
Compared 124 screenshots: 124 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
|
Compared 128 screenshots: 128 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
Problem
The
build-mac-nativeandbuild-ios-metaljobs flake withFATAL: N of 128 expected screenshot(s) were not produced: the suite stops emitting mid-run at a random test, the runner times out waiting forCN1SS:SUITE:FINISHED, and every screenshot that was produced matches its golden. Observed on master and on unrelated PR branches (e.g. #5216 needed two re-runs).Artifact forensics showed the app is not idle when this happens:
device-runner.logends in an endlessWe had a signal 11spam from the UIKit main thread — ParparVM'sSignalHandlerconverts SIGSEGV into a Java NPE and returns, so a thread that faults outside a Java try frame re-executes the faulting instruction forever. The flake is a random memory-corruption crash, not a hang.Investigation
Reproduced locally by looping the Mac Catalyst screenshot suite (fails ~1-in-3). Two distinct crashes were caught:
1.
sampleof a wedged run — a dying Java thread aborted inside libmalloc and took the whole VM down with it:with the GC thread blocked in
codenameOneGCMark, the EDT blocked inmonitorEntermid-screenshot-encode, and a starting thread blocked ingetThreadLocalData— total VM standstill.2. A second repro terminated with:
a freed
GLUIImage's memory reused by another draw op — i.e. a dangling pointer dereferenced on the main thread; with less lucky memory reuse this is the straight SIGSEGV from case 1.Fixes
vm/ByteCodeTranslator/src/cn1_globals.m— two bugs onplaceObjectInHeapCollection's rarely-taken grow path:threadActiveto drop before migrating itspendingHeapAllocations— but a thread that finishesrunImpldropsthreadActivethroughmarkDeadThread, which migrates the same buffer concurrently. Both sides double-place the same objects and race the grow-and-free ofallObjectsInHeap(concurrent grows double-free the old array — the captured abort). The GC migration now takes the critical section (whichmarkDeadThreadalready holds) and re-checks the thread slot.posat-1, so the placed object's__heapPositionwas never recorded: a later reference-counted free could not null its heap slot yet freed the object anyway, leaving a dangling pointer for the next sweep to dereference. The slot index is now recorded.allObjectsInHeapwithout the critical section.Ports/iOSPort/nativeSources/ExecutableOp.{h,m}— the Phase 3 v2 mutable-image pipeline tags queued ops with their targetGLUIImage, but the ivar was__unsafe_unretained; the main-threaddrawFramedrain dereferences it after the EDT queued the op, so a mutable image GC'd in between dangles.setTargetnow retains (released indealloc), matching the ops' image ivars (DrawImage.imget al).scripts/run-mac-native-ui-tests.sh,scripts/run-ios-ui-tests.sh— on suite timeout the runners nowsamplethe live app process into the artifacts (the re-faulting thread makes the sample contain the exact crashing stack), flag the signal-handler-loop signature, and collect crash reports from~/Library/Logs/DiagnosticReports. This is what made the diagnosis possible; future flakes self-diagnose.Verification
🤖 Generated with Claude Code