Skip to content

Fix wedged-acquisition freezes and prevent USB suspend from killing the stream#5

Merged
cboulay merged 5 commits into
masterfrom
fix-xfer-pending-freeze
Jul 4, 2026
Merged

Fix wedged-acquisition freezes and prevent USB suspend from killing the stream#5
cboulay merged 5 commits into
masterfrom
fix-xfer-pending-freeze

Conversation

@cboulay

@cboulay cboulay commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Problem

Leaving the machine unattended while streaming would kill acquisition with uldaq's ##### error still xfer pending. mNumXferPending =28: macOS idle sleep suspends the USB bus (and App Nap throttles the libusb event thread), the DAQ's in-flight transfers never complete, and the scan is unrecoverable. Before this branch, a wedged device could also freeze the GUI outright.

Changes

Robustness against a wedged device (graceful failure path):

  • Fix GUI deadlock when stopping a wedged acquisition: IDevice::requestStop() breaks the getData*() polling loops before join().
  • Cap consecutive scan restarts (MAX_RESTART_ATTEMPTS = 3) so an unrecoverable USB state stops the stream instead of hammering ulAInScanStop forever.
  • Throttle overrun status reports to one per second so a recovery storm cannot flood the GUI event queue.

Root-cause prevention (keep the wedge from happening):

  • Hold an RAII PowerAssertion in StreamThread::threadFunction for the lifetime of the streaming session. On macOS this uses -[NSProcessInfo beginActivityWithOptions:reason:] with NSActivityUserInitiated | NSActivityLatencyCritical: blocks idle system sleep, opts out of App Nap, disables timer throttling. Released automatically on stop/error/shutdown; covers both the GUI app and the CLI. Display sleep is still allowed; manual sleep (lid close) is still handled by the graceful path above. Non-Apple platforms get a no-op stub.

Test tooling:

  • scripts/audio_latency_test now runs a raw-timestamp health check (no clock sync, no dejitter) after every run, reporting negative timestamps and non-monotonic (backward) steps that the default dejittered analysis would otherwise mask. --fail-on-bad-timestamps turns it into a regression gate.

Verification

  • Release build of both MCCOutlet.app and MCCOutletCLI passes.
  • pmset -g assertions confirms the PreventUserIdleSystemSleep assertion is registered while the guard is alive and released on destruction.
  • Timestamp check validated against existing XDF recordings (flags the known ~0.04% chunk-boundary back-dating artifact; zero negative timestamps).
  • Pending: leave-the-machine-idle soak test with the DAQ attached to confirm the stream survives unattended periods.

🤖 Generated with Claude Code

cboulay and others added 5 commits June 24, 2026 18:25
When a device wedges (FIFO overrun / stalled USB transfer), the getData*()
loops spin inside their own `while (!disconnecting_)` loop and never return,
so StreamThread::threadFunction never re-checks shutdown_. StreamThread::stop()
sets shutdown_ then immediately joins, but disconnecting_ was only set later
inside disconnect() (after the join), so the worker could never exit and join()
blocked forever, freezing the GUI thread.

Add IDevice::requestStop() (sets disconnecting_) and call it in
StreamThread::stop() before join(), so the worker breaks out of getData*()
and exits cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
restartScan() calls ulAInScanStop(), which is what makes libuldaq print
"##### error still xfer pending. mNumXferPending =N" when USB transfers are
stuck. A genuine FIFO overrun recovers in one restart, but an unrecoverable
state left the getData*() loops calling restartScan() ~20x/sec indefinitely.

Cap consecutive restarts per getData*() call (MAX_RESTART_ATTEMPTS = 3); on
exhaustion, report an error and return false so the streaming thread stops
cleanly instead of spinning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
restartScan() emitted a status callback on every restart. In the GUI each
callback posts a queued event via QMetaObject::invokeMethod, so a recovery
storm could flood the main thread's event queue and make the window
unresponsive. Coalesce overrun reports to at most one per second; overrun_count_
still increments for every restart so the reported count stays accurate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… the DAQ

When the Mac idles into system sleep (or App Nap throttles the libusb
event thread), the DAQ's in-flight USB transfers never complete. uldaq
then logs "##### error still xfer pending" on the next scan stop and the
acquisition cannot be recovered, so the restart cap trips and the stream
dies whenever the machine is left unattended.

Prevent the wedge at the source: StreamThread::threadFunction now holds
an RAII PowerAssertion for the lifetime of the streaming session. On
macOS this uses -[NSProcessInfo beginActivityWithOptions:reason:] with
NSActivityUserInitiated | NSActivityLatencyCritical, which blocks idle
system sleep, opts out of App Nap, and disables timer throttling; it is
released automatically on stop, error, or shutdown. Display sleep is
still permitted, and a manual sleep (lid close) is still handled by the
existing graceful-stop path. Non-Apple platforms get a no-op stub.

Verified via pmset -g assertions that the PreventUserIdleSystemSleep
assertion is registered while the guard is alive and released on
destruction.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The latency analysis loads XDF with dejitter_timestamps=True, which
refits timestamps onto a clean monotonic line and therefore hides any
defect in what the outlet actually emitted. Add check_timestamps(),
which loads the raw timestamps (no sync, no dejitter) and reports
negative values and non-monotonic (backward) steps with their count and
magnitude.

The check runs at the end of every live or --analyze-xdf run and is
informational by default; pass --fail-on-bad-timestamps to exit
non-zero when the MCC stream has negative or backward raw timestamps,
for use as a regression gate.

On existing recordings this surfaces the expected chunk-boundary
back-dating artifact (~0.04% backward steps, up to ~9 ms) that liblsl
produces when a chunk spans more sample-time than the wall clock
elapsed since the previous push; it is erased by any dejittering
consumer and does not affect the latency numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cboulay cboulay merged commit 9e95594 into master Jul 4, 2026
5 checks passed
@cboulay cboulay deleted the fix-xfer-pending-freeze branch July 4, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant