Skip to content

feat: cross-platform force-kill primitive for stuck PHP threads#2365

Open
nicolas-grekas wants to merge 1 commit intophp:mainfrom
nicolas-grekas:force-kill-primitives
Open

feat: cross-platform force-kill primitive for stuck PHP threads#2365
nicolas-grekas wants to merge 1 commit intophp:mainfrom
nicolas-grekas:force-kill-primitives

Conversation

@nicolas-grekas
Copy link
Copy Markdown
Contributor

First step of the split suggested in #2287: land the force-kill
infrastructure as a standalone, reviewable primitive, independent of
background workers. Useful on its own and gives follow-up
graceful-shutdown work a reviewed foundation.

What

  • C primitives (frankenphp.c, frankenphp.h):

    • frankenphp_init_force_kill(int n) / frankenphp_destroy_force_kill()
      allocate/free per-thread slots.
    • frankenphp_save_php_timer(uintptr_t idx) captures this thread's
      timer/handle at boot, so the kill path can fire from any goroutine
      without touching per-thread PHP state directly.
    • frankenphp_force_kill_thread(uintptr_t idx) arms the kill:
      • Linux/FreeBSD ZTS: arms PHP's max_execution_time timer, delivers
        SIGALRM, raises Maximum execution time exceeded.
      • Windows: CancelSynchronousIo + QueueUserAPC to interrupt I/O
        and alertable waits.
      • macOS and other platforms: safe no-op; the thread is abandoned and
        exits when the blocking call returns on its own.
  • Go wiring (phpmainthread.go):

    • initPHPThreads allocates the slots (sized to maxThreads).
    • drainPHPThreads frees them.
    • PHP thread boot loop calls frankenphp_save_php_timer once up front.
  • Graceful drain (worker.go):

    • drainWorkerThreads now waits up to 5 s for threads to yield, then
      arms the kill on any still-stuck thread and keeps waiting on
      ready.Wait(). Drain completes in bounded time instead of hanging
      on a stuck sleep/I/O call.

Notes

  • No behavior change for threads that yield within 5 s.
  • No background-worker code in this PR; the machinery is general-purpose.
  • Budget: +216 / -1.

Comment thread frankenphp.c Outdated
#elif defined(PHP_WIN32)
if (thread_handles && thread_handle_saved[idx]) {
CancelSynchronousIo(thread_handles[idx]);
QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, thread_handles[idx], 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL php sleep uses SleepEx (which is alertable), but usleep does not. This isn't the greatest option to wake up sleeping threads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern holds on Windows for non-alertable Sleep(), which is what PHP's usleep uses. From our layer there's no fix: TerminateThread would leave locks in undefined state. <-done in drainWorkerThreads will wait until the syscall returns naturally in that case. Documented the limitation next to the primitive.

On Linux/FreeBSD the new pthread_kill(SIGRTMIN+3) path does interrupt usleep (and the rest) via EINTR. On macOS it's the same limitation as Windows usleep (no realtime signals).

One thought: this could be fixed upstream by having php-src switch usleep on Windows from Sleep to SleepEx(msec, TRUE). That would make it alertable, and our QueueUserAPC would then interrupt it naturally. Might be worth a small PR to php-src.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Once that lands, our QueueUserAPC path on Windows will interrupt usleep naturally with no change on our side, and the drain-indefinitely concern on worker.go goes away for the usleep case too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LLM posted this on its own but I'm sincerely grateful for this : D)

Comment thread frankenphp.c Outdated
Comment thread frankenphp.c Outdated
Comment thread frankenphp.c Outdated
Comment thread worker.go
if globalLogger.Enabled(globalCtx, slog.LevelWarn) {
globalLogger.LogAttrs(globalCtx, slog.LevelWarn, "worker threads did not yield within grace period, force-killing stuck threads")
}
<-done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circling back to Windows only actually interrupting SleepEx, wouldn't we block indefinitely here if the force_kill_thread didn't work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. On Windows with PHP usleep (non-alertable Sleep), <-done will wait until the sleep returns on its own because the APC cannot interrupt it. Documented. Fixing it properly would need php-src to use SleepEx(msec, TRUE) for usleep so the APC can wake it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we could technically force kill a thread. But I doubt that'd be safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, TerminateThread can abandon held locks mid-modification and corrupt the process - better to lean on your php-src fix.

Comment thread frankenphp.c Outdated
Comment on lines +152 to +159
if (thread_php_timers && thread_php_timer_saved[idx]) {
struct itimerspec its;
its.it_value.tv_sec = 0;
its.it_value.tv_nsec = 1;
its.it_interval.tv_sec = 0;
its.it_interval.tv_nsec = 0;
timer_settime(thread_php_timers[idx], 0, &its, NULL);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing timers like this, I'd actually prefer if we just stored a reference to the execution globals of each thread at startup. You then can just set EG(timed_out) to true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Each thread now saves pointers to its own EG(vm_interrupt) and EG(timed_out) at boot; force_kill_thread does atomic-stores into both. That path is the cross-platform core, including macOS and Windows.

We kept a platform-specific wake-up on top: pthread_kill(SIGRTMIN+3) on Linux/FreeBSD, CancelSynchronousIo + QueueUserAPC on Windows. Without it, a thread stuck in a blocking syscall (select, sleep, DB wait) would only be interrupted when the syscall returned on its own.

Side effect of the switch: the old timer path silently did nothing for bg workers because EG(pid) was zero when max_execution_time=0, so this also fixes a latent bug.

Comment thread frankenphp.c Outdated
Comment on lines +142 to +148
thread_slots = calloc((size_t)num_threads, sizeof(force_kill_slot));
if (thread_slots == NULL) {
/* Out of memory at startup: leave force-kill disabled rather than crash
* later in register/kill. Graceful drain still works via the yielding
* path. */
force_kill_num_threads = 0;
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to just store the force_kill_slot on the go side in the phpThread struct and then pass it back to C in frankenphp_force_kill_thread.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The slot now lives on phpThread and gets populated by go_frankenphp_store_force_kill_slot from the PHP thread's own TSRM context at boot; force_kill_thread(slot) takes it by value. That lets us drop the C-side thread_slots array, the bounds checks, and both init_force_kill / destroy_force_kill. Signal-handler install moved to a one-shot pthread_once inside register; Windows CloseHandle lives in a new release_thread_for_kill called during drain. Diff is -25 lines, and happens-before is covered because the thread writes the slot before it transitions to Ready.

Introduces a small, self-contained primitive that unblocks a PHP thread
stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful
drain used by RestartWorkers and DrainWorkers can make progress instead
of waiting for the block to return on its own. The primitive is useful
on its own and gives follow-up graceful-shutdown work a reviewed
foundation to build on.

- frankenphp.c: add frankenphp_init_force_kill / frankenphp_save_php_timer
  / frankenphp_force_kill_thread / frankenphp_destroy_force_kill. The
  per-thread PHP timer handle (Linux/FreeBSD ZTS) or OS thread handle
  (Windows) is captured at thread boot and stored in a pre-sized array
  so the kill path can fire from any goroutine without touching
  per-thread PHP state. Linux/FreeBSD arm PHP's max_execution_time timer
  (delivers SIGALRM -> "Maximum execution time exceeded"); Windows uses
  CancelSynchronousIo + QueueUserAPC to interrupt I/O and alertable
  waits; macOS and other platforms are a safe no-op (the thread is
  abandoned and exits when the blocking call returns naturally).

- phpmainthread.go: wire frankenphp_init_force_kill into initPHPThreads
  (sized to maxThreads, matching the thread_metrics allocation) and
  frankenphp_destroy_force_kill into drainPHPThreads.

- worker.go: add a 5-second graceful-drain grace period to
  drainWorkerThreads. Once elapsed, arm the force-kill primitive on any
  thread still outside Yielding and keep waiting on ready.Wait(); the
  kill lets the thread return from its blocking call so the drain
  completes in bounded time instead of hanging.

- worker_test.go + testdata/worker-sleep.php: TestRestartWorkersForceKillsStuckThread
  drives the path end-to-end. A worker blocks inside sleep(60) below
  frankenphp_handle_request (so drainChan close can't reach it); the
  test asserts RestartWorkers returns within 8s (grace + slack). The
  test skips on platforms without the underlying primitive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants