Skip to content

Fix NPE in PendingAddOp.maybeTimeout() when clientCtx is null after recycling#4760

Merged
merlimat merged 2 commits intoapache:masterfrom
eolivelli:fix/pending-add-op-maybe-timeout-null-clientctx
Apr 29, 2026
Merged

Fix NPE in PendingAddOp.maybeTimeout() when clientCtx is null after recycling#4760
merlimat merged 2 commits intoapache:masterfrom
eolivelli:fix/pending-add-op-maybe-timeout-null-clientctx

Conversation

@eolivelli
Copy link
Copy Markdown
Contributor

@eolivelli eolivelli commented Apr 25, 2026

Motivation

monitorPendingAddOps() iterates over the pendingAddOps queue and calls
maybeTimeout() on each op. Concurrently, sendAddSuccessCallbacks() can
remove a completed op from the queue, call submitCallback() on it, and
ultimately trigger recyclePendAddOpObject() which sets clientCtx = null
(and lh = null, etc.).

If the scheduler thread still holds a live iterator reference to the same op
and then calls maybeTimeout(), dereferencing clientCtx causes a
NullPointerException:

java.lang.NullPointerException
    at org.apache.bookkeeper.client.PendingAddOp.maybeTimeout(PendingAddOp.java:157)
    at org.apache.bookkeeper.client.LedgerHandle.monitorPendingAddOps(LedgerHandle.java:2063)

The race is only triggered when addEntryQuorumTimeoutNanos > 0 (i.e. the
quorum-timeout monitor is actually scheduled). It became reliably observable
with Netty 4.1.130, which changed Recycler thread-scheduling behavior
and narrowed the window enough to expose the pre-existing race.

Closes #4759.

Changes

PendingAddOp.java

  • maybeTimeout() — snapshot clientCtx into a local.
    A naive null-check at the top of the method does not close the race:
    clientCtx could still be nulled between the guard and the subsequent
    clientCtx.getConf() dereference. volatile does not help here — it
    only guarantees visibility, not atomicity across two reads. Reading the
    field once into a local variable makes the second access race-free
    (locals are not visible to other threads).

  • timeoutQuorumWait() — bail when already recycled.
    Even after maybeTimeout() snapshots a non-null clientCtx,
    recyclePendAddOpObject() may complete before timeoutQuorumWait()
    acquires the monitor (both methods are synchronized). The pre-existing
    if (completed) return; guard does not cover this — recycling
    resets completed to false. Without an explicit null check the method
    NPEs on lh.getLedgerMetadata() and could invoke
    handleUnrecoverableErrorDuringAdd on a stale handle. Extended the
    early-return to if (completed || lh == null || clientCtx == null).

PendingAddOpTest.java

Four new unit tests:

Test What it verifies
testMaybeTimeoutReturnsFalseWhenClientCtxIsNull The exact race: clientCtx = null must not NPE
testMaybeTimeoutReturnsFalseWhenWithinQuorumTimeout Normal path — not yet timed out
testMaybeTimeoutReturnsTrueWhenQuorumTimeoutExpired Normal path — timed out, true returned
testTimeoutQuorumWaitIsNoOpWhenAlreadyRecycled Recycle wins the monitor race after maybeTimeout() already saw a non-null clientCtx: timeoutQuorumWait() must not NPE and must not invoke handleUnrecoverableErrorDuringAdd on the stale handle

…ecycling

monitorPendingAddOps() iterates the pendingAddOps queue and calls
maybeTimeout() on each op. Concurrently, sendAddSuccessCallbacks()
can remove a completed op from the queue and, via submitCallback(),
lead to recyclePendAddOpObject() which sets clientCtx = null. If the
scheduler thread still holds an iterator reference to that op and then
calls maybeTimeout(), the dereference of clientCtx causes an NPE.

The race is triggered in practice when addEntryQuorumTimeoutNanos > 0
(i.e. the quorum-timeout monitor is enabled). It became visible with
Netty 4.1.130, which changed Recycler thread-scheduling behavior and
made the narrow window between the iterator snapshot and the null
assignment observable.

Fix:
- Make clientCtx volatile so that the null write in the synchronized
  recyclePendAddOpObject() is immediately visible to the unsynchronized
  maybeTimeout() reader.
- Guard the top of maybeTimeout() with an explicit null check: if
  clientCtx is null the op has already completed and recycled, so
  there is nothing to time out and the method returns false.

Closes: apache#4759
Comment thread bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java Outdated
…t recycle

Per @merlimat's review on apache#4760, the original null-check in maybeTimeout()
still races: clientCtx may be nulled between the guard and the subsequent
getConf() dereference. Volatile only addresses visibility, not atomicity
across two reads.

- Snapshot clientCtx into a local in maybeTimeout(); the local cannot be
  mutated by another thread, so the dereference is race-free.
- Drop the volatile modifier on clientCtx (no longer load-bearing once the
  read is done once into a local).
- Extend timeoutQuorumWait()'s early-return to also short-circuit when
  lh / clientCtx are null. recyclePendAddOpObject() resets `completed` to
  false, so the prior `if (completed)` guard does not cover the
  already-recycled case; without this check timeoutQuorumWait() NPEs on
  lh.getLedgerMetadata() (and worse, could invoke
  handleUnrecoverableErrorDuringAdd on a stale handle).
- Add testTimeoutQuorumWaitIsNoOpWhenAlreadyRecycled covering the new guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli
Copy link
Copy Markdown
Contributor Author

@merlimat can you please take another look ?

Copy link
Copy Markdown
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@merlimat merlimat merged commit 697bd59 into apache:master Apr 29, 2026
20 checks passed
merlimat pushed a commit that referenced this pull request Apr 29, 2026
…ecycling (#4760)

* Fix NPE in PendingAddOp.maybeTimeout() when clientCtx is null after recycling

monitorPendingAddOps() iterates the pendingAddOps queue and calls
maybeTimeout() on each op. Concurrently, sendAddSuccessCallbacks()
can remove a completed op from the queue and, via submitCallback(),
lead to recyclePendAddOpObject() which sets clientCtx = null. If the
scheduler thread still holds an iterator reference to that op and then
calls maybeTimeout(), the dereference of clientCtx causes an NPE.

The race is triggered in practice when addEntryQuorumTimeoutNanos > 0
(i.e. the quorum-timeout monitor is enabled). It became visible with
Netty 4.1.130, which changed Recycler thread-scheduling behavior and
made the narrow window between the iterator snapshot and the null
assignment observable.

Fix:
- Make clientCtx volatile so that the null write in the synchronized
  recyclePendAddOpObject() is immediately visible to the unsynchronized
  maybeTimeout() reader.
- Guard the top of maybeTimeout() with an explicit null check: if
  clientCtx is null the op has already completed and recycled, so
  there is nothing to time out and the method returns false.

Closes: #4759

* Address review: snapshot clientCtx and guard timeoutQuorumWait against recycle

Per @merlimat's review on #4760, the original null-check in maybeTimeout()
still races: clientCtx may be nulled between the guard and the subsequent
getConf() dereference. Volatile only addresses visibility, not atomicity
across two reads.

- Snapshot clientCtx into a local in maybeTimeout(); the local cannot be
  mutated by another thread, so the dereference is race-free.
- Drop the volatile modifier on clientCtx (no longer load-bearing once the
  read is done once into a local).
- Extend timeoutQuorumWait()'s early-return to also short-circuit when
  lh / clientCtx are null. recyclePendAddOpObject() resets `completed` to
  false, so the prior `if (completed)` guard does not cover the
  already-recycled case; without this check timeoutQuorumWait() NPEs on
  lh.getLedgerMetadata() (and worse, could invoke
  handleUnrecoverableErrorDuringAdd on a stale handle).
- Add testTimeoutQuorumWaitIsNoOpWhenAlreadyRecycled covering the new guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NullPointerException in PendingAddOp.maybeTimeout when clientCtx is null after recycling

2 participants