Fix NPE in PendingAddOp.maybeTimeout() when clientCtx is null after recycling#4760
Merged
merlimat merged 2 commits intoapache:masterfrom Apr 29, 2026
Conversation
…ecycling monitorPendingAddOps() iterates the pendingAddOps queue and calls maybeTimeout() on each op. Concurrently, sendAddSuccessCallbacks() can remove a completed op from the queue and, via submitCallback(), lead to recyclePendAddOpObject() which sets clientCtx = null. If the scheduler thread still holds an iterator reference to that op and then calls maybeTimeout(), the dereference of clientCtx causes an NPE. The race is triggered in practice when addEntryQuorumTimeoutNanos > 0 (i.e. the quorum-timeout monitor is enabled). It became visible with Netty 4.1.130, which changed Recycler thread-scheduling behavior and made the narrow window between the iterator snapshot and the null assignment observable. Fix: - Make clientCtx volatile so that the null write in the synchronized recyclePendAddOpObject() is immediately visible to the unsynchronized maybeTimeout() reader. - Guard the top of maybeTimeout() with an explicit null check: if clientCtx is null the op has already completed and recycled, so there is nothing to time out and the method returns false. Closes: apache#4759
merlimat
reviewed
Apr 26, 2026
…t recycle Per @merlimat's review on apache#4760, the original null-check in maybeTimeout() still races: clientCtx may be nulled between the guard and the subsequent getConf() dereference. Volatile only addresses visibility, not atomicity across two reads. - Snapshot clientCtx into a local in maybeTimeout(); the local cannot be mutated by another thread, so the dereference is race-free. - Drop the volatile modifier on clientCtx (no longer load-bearing once the read is done once into a local). - Extend timeoutQuorumWait()'s early-return to also short-circuit when lh / clientCtx are null. recyclePendAddOpObject() resets `completed` to false, so the prior `if (completed)` guard does not cover the already-recycled case; without this check timeoutQuorumWait() NPEs on lh.getLedgerMetadata() (and worse, could invoke handleUnrecoverableErrorDuringAdd on a stale handle). - Add testTimeoutQuorumWaitIsNoOpWhenAlreadyRecycled covering the new guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@merlimat can you please take another look ? |
merlimat
pushed a commit
that referenced
this pull request
Apr 29, 2026
…ecycling (#4760) * Fix NPE in PendingAddOp.maybeTimeout() when clientCtx is null after recycling monitorPendingAddOps() iterates the pendingAddOps queue and calls maybeTimeout() on each op. Concurrently, sendAddSuccessCallbacks() can remove a completed op from the queue and, via submitCallback(), lead to recyclePendAddOpObject() which sets clientCtx = null. If the scheduler thread still holds an iterator reference to that op and then calls maybeTimeout(), the dereference of clientCtx causes an NPE. The race is triggered in practice when addEntryQuorumTimeoutNanos > 0 (i.e. the quorum-timeout monitor is enabled). It became visible with Netty 4.1.130, which changed Recycler thread-scheduling behavior and made the narrow window between the iterator snapshot and the null assignment observable. Fix: - Make clientCtx volatile so that the null write in the synchronized recyclePendAddOpObject() is immediately visible to the unsynchronized maybeTimeout() reader. - Guard the top of maybeTimeout() with an explicit null check: if clientCtx is null the op has already completed and recycled, so there is nothing to time out and the method returns false. Closes: #4759 * Address review: snapshot clientCtx and guard timeoutQuorumWait against recycle Per @merlimat's review on #4760, the original null-check in maybeTimeout() still races: clientCtx may be nulled between the guard and the subsequent getConf() dereference. Volatile only addresses visibility, not atomicity across two reads. - Snapshot clientCtx into a local in maybeTimeout(); the local cannot be mutated by another thread, so the dereference is race-free. - Drop the volatile modifier on clientCtx (no longer load-bearing once the read is done once into a local). - Extend timeoutQuorumWait()'s early-return to also short-circuit when lh / clientCtx are null. recyclePendAddOpObject() resets `completed` to false, so the prior `if (completed)` guard does not cover the already-recycled case; without this check timeoutQuorumWait() NPEs on lh.getLedgerMetadata() (and worse, could invoke handleUnrecoverableErrorDuringAdd on a stale handle). - Add testTimeoutQuorumWaitIsNoOpWhenAlreadyRecycled covering the new guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
monitorPendingAddOps()iterates over thependingAddOpsqueue and callsmaybeTimeout()on each op. Concurrently,sendAddSuccessCallbacks()canremove a completed op from the queue, call
submitCallback()on it, andultimately trigger
recyclePendAddOpObject()which setsclientCtx = null(and
lh = null, etc.).If the scheduler thread still holds a live iterator reference to the same op
and then calls
maybeTimeout(), dereferencingclientCtxcauses aNullPointerException:The race is only triggered when
addEntryQuorumTimeoutNanos > 0(i.e. thequorum-timeout monitor is actually scheduled). It became reliably observable
with Netty 4.1.130, which changed
Recyclerthread-scheduling behaviorand narrowed the window enough to expose the pre-existing race.
Closes #4759.
Changes
PendingAddOp.javamaybeTimeout()— snapshotclientCtxinto a local.A naive null-check at the top of the method does not close the race:
clientCtxcould still be nulled between the guard and the subsequentclientCtx.getConf()dereference.volatiledoes not help here — itonly guarantees visibility, not atomicity across two reads. Reading the
field once into a local variable makes the second access race-free
(locals are not visible to other threads).
timeoutQuorumWait()— bail when already recycled.Even after
maybeTimeout()snapshots a non-nullclientCtx,recyclePendAddOpObject()may complete beforetimeoutQuorumWait()acquires the monitor (both methods are
synchronized). The pre-existingif (completed) return;guard does not cover this — recyclingresets
completedtofalse. Without an explicit null check the methodNPEs on
lh.getLedgerMetadata()and could invokehandleUnrecoverableErrorDuringAddon a stale handle. Extended theearly-return to
if (completed || lh == null || clientCtx == null).PendingAddOpTest.javaFour new unit tests:
testMaybeTimeoutReturnsFalseWhenClientCtxIsNullclientCtx = nullmust not NPEtestMaybeTimeoutReturnsFalseWhenWithinQuorumTimeouttestMaybeTimeoutReturnsTrueWhenQuorumTimeoutExpiredtruereturnedtestTimeoutQuorumWaitIsNoOpWhenAlreadyRecycledmaybeTimeout()already saw a non-nullclientCtx:timeoutQuorumWait()must not NPE and must not invokehandleUnrecoverableErrorDuringAddon the stale handle