docs(design): propose hotspot split M2 — migration plane#945
Conversation
Parent partial doc 2026_02_18_partial_hotspot_shard_split.md tracks M1 as shipped; M2-M4 are open. This proposal scopes M2 — the migration plane that moves a split child to a different Raft group. Scope: - SplitJob durable record under !dist|job|<id> with phase/cursor/ts fields so a leader flap resumes mid-flight. - BACKFILL / FENCE / DELTA_COPY / CUTOVER / CLEANUP state machine driven by a single migrator goroutine on the default-group leader, with per-phase recovery semantics enumerated in the matrix. - proto/distribution.proto extension: target_group_id on SplitRangeRequest; GetSplitJob / ListSplitJobs / AbandonSplitJob. - proto/internal.proto: ExportRangeVersions (server-streaming) + ImportRangeVersions (idempotent via persisted import_cursor). - store/MVCCStore.ExportVersions / ImportVersions with internal-key family dispatch and routeKey() equivalence assertion (covers !txn|, !lst|, !ddb|, !redis|, !sqs|, !s3|). - ErrRouteWriteFenced at both coordinator and FSM (defense in depth on the gRPC mesh + Raft apply path); reuses PR #927 / #932 Composed-1 alignment at CUTOVER. - 8-PR landing plan with five-lens self-review per PR. Non-goals (M3+): hotspot detection, auto-split scheduler, online merge, per-tenant policy. 7 open questions called out at the end for reviewer input — notably grace-period sizing, abandon-after-cutover policy, and backfill-vs-prepare-window semantics. Per the design-doc-first workflow in CLAUDE.md, this PR is the doc only; M2-PR1..PR7 follow once the proposal is accepted.
|
Important Review skippedReview was skipped as selected files did not have any reviewable changes. 💤 Files selected but had no reviewable changes (1)
⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughDesign doc adding a Milestone‑2 migration plane: durable SplitJob state machine and phases, cross-group route/staging semantics, streaming MVCC Export/Import contracts and RPCs, CUTOVER/promote semantics, coordinator/FSM fencing and read-fence, migrator runtime, recovery matrix, test plan, and rollout gating. ChangesMilestone 2 Migration Design
Sequence Diagram(s)sequenceDiagram
participant Coordinator
participant SourceStore
participant TargetStore
participant Catalog
Coordinator->>SourceStore: ExportRangeVersions(job_id, start_cursor, KeyFilter)
SourceStore->>Coordinator: Stream(versions[], cursor)
Coordinator->>TargetStore: ImportRangeVersions(job_id, cursor, KeyFilter)
TargetStore->>TargetStore: persist staged MVCC pages (idempotent by job_id+cursor)
TargetStore->>Catalog: stage shadow keyspace (not bound in catalog until CUTOVER)
Coordinator->>Catalog: CUTOVER (atomic catalog rebind, commit route_version)
Catalog->>Coordinator: ack CUTOVER (route_version increment)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request proposes a design document for Milestone 2 of the Hotspot Shard Split feature, focusing on the migration plane. It outlines the data model additions (such as SplitJob), state machine transitions, RPC changes, MVCC range export/import, and coordinator/FSM integration. The review feedback highlights three key areas for improvement: defining a retention or garbage collection policy for completed SplitJob records to prevent unbounded storage growth, decoupling the low-level store package from high-level routing logic by using an interceptor or delegate pattern instead of directly referencing routeKey, and addressing rolling upgrade risks where older M1 nodes might not enforce the WriteFenced state by detailing mitigation strategies such as capability gates.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| ### 3.1 SplitJob | ||
|
|
||
| A split job is durable state for the migration. Persist under reserved keys in the default group, alongside the route catalog. | ||
|
|
||
| Key layout: | ||
|
|
||
| - `!dist|meta|next_job_id` — uint64, allocator for `job_id`. | ||
| - `!dist|job|<job_id>` — encoded `SplitJob` record. |
There was a problem hiding this comment.
The proposed design persists SplitJob records under !dist|job|<job_id> but does not specify a limit or a garbage collection/retention policy for completed (DONE, FAILED, or abandoned) jobs. Over time, as the cluster performs many hotspot splits, this collection can grow unboundedly. This poses a risk of unbounded storage growth and potential OOM issues when listing or scanning jobs (e.g., via ListSplitJobs).
To prevent unbounded memory and storage growth, please define a fixed bound or a retention/cleanup policy to automatically garbage collect or archive historical job records after a certain period or count.
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.
| ### 6.3 Internal-key coverage | ||
|
|
||
| §9 of the parent doc enumerates the key families. The exporter scans the raw storage range `[start, end)` filtered by **routeKey** equivalence to the moving range, so internal keys (`!txn|...`, `!lst|...`, redis hash/list/etc.) land in the same shard as their logical owner. The importer dispatches by `key_family` into the matching store helper to preserve any per-family invariant (e.g., list head pointer updates). | ||
|
|
||
| A simple safety check: every exported key must map back to the moving range under `routeKey()`. The exporter asserts this on every row before adding to the chunk; an assertion failure aborts the job and surfaces in `last_error`. This protects against a future internal-key family being added that the exporter forgot to teach. |
There was a problem hiding this comment.
The proposed design has the low-level store package (store/mvcc_store.go) directly referencing and calling routeKey() to filter keys and map them back to the moving range. This introduces tight coupling between the low-level generic storage package and high-level routing/domain logic.
Per the general rules, to prevent tight coupling between low-level generic packages and high-level domain packages, you should use an interceptor or delegate pattern with interfaces (or function values) defined in the low-level package. For example, ExportVersions can accept an optional, nil-safe key-filter function or delegate interface (e.g., func(key []byte) bool) injected from the high-level package, keeping the store package generic and decoupled.
References
- To prevent tight coupling between low-level generic packages and high-level domain packages, use an interceptor or delegate pattern with interfaces defined in the low-level package. Ensure the interceptor is optional (nil-safe) so that builds or environments with the high-level feature disabled can run without modification.
| | Lens | M2-specific risk | | ||
| |---|---| | ||
| | Data loss | CUTOVER atomicity; ImportRangeVersions idempotency; the BACKFILL/DELTA_COPY commit-ts windows have no gap. | | ||
| | Concurrency | Migrator goroutine vs leadership election; FENCE visibility timing on followers; AbandonJob race with phase advance. | | ||
| | Performance | Chunk size vs gRPC frame limit; pacing default vs prod throughput; cursor lookup cost per import. | | ||
| | Data consistency | Composed-1 alignment at CUTOVER; reader at `≤ fence_ts` on source vs reader at `≥ catalog_version+1` on target; internal-key route mapping. | | ||
| | Test coverage | Property tests over chunk-then-restart sequences; FSM unit for FENCE; Jepsen split + nemesis. | |
There was a problem hiding this comment.
The design introduces a significant operational risk during rolling upgrades. If a migration job is initiated while some nodes in the cluster are still running the older M1 version, those M1 nodes will not enforce the WriteFenced state (since the FENCE rejection logic is introduced in M2-PR5). This could allow writes to be accepted on the source group during the FENCE/DELTA_COPY phases, leading to silent write loss or inconsistency.
Per the general rules, when a design document identifies a significant operational risk like rolling upgrade incompatibility, it must detail potential mitigation strategies. Please update the design (e.g., in the self-review or as a new section) to include layered mitigations, such as using admin RPC capability gates to prevent starting any SplitRange migration jobs until all nodes have upgraded and advertised M2 capability.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
There was a problem hiding this comment.
Already addressed in §11.1 ("Rolling-upgrade compatibility") landed in commit 93de9b2: cluster-readiness gate at the SplitRange entry RPC keyed on the cap_migration_v2 capability bit in the existing distribution heartbeat, with M2-PR5 as the bit-advertise commit so the gate semantics are safe across rollout. Could you confirm this resolves the finding, or point at what's still uncovered?
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3aca90c833
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ├─ BACKFILL: right child created with state=MigratingTarget on target group; | ||
| │ source remains Active; writes still flow to source. | ||
| ├─ FENCE: source.right_child → WriteFenced; writes on the moving range |
There was a problem hiding this comment.
Avoid publishing overlapping migration routes
When a cross-group job enters BACKFILL, this plan creates the target right child while leaving the full source route Active, so the catalog snapshot contains overlapping ranges for the moving interval. The current engine does not ignore non-Active states when applying snapshots: routesFromCatalog converts every descriptor and validateRouteOrder rejects any prev.End > curr.Start with ErrEngineSnapshotRouteOverlap (distribution/engine.go:468-523). As written, watchers would fail to apply the first migration catalog snapshot, so the migration cannot progress through the proposed visible route states unless the design keeps the target out of the route snapshot or changes the overlap invariant.
Useful? React with 👍 / 👎.
|
|
||
| - `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`. | ||
|
|
||
| Reads are **not** fenced: `ShardStore.GetAt` continues to serve from source until CUTOVER. Snapshot reads at any `commit_ts ≤ fence_ts` remain consistent because the source's MVCC history is unchanged through DELTA_COPY. |
There was a problem hiding this comment.
Add a read fence for post-cutover stale routes
If a coordinator or source-group leader has not applied the CUTOVER catalog yet, leaving reads unfenced lets it keep routing the moved key to the source while new writes are already accepted on the target. The current read path resolves the group from the local engine and then serves/proxies the read without carrying an observed catalog version (kv/shard_store.go:38-53, kv/shard_store.go:1471-1477), while the Composed-1 guard cited here only checks mutations in verifyOwnerFromSnapshot (kv/fsm.go:753-773). During the CLEANUP grace window this can return the pre-cutover source value after a successful post-cutover target write, so the design needs a read-side route-version fence/redirect or another mechanism that stops stale source reads for the moved range.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md (2)
108-108: ⚖️ Poor tradeoffConfirm operational acceptance of one-way CUTOVER.
After CUTOVER completes, the migration cannot be rolled back without implementing a full reverse migration (explicitly out of scope for M2). This means if post-CUTOVER issues arise (e.g., target group instability, data corruption discovered later), the only recovery path is forward through CLEANUP or manual intervention.
Is this operational trade-off acceptable for M2, or should §12 Open Questions include a specific item about post-CUTOVER failure scenarios and operator runbooks?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` at line 108, The document currently states that CUTOVER is one-way and rollback requires a reverse migration; update the design doc to explicitly capture operational acceptance by either (A) adding a clear sentence in the CUTOVER paragraph that the team accepts the one-way behavior for M2 and lists recovery paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and operator runbooks" that requests acceptance criteria, required runbooks, and escalation steps for issues like target-group instability or latent data corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and CLEANUP so reviewers can locate and verify the change.
258-261: ⚖️ Poor tradeoffVerify sufficiency of routeKey() assertion failure handling.
Lines 260-261 state that if an exported key fails the
routeKey()equivalence check, "an assertion failure aborts the job and surfaces inlast_error." Per §4 line 106, this puts the job intoFAILEDstate requiring manual operator intervention.This safety check protects against a future internal-key family being added without updating the exporter. However, silent job abortion may not provide sufficient visibility for this class of bug.
Consider:
- Is
last_errorsurfaced prominently enough (e.g., via metrics/alerts) to catch this during development/testing?- Should dev/test environments panic on this assertion to fail-fast rather than require operator log inspection?
- Should this be elevated in §12 Open Questions as a code-maintenance concern?
The defensive check is excellent; the question is whether the failure mode provides adequate feedback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` around lines 258 - 261, The routeKey() assertion currently aborts the exporter job and writes to last_error leaving the job in FAILED state; enhance visibility and fail-fast behavior by (1) emitting a dedicated metric/counter and structured error log whenever routeKey() mismatches occur (include routeKey, key_family, and chunk id) so monitoring/alerts can catch it, (2) adding a configurable dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that panics/throws immediately instead of only setting last_error to ensure fast feedback in testing, and (3) document this operational choice in the design notes (add to §12 Open Questions) so maintainers know to update alerting/ops runbooks; refer to routeKey(), exporter, last_error and the FAILED job state when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Around line 76-86: Add a language specifier to the fenced code block that
contains the migration state diagram (the block starting with "source.Active"
and the listed states BACKFILL/FENCE/DELTA_COPY/CUTOVER/CLEANUP) to satisfy
markdownlint MD040; update the opening fence from ``` to something like ```text
(or ```nohighlight) so the block is explicitly marked as plain text.
---
Nitpick comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Line 108: The document currently states that CUTOVER is one-way and rollback
requires a reverse migration; update the design doc to explicitly capture
operational acceptance by either (A) adding a clear sentence in the CUTOVER
paragraph that the team accepts the one-way behavior for M2 and lists recovery
paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add
a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and
operator runbooks" that requests acceptance criteria, required runbooks, and
escalation steps for issues like target-group instability or latent data
corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and
CLEANUP so reviewers can locate and verify the change.
- Around line 258-261: The routeKey() assertion currently aborts the exporter
job and writes to last_error leaving the job in FAILED state; enhance visibility
and fail-fast behavior by (1) emitting a dedicated metric/counter and structured
error log whenever routeKey() mismatches occur (include routeKey, key_family,
and chunk id) so monitoring/alerts can catch it, (2) adding a configurable
dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that
panics/throws immediately instead of only setting last_error to ensure fast
feedback in testing, and (3) document this operational choice in the design
notes (add to §12 Open Questions) so maintainers know to update alerting/ops
runbooks; refer to routeKey(), exporter, last_error and the FAILED job state
when making these changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 70d218ee-479f-4859-864a-0bdc51656929
📒 Files selected for processing (1)
docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md
… minor)
Codex P1 (line 80) — overlapping migration routes
Rewrote §3.2 state machine: catalog snapshots are kept strictly
non-overlapping throughout the job. BACKFILL/DELTA_COPY land in a
shadow keyspace (!dist|migstage|<job_id>|<raw_key>) on the target
group's MVCC store, outside the catalog. The catalog gets a NEW
WriteFenced route at FENCE-time (split out from source) and the
group-rebind happens atomically at CUTOVER. routesFromCatalog +
validateRouteOrder (distribution/engine.go:496-527) stay green.
RouteState.MigratingSource/MigratingTarget are reserved but unused
in M2.
Codex P1 (line 277) — post-CUTOVER read fence
Added §7.2: read-path now carries read_route_version (parallel to
ObservedRouteVersion on writes). New verifyReadOwner gate next to
verifyOwnerFromSnapshot returns ErrRouteMoved{new_route_version,
new_group_id} when a stale coordinator routes a read to source
after CUTOVER. Coordinator does WatcherRefresh + single retry.
CLEANUP grace window readFenceGrace=30s bounds the corner case.
The write-side Composed-1 gate is unchanged so PR #932's Jepsen
trust holds.
Gemini medium (line 51) — SplitJob unbounded growth
Added §3.1.1 + §4.2: live cap maxLiveJobs=64 enforced by
SplitRange; on terminal transition jobs move to
!dist|jobhist|<terminal_at_ms_be>|<id> with retention by
maxJobHistory=1k OR historyTTL=7d (whichever tighter); once-per-
minute sweep on default-group leader. ListSplitJobs supports
since_terminal_at_ms / phase filters and cursors.
Gemini medium (line 260) — store/routeKey tight coupling
Added store.KeyFilter delegate; ExportVersions takes the filter as
a nil-safe argument so the store package never imports kv or
knows about routing keys. routeKey() closure built in
kv/migrator_filter.go and passed in by the migrator. §6.3
rewritten with the explicit decoupling seam.
Gemini medium (line 346) — rolling-upgrade incompatibility
Added §11.1: node_capability_bitfield in the existing distribution
heartbeat; cap_migration_v2 bit gates cross-group SplitRange at
the entry RPC with ErrClusterNotReadyForMigration; in-flight
job-pause path on observed downgrade; bit is advertised only in
M2-PR5 (the commit that lands the FENCE rejection enforcement)
so the gate is safe across the rollout. Same-group split (M1)
remains unconditionally available.
CodeRabbit minor (line 86) — MD040 fenced-code-block language
All fenced blocks now carry a language tag (text/proto/go).
Replaced the bare ``` with ```text on the state-machine diagram
and consolidated to a single language across the diagram and
evolution table.
PR-plan note (§11): the rolling-upgrade gate dictates a hard
ordering between M2-PR4 (state machine) and M2-PR5 (FENCE
enforcement) — the capability bit only goes live in PR5.
|
@claude review — addressed all 6 findings from the gemini / codex / coderabbit reviews in commit 93de9b2: Codex P1 (line 80, overlapping routes) — §3.2 rewritten so catalog snapshots stay strictly non-overlapping. BACKFILL/DELTA_COPY land in a target-side shadow keyspace ( Codex P1 (line 277, post-CUTOVER read fence) — added §7.2 with Gemini medium (line 51, SplitJob unbounded growth) — added §3.1.1 / §4.2: Gemini medium (line 260, store/routeKey coupling) — Gemini medium (line 346, rolling upgrade) — added §11.1: CodeRabbit minor (line 86, MD040) — all fenced blocks now carry a language tag. |
|
Claude finished @bootjp's task in 3m 26s —— View job Review of commit 93de9b2 — addressing 6 prior findingsTodo list
Verified: all 6 prior findings are addressed
New observations on the updated designP1 — CUTOVER two-step ordering undefined (data-visibility gap)§3.2 (line 139) says:
But the CUTOVER step as described consists of two independent Raft proposals on different groups:
The design does not specify the ordering of these two. If (1) commits before (2), reads routed to the target during the rename window return empty results: the catalog says the target owns the range, but the target's MVCC only has shadow keys. The read fence in §7.2 only redirects stale source reads; it doesn't compensate for the target not yet having live data. The safe ordering is: (2) commits on the target first, then (1) executes on the default group — by the time routing switches, all data is live on the target. But:
Suggest: add a §7.5 (or a sub-bullet in §4's CUTOVER row) that specifies: "Target bulk-rename commits first ( Medium — Grace window inconsistency (§7.2.4 vs §4 vs Open Question 3)Three places cite the CLEANUP grace default and they disagree:
Pick one and make them consistent. Whichever wins should also answer Open Question 3 (or at least note the reasoning). For reference: Medium — "Wait until applied" in §7.2.2 under-specified
How is this wait implemented? Options include an applied-index condition variable, a short retry loop with a timeout, or a channel signalled by the FSM apply goroutine. The choice affects:
A one-sentence note (e.g., "implemented as a condition-variable wait on the FSM's appliedIndex, timedout by Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93de9b2d2e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // ImportVersions writes the given versions idempotently. A second call | ||
| // with the same (job_id, cursor) MUST be a no-op past the recorded | ||
| // cursor — the importer records last-applied cursor under | ||
| // !dist|job|<id>|import_cursor so a network retry doesn't double-write. | ||
| ImportVersions(ctx context.Context, jobID uint64, versions []MVCCVersion, | ||
| cursor []byte) ([]byte, error) |
There was a problem hiding this comment.
Advance target HLC when importing versions
When the source range contains commit timestamps higher than the target group's current LastCommitTS/HLC, this import contract only writes versions and an idempotency cursor, but does not require the target to observe the max imported commit_ts. Since the existing FSM observes commit timestamps only after normal pb.Request applies, a migration import path can leave the target clock below the staged data; after CUTOVER the first target write can receive a lower timestamp than an imported older value, causing MVCC time travel where later reads resurrect the pre-cutover value. The import apply needs to atomically advance metaLastCommitTS and the target HLC (or otherwise fence new writes) to at least the max imported timestamp before the route becomes active.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 17b0719 — added §6.2.1 "Target HLC advancement": ImportVersions now atomically advances both metaLastCommitTS and the HLC ceiling (hlc.SetPhysicalCeiling) to max(batch.commit_ts) under the same FSM apply lock as the MVCC writes. SplitJob carries a monotone max_imported_ts witness, and CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS — so post-CUTOVER Next() is provably greater than every imported ts. Reuses the existing Composed-1 monotone-ceiling primitive (PR #927) — no new fence.
| | `snapshot_ts` | uint64 | HLC commit-ts pinned at BACKFILL entry. | | ||
| | `fence_ts` | uint64 | HLC commit-ts pinned at FENCE → DELTA_COPY entry. | | ||
| | `cutover_version` | uint64 | Catalog version produced by the CUTOVER write; populated at CUTOVER, used by §7.2 read-fence and §4.2 GC. | | ||
| | `cursor` | bytes | Next-key cursor for BACKFILL / DELTA_COPY resume. | |
There was a problem hiding this comment.
Make the export cursor address MVCC versions
The job cursor is defined as a next-key cursor, but an MVCC export can contain many committed versions for the same raw key. If a hot key's history exceeds chunkBytes, resuming by key alone either has to keep that entire key history in one unbounded chunk or risks skipping/duplicating the remaining versions after a restart; the API should make the cursor opaque over the encoded MVCC position (raw key plus commit timestamp) or explicitly require chunks never split a key's full version history.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 17b0719 — added §6.1.1 "Cursor granularity": the cursor now encodes (raw_key, commit_ts) and the exporter iterates (raw_key ASC, commit_ts DESC) with exclusive-resume semantics. Chunks may contain only versions of one raw key without an all-or-nothing constraint on the history — the next chunk continues at the next-older commit_ts of the same key. The SplitJob row was updated accordingly. Opaque encoding so the store-side type stays []byte and the exporter alone owns the codec (preserves the §6.1 store/kv decoupling seam).
…x P1/P2) Codex P1 on PR #945 (line 323) — Advance target HLC when importing Source range can contain commit_ts > target group's current HLC / metaLastCommitTS. Import-only versions-and-cursor writes leave the target clock behind; after CUTOVER the first target write at Next() can collide with imported ts and resurrect pre-cutover values on a snapshot read. Fix: ImportVersions atomically advances metaLastCommitTS AND calls hlc.SetPhysicalCeiling(batchMax) under the same FSM apply lock as the MVCC writes; SplitJob carries a monotone max_imported_ts witness; CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS. Reuses the existing Composed-1 / HLC monotone-ceiling primitive — no new fence. New unit + property tests pin (a) per-batch advance, (b) CUTOVER gap-closing proposal, (c) strict Next() > max_imported_ts post-CUTOVER. Codex P2 on PR #945 (line 66) — Cursor must address MVCC versions A "next-key" cursor cannot resume safely on a hot key whose version history exceeds chunkBytes. New §6.1.1 spec: opaque cursor encodes (raw_key, commit_ts), iterated in (raw_key ASC, commit_ts DESC) order with exclusive-resume semantics. Chunks may contain only versions of one raw key; the next chunk continues with the next-older version of the same key — no all-or-nothing constraint on a key's history. Idempotent ack at the importer dedups any one resume overlap. SplitJob row (§3.1) gains `max_imported_ts uint64` and the cursor field description is updated. §3.2 state-machine diagram now shows the SetPhysicalCeiling proposal as a precondition of CUTOVER. No caller-visible wire change beyond the existing M2 proto deltas.
|
@claude review — round 2:
|
|
Claude finished @bootjp's task in 2m 51s —— View job Review — Round 2Todo list
Round-2 items — all three verified
Remaining items from prior reviewP1 — CUTOVER bulk-rename ordering still implicit (unresolved)The prior review flagged that the CUTOVER step on two different Raft groups must be sequenced: target bulk-rename first, then catalog CAS — otherwise there is a window where the catalog says the target owns the range but the target's MVCC only has shadow keys, making reads to the target return empty results. The current doc has moved closer but the ordering is still not stated explicitly in the §3.2 CUTOVER description. §6.2.1 (line 367) says It does not list The safe and intended ordering is:
Steps 2 and 3 are specified, step 1 is not. Without step 1 appearing in the CUTOVER sequence, an implementer working from §3.2 alone could issue the catalog CAS before the rename is applied, creating the data-availability gap. The leader-flap resume case (new migrator checking whether rename already landed before re-proposing the catalog write) also needs to be specced. Suggested fix: add a Medium — Grace window inconsistency (still unresolved)Three locations cite the CLEANUP grace default and they still disagree after commit 17b0719:
§7.2.4 was updated with a rationale ("comfortably exceed both lease TTL and watcher tick × 2") but the value there (30 s) still disagrees with the other two (60 s). Pick one and update all three; if 30 s wins, close Open Question 3 with the chosen rationale. Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 17b0719055
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| FSM-side defense in depth: | ||
|
|
||
| - `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`. |
There was a problem hiding this comment.
Reject fenced writes even when the route version is unpinned
This FSM-side fence is described as an extension of verifyOwnerFromSnapshot, but today verifyComposed1 returns before that helper when ObservedRouteVersion == 0. There are still intentional zero-version write paths (for example read/write txns with ReadKeys, caller-supplied StartTS, or resolver-claimed keys in kv/sharded_coordinator.go:694-754), so a stale coordinator in those contexts can forward a write to the source during FENCE and the proposed FSM defense will not run. The fence check needs to be on an unconditional apply path, or the design needs to require all mutation flows to carry a nonzero observed route version before cross-group split is enabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit c16e1c4. §7.1 now introduces verifyRouteNotFenced as a SEPARATE, unconditional pre-gate ahead of verifyComposed1 — not an extension of verifyOwnerFromSnapshot. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and NO dependency on ObservedRouteVersion. So the zero-version write flows in kv/sharded_coordinator.go:694-754 (ReadKeys, caller-supplied StartTS, resolver-claimed keys) are covered — the gate runs on every applied write request unconditionally. The two gates stay independent so a future change to one cannot regress the other.
| [ target.Active(right) ] | ||
| ``` | ||
|
|
||
| Key invariant: **catalog snapshots never contain overlapping ranges**. The target shadow keyspace lives outside the catalog (BACKFILL / DELTA_COPY land into private `!dist|migstage|<job_id>|<raw_key>` keys on the target group's MVCC store; CUTOVER promotes the staged data to live keys via a single bulk-rename FSM apply). This keeps `routesFromCatalog` + `validateRouteOrder` (`distribution/engine.go:496-527`) green throughout the migration — addresses codex P1 on overlapping routes (PR #945 review). RouteState `MigratingSource` / `MigratingTarget` are reserved but unused in M2; they remain available for a future merge / multi-stage migration design. |
There was a problem hiding this comment.
Avoid an unbounded CUTOVER bulk rename
Staging every imported row under !dist|migstage|... and then promoting the whole range via a single FSM apply makes CUTOVER proportional to the entire migrated range. For any split larger than the Raft proposal/apply budget, this defeats the chunked BACKFILL/DELTA_COPY design and can block or fail exactly when the catalog must switch atomically. The design should avoid a per-key bulk move at cutover, or make the staged-to-live promotion incremental while preserving the visibility fence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit c16e1c4. New §6.4 "Incremental staged-to-live promotion": CUTOVER is O(1) — just a catalog version bump + route swap + set staged_visibility_active on the new target route, NO per-key work. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. A leader-local promoter on the target group walks !dist|migstage|<job_id>|* in cursor-resumable chunks (same chunkBytes / pacing as BACKFILL) and proposes one bounded PromoteStaged Raft entry per batch — never a single oversized apply. CLEANUP → DONE waits for the promoter to drain AND the grace window AND the staged_visibility_active flag clear. SplitJob picks up promote_cursor + promotion_completed_ts for resumability and audit.
…945) Codex P1 on PR #945 (line 407) — FENCE must run on unconditional path verifyComposed1 returns before verifyOwnerFromSnapshot when ObservedRouteVersion == 0, and several intentional zero-version write paths exist today (read/write txns with ReadKeys, caller- supplied StartTS, resolver-claimed keys in kv/sharded_coordinator.go:694-754). The naïve "extend verifyOwnerFromSnapshot to reject WriteFenced" placement would silently skip those paths — a stale coordinator could forward a write to the source during FENCE without the FSM defense ever running. Fix in §7.1: introduce verifyRouteNotFenced as a separate, unconditional pre-gate ahead of verifyComposed1. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and no dependency on ObservedRouteVersion. Two independent gates, write side unchanged Composed-1 trust preserved. Codex P2 on PR #945 (line 147) — CUTOVER bulk rename is unbounded Staging every imported row under !dist|migstage|... and promoting via a single FSM apply makes CUTOVER O(migrated range) — defeats chunked BACKFILL/DELTA_COPY exactly when the catalog must atomically switch. New §6.4 ships incremental promotion: 1. CUTOVER apply is O(1): catalog version bump + route swap + set staged_visibility_active on the new target route. 2. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. 3. A leader-local promoter on the target group walks the staged prefix in cursor-resumable chunks, copying each version to its live position with one bounded PromoteStaged apply per batch. 4. CLEANUP → DONE precondition extends to "promoter empty AND grace window elapsed AND staged_visibility_active cleared", so route is indistinguishable from any Active route once promotion finishes. SplitJob row picks up promote_cursor (opaque) and promotion_completed_ts (diagnostic). §3.2 state-machine diagram updated: CUTOVER no longer says "bulk-rename"; the staged-to- live work is explicitly background and incremental. Both fixes preserve the §6.2.1 HLC ceiling advance and §7.2 read fence guarantees — incremental promotion is a strict superset of the chunked BACKFILL/DELTA_COPY semantics applied at a different phase.
|
@claude review — round 3 (2 codex findings on commit 17b0719):
|
|
Claude finished @bootjp's task in 3m 56s —— View job Review — Round 3Todo list
Round-3 items — both verified
Prior round-2 P1 — CUTOVER two-step ordering: resolved by §6.4The round-2 P1 flagged that "target bulk-rename must commit before the catalog CAS" to avoid a data-availability gap. §6.4 eliminates the bulk-rename entirely: CUTOVER sets The CUTOVER sequence (§3.2, lines 120–134) now correctly specifies: New P1 — §6.4 step 3 "no-op when a later live write has shadowed it" can lose MVCC history§6.4 step 3 (line 411) says the Scenario:
The "shadow" optimization is safe only for a last-write-wins (non-MVCC) store. In an MVCC store with snapshot read semantics, every version of a key must be preserved across its valid range. The correct guard is idempotency: "no-op if the staged version has already been copied to the live namespace (checked by the presence of a live version at the same commit_ts)." The fix is two parts:
The MVCC compaction GC ( Remaining from round 2 (three items still open)Medium — Grace window inconsistency (unresolved, 3rd flag)Three locations still disagree after commit c16e1c4:
§7.2.4 has a clear rationale for 30 s. If 30 s is the chosen value: update §4 line 168 to 30 s and close Open Question 3 with the chosen value + rationale. If 60 s is the chosen value: update §7.2.4 to 60 s and reconcile the rationale. Either way, one pass to make all three locations consistent and close OQ3. Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c16e1c4335
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // ExportVersions iterates committed versions in the half-open range | ||
| // [start, end) whose commit_ts is in (minCommitTS, maxCommitTS], |
There was a problem hiding this comment.
Scan internal families outside raw route bounds
For splits whose logical range is a normal user-key interval such as [foo, bar), this raw [start, end) MVCC iterator never visits keys like !redis|str|foo, !lst|...foo, or !txn|lock|foo, because those prefixes sort outside the user-key range before the routeKey() filter can accept them. That means the proposed migration can cut over without copying internal state that routes to the moved key, causing data loss or broken adapter invariants; the export needs to scan the relevant internal-family prefixes or use a route-key-indexed traversal rather than relying on a raw-range scan plus filter.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. New §6.3.1 "Internal-family brackets": migration now runs MULTIPLE parallel exports per phase, one per relevant raw-key family (user-key, !txn|lock|, !txn|cmt|, !txn|rb|, !txn|meta|, !txn|int|, !lst|meta|, !lst|itm|, !redis|, !ddb|item|, !ddb|gsi|, !sqs|, !s3|), each driven by the same KeyFilter so routeKey() remains the single source of truth. PlanExportBrackets() returns the bracket list; per-bracket opaque cursors + done flags on the SplitJob allow independent resumability. Bracket count is small (~12 today) and fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family.
|
|
||
| #### 7.2.2 Source FSM rejects reads after CUTOVER | ||
|
|
||
| When a source-group FSM applies a lease read whose `read_route_version` is **strictly less than** the FSM's last-applied catalog version, and the requested key's `routeKey()` maps to a route the FSM no longer owns, the FSM returns `ErrRouteMoved{new_route_version, new_group_id}` instead of the value. The error is mapped to a retryable status (gRPC `codes.FailedPrecondition` + detail) and the coordinator retries after a `RouteEngine` refresh. |
There was a problem hiding this comment.
Reject reads when both caller and source are stale
This condition only rejects when the request version is older than the source FSM's local catalog version, so it misses the common post-CUTOVER case where both the coordinator and the source group have only applied the pre-CUTOVER/FENCE version. In that window another coordinator that has refreshed can already write the key on the target, while this stale coordinator routes a read to the stale source and the source serves the old value because read_route_version == local_version; the read fence needs an authoritative cutover/ownership check that also catches equal-stale source reads during the grace window.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. §6.4 step 2 rewritten: the staged area now participates in MVCC MERGE — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. Four read_ts cases enumerated to prove correctness (live_write.ts > max_imported_ts > staged.ts; merge correctly surfaces live for read_ts ≥ live.ts and staged for read_ts ∈ (max_imported_ts, live.ts)). Per-family adapter invariants preserved via the key_family dispatch.
|
|
||
| No iteration over the migrated key range, no bulk rename, no per-key proposals. The Raft proposal carrying this apply is `O(1)` in the migrated data size — bounded by a handful of catalog descriptors and the SplitJob record. | ||
|
|
||
| 2. **Staged keyspace stays visible through a per-route flag.** After CUTOVER the target FSM's read path checks the route's `staged_visibility_active` bit. When set, a read for key `K` in the moved range first looks up `!dist|migstage|<job_id>|<K>` and falls back to the live key only if the staged entry is absent. (Writes always land in the live keyspace — the CUTOVER bump already routed writers to the target group via the catalog, and §6.2.1's HLC ceiling advance guarantees their `commit_ts` is strictly greater than every imported `commit_ts`, so a live write shadowing a staged version is the *correct* MVCC visibility.) This gives reads access to the imported data the instant CUTOVER lands, with **no promotion work blocking CUTOVER itself**. |
There was a problem hiding this comment.
Merge staged and live MVCC versions on reads
Looking in the staged keyspace first and falling back to live only when staged is absent makes post-CUTOVER reads return stale imported versions after any live write to the same key, until promotion deletes the staged entry. For a read timestamp after the live write's commit timestamp, both staged and live versions exist and the newest live version must win, so the read path needs to merge/compare the staged and live MVCC candidates by commit timestamp instead of treating staged presence as authoritative.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. §7.2.2 replaces the strict-less-than version check with an authoritative ownership check. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader (cached one second). New §7.2.2a walks through the exact 7-step equal-stale scenario (t=0..6) showing the watermark + ownership-query path catches the case the original rule missed.
Codex P1 round-4 on PR #945 — three fundamental design issues in the §6.3 / §6.4 / §7.2 designs landed in c16e1c4: P1 line 310 — Scan internal families outside raw route bounds A single raw [start, end) iterator + routeKey filter misses every internal-family raw key (!txn|lock|foo, !lst|meta|foo, !redis|str|foo, !ddb|item|<tbl>|foo, !sqs|..., !s3|...) because those prefixes sort outside the user-key band — the filter rejects nothing because the iterator never visits those bytes. Without the fix, CUTOVER would leave intent locks, list metadata, and adapter-private state behind on the source. Fix in new §6.3.1 "Internal-family brackets": migration runs multiple parallel exports per phase, one per relevant raw-key family, driven by the same KeyFilter (routeKey() remains the single source of truth). The migrator computes the bracket list from the moving range's logical scope via PlanExportBrackets; each bracket gets its own opaque cursor and done flag on the SplitJob. Bracket count is small (~12 today) and bounded; the fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family — both are required for end-to-end coverage. P1 line 408 — Merge staged and live MVCC versions on reads The §6.4 step 2 "staged-first, fall back to live when staged is absent" form returned stale imported versions after any live write to the same key. After the §6.2.1 HLC advance, live writes have commit_ts > max_imported_ts; for read_ts ≥ live_write.ts the live version is the most-recent committed, but the fallback rule would still return the older staged version because the staged entry exists until promotion. Fix in §6.4 step 2: the staged area participates in MVCC merge — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. This is the standard MVCC visibility rule applied to a second column-family-like source, not stage-shadows-live. Four read-ts cases enumerated to prove correctness; per-family adapter invariants preserved via the key_family dispatch (§6.3.1). P1 line 459 — Reject reads when both caller and source are stale The strict-less-than 'read_route_version < source_local_version' rule misses the equal-stale window: both coordinator and source pinned at the pre-CUTOVER version, while a different refreshed coordinator has already written to the target. Source's local snapshot says it still owns the key, so the read fence passes and serves the stale value. Fix in §7.2.2: replace version comparison with an authoritative ownership check scoped on whichever snapshot exists. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader, cache one second. New §7.2.2a walks through the exact 7-step equal-stale scenario to demonstrate the fix catches the case the original rule missed. All three are fundamentally correctness fixes — the original wording would have produced silent data loss (310), MVCC visibility regression (408), or stale reads (459) in production. The fixes preserve the catalog-stays-non-overlapping invariant (§3.2), the §6.2.1 HLC advance, and the §7.2 read-fence scaffolding; only the under-spec'd details are tightened.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 231e9a5729
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 1. **Per-batch advance.** On every apply of an `ImportVersions` batch, the target FSM computes `batchMax = max(versions[i].commit_ts)` and **atomically** (under the same FSM apply lock that mutates the MVCC store): | ||
| - sets `metaLastCommitTS = max(metaLastCommitTS, batchMax)`, and | ||
| - calls `hlc.SetPhysicalCeiling(batchMax)` (the same Raft-agreed primitive used by the M1 HLC renewal proposer, `kv/coordinator.go:644-669`). `SetPhysicalCeiling` is already monotone — a lower argument is a no-op — so duplicate / out-of-order batches are safe. |
There was a problem hiding this comment.
Convert HLC timestamps before setting the physical ceiling
This passes batchMax/max_imported_ts as a full HLC commit timestamp into hlc.SetPhysicalCeiling, but the current HLC stores that field as Unix milliseconds and Next() later shifts it as the physical part (kv/hlc.go). For any imported timestamp with logical bits set or even a normal ms<<16 encoding, this advances the target ceiling to an artificial far-future millisecond, causing subsequent writes to jump to roughly commit_ts<<16 and weakening the lease-ceiling fence until wall time catches up. The import path should either Observe the full HLC timestamp or pass only its physical millisecond component to SetPhysicalCeiling.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 44e6125. §6.2.1 step 1 now passes hlcPhysicalMs(batchMax) = int64(batchMax >> 16) to SetPhysicalCeiling (the same shift HLC.Next() uses to extract the physical half, kv/hlc.go:128-145) AND calls hlc.Observe(batchMax) for the local clock's last-issued value. Step 3 (CUTOVER precondition) uses the same conversion. Both functions are monotone so duplicates/out-of-order are safe.
|
|
||
| The migrator runs the brackets **in parallel up to `--migrationExportFanout` (default 4)** within a single phase (BACKFILL or DELTA_COPY): each bracket gets its own opaque cursor (the §6.1.1 codec is extended to key on `(family, raw_key, commit_ts)`), so a bracket can be paused/resumed independently and the §9 resumability matrix applies per bracket. The migrator records per-bracket `cursor` and `done` flags on the SplitJob; the phase advances only when **every bracket** reports `done = true`. Cost is at most `fanout`× the gRPC frame budget but is bounded by the total bracket count (~12 today), not by data volume. | ||
|
|
||
| A bracket whose family doesn't actually intersect the moving range (e.g. an SQS migration on a Redis-only deployment) returns its first chunk empty and `done=true` immediately — no wasted scan work. The `keys per bracket` distribution is a §7.3 / §11.x metric so operators can spot a runaway family scan. |
There was a problem hiding this comment.
Bound sparse internal-family export scans
This claims an internal-family bracket with no matching keys returns immediately, but the proposed brackets are family-wide and the only narrowing happens after iteration through routeKey(). In a cluster with a large SQS/Redis/etc. prefix outside the moving range, an export for a range with no accepted keys must scan that whole raw prefix to prove done=true because rejected rows do not count toward chunkBytes, making a single chunk unbounded in CPU/I/O and defeating the migration pacing guarantees. The exporter needs a route-key-indexed traversal or a separate scan budget that advances the cursor over rejected rows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 44e6125. §6.3.1 ships two layers: (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; iterator returns at the budget with cursor advanced over the rejected tail. (2) Per-family EncodeBracketStart hook tightens family-wide bounds to the moving-slice band where the family layout allows it. Layer (1) alone is sufficient for correctness; (2) is the optimization per adapter PR. New metrics: keys per bracket + rejected_rows per bracket.
|
@claude review — round 4 (3 codex P1 on commit c16e1c4):
|
|
Claude finished @bootjp's task in 5m 5s —— View job Claude finished @bootjp's task in 3m — View job Review — Round 4Todo list
Round-4 items — all three verified
New P1 —
|
| Location | Line | Value |
|---|---|---|
| §4 table, CLEANUP row | 168 | 60 s |
| §7.2.4 | 566 | 30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2") |
| §12 Open Question 3 | 681 | 60 s |
§7.2.4 has the rationale for 30 s. Pick one value, update all three locations, and close Open Question 3.
Minor — !dist| prefix on target-group import cursor
Line 349 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase. Future operator tools and distribution/catalog.go's key parsers would treat this as a catalog entry. Rename to a target-scoped prefix: !migstage|cursor|<id>.
Medium — "Wait until applied" still a one-liner
Line 544: "the FSM treats read_route_version > source_local_version as a 'wait until applied' (bounded by lease TTL) before answering" — still unspecified. The implementation mechanism (condition variable on appliedIndex signalled by the apply goroutine? short retry loop?) and the timeout error type (does the timeout surface as ErrRouteMoved or a distinct ErrTimeout?) remain unspecified. A one-sentence elaboration would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."
Summary
All three round-4 items (codex P1s on commit c16e1c4) are properly verified as resolved. One new P1 emerges from the latest codex review on commit 231e9a5: SetPhysicalCeiling(batchMax) receives a full HLC timestamp instead of Unix ms, confirmed by reading kv/hlc.go — passing a full HLC value would set the ceiling ~3.5 billion years ahead and corrupt every subsequent Next() call. The correct call is SetPhysicalCeiling(int64(batchMax >> hlcLogicalBits)).
The round-3 P1 (PromoteStaged "no-op when shadowed" drops MVCC history) remains unresolved for the fourth consecutive round — the "shadowed" clause on line 477 still exists and is still wrong. Three editorial items from round 2 (grace window inconsistency, !dist| prefix, wait-until-applied spec) and one new P2 (sparse family bracket scan unbounded) round out the remaining work before implementation can start cleanly.
…dex round-5) Codex P1 round-5 on PR #945 (line 367) — HLC SetPhysicalCeiling units hlc.SetPhysicalCeiling takes int64 Unix milliseconds (kv/hlc.go:222), but the previous spec passed the full HLC encoded commit_ts (ms<<16 | logical) as the argument. With logical bits set or any normal ms<<16 encoding that would advance the ceiling by 65 536× the intended milliseconds — weakening the lease fence until wall time catches up. Fix §6.2.1 steps 1 and 3: pass hlcPhysicalMs(batchMax) = int64(batchMax >> 16) (the same shift HLC.Next() uses to extract the physical half) AND call hlc.Observe(batchMax) so the local clock's last-issued value tracks the full HLC ts. Both calls are monotone so duplicates and out-of-order batches are safe. Codex P2 round-5 on PR #945 (line 445) — sparse internal-family scans The earlier 'returns its first chunk empty and done=true immediately' claim only holds when the iterator can short-circuit; under the §6.1 contract accept rejections do NOT count against chunkBytes, so a family-wide raw range can scan an arbitrarily large prefix to prove done=true for a moving range with no intersection. Fix §6.3.1: two layers. (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; when the budget hits the iterator returns whatever it has accepted + the next cursor (over the rejected tail position). (2) Route-key sub-prefix indexing where the family layout allows it — per-family EncodeBracketStart() hook tightens bracket bounds from family-wide to moving-slice-within-family; adapters without the hook keep the scan-budget pacing alone, which is sufficient for correctness. New metrics: keys per bracket + rejected_rows per bracket.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 44e6125daf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | FENCE → DELTA_COPY Copy (snapshot_ts, fence_ts] from | ||
| | source to target shadow space. |
There was a problem hiding this comment.
Wait for source to enforce the fence before fixing fence_ts
When the catalog enters WriteFenced, the source FSM only learns that through its local route engine, which is refreshed asynchronously by the catalog watcher (distribution/watcher.go:11, distribution/watcher.go:87-98), and the proposed FSM fence checks the current local snapshot. If the migrator records fence_ts and starts copying (snapshot_ts, fence_ts] immediately, a stale source leader can still accept writes for the moved keys after fence_ts but before its watcher applies the fenced route; those versions are above the DELTA_COPY upper bound and will be left on the source at CUTOVER. The phase needs a source-side barrier/ack that the fenced route is applied, or fence_ts must be chosen after writes are actually rejected on the source.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 9fa5931. New §3.2a 'Source-side fence-apply barrier': migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat or a per-node GetCatalogVersion RPC). Only after the unanimous voter-ACK does the migrator query source.LastCommitTS() and pin fence_ts. Unanimous (not quorum) is required for fence safety — any voter could become the leader later and re-serve writes for the moved range in the pre-FENCE state if its catalog hadn't applied. fence_ack_cursor is durable in the SplitJob for restart-safety. State machine diagram (§3.2) updated.
|
@claude review — round 5 (1 P1 + 1 P2 on commit 231e9a5):
|
|
Claude finished @bootjp's task in 4m 52s —— View job Review — Round 5Todo list
Round-5 items — both verified
New P1 —
|
| Location | Line | Value |
|---|---|---|
| §4 table, CLEANUP row | 168 | 60 s |
| §7.2.4 | 574 | 30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2") |
| §12 Open Question 3 | 689 | 60 s |
§7.2.4 has the rationale for 30 s. Picking 30 s: update lines 168 and 689 to match, and close Open Question 3 with the §7.2.4 rationale. Five rounds of flagging — this is a one-line fix.
Minor — !dist| prefix on target-group import cursor (5th flag)
Line 350, inside the ImportVersions doc comment: !dist|job|<id>|import_cursor is stored in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase (distribution/catalog.go). An operator tool or the catalog's key parsers scanning the default-group catalog would misparse these as catalog entries. Rename to a target-scoped prefix such as !migstage|cursor|<id> — one-line change.
Medium — "Wait until applied" spec (4th flag, slightly improved)
Line 552 (§7.2.2 sub-case 2): "the FSM treats this as 'wait until applied' (bounded by lease TTL) before answering, then re-evaluates ownership; if still not owned, returns ErrRouteMoved."
The round-4 addition of "then re-evaluates ownership; if still not owned, returns ErrRouteMoved" is an improvement. What remains unspecified: the implementation mechanism for "wait until applied" (condition variable on appliedIndex signalled by the apply goroutine? a retry loop with time.Sleep? a channel?) and the timeout value within the lease TTL (200 ms? 500 ms?). Given the level of implementation detail elsewhere in the doc, one sentence would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."
Summary
Both round-5 items are verified and properly resolved. hlcPhysicalMs(ts) = int64(ts >> 16) is correct per kv/hlc.go (confirmed hlcLogicalBits = 16, SetPhysicalCeiling takes Unix ms, Observe takes a full HLC value). The scan-budget pacing correctly bounds per-call cost for sparse family brackets.
One new P1 emerges from the codex review of commit 44e6125: fence_ts is recorded before the source-group FSM has applied the WriteFenced route — writes accepted in the watcher-lag window after fence_ts are outside DELTA_COPY's range and would be lost at CUTOVER. The fix is a source-group catalog-version poll before DELTA_COPY begins.
The round-3 P1 (PromoteStaged "no-op when shadowed") has now been flagged five consecutive rounds without resolution; it is a correctness bug that causes MVCC history versions to vanish. Three editorial items from round 2 remain (grace window, !dist| prefix, wait-until-applied).
…und-6 P1) Codex P1 round-6 on PR #945 (line 115) — fence_ts pinned before source FSM actually rejects writes The FENCE catalog write commits on the default group, but the source FSM only learns about WriteFenced through the watcher's async catalog snapshot apply (distribution/watcher.go:11, 87-98). If the migrator records fence_ts = source.LastCommitTS() and starts DELTA_COPY immediately, a stale source leader can still accept writes for the moved keys for one watcher-tick worth of time after the catalog commit. Those writes get commit_ts > fence_ts, fall outside DELTA_COPY's (snapshot_ts, fence_ts] window, and are silently left on the source at CUTOVER — write loss after CUTOVER hands the routing to the target. Fix: new §3.2a 'Source-side fence-apply barrier'. The migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat's existing catalog_version_applied field, or a per-node GetCatalogVersion RPC if the heartbeat is stale). Only then does the migrator query source.LastCommitTS() and pin that as fence_ts. Crucially: this requires UNANIMOUS voter-ACK, not quorum. Any voter could become the leader at any future moment; without local WriteFenced applied, it would re-serve writes for the moved range in the pre-FENCE state. Quorum is necessary for catalog-write durability; unanimous voter-ACK is necessary for fence safety. State machine diagram (§3.2) updated: FENCE → DELTA_COPY now explicitly waits for the source-side fence ACK before picking fence_ts. SplitJob gains a fence_ack_cursor for restart-safety during the wait. A non-responsive voter blocks the phase; the migrator logs the pending voter list and either operator intervenes or Raft removes the voter via the leader-loss path. CUTOVER is structurally guaranteed by a different mechanism (the §7.2 read fence's authoritative ownership check), so no parallel 'every voter applies CUTOVER' gate is needed — readers are the consumers there.
|
@claude review — round 6 (1 P1 on commit 44e6125):
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Address 4 substantive findings flagged on rounds 13-14 that needed real design changes (not just reply). The remaining 8 inline findings on the PR are already resolved by prior commits and get inline replies pointing to the relevant sections. 1. Codex round-14 P1 line 171 -- "Close the prepare race before fencing" The step-0 drain alone is insufficient: between drain-empty and every source voter applying WriteFenced, the route is still observably Active on the source FSM. A coordinator with a fresh route cache can prepare a new transaction in that window, escape step 0, and only become rejectable after step 1's apply ACK. §3.2a adds step 0b -- a second intent-lock drain AFTER step 1's unanimous voter-ACK returns and BEFORE picking fence_ts in step 2. By step 1's completion verifyRouteNotFenced rejects new prepares synchronously at the FSM apply gate, so the second drain only waits out the small window of intents that landed between step 0 and step 1's last-voter ACK (typically <100ms Raft apply latency, converges in one txn-TTL like step 0). Net effect across both drains: every prepare is in (a) pre-FENCE accept window (BACKFILL+DELTA_COPY), (b) catalog-write window (step 0b drain), or (c) post-FENCE-apply (rejected at FSM gate). SplitJob adds post_fence_drain_completed bool for restart safety. 2. Codex round-14 P1 line 405 -- "Capture intents that resolve below the snapshot timestamp" 2PC commit_ts is allocated at prepare time, not commit-apply time. A prepare with commit_ts < snapshot_ts can have its Phase_COMMIT apply AFTER snapshot_ts -- BACKFILL's iterator misses it (not in store yet) and DELTA_COPY (snapshot_ts, fence_ts] excludes it (commit_ts <= snapshot_ts). Committed write silently dropped. New §6.1.0 -- snapshot_ts must dominate every in-flight prepare. Migrator scans intent locks, reads their recorded commit_ts, picks snapshot_ts = min(HLC.Next(), min_prepared_commit_ts - 1). Every commit with commit_ts <= snapshot_ts is already applied (its prepare resolved before the scan); every commit > snapshot_ts falls into DELTA_COPY's window. Alternative -- wait for all prepares to resolve -- is also correct but pushes the boundary down is cheaper. SplitJob adds snapshot_min_prepared_ts for restart consistency. 3. Coderabbit round-12 Major line 429 + codex round-14 P2 line 540 -- Cursor semantics mismatch between §6.1.1 (last-emitted) and §6.3.1 (advances past rejected rows for scan-budget pacing). For a sparse bracket that accepts zero rows before maxScannedBytes, "last-emitted" gives no resume point -- would infinitely rescan the rejected window. §6.1.1 redefined: cursor is (raw_key, commit_ts, scanned_position_tag) -- "last-scanned" position with a 1-bit tag distinguishing last-emitted (resume strictly past, no duplicate) from last-scanned-not-emitted (resume strictly past, no rescan). Encoding extends to include the tag byte. Preserves no-duplicate AND no-skip AND no-rescan simultaneously. 4. Codex round-14 P2 line 439 -- "Track import idempotency per bracket" With §6.3.1 brackets in parallel, (job_id, cursor) as the idempotency key collides: bracket A acks cA, bracket B overwrites with cB, retry of A no longer matches. §6.1.1 importer idempotency key changed to (job_id, bracket_id, batch_seq). bracket_id is stable for SplitJob lifetime (recorded in bracket_progress); batch_seq is monotonic per-bracket. Cursor stays as the exporter's resume point (decoupled from importer dedup). BracketProgress row extended with bracket_id + last_acked_batch_seq. 5. Codex round-14 P1 line 562 -- "Merge staged data for range reads" §6.4 step 2's staged/live merge was specified only for point reads. ScanAt/RawScanAt over the moved range during the CUTOVER-to-promotion window would walk only the live keyspace (empty for unpromoted ranges) and return empty/partial results. §6.4 adds step 2a -- range-scan merged iterator. While staged_visibility_active=true the target FSM constructs a merged iterator over live raw range [scan_start, scan_end) AND staged prefix !dist|migstage|<job_id>|<encode(...)> at read_ts, merged by (logical_key, commit_ts) with newest wins. Implementation in store/staged_merge_iterator.go (M2-PR6). DynamoDB Scan/Query, Redis SCAN, gRPC RawScan all benefit; once CLEANUP clears staged_visibility_active the merge falls back to live-only. All changes are doc-only. No semantic change to any phase that wasn't already covered by the existing fence/drain/HLC ceiling invariants; this batch closes the corner cases those invariants depended on.
|
@claude review — round 15 (4 substantive design changes from codex round-14 P1/P2 + coderabbit Major on commit 22dca9b):
SplitJob fields extended: snapshot_min_prepared_ts, post_fence_drain_completed, BracketProgress.{bracket_id, last_acked_batch_seq}. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22dca9bbc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| uint64 max_commit_ts = 3; // snapshot_ts (BACKFILL) or fence_ts (DELTA_COPY) | ||
| uint64 min_commit_ts = 4; // 0 for BACKFILL; snapshot_ts for DELTA_COPY | ||
| bytes cursor = 5; // resume token; empty on first call | ||
| uint32 chunk_bytes = 6; // soft cap, server may exceed by one row |
There was a problem hiding this comment.
Carry scan-budget limits on export RPCs
§6.3.1 relies on each export call carrying maxScannedBytes so sparse family brackets can stop after rejected rows and return a cursor, but the wire request only exposes chunk_bytes; the store sketch similarly lacks a scan-budget argument. Fresh evidence after the prior sparse-scan fix is that there is still no field in this RPC contract for the budget, so an implementation following this proto cannot enforce the bounded-scan guarantee and can still spend an unbounded call scanning a large unrelated internal-family prefix.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 2b4faf1. §5.1 ExportRangeVersionsRequest carries 'uint64 max_scanned_bytes = 9;' — the scan-budget the source-side iterator honors per §6.3.1 (sparse brackets stop after rejected rows, return cursor with the last-scanned position + tag from §6.1.1). The store-side ExportVersions signature gains 'maxScannedBytes int' alongside the existing chunkBytes.
| message ImportRangeVersionsRequest { | ||
| uint64 job_id = 1; | ||
| repeated MVCCVersion versions = 2; | ||
| bytes cursor = 3; // for ack/idempotency | ||
| } |
There was a problem hiding this comment.
Add bracket and batch identity to imports
The corrected idempotency design in §6.1.1 says the importer deduplicates by (job_id, bracket_id, batch_seq), but ImportRangeVersionsRequest still carries only job_id, versions, and cursor. Fresh evidence after the previous bracket-idempotency comment is that the wire contract has no way to send the bracket or per-bracket sequence, so parallel bracket retries can still collide or be impossible to dedupe as specified.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 2b4faf1. §5.1 ImportRangeVersionsRequest carries 'uint64 bracket_id = 4; uint64 batch_seq = 5;' — the wire fields the importer needs to dedupe by (job_id, bracket_id, batch_seq) per §6.1.1. The importer persists max_applied_batch_seq under !migstage|ack|<job_id>|<bracket_id>; bracket A's retry no longer collides with bracket B's progress.
|
Addressed both codex P2 findings on the M2 migration-plane wire contract in 2b4faf1 (doc-only, P2-A — "Carry scan-budget limits on export RPCs" (FIXED): §6.3.1's scan-budget pacing referenced a
P2-B — "Add bracket and batch identity to imports" (FIXED): §6.1.1's dedup keys on
Internal consistency verified across §3.1 ( @claude review |
|
Claude finished @bootjp's task in 3m 47s —— View job Review — Round 16Todo list
Round-16 items — both verified
P2-A analysis —
P2-B analysis — bracket identity on imports:
All prior round items (rounds 3–15) remain resolved in commit 2b4faf1. No prior fix is regressed. New observationsBoth are minor and non-blocking for M2-PR1. Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b4faf1182
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| M2 fixes this with a **three-step gate** on the FENCE → DELTA_COPY transition (step 0 added per claude round-12 P1 on PR #945 — prepared-txn drain — see §7.1's no-`Phase_COMMIT`-bypass rationale): | ||
|
|
||
| 0. **Prepared-txn drain (resolves OQ-6).** Before transitioning `BACKFILL → FENCE` (yes, before — the FENCE catalog write itself does not run until this drain returns), the migrator waits until **no in-flight transaction intent locks remain in `[routeStart, routeEnd)`**. "No intents remaining" is observable via a scan of `!txn|lock|<encoded>[routeStart, routeEnd)` returning empty. The wait is bounded by the txn timeout — a long-lived prepared txn either commits or is aborted by the lock resolver (`kv/lock_resolver.go`) within its TTL, so the scan converges in at worst one TTL window. While the drain is pending the migrator logs `last_error = "fence drain pending: <N> intent locks in moving range"`. **Why this is required:** §7.1's `verifyRouteNotFenced` rejects every write request in the WriteFenced range and has **no `Phase_COMMIT` bypass** because the matching intent lock is excluded from BACKFILL (§6.1) and absent on the target after CUTOVER — a `Phase_COMMIT` reaching the FSM after FENCE has nothing to commit against. Letting a prepared txn into the FENCE phase would therefore strand its commit indefinitely. The drain guarantees every commit that reaches the FSM during/after FENCE is for a txn whose intent has already resolved (committed via the pre-FENCE accept window, or aborted by the lock resolver). After CUTOVER the target has no historical intent for a moved-range key, so a coordinator that retries a Phase_COMMIT against the target would also fail; the drain is the only safe ordering. |
There was a problem hiding this comment.
Filter txn-lock drains by routeKey
For adapter/internal routes whose routing key differs from the raw storage key, this raw !txn|lock|<encoded>[routeStart, routeEnd) scan can report empty while prepared locks still exist. I checked txnLockKey (kv/txn_keys.go) and routeKey (kv/shard_key.go): a DynamoDB item lock is stored as !txn|lock|!ddb|item|... but routes as !ddb|route|table|..., so it sorts outside a drain bounded by txnLockKey(routeStart)/txnLockKey(routeEnd). In that case the pre/post-fence drains (and the matching snapshot prepare scan later in the doc) miss the lock, allowing the migration to strand a prepared transaction or choose a snapshot boundary that drops its eventual committed version; the drain needs to scan the txn-lock family and apply the same route-key filter/bracket logic used for export.
Useful? React with 👍 / 👎.
|
@claude review codex P1 "Filter txn-lock drains by routeKey" — fixed (ce3865f). Verified the claim against
Fix (doc-only, reuses the doc's own §6.3.1 export mechanism): the drains/scan are now specified as a full Sections updated:
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce3865f20d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| **The fix is to pick `snapshot_ts` such that no prepare with `commit_ts ≤ snapshot_ts` is still in-flight.** Concretely, the migrator does the following before opening BACKFILL: | ||
|
|
||
| 1. **Scan in-flight intents.** Run the **route-faithful lock scan** of §3.2a.0a — the `!txn|lock|` family bracket `[txnLockPrefix(), txnLockPrefixEnd())` filtered by `routeKey(lockedKey) ∈ [routeStart, routeEnd)`, **not** a raw `[txnLockKey(routeStart), txnLockKey(routeEnd))` bracket (which would miss DynamoDB/SQS/Redis/list/S3 item locks whose raw storage key sorts outside the routing-key bounds — see §3.2a.0a). For each accepted lock, decode its recorded `commit_ts` (the lock value already carries it for the existing lock-resolver; if it does not in the current implementation, the M2-PR0 wire change adds it — same Raft-applied lock-write path). Take `min_prepared_commit_ts = min(commit_ts for all in-flight prepares accepted by the filter)`. If no prepare exists in the moving range, set `min_prepared_commit_ts = ∞`. | ||
| 2. **Pick `snapshot_ts`.** `snapshot_ts = min(HLC.Next(), min_prepared_commit_ts - 1)`. This guarantees every commit with `commit_ts ≤ snapshot_ts` is **already applied** (its prepare has resolved before the scan — otherwise it would have been in the scan result), so the BACKFILL iterator sees it. Every commit with `commit_ts > snapshot_ts` falls into the DELTA_COPY `(snapshot_ts, fence_ts]` window (since `fence_ts ≥ snapshot_ts` and DELTA_COPY runs until FENCE). |
There was a problem hiding this comment.
Fence new prepares before fixing the snapshot boundary
This only scans prepares that already exist before BACKFILL, but the source route remains Active until the later FENCE transition, so a coordinator can prepare a transaction after this scan with a commit timestamp <= snapshot_ts (for example from a lagging/caller-provided HLC), then commit after the BACKFILL iterator has passed its key. That version is not in the initial scan result and is excluded from DELTA_COPY's (snapshot_ts, fence_ts] window, so the target can miss a committed write; the design needs to block/drain prepares before choosing this boundary or make post-scan resolutions get copied regardless of their commit timestamp.
Useful? React with 👍 / 👎.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
codex round-15 P1 on PR #945: the §6.1.0 snapshot prepare scan ran once before BACKFILL and was trusted as the DELTA_COPY lower bound, but the source route stays Active until the much-later FENCE transition. A coordinator can prepare a new txn after the scan with a low caller-supplied / lagging-clock commit_ts <= snapshot_ts (verified: resolveDispatchCommitTS uses caller commit_ts as-is, kv/coordinator.go; caller-supplied StartTS/CommitTS path live for S3/SQS/DynamoDB, kv/sharded_coordinator.go:638-665; NextFenced only floors the physical half, kv/hlc.go), commit after BACKFILL passed its key, and be excluded from a (snapshot_ts, fence_ts] window -> lost prepared write. Fix: fence new prepares before fixing the boundary. The pre-BACKFILL scan becomes a provisional snapshot for the early bulk copy; the authoritative DELTA_COPY lower bound delta_floor is finalised only after §3.2a step 1's unanimous verifyRouteNotFenced apply + step 0b's post-fence drain, when no new prepare can land. snapshot_min_prepared_ts is a monotone-down running minimum over both the live !txn|lock| family and the durable !txn|cmt| commit-record family (kv/txn_keys.go:50) so an inter-tick prepare-then-commit is still observed. DELTA_COPY exports (delta_floor, fence_ts]. Consistent across §3.1 (field docs + new delta_floor field), §3.2 diagram, §3.2a steps 0b/2, §4, §6.1.0, §9 matrix, §10.1/§10.2 tests, OQ-6. Status: Proposed. Doc-only.
|
@claude review Addressed codex P1 "Fence new prepares before fixing the snapshot boundary" in 722c8a6 (doc-only). Finding: §6.1.0's snapshot prepare scan ran once before BACKFILL and was trusted as the DELTA_COPY lower bound. The source route stays Chosen mechanism — fence-then-floor (reviewer remedy #1): the pre-BACKFILL scan is downgraded to a provisional snapshot for the early bulk copy; the authoritative DELTA_COPY lower bound This fits the existing phases best: the post-fence drain (§3.2a step 0b) already establishes the "no new prepare can land" point; I anchored boundary finalisation there rather than reordering BACKFILL after FENCE (which would fence writes for the whole bulk copy). Code claims verified (PR branch): caller-supplied commit_ts used as-is in Edits: §3.1 (snapshot_ts/snapshot_min_prepared_ts docs + new
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 722c8a689a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| **The fix is to fence new prepares before fixing the boundary, then floor DELTA_COPY below every prepare ever observed in the pre-fence window.** The single pre-BACKFILL scan becomes a *provisional* snapshot for the early bulk copy; the boundary that DELTA_COPY trusts is finalised after the fence. Concretely: | ||
|
|
||
| 1. **Provisional snapshot for BACKFILL.** Before opening BACKFILL, run the **route-faithful lock scan** of §3.2a.0a — the `!txn|lock|` family bracket `[txnLockPrefix(), txnLockPrefixEnd())` filtered by `routeKey(lockedKey) ∈ [routeStart, routeEnd)`, **not** a raw `[txnLockKey(routeStart), txnLockKey(routeEnd))` bracket (which would miss DynamoDB/SQS/Redis/list/S3 item locks whose raw storage key sorts outside the routing-key bounds — see §3.2a.0a). For each accepted lock, decode its recorded `commit_ts` (the lock value already carries it for the existing lock-resolver; if it does not in the current implementation, the M2-PR0 wire change adds it — same Raft-applied lock-write path). Take `min_prepared_commit_ts = min(commit_ts for all in-flight prepares accepted by the filter)`; `∞` if none. Pick the provisional `snapshot_ts = min(HLC.Next(), min_prepared_commit_ts - 1)` and start BACKFILL at it. Over-copy is harmless — a low `snapshot_ts` just hands more work to DELTA_COPY. | ||
| 2. **Track the running minimum until the fence blocks new prepares.** From BACKFILL through §3.2a step 0b's post-fence drain, the migrator maintains `snapshot_min_prepared_ts` = the smallest `commit_ts` of any txn whose primary resolves into the moving range and that prepared after the provisional scan (monotone-down, persisted on each advance). Because the source is still `Active` during BACKFILL and FENCE-apply, new prepares with arbitrarily low caller-supplied / lagging-clock `commit_ts` can keep appearing; the running minimum must capture every one. To avoid an inter-tick gap — a prepare that both lands and resolves between two ticks would never appear in a *live-lock* sample — the running minimum is taken over **both** durable txn families, route-faithful by `routeKey()`: |
There was a problem hiding this comment.
Track prepares for moved keys, not only moved primaries
This still misses a multi-shard transaction whose secondary key is in the moving range but whose primary key is outside it. I checked kv/fsm.go:1136-1140 and kv/txn_keys.go:50-53: the durable !txn|cmt|... record is written only when committing the primary and is keyed by primaryKey, so a secondary prepare that lands and resolves between migrator ticks leaves no commit-record row whose routeKey() falls in the moved range. If its commit_ts <= snapshot_ts after BACKFILL passed that secondary key, delta_floor is not lowered and DELTA_COPY can still skip the committed secondary version.
Useful? React with 👍 / 👎.
| bytes range_start = 1; | ||
| bytes range_end = 2; | ||
| uint64 max_commit_ts = 3; // snapshot_ts (BACKFILL) or fence_ts (DELTA_COPY) | ||
| uint64 min_commit_ts = 4; // 0 for BACKFILL; snapshot_ts for DELTA_COPY |
There was a problem hiding this comment.
Use delta_floor as DELTA_COPY lower bound
This wire contract still says DELTA_COPY sends min_commit_ts = snapshot_ts, which contradicts the later delta_floor fix in §6.1.0. If the implementation follows this proto comment, a prepare that commits after BACKFILL passed its key with commit_ts <= snapshot_ts is again excluded from (snapshot_ts, fence_ts], even though the design now computes delta_floor specifically to include that version in DELTA_COPY.
Useful? React with 👍 / 👎.
|
@claude review Addressed both codex round-16 P1s on this doc-only PR. Pushed P1 "Track prepares for moved keys, not only moved primaries" — accepted, fixed. Reviewer's argument (verified): the running-minimum's inter-tick backstop relied on the Fix (§3.1 field def, §6.1.0 step 2/3 header + body, OQ-6, §10.1/§10.2 tests):
P1 "Use delta_floor as DELTA_COPY lower bound" — accepted, fixed. Reviewer's argument (verified): the Fix (§5.2): |
|
Claude finished @bootjp's task in 8m 3s —— View job Round 17 Review — commit
|
| PR | Scope | Tests |
|---|---|---|
| M2-PR0 | Add commit_ts field to the txnLock wire value (Raft-applied lock-write path); decoder backward-compat for existing locks without the field (zero → fallback to commit-record lookup) |
kv/txn_codec_test.go round-trip + backward-compat case |
or absorb it into M2-PR1 if it fits that scope.
Minor-2 — §6.1.0 step 2: priority between lock-value commit_ts and commit-record point lookup is ambiguous
Step 2 (lines 460): "the commit_ts itself is added by the M2-PR0 wire change" and "It then resolves the txn's commit_ts by a direct point lookup of the commit record at !txn|cmt|<PrimaryKey>|<startTS>." Two sources are named for the same value; the relationship between them is not stated.
A reader is left to guess:
- (a) Read
commit_tsfrom the lock value (fast path, M2-PR0); fall back to commit-record point lookup when the field is absent (existing lock written before M2-PR0)? - (b) Always do the commit-record point lookup; the lock-value
commit_tsis mentioned as context only? - (c) Read both and take the minimum/maximum?
If (a) (which seems most likely given the M2-PR0 framing), the fallback rule should be stated explicitly — especially because backward-compat for existing locks without the field needs to be defined somewhere.
Suggested fix: one sentence in step 2 clarifying the priority, e.g.: "The migrator reads commit_ts from the lock value (M2-PR0 field); for locks written before M2-PR0 where the field is absent (zero), it falls back to a direct point lookup of !txn|cmt|<PrimaryKey>|<startTS>."
Minor-3 (carryover round-16) — !migstage|ack|<job_id>|* cleanup not in §6.4 step 5
§6.2 (line 528) documents that the importer persists (job_id, bracket_id) → max_applied_batch_seq under !migstage|ack|<job_id>|<bracket_id>. §6.4 step 5 (line 697) lists the CLEANUP→DONE preconditions as: (a) grace window elapsed, (b) no staged keys remain, (c) staged_visibility_active cleared. The ack records are not mentioned.
These records are only needed during DELTA_COPY for import dedup; retaining them after DONE is GC debt. The CLEANUP→DONE condition in step 5 should either explicitly delete !migstage|ack|<job_id>|* in the same apply that clears staged_visibility_active, or note the cleanup is deferred (and where).
Minor-4 (carryover round-16) — max_scanned_bytes = 0 translation not documented at the store layer
§5.2 proto (line 384): "0 means 'use the server default (4 × chunk_bytes, §6.3.1)'." The server handler must therefore translate 0 → 4 × chunkBytes before passing maxScannedBytes to store.ExportVersions. The ExportVersions store signature doc (lines 493-496) describes maxScannedBytes as "a hard cap" with no mention of a 0 sentinel.
If 0 reaches the store layer and is interpreted there as "no cap" (a natural reading of an unsigned zero), the proto-level default is silently discarded and the sparse-bracket scan is unbounded. The server handler translation rule should be called out explicitly in the store API doc comment (or the store layer should define the sentinel itself so the translation is in one place).
Summary
| Finding | Severity | Status |
|---|---|---|
| P1-A: secondary-key prepare tracking | P1 | ✅ Resolved (§3.1, §6.1.0 steps 1-3, §10.1/10.2) |
P1-B: delta_floor as DELTA_COPY wire lower bound |
P1 | ✅ Resolved (§5.2 field 4, doc sweep) |
| M2-PR0 absent from §11 PR table | Minor | New |
| §6.1.0 step 2 lock-value vs commit-record priority | Minor | New |
| `!migstage | ack` cleanup missing from §6.4 step 5 | Minor |
max_scanned_bytes = 0 store translation undocumented |
Minor | Carryover round-16 |
Both round-17 P1 items are fully resolved. The design is implementation-ready for M2-PR1. The four minor items above are editorial; they do not block implementation but should be addressed before or alongside the PRs that touch those sections (M2-PR0/PR3 for Minors 1-2; M2-PR7 for Minor-3; M2-PR3 for Minor-4).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5632dc20dd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| CLEANUP deletes the moved range from source's MVCC after a configurable grace window (`readFenceGrace`, default 30 s — chosen to comfortably exceed both lease TTL and watcher tick × 2). | ||
|
|
||
| **Pre-watcher staleness window after CUTOVER (closes claude round-11 medium on PR #945).** The `CatalogWatcher` propagates the CUTOVER catalog write to the source-group leader within one polling interval (`defaultCatalogWatcherInterval = 100 ms`, `distribution/watcher.go`). Within that ~100 ms window the source FSM has neither applied the new catalog nor received the heartbeat-carried `max_cutover_version_seen`, so §7.2.2b Branch 2's watermark check has nothing to fire on. The window is bounded by the watcher polling interval rather than the heartbeat interval (~1 s) because watcher → FSM apply is the faster of the two propagation paths. A coordinator whose watcher fired right at the same tick as CUTOVER could refresh and route reads to the target while the source's watcher hasn't yet applied v+1 — a ~100 ms window during which a cross-boundary read could see source-stale data. This is consistent with the watcher-lag model the read fence operates under elsewhere (lease reads, Composed-1 read path); deployments requiring zero-staleness can replace the polling watcher with a synchronous notification from the default-group leader to source-group leaders on the CUTOVER Raft apply (OQ-17 — "synchronous CUTOVER notification to source group" — out of scope for M2). |
There was a problem hiding this comment.
Close the admitted post-cutover stale-read window
Fresh evidence in the current text is that this paragraph explicitly says the source has neither applied the CUTOVER catalog nor received max_cutover_version_seen for ~100 ms, so neither Branch 1 nor Branch 2 can reject stale reads. In that window, a refreshed coordinator can already route a write/read to the target while a stale coordinator still sends a read or cross-boundary scan to the source, which will serve pre-cutover data and violates the goal of atomic reader/writer cutover; the design needs a synchronous source-side cutover signal/barrier or must delay target visibility/writes until sources can fence reads.
Useful? React with 👍 / 👎.
Summary
docs/design/.Key design points
!dist|job|<id>so a default-group leader flap resumes mid-migration without operator action.routeKey()equivalence check on every exported row, so!txn|,!lst|,!ddb|,!redis|,!sqs|,!s3|families all land in the right shard.target_group_id == 0or equal to source short-circuits through the existing fast path.Open questions for reviewers
7 questions called out at
§12— notably: grace-period sizing, abandon-after-cutover policy, backfill-vs-prepare semantics, andMVCCVersion.key_familystrict-enum vs numeric.Test plan
*_proposed_*→*_partial_*once PR1 lands.Summary by CodeRabbit