docs(design): propose hotspot split M3 — automation by bootjp · Pull Request #951 · bootjp/elastickv

bootjp · 2026-06-10T19:49:53Z

Design-doc-only PR. No implementation code. Per the repo's design-doc-first workflow, this lands the M3 proposal for review before any implementation.

New file: docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md (Status: Proposed, Author: bootjp, Date: 2026-06-11).

Summary

Milestone 3 of the hotspot shard split design: automation — auto-detection + an auto-split scheduler. M1 (durable versioned route catalog, manual same-group SplitRange, route watcher) has shipped; M2 (#945) proposes the cross-group migration plane. M3 composes with M2 but delivers first value standalone: auto-detection + same-group auto-split (the M1 capability) work without M2; cross-group target selection slots in behind the same SplitRange interface once M2 lands.

It closes the gap the parent doc §1 calls out: distribution/engine.go's RecordAccess/threshold-splitRange exists but is wired into nothing (only tests call it; verified by grep) and is midpoint-only + mutates the engine directly (bypassing the catalog/watcher).

Key design points

Detection signal — reuse the keyviz sampler, retire the engine counters. The keyviz MemSampler is already wired into the exact request-path functions the parent doc names (groupMutations→observeMutation, observeRead in kv/sharded_coordinator.go), is allocation-free on the hot path, splits reads vs writes, is windowed, has sub-range + Top-K split-key evidence, and bounds cardinality. Driving detection off it avoids a second parallel pipeline. The engine's dead RecordAccess/splitRange/Route.Load path is removed in M3-PR1. M3 adds zero hot-path cost — it consumes already-flushed windows off-path.
Aggregation — leader-local, no new RPC. The detector runs on the default-group leader and reads its own MemSampler.Snapshot for routes it leads (the leader serves the load it would split on). No ReportAccess RPC and no RPC piggybacking; cluster fan-out stays an admin-UI concern and can slot behind the detector interface later if needed.
Detector — score = write_ops*Ww + read_ops*Wr (4/1), 50k ops/min threshold, 3-consecutive-window up-side hysteresis + 0.8× down-side band, per-route 10-min cooldown (monotonic deadline, not wall clock), split-key from observed distribution (sub-range p50 / single-key isolation / decline when no evidence — never blind midpoint), guardrails: min route span, max routes, per-cycle split budget.
Scheduler — invokes the existing proto.Distribution.SplitRange RPC with expected_catalog_version CAS; all catalog mutations go through SplitRange (repo convention). Same-group today; once M2 lands the least-loaded-group target policy flips the target_group_id field on the same RPC.
Safety/ops — default OFF behind --autoSplit; runtime kill switch (atomic.Bool); fixed-enum metrics (no per-route/key labels); stable slog keys; detector state is leader-local and resets on election (re-earns confidence; cooldown re-seeded from parent_route_id lineage) — stated explicitly so non-replication isn't mistaken for an oversight.
PR breakdown — M3-PR1 (retire engine path + metric scaffolding), M3-PR2 (aggregation + pure detector, observe-only), M3-PR3 (scheduler + flag + kill switch), each independently shippable; M3-PR4 enables cross-group target selection post-M2.

Every current-state claim cites concrete file:line evidence, verified before writing.

Open questions

OQ-1 — Fully delete Route.Load/Engine.Stats, or keep Stats for diagnostics?
OQ-2 — Down-side hysteresis: hold the counter in the [0.8×, 1.0×) band, or decay it?
OQ-3 — No sub-range/Top-K evidence: decline to split (proposed), or allow a min/max-observed-key midpoint?
OQ-4 — minRouteSpan without per-route key counts: approximate from sampler evidence, or add a cheap key-count estimate?
OQ-5 — Leader-local scoring sufficient, or is there a heavy-follower-read workload needing cluster fan-out?
OQ-6 — Post-M2: enable cross-group target automatically, or gate behind --autoSplitCrossGroup?
OQ-7 — Durable cooldown across elections: is parent_route_id + a catalog split timestamp enough, or do we need durable last_split_at?
OQ-8 — Express threshold in ops/min (proposed) or ops-per-Step to avoid coupling sensitivity to --keyvizStep?

Test plan

Doc-only — no Go tests run. Markdown verified clean; all cited file:line references confirmed against main. The doc's own §8.2 lays out the implementation-phase test strategy (table-driven detector unit tests + pgregory.net/rapid invariants, detector→SplitRange integration in the 3-node demo, Jepsen hotspot workload deferred to M4 per the parent doc).

coderabbitai · 2026-06-10T19:50:02Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 19 minutes and 3 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eecf2f71-abd0-490c-ad9f-c4fa9a2c10a0

📥 Commits

Reviewing files that changed from the base of the PR and between ed82233 and ea5c12a.

📒 Files selected for processing (1)

docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch design/hotspot-split-m3-automation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bootjp · 2026-06-10T19:50:03Z

@claude review

claude · 2026-06-10T19:50:20Z

Claude finished @bootjp's task in 3m 20s —— View job

Review: M3 Automation Design Doc

Load tools
Read the M3 design doc
Read parent design doc and related source files for context
Write and post review feedback

This is a high-quality, well-structured proposal. All cited file:line references were verified against the current codebase and are accurate. The rationale for choosing the keyviz sampler over the engine's dead RecordAccess path is sound. The PR breakdown (5 independently-shippable PRs) is excellent. The following are issues worth resolving before implementation begins.

Issues

1. Score normalization formula is underspecified (§5.1)

The doc says the score is computed over "a 1-minute equivalent window (trailing flush columns summed and normalized to the flush Step)." But the exact formula is not given. When --keyvizStep ≠ 60s, "summed and normalized to the flush Step" is ambiguous. Proposed clarification:

score = (sum of write_ops over N columns / (N × Step_seconds)) × 60
      × Ww + ...

This is exactly what OQ-8 is pointing at, but the issue goes beyond units — the doc should commit to a specific formula. If the threshold stays in "ops/min", the normalization formula must be stated explicitly so unit tests in M3-PR2 can pin it. Without that, two contributors can independently implement different normalizations that both pass tests but produce different sensitivities.

Recommendation: close OQ-8 in this doc by committing to score = (sum_over_N_columns / (N × Step_s) × 60) × weights, and document that as the invariant the table-driven unit tests verify.

2. MemSampler.Snapshot takes time.Time params — the "last N columns" query needs a concrete recipe (§4.2)

MemSampler.Snapshot(from, to time.Time) []MatrixColumn (keyviz/sampler.go:1463) takes a time range, not a column count. The doc repeatedly says "last N flush columns" but never specifies how the detector computes from/to. Suggested recipe:

to   = time of last completed flush (flusher's most recent flush boundary)
from = to − N × Step

The detector needs a stable reference to the last flush time. It's worth calling out explicitly whether the detector gets this from a Flusher handle, a channel notification, or simply reads sampler.Snapshot(now − N×Step, now) and accepts that the trailing edge might capture a partially-flushed column. The latter would silently skew scores; the design should rule it out.

This is a concrete implementation question that will arise in M3-PR2 — resolving it here avoids a revision mid-PR.

3. OQ-7 needs a data-model change that isn't scoped to any PR (§7.6)

The design proposes seeding cooldown after election from "any route whose parent_route_id lineage indicates a recent split, or whose creation is recent per the catalog." But RouteDescriptor (parent doc §5) has parent_route_id uint64 and state, but no split_at or created_at timestamp field. Without a timestamp, the new leader cannot tell whether a recent-lineage split happened within the last splitCooldown window. "Recent per the catalog" has no basis in the current schema.

This means the durable-cooldown reconstruction requires adding a timestamp field to RouteDescriptor. That's a catalog schema change — it belongs in M3-PR1 (which already touches engine cleanup), not as an implicit assumption in §7.6.

Recommendation: add split_at_hlc uint64 (HLC timestamp of when the split was applied) to RouteDescriptor, set by the SplitRange handler when it writes the two child RouteDescriptors. State explicitly in §8.1 M3-PR1 or M3-PR2 that this field is added. OQ-7 then becomes "is split_at_hlc sufficient or do we also need last_split_at_hlc updated on every re-split?"

4. Single-key isolation (§5.4.2) requires two SplitRange calls per detection cycle

The proposed single-key isolation approach is: "split key = hot key, and a follow-on split at hot key's successor so the child contains only that key." This is two SplitRange calls: one to split [A, B) into [A, hotkey) and [hotkey, B), and another to split [hotkey, B) into [hotkey, hotkey+1) and [hotkey+1, B). With maxSplitsPerCycle = 1, the second call cannot be issued in the same cycle.

Options worth deciding now:

Count the pair as 1 budget unit (special-case the two calls atomically, or issue back-to-back in the same cycle despite the per-cycle limit).
Defer the second split to the next cycle: the child [hotkey, B) is now much smaller; it re-evaluates next cycle and the isolation completes. Simpler but takes 2 cycles.
Atomic two-call variant: issue both in one cycle, counting as 2 budget units; maxSplitsPerCycle = 2 minimum for single-key isolation to work in one pass.

This affects both the scheduler implementation and the acceptance criteria in §8.4. The design should commit to one approach.

5. Detector state map doesn't describe how manual splits are reconciled (§6.3)

The expected_catalog_version CAS correctly makes the scheduler fail-closed on a manual split. But the detector's in-memory state map retains the parent route entry after its catalog entry disappears. On the next evaluation cycle, the detector would try to evaluate a route that no longer exists in the engine, wasting work at minimum and potentially logging confusing output.

Recommendation: at the start of each evaluation cycle the detector should reconcile its state map against the live route set (e.g. discard state for any routeID that is no longer present in the engine's routes). This is a small but load-bearing correctness detail that should be in §5 or §6.

6. historyMu.Lock() in Snapshot contends with RunFlusher (§3.3)

§3.3 correctly notes that Snapshot "takes historyMu — never touched by Observe." However, RunFlusher also takes historyMu on each flush to append the new column. The lock is a write mutex (s.historyMu.Lock()). If the default eval interval equals --keyvizStep, the detector's Snapshot call and the flusher's append will regularly race for the same lock, adding tail latency to one or the other.

This is minor given detection is off-path, but worth noting: the five-lens performance review in M3-PR2 should benchmark or at least measure the lock contention under load (e.g. with go test -race plus a timing check). The doc could add a one-line note to the performance lens in §8.3 to check this.

Open Questions — Recommendations

OQ-1 (Route.Load / Stats fate): Recommend keeping Engine.Stats() but deleting RecordAccess/splitRange/the threshold constructor. Stats() is useful as a debug surface for the admin tool and costs nothing when unused. The comment at distribution/engine.go:377-393 shows it already returns a safe copy.

OQ-2 (down-side hysteresis shape): Hold (current proposal) is the right choice. Decaying by one per cycle introduces an implicit max-hold time that's a function of candidateWindows, making the threshold–cooldown interaction harder to reason about in unit tests. Hold is simple, deterministic, and easy to assert: "any score in [0.8×, threshold) must never change consecutiveOver."

OQ-3 (no-evidence fallback): Decline (current proposal). Allowing a min/max-observed-key midpoint as fallback re-introduces the midpoint-only problem the parent doc §3.1.3 explicitly rejects for unobserved subspaces. Better to push the operator toward enabling --keyvizKeyBucketsPerRoute via a clear metric (autosplit_skipped_no_split_key).

OQ-5 (leader-local vs cluster-wide): Leader-local is sufficient for M3. The rationale in §4.1 is well-argued (the leader serves the load it would split on). Flag this as a known limitation in the doc, not as a deferred OQ — it's already a deliberate non-goal.

OQ-6 (cross-group gate): Recommend gating behind --autoSplitCrossGroup. Cross-group splits trigger data movement (M2's SplitJob); operators should opt into that separately from range splitting. The flag cost is trivial; the blast radius of unexpected cross-group data migration is not.

Minor / Nits

§5.4.1 "p50 from sub-range buckets" — clarify which bucket boundary is chosen: the lower (lo) or upper (hi) boundary of the bucket whose cumulative load crosses 50%. SubBucketBoundsFor returns (lo, hi []byte, ok bool); the implementation needs to pick one.
§7.6 "leader-loss callback" — add a file:line pointer to the actual mechanism (presumably the same kv/lease_state.go / kv/raft_engine.go callbacks used by lease invalidation). This doc targets a reviewer unfamiliar with the lease layer; the cross-reference closes the loop.
§7.1 --autoSplit implies sampler construction — add a note about the interaction with cmd/server/demo.go and main.go: the wiring condition becomes if keyvizEnabled || autoSplit { constructSampler() }. Flag this for M3-PR3 so the flag-wiring reviewer doesn't miss it.

Summary

Fundamentally sound design. The key things to resolve before M3-PR1 begins:

Commit to a precise scoring/normalization formula (closes OQ-8 and pins unit tests).
Specify how the detector queries Snapshot for "last N columns" (concrete time-range recipe).
Add split_at_hlc to RouteDescriptor schema in M3-PR1's scope (unblocks OQ-7 reconstruction).
Decide single-key isolation budget counting (2 calls = 1 or 2 budget units).
Add detector state-map reconciliation to §5 or §6.

gemini-code-assist

Code Review

This pull request proposes a design document for Milestone 3 of the Hotspot Shard Split, focusing on automation. The design leverages the existing keyviz sampler to detect hot routes, implements a pure detector with hysteresis and cooldown, and schedules splits via the SplitRange RPC. The review feedback highlights three key design improvements: implementing a garbage-collection mechanism for the leader-local state map to prevent memory leaks, addressing the lack of creation timestamps in RouteDescriptor to enable accurate cooldown seeding on new leaders, and restricting auto-split eligibility to routes in RouteStateActive to avoid interfering with transitional migration states.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T19:51:16Z

+The detector's hysteresis counters and cooldown deadlines live in a **leader-local in-memory state map** — they are **not** replicated through Raft. On a default-group leadership change:
+
+- The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses).
+- The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed.
+- **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state.
+
+Keeping detector state leader-local (not Raft-replicated) is a deliberate cost/safety trade: replicating per-cycle counters would add Raft traffic for what is a best-effort heuristic, and the worst case of a lost counter is a *delayed* split (safe), never an *erroneous* one. We state this explicitly so a reviewer does not mistake the non-replication for an oversight.


To prevent unbounded memory growth in the leader-local in-memory state map, the design should specify a pruning/garbage-collection mechanism. Since route IDs are deleted from the catalog after a split (and replaced by child routes), keeping their hysteresis counters and cooldown deadlines in the map indefinitely will cause a memory leak.\n\nThe detector should periodically (e.g., at the end of each evaluation cycle) compare the keys of the state map against the active routes in the current catalog snapshot and evict any route IDs that are no longer present.

Accepted. Added §7.7(a): the detector reconciles its leader-local state map against the live catalog CatalogSnapshot at the start of every evaluation cycle and evicts any RouteID not in the live set, so the map is strictly bounded by the live route count (≤ maxAutoRoutes). autosplit_tracked_routes reflects the post-GC size. This also covers the disappeared/manual-split route case. Fixed in b56950a.

gemini-code-assist · 2026-06-10T19:51:16Z

+
+- The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses).
+- The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed.
+- **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state.


The proposal to seed cooldowns on a new leader by checking if a route's creation is "recent per the catalog" has a design gap: the RouteDescriptor struct in catalog.go does not contain any creation timestamp, split timestamp, or catalog version of origin. Without this metadata, a newly elected leader cannot determine whether a route with a parent_route_id was split 1 minute ago (and should be in cooldown) or 1 hour ago (cooldown expired).\n\nTo resolve OQ-7 and make this safety mechanism viable, the design should either:\n1. Add a durable split_timestamp or created_at field to the RouteDescriptor (or catalog metadata).\n2. Store a mapping of catalog_version -> timestamp in the catalog metadata so the leader can resolve the age of the route's origin version.

Accepted — you are right that the old 'creation is recent per the catalog' wording had no schema basis (RouteDescriptor at distribution/catalog.go:71-78 carries only RouteID, Start, End, GroupID, State, ParentRouteID — no timestamp). Took your option 1: added a durable RouteDescriptor.SplitAtHLC uint64 (HLC, set by the SplitRange handler at adapter/distribution_server.go:136), encoded under a bumped catalogRouteCodecVersion (catalog.go:24) with v1 records decoding to SplitAtHLC=0 for back-compat. The new leader reconstructs cooldown deterministically: now_hlc − SplitAtHLC < splitCooldown. See §7.7(b); scoped to M3-PR1 (§8.1). This resolves OQ-7. Fixed in b56950a.

gemini-code-assist · 2026-06-10T19:51:16Z

+
+The target-group selection slots **behind the same `SplitRange` interface** — the scheduler code path is identical; only the `target_group_id` field flips from zero (same-group) to a chosen group once M2 is present. M3 ships the policy hook disabled (always same-group) and a follow-on M3.x / M2-completion PR enables target selection. OQ-6 asks whether to gate target selection on an explicit `--autoSplitCrossGroup` sub-flag.
+
+### 6.3 Concurrency with in-flight splits


To ensure safety during migrations, the design should explicitly state that only routes in RouteStateActive are eligible for auto-split detection. Routes in transitional states such as RouteStateWriteFenced, RouteStateMigratingSource, or RouteStateMigratingTarget must be excluded from candidacy. This prevents concurrent automated splits from interfering with active M2 migration jobs.

Accepted. Added §5.0 as the first gate of every evaluation cycle: only RouteStateActive routes (distribution/catalog.go:51-58) are candidates; RouteStateWriteFenced/RouteStateMigratingSource/RouteStateMigratingTarget are excluded from scoring and promotion entirely. This subsumes the §6.3 in-flight-SplitJob exclusion. Asserted in §8.2 tests and §8.4 acceptance criterion 8. Fixed in b56950a.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3dc133ecb7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T19:51:40Z

+- **M3 reads detection input from the keyviz sampler**, not from `Route.Load`.
+- The engine's `RecordAccess` / `splitRange` / `Route.Load` path is **retired** (§3.4). No new call site is added for it.
+
+This means **auto-split requires keyviz sampling to be active**. `--autoSplit` implies the sampler is constructed and flushing (§7.1). We do *not* require the operator to also pass `--keyvizEnabled`; enabling auto-split turns on the minimum sampling auto-split needs (route-granular counts; sub-range buckets and Top-K stay opt-in via the existing keyviz flags and only sharpen split-key selection — see §5.4).


Enable split-key evidence with autoSplit defaults

With the current keyviz defaults (DefaultKeyBucketsPerRoute = 1 and --keyvizHotKeysEnabled=false), turning on only --autoSplit constructs a route-granular sampler, but §5.4.3 later says the detector declines whenever neither sub-range buckets nor Top-K are enabled. That means the default advertised --autoSplit mode can never produce a split key or satisfy the M3 acceptance criteria unless operators also discover and enable another keyviz flag; either make --autoSplit enable a split-key evidence source or document it as required configuration.

Useful? React with 👍 / 👎.

Accepted — correct catch. With bare keyviz defaults (DefaultKeyBucketsPerRoute=1 at keyviz/sampler.go:91, --keyvizHotKeysEnabled=false at main.go:205) the detector had no split-key evidence and §5.4.3 would decline every split, making 'just --autoSplit' a permanent no-op. Fixed by making --autoSplit imply the evidence source it needs: when bucketing is left at the default, auto-split constructs the sampler with KeyBucketsPerRoute=autoSplitDefaultBuckets (default 16, under the 256 cap at sampler.go:115), so §5.4.1 sub-range-p50 produces a split key out of the box. Added the implied-vs-explicit config table in §3.1, the keyvizEnabled||autoSplit wiring condition + --autoSplitDefaultBuckets flag in §7.1, and updated OQ-3. Top-K stays opt-in (only the §5.4.2 single-key path). Fixed in b56950a.

chatgpt-codex-connector · 2026-06-10T19:51:40Z

+Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint:
+
+1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket boundary nearest the cumulative-load median, and use that boundary key (reconstructed via `MemSampler.SubBucketBoundsFor`, `keyviz/sampler.go:559`) as the split key. This places the boundary where load is actually halved.
+2. **Single-key skew — isolate the hot key.** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key is isolated into its own narrow child (split key = hot key, and a follow-on split at hot key's successor so the child contains only that key). This is the single-hot-key hotspot case the whole feature targets.


Exempt compound hot-key splits from cooldown

This single-key isolation plan requires two SplitRange calls: first at the hot key, then a follow-on split at its successor. However §5.3 immediately puts both children into the 10-minute cooldown after a split is scheduled, and §6.3 defaults to one split per cycle; unless the scheduler treats this as an atomic/explicitly exempt compound operation, the child containing the hot key cannot be split again until cooldown expires, so the hot key is not isolated as claimed.

Useful? React with 👍 / 👎.

Accepted — the contradiction was real: §5.4.2 needs two SplitRange calls and §5.3 + §6.3 would have locked the hot-key child in cooldown after the first. Reworked it as one atomic compound decision: §5.4.2 now emits both split keys (at H and at H·0x00) together; §6.4 executes them as a chained-CAS back-to-back pair counted as 1 budget unit (so default maxSplitsPerCycle=1 still works), with partial-completion safety; §5.3 makes the intermediate child [H,B) cooldown-exempt while applying cooldown to the three final children. So the hot key is actually isolated. Added autosplit_compound_partial_total and §8.2 assertions. Fixed in b56950a.

- §5.0: explicit RouteStateActive-only candidacy precondition; exclude WRITE_FENCED/MIGRATING_* (gemini medium #3). - §3.1/§7.1: --autoSplit implies sub-range buckets (autoSplitDefaultBuckets, default 16) + sampler-wiring condition keyvizEnabled||autoSplit, with an implied-vs-explicit config table; closes the bare-keyviz-defaults no-op trap (codex P2 #1). - §5.3/§5.4.2/§6.4: compound single-key isolation as one atomic decision (1 budget unit, 2 chained CAS SplitRange calls); intermediate child is cooldown-exempt so the hot key can actually be isolated (codex P2 #2). - §7.7a: bounded leader-local state map via per-cycle GC reconciliation against the live catalog (gemini medium #1); also reconciles disappeared/manual-split routes. - §7.7b: add durable RouteDescriptor.SplitAtHLC for deterministic post-election cooldown reconstruction; replaces the unimplementable 'recent per catalog' seeding (gemini medium #2); resolves OQ-7; scoped to M3-PR1. - §7.3 metrics normalized to bounded {reason} enum; added autosplit_compound_partial_total. §8.1/§8.4 updated for new scope + criteria.

bootjp · 2026-06-11T05:04:02Z

@claude review

claude · 2026-06-11T05:04:15Z

Claude finished @bootjp's task in 4m 53s —— View job

Review: M3 Automation Design Doc (Round 2)

Load tools
Read the updated M3 design doc (post b56950a)
Verify all previous issues were properly addressed
Check for any new issues introduced by the revisions
Post review findings

The revision addressed all five issues from the previous round via b56950a. The doc is substantially better — §5.0 (active-state gate), §6.4 (compound budget/CAS), §7.7 (state-map GC + SplitAtHLC), and the implied-bucketing table in §3.1 are all clean. Two items from the previous review remain open, and I found three new issues worth resolving before M3-PR1 begins.

Carried Over (Not Addressed)

1. Score normalization formula is still underspecified (§5.1 / OQ-8)

§5.1 says the score is "summed and normalized to the flush Step" but commits to no formula. OQ-8 is still open. Without pinning a formula, two contributors can implement different normalizations that both pass unit tests but produce different sensitivities. The formula needs to be in the doc, not left to implementation:

score = (Σ write_ops_over_N_columns / (N × Step_s) × 60) × Ww
      + (Σ read_ops_over_N_columns  / (N × Step_s) × 60) × Wr

Recommendation: close OQ-8 in this doc. The ops/min expression is more intuitive for operators setting --autoSplitThreshold. Pin N = candidateWindows (or a separate --autoSplitWindowColumns if you want to decouple them), document the formula, and reference it from the §8.2 unit test description so the table-driven tests are unambiguous.

2. MemSampler.Snapshot(from, to) time-range recipe still missing (§4.2)

The doc says "trailing N flush columns" but Snapshot(from, to time.Time) (keyviz/sampler.go:1463) takes a time range, not a column count. The doc never says how from and to are computed. A "trailing N columns" detector that calls Snapshot(now - N×Step, now) will include a partially-flushed current column — the accumulating-but-not-yet-flushed bucket — skewing scores low toward the end of each flush window.

Proposed recipe:

to   = time of the most-recently-committed flush column
from = to − N × Step

The detector needs a reference to the last flush boundary. Worth stating whether it gets this from a Flusher handle, a notification channel from RunFlusher, or simply reads Column.At from the last element of the snapshot. The partial-flush skew case should be called out explicitly so M3-PR2 implementers know to guard against it.

New Issues

3. DecodeRouteDescriptor version check breaks rolling upgrades (§7.7b)

The current decoder (distribution/catalog.go:200) is strict:

if raw[0] != catalogRouteCodecVersion {
    return RouteDescriptor{}, errors.Wrapf(ErrCatalogInvalidRouteRecord, "unsupported version %d", raw[0])
}

Bumping catalogRouteCodecVersion to 2 for SplitAtHLC means any pre-M3 node (still running v1 code) that reads a v2 record from the catalog will get ErrCatalogInvalidRouteRecord. In a rolling upgrade — the operator upgrades one node at a time — a post-M3 node writes v2 catalog entries; the not-yet-upgraded nodes try to decode them and fail. The catalog read failure (Snapshot → DecodeRouteDescriptor → error) could cascade to route-not-found errors on those nodes.

The doc says the change is "backward-compatible with existing catalog records" (true: new code reading old v1 records → SplitAtHLC=0), but that is the easy direction. The hard direction — old code reading new v2 records — is not addressed and is currently an error.

Options:

Stay at v1, append optionally. Relax the decoder to if raw[0] > catalogRouteCodecVersion { /* skip extra tail bytes */ } instead of rejecting; encode SplitAtHLC only when non-zero. Old decoders on unmodified code would still reject v2, so this requires an opt-in decoder relaxation shipped before the v2 writer.
Version bump with documented upgrade fence. Explicitly state: "all nodes must be upgraded to M3-PR1 before any M3-PR3 SplitRange write produces v2 records." Adds an ops constraint; document it.
Encode SplitAtHLC as a separate catalog key (e.g. {routePrefix}{routeID}/split_at_hlc) decoded independently. Zero migration ceremony; old nodes ignore unknown keys.

Whichever approach is chosen, it needs to be in the doc. As written, the design implies a hard upgrade fence but doesn't state it.

4. evalInterval < Step allows consecutiveOver to advance faster than intended (§5.1 / §7.1)

--autoSplitEvalInterval defaults to --keyvizStep but can be set independently. When evalInterval < Step, multiple evaluation cycles complete before a new flush column arrives. Each cycle the detector calls Snapshot(from, to) and gets the same set of N columns as the previous cycle (no new data), yet §5.2 advances consecutiveOver on every cycle the score is above threshold. With evalInterval = 5s and Step = 60s, the detector can promote a candidate in 15 seconds (3 cycles × 5s) even though only one flush window of data exists — completely defeating the "3 consecutive flush windows" intent from the parent doc §6.2.

The fix is straightforward: the detector should only advance consecutiveOver (or even run the scoring step) when at least one new flush column has appeared since the last evaluation. A cheap guard: record the At timestamp of the most-recent column at last evaluation; skip the cycle if it hasn't changed. This should be in §5.2 and §7.1.

5. Multi-leader coverage gap understated in §4.1

§4.1 says "the leader's local sampler already holds the authoritative load for routes it leads" and frames OQ-5 narrowly as "heavy follower-forwarded reads." But the gap is broader: in a multi-group cluster (--raftGroups with G1, G2, G3, default), the default-group leader is not necessarily the leader of G1/G2/G3. In Elastickv, each node runs a single ShardedCoordinator and MemSampler; a node only calls Observe for requests it receives from clients. If clients connect to the G1 leader (which is not the default-group leader), those observations never reach the default-group leader's sampler.

Concretely: in a 3-node cluster where node A leads the default group and node B leads shard group G1, node A's sampler has no data for routes in G1 that serve traffic through node B. The detector on node A cannot detect or split those routes.

This is a more fundamental limitation than OQ-5 implies. Recommend:

Rename the OQ-5 framing from "heavy follower-forwarded reads" to "multi-leader deployments where shard group leaders differ from the default-group leader."
State it explicitly as a known limitation in §4.1 (not just in the OQ): "For routes whose shard group's leader is a different node than the default-group leader, the local sampler has no load data. Those routes are invisible to the detector in M3. Cross-node scoring is deferred to a future milestone."

This is a non-goal clarification, not a design flaw — but leaving it as a narrow OQ5 means M3-PR2 reviewers may not catch the implication for multi-group deployments.

Nits (Carried)

§5.4.1 lo vs hi boundary still unspecified. SubBucketBoundsFor(routeID, bucket) (keyviz/sampler.go:559) returns (lo, hi []byte, ok bool). The p50 split key is the boundary between the p50 bucket and the next one — i.e., the hi of the bucket whose cumulative load first crosses 50%. The doc says "boundary key nearest the cumulative-load median" without distinguishing lo from hi. Add one word: "use the hi (upper boundary) of the bucket…"
§7.1 --autoSplitDefaultBuckets upper bound not stated. The flag defaults to 16 and is compared against MaxKeyBucketsPerRoute = 256, but the doc doesn't say whether --autoSplitDefaultBuckets is itself capped at MaxKeyBucketsPerRoute. Should be explicit (and validated in main.go).
§7.3 metric label overlap. no_split_key, route_cap, and budget_exhausted appear in both autosplit_splits_failed_total{reason} and autosplit_skipped_total{reason}. The semantics are unclear: does a "no split key" condition go to splits_failed_total (the call was attempted) or skipped_total (the call was not attempted)? A one-line clarification of the split between these counters ("failed = RPC was issued and rejected; skipped = RPC was not issued") would disambiguate.

Open Question Recommendations

OQ-1: Keep Engine.Stats() (delete RecordAccess/splitRange). Stats() is a useful debugging surface and costs nothing when unused.

OQ-2: Hold (current proposal). Decaying by one per cycle creates an implicit max-hold time that's a function of candidateWindows, making the threshold–cooldown interaction harder to reason about. Hold is simple and deterministic.

OQ-6: Gate behind --autoSplitCrossGroup. Cross-group splits trigger M2's SplitJob data movement; operators should opt into that separately from same-group range splitting. The blast radius of unexpected cross-group migration is not justified by the flag cost.

Summary

Prior round's five issues are cleanly resolved. Two pre-existing gaps remain open (normalization formula, Snapshot recipe). Three new issues: the codec version check breaks rolling upgrades, the evalInterval < Step case lets the consecutive-window counter advance on stale data, and the multi-leader coverage limitation is understated. The rolling-upgrade issue (§3) is the most load-bearing — it needs a concrete resolution in the doc before M3-PR1 can land the schema change.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b56950acb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:05:16Z

+
+- **`SplitAtHLC uint64`** — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records.
+
+With `SplitAtHLC` present, the new leader reconstructs cooldown deterministically: for each live route, if `now_hlc − SplitAtHLC < splitCooldown` (compared in HLC physical-ms units, monotonic per §5.3), seed that route's cooldown deadline to `SplitAtHLC + splitCooldown`. This is exact, not heuristic, and uses the durable timestamp rather than `parent_route_id` recency (which only tells you *whether* a route is a child, not *when*).


Convert HLCs before comparing cooldown durations

When reconstructing cooldown after a leader change, this formula appears to subtract raw HLC values and add splitCooldown directly to an HLC. In this repo HLCs encode physical milliseconds in the upper bits and a 16-bit logical counter in the low bits (kv/hlc.go layout), so a raw difference advances by 1<<16 per millisecond; a 10-minute cooldown would look expired after only about 9 ms, allowing immediate re-splits after elections. The design should shift/extract the physical millisecond component before duration math, then convert back only for diagnostics if needed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:05:17Z

+
+M3 adds one durable field to `RouteDescriptor`:
+
+- **`SplitAtHLC uint64`** — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records.


Use a fenced or committed timestamp for SplitAtHLC

This proposes writing a durable catalog field with HLC.Next(), but kv/hlc.go documents that new persistence-grade timestamp allocations must use NextFenced() so an expired physical ceiling fails closed. Since SplitAtHLC is persisted in RouteDescriptor and later drives cooldown reconstruction, use the split transaction's committed timestamp or allocate it with the fenced path; otherwise the autosplit metadata can be written with an unfenced stale-leader timestamp in the exact contexts the HLC fence is meant to prevent.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:05:17Z

+### 7.1 Default OFF behind a flag
+
+- `--autoSplit` (bool, default `false`). When false, none of the detector/scheduler goroutines start; behavior is identical to today. When true, the default-group leader runs the detector + scheduler loop, and the keyviz sampler is constructed and flushed if it is not already.
+- **Sampler-wiring condition.** The sampler is built when `keyvizEnabled || autoSplit` (today it is built only on `keyvizEnabled`; see `main.go:2005` `newKeyvizSampler` and the analogous wiring in `cmd/server/demo.go`). M3-PR3's flag-wiring change touches both `main.go` and `cmd/server/demo.go`. When auto-split forces the sampler on and the operator left `--keyvizKeyBucketsPerRoute` at its default of 1, auto-split raises it to `autoSplitDefaultBuckets` (default 16) so split-key evidence exists by default (§3.1 table). An explicit `--keyvizKeyBucketsPerRoute 1` is honored as-is (operator opts into route-granular-only → §5.4.3 decline).


Track whether key bucket count was explicitly set

This requires distinguishing an omitted --keyvizKeyBucketsPerRoute from an explicit --keyvizKeyBucketsPerRoute 1, but the current startup flags are plain flag.Int values in main.go/cmd/server/demo.go, where both cases have the same final value. Without adding explicit flag-presence tracking (for example via flag.Visit or a config wrapper), the implementation has to either always raise 1 to autoSplitDefaultBuckets and ignore the documented operator opt-out, or honor 1 and reintroduce the default no-split trap this section is trying to close.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T05:37:44Z

@claude review

Round-2 review addressed in 59990ce (doc-only; Status: Proposed unchanged). Per-finding dispositions, all verified against the tree:

codex P2 #1 — "Convert HLCs before comparing cooldown durations" — ACCEPTED

Verified against kv/hlc.go:31-49: layout is physical_ms << 16 | logical (hlcLogicalBits = 16, exported kv.HLCLogicalBits at :41). A raw HLC subtraction advances 1<<16 = 65536 per ms, so a 10-min cooldown would lapse in ~9.16 ms — exactly the bug. New §7.7(d) defines hlcPhysicalMs(v) = int64(v >> kv.HLCLogicalBits) and pins all cooldown/duration math to physical ms; §8.2 adds the 1<<16-per-ms regression test; acceptance criterion 6 updated.

codex P2 #2 — "Use a fenced or committed timestamp for SplitAtHLC" — ACCEPTED

Verified: HLC.Next() bypasses the fence (kv/hlc.go:110-128); persistence allocations must use NextFenced(). The SplitRange commit already goes through coordinator.Dispatch (adapter/distribution_server.go:229-235) → resolveTxnCommitTS (kv/sharded_coordinator.go:1091-1107) → nextTxnTSAfter (:1365-1379) → NextFenced(). §7.7(b) now defines SplitAtHLC as that transaction's fenced commit ts: the handler pre-allocates via HLC.NextFenced(), stamps it into both child descriptors, and passes it as the explicit OperationGroup.CommitTS (the coordinator honors a caller-supplied CommitTS and Observes it, :1098-1102) so the stored value provably equals the commit ts. ErrCeilingExpired → fail SplitRange closed. Documented why this is safe across elections.

codex P2 #3 — "Track whether key bucket count was explicitly set" — ACCEPTED

Verified: keyvizKeyBucketsPerRoute is a plain flag.Int (main.go:197); omitted and explicit 1 both yield 1. No flag.Visit in the repo today. §3.1 and §7.1 now specify flag.Visit after flag.Parse() to record explicit presence (keyBucketsExplicit), mirrored in cmd/server/demo.go; the implied raise to autoSplitDefaultBuckets fires only when the flag was not explicitly set. Also clamped --autoSplitDefaultBuckets to [2, MaxKeyBucketsPerRoute] (the §7.1 nit).

Claude round-2 review — incorporated

OQ-8 (RESOLVED): §5.1 pins the weighted ops/min formula (score = write_rate×Ww + read_rate×Wr, rate = Σops/(N×Step_s)×60, N = candidateWindows), making the threshold --keyvizStep-invariant; referenced from §8.2.
Snapshot(from,to) recipe (§4.2): over-fetch-then-trim anchored on the last committed MatrixColumn.At, dropping the partial current column.
evalInterval < Step (§5.2): consecutiveOver advances per new committed flush column, not per cycle (tracked via freshest At).
Multi-leader gap (§4.1 + OQ-5 reframed): stated as an explicit known limitation — routes whose group leader differs from the default-group leader are invisible to the leader-local detector; cross-node scoring deferred. Reframed from "follower-forwarded reads."
Rolling-upgrade codec (NEW, §7.7c): strict decoder (distribution/catalog.go:200-202) breaks old-reads-new. Resolved by shipping an accept-and-ignore-forward-version decoder first, then the v2 encoder, plus a documented M3-PR1 rollout fence; rejected the separate-key alternative. Acceptance criterion 10 + §8.1 codec tests added.
Nits: §5.4.1 now specifies the bucket hi boundary; §7.3 disambiguates failed (RPC issued) vs skipped (RPC not issued) and removes the overlapping reasons from the failed enum.
OQ convergence: OQ-1 (keep Stats, drop RecordAccess), OQ-2 (hold band), OQ-6 (gate behind --autoSplitCrossGroup), OQ-7, OQ-8 → RESOLVED with the reviewer's recommendations inline; OQ-3/OQ-4/OQ-5 remain open as implementer/future-milestone calls.

No rebuttals this round — all findings were valid.

claude · 2026-06-11T05:37:57Z

Claude finished @bootjp's task in 4m 13s —— View job

Review: M3 Automation Design Doc (Round 3)

Load tools
Read the updated M3 design doc (post 59990ce)
Verify all round-2 issues were addressed
Check for any new issues introduced by the revisions
Post review findings

All round-2 findings are cleanly addressed. The OperationGroup.CommitTS field exists in the codebase (kv/transcoder.go:29-31: "CommitTS optionally pins the transaction commit timestamp") and resolveTxnCommitTS already honors it (kv/sharded_coordinator.go:1098-1102), so the §7.7b fenced-ts mechanism is wired-in and sound. Three new issues, all in M3-PR1's scope:

Issue 1 — Decoder tail-bytes check at `catalog.go:213-215` also needs relaxation (§7.7c)

The design correctly describes relaxing the version-byte check at distribution/catalog.go:200 from "reject unless equal" to "accept if version ≥ minimum." But a second strict check at lines 213-215 in the same function also rejects trailing bytes:

if r.Len() != 0 {
    return RouteDescriptor{}, errors.WithStack(ErrCatalogInvalidRouteRecord)
}

This runs unconditionally after decoding End. A v2 record with the appended 8-byte SplitAtHLC tail would hit this check and return ErrCatalogInvalidRouteRecord — even after the version-byte check is relaxed. The result: a "relaxed" decoder still rejects v2 records, breaking the accept-and-ignore scheme entirely.

M3-PR1 must change this check to skip (rather than reject) unknown trailing bytes when the version is a forward version. For a v2 decoder, the fix is to conditionally read SplitAtHLC when version == 2 and discard any bytes remaining beyond that. For the "old code reads new record" direction (which is what the rolling-upgrade fence is about), the v1-era decoder shipped in M3-PR1 must skip trailing bytes rather than error.

The §8.1 codec test "v2 record decodes under the relaxed-decoder without error" would catch this failure — but the doc should name this specific line as part of the change scope so M3-PR1 implementers don't miss it.

Issue 2 — `routeDescriptorEqual` and `CloneRouteDescriptor` must include `SplitAtHLC` (§7.7b, M3-PR1 scope)

When M3-PR1 adds SplitAtHLC uint64 to RouteDescriptor, two helper functions in distribution/catalog.go will silently drop the field unless also updated:

CloneRouteDescriptor (catalog.go:368-377) — used by normalizeRoutes → prepareSave → Save. It currently copies all fields by name; a newly-added SplitAtHLC will not be copied:

func CloneRouteDescriptor(route RouteDescriptor) RouteDescriptor {
    return RouteDescriptor{
        RouteID:       route.RouteID,
        Start:         CloneBytes(route.Start),
        ...
        ParentRouteID: route.ParentRouteID,
        // SplitAtHLC silently absent — zero-valued in every clone
    }
}

Every route going through Save is cloned via normalizeRoutes — so SplitAtHLC would be zeroed before encoding, silently losing the durable timestamp.

routeDescriptorEqual (catalog.go:693-700) — used by appendUpsertRouteMutations to skip re-encoding unchanged routes. Without SplitAtHLC in the equality check, two routes that differ only in SplitAtHLC are treated as identical, and appendUpsertRouteMutations skips encoding the updated field. If a split updates SplitAtHLC on the child routes and the prior catalog entry for those route IDs already exists, the field is silently not written.

Both functions must include SplitAtHLC in M3-PR1. These are low-visibility Go pitfalls (the struct literal compiles without exhaustiveness checking) and are easy to miss in a code review; they should be called out explicitly in §7.7b or §8.1 so they land in M3-PR1's checklist.

Issue 3 — `now_hlc` source for cooldown reconstruction is underspecified (§7.7d)

§7.7d defines the cooldown check as hlcPhysicalMs(now_hlc) − hlcPhysicalMs(SplitAtHLC) < splitCooldownMs and says now_hlc is "the leader-issued HLC the detector reads at cycle start." The three candidate sources have different semantics:

HLC.Next() — unfenced, advances the clock and causes unnecessary logical-counter churn for what is a purely read operation.
HLC.NextFenced() — fenced, can return ErrCeilingExpired, requiring error handling in the detector.
HLC.Current() — returns the last issued value without advancing. Correct if the node has been active; could be stale on a newly-elected leader that hasn't yet issued a timestamp.

For cooldown reconstruction the detector only needs a current physical millisecond count — there is no ordering guarantee needed. The simplest correct option is time.Now().UnixMilli() (the wall clock), which CLAUDE.md allows for "diagnostics/metrics" — and cooldown reconstruction is purely heuristic, not ordering-sensitive. The HLC physical-ms comparison exists to convert SplitAtHLC to ms (since it's a packed HLC value), but now itself can come directly from the wall clock:

elapsedMs := time.Now().UnixMilli() - hlcPhysicalMs(SplitAtHLC)

Using HLC.Current() for now_hlc would also work but creates an implicit dependency on the HLC having been recently ticked on this node. Either choice is correct; the doc should commit to one and explain why (especially given the "no wall clock for ordering-sensitive reads" CLAUDE.md rule — clarifying that cooldown reconstruction is not ordering-sensitive removes any ambiguity about whether the wall-clock form violates the rule).

Confirmed Resolved

All round-2 and round-3 findings are addressed:

OQ-8 (normalization formula): pinned as score = write_rate × Ww + read_rate × Wr with explicit rate = Σ ops / (N × Step_s) × 60 (§5.1) ✓
Snapshot(from,to) recipe: over-fetch-then-trim anchored on last committed At (§4.2) ✓
evalInterval < Step: consecutiveOver advances per new committed flush column, not per cycle (§5.2) ✓
Multi-leader coverage gap: explicit known limitation stated in §4.1 ✓
Rolling-upgrade codec: relaxed-decoder-first + rollout fence (§7.7c) ✓ — except Issue 1 above
HLC cooldown math: physical-ms conversion via hlcPhysicalMs (§7.7d) ✓
Fenced SplitAtHLC: pre-allocated NextFenced() → OperationGroup.CommitTS (§7.7b) — confirmed OperationGroup.CommitTS exists (kv/transcoder.go:29-31) ✓
flag.Visit for bucket-count tracking: specified in §3.1 and §7.1 ✓
State-map GC: reconcile against live catalog each cycle (§7.7a) ✓
Compound budget/CAS: compound = 1 budget unit, CAS chaining, partial-completion safety (§6.4) ✓
All nits from round 2: lo/hi boundary (§5.4.1), metric failed vs skipped disambiguation (§7.3) ✓

Summary

Issues 1 and 2 are concrete correctness gaps in the M3-PR1 decoder-relaxation and struct-update story; both would be caught by the §8.1 codec round-trip tests if those tests are written first. Issue 3 is an underspecification to close before M3-PR2 (the detector). Once these three are addressed in the doc, M3-PR1 is ready to implement.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59990ceaf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:40:37Z

+
+Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint:
+
+1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved.


Handle p50 buckets that end at the route boundary

When the median load falls in the last sub-bucket, SubBucketBoundsFor returns that bucket's hi as the route's own End (or nil for an unbounded tail), and §5.4 later requires split_key < route.End; SplitRange also rejects split keys at the boundary. In an upper-tail hotspot where the last bucket carries >50% of the load, the default --autoSplit path will therefore choose an invalid split key and drop the decision instead of splitting a common hot range. The p50 path needs an interior fallback for first/last median buckets, such as using the adjacent lower boundary when hi is the route end.

Useful? React with 👍 / 👎.

Already covered in §5.4.1 paragraph 2 ("Boundary-bucket fallback (review, codex P2)") immediately under the line you flagged: when the p50 bucket's hi equals route.End (or is nil for an unbounded tail), the detector falls back to the median bucket's lo — an interior boundary that places the dominant last bucket in the right child while satisfying route.Start < split_key < route.End. The symmetric first-bucket case (lo == route.Start) falls back to hi. §8.2 has both "p50 in the last bucket → interior lo fallback" and "p50 in the first bucket → interior hi fallback" tests. Could you confirm this resolves the finding?

chatgpt-codex-connector · 2026-06-11T05:40:37Z

+Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint:
+
+1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved.
+2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key `H` lands in its own single-key child. Isolating `H` from a route `[A, B)` takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets.


Support hot keys already at the lower route boundary

This compound flow always starts by splitting at H, but when the hot key is already the route start (H == A) that first SplitRange is invalid because the existing handler rejects split keys equal to parent.Start. That case occurs naturally after a previous split at the hot key (or for a route whose lower bound is the key), and the detector's own boundary validation would drop the isolation even though a single split at H·0x00 would isolate {H}. The algorithm should special-case H == route.Start instead of requiring two boundaries unconditionally.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:40:37Z

+M3 adds one durable field to `RouteDescriptor`:
+
+- **`SplitAtHLC uint64`** — the **committed HLC timestamp of the `SplitRange` transaction that produced this route**. This must be a leader-fenced, Raft-committed timestamp, **not** an arbitrary `HLC.Next()` allocation (review correctly flagged this: `HLC.Next()` bypasses the physical-ceiling fence — `kv/hlc.go:110-128` — so a stale leader could mint a `SplitAtHLC` inside a window the ceiling fence is meant to forbid, the exact case the fence exists to prevent). The safe source is already at hand: the `SplitRange` handler commits the catalog mutation through `coordinator.Dispatch` (`adapter/distribution_server.go:229-235`), and the coordinator allocates that transaction's commit ts via the **fenced** path (`resolveTxnCommitTS`, `kv/sharded_coordinator.go:1091-1107` → `nextTxnTSAfter`, `:1365-1379` → `NextFenced()`). So `SplitAtHLC` is **defined to be that transaction's commit ts** — the same value the split mutation is durably applied at.
+  - **Mechanism (so the stored value equals the actual commit ts).** Today `CoordinateResponse` returns only `CommitIndex`, not the commit ts (`kv/coordinator.go:165-167`), so the handler cannot read the ts back after dispatch. Instead the handler **pre-allocates** the fenced commit ts via `HLC.NextFenced()` on the default-group leader, stamps it into both child `RouteDescriptor.SplitAtHLC` *before* encoding, and passes it as the explicit `OperationGroup.CommitTS` (with `StartTS = snapshot.ReadTS`). The coordinator honors a caller-supplied `CommitTS` (it `Observe`s it to keep the clock monotonic, `kv/sharded_coordinator.go:1098-1102`) rather than minting a fresh one, so the value encoded in the descriptors is provably the transaction's commit ts. If `NextFenced()` returns `ErrCeilingExpired` (the leader's physical ceiling has lapsed), the handler fails the `SplitRange` closed (`FailedPrecondition`) instead of writing an unfenced timestamp — a brief, safe rejection, not a corrupt record.


Allocate SplitAtHLC after observing the snapshot timestamp

Pre-allocating CommitTS with a single HLC.NextFenced() can still produce a timestamp <= snapshot.ReadTS after restart or leader change when the durable catalog's latest timestamp has the same physical HLC component but a higher logical counter than this process has observed. Dispatch rejects caller-supplied commit timestamps that are not greater than StartTS, so SplitRange would fail even though a valid fenced timestamp exists; mirror nextTxnTSAfter(readTS) semantics by observing readTS and retrying NextFenced() before stamping SplitAtHLC.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T06:35:02Z

@claude review

Round-3 review addressed in afc95e3b (doc-only). Per-issue dispositions — all six findings verified against the tree and accepted:

Issue 1 — second strict decoder check at catalog.go:213-215 (ACCEPTED). Confirmed: DecodeRouteDescriptor has two strict checks — the version byte at :200-202 and the unconditional trailing-bytes guard if r.Len() != 0 at :213-215. Relaxing only the version byte leaves the tail check rejecting every v2 record. §7.7c now names both checks as M3-PR1 scope and specifies the relaxed decoder precisely: version-byte → >= catalogRouteCodecVersionMin (= 1); read the 8-byte SplitAtHLC tail iff version >= 2; trailing-bytes check → drain unknown tail (io.Copy(io.Discard, r)) instead of erroring. This makes both directions safe: v2-reads-future-v3 and the load-bearing old-v1-era-reads-v2 case.

Issue 2 — CloneRouteDescriptor / routeDescriptorEqual must include SplitAtHLC (ACCEPTED). Confirmed: CloneRouteDescriptor (catalog.go:368-377) copies by field into a struct literal (every save goes through normalizeRoutes → prepareSave, so an omitted field is zeroed before encode); routeDescriptorEqual (catalog.go:693-700) is used by appendUpsertRouteMutations to skip re-encoding unchanged routes, so an SplitAtHLC-only delta would be silently dropped. Both are now explicit M3-PR1 checklist items in §7.7b and §8.1, with tests asserting clone-preserves and equal-distinguishes.

Issue 3 — now source for cooldown reconstruction (ACCEPTED — committed to wall clock). §7.7d now commits to time.Now().UnixMilli() for now, converting only SplitAtHLC via hlcPhysicalMs. Rationale spelled out: cooldown reconstruction is a non-ordering-sensitive heuristic throttle (CLAUDE.md diagnostics/metrics carve-out applies), so the wall clock is permitted; Next()/NextFenced() rejected (write/churn, can fail-closed); Current() rejected (stale on a freshly elected leader that hasn't ticked). The fence lives on SplitAtHLC, not now. §7.7b, §8.2, §8.3, §8.4#6, OQ-7 reconciled to match.

Plus three codex P2 inline findings at HEAD (all ACCEPTED):

p50 in the boundary sub-bucket (hi == route.End). SubBucketBoundsFor returns the route's own End for the last bucket (keyviz/sampler.go:570-575); validateSplitKey rejects splitKey == parent.End (distribution_server.go:370-371). §5.4.1 adds an interior fallback (median lo when hi >= route.End; symmetric hi when lo == route.Start).
Hot key already at lower boundary (H == route.Start). validateSplitKey rejects splitKey == parent.Start (distribution_server.go:358-360), so the two-boundary compound's first call is invalid. §5.4.2 special-cases H == route.Start to the single-split form (one SplitRange at H·0x00), with an already-isolated decline.
Pre-allocated SplitAtHLC can be <= snapshot.ReadTS. Confirmed: resolveTxnCommitTS rejects caller commitTS <= startTS (sharded_coordinator.go:1103-1105) without retrying. §7.7b now mirrors nextTxnTSAfter (:1365-1388) — NextFenced(), and if <= ReadTS, Observe(ReadTS) then re-allocate — guaranteeing CommitTS > StartTS.

Status stays Proposed; OQ numbering (1-8) unchanged. Codec round-trip test matrix (cases 1-7, both directions) added in §7.7c; §8.1/§8.2/§8.4 test rows updated.

claude · 2026-06-11T06:35:14Z

Claude finished @bootjp's task in 5m 43s —— View job

Review: M3 Automation Design Doc (Round 4)

Load tools
Read the updated M3 design doc (post afc95e3)
Verify all round-3 issues and codex P2 findings were addressed
Check for new issues introduced by the revisions
Post review findings

All six round-3 findings and the three codex P2 inline findings are cleanly addressed. Verified against the codebase:

Issue 1 (second strict decoder check at catalog.go:213-215) — both checks named in §7.7c, relaxed decoder precisely specified. ✓
Issue 2 (CloneRouteDescriptor/routeDescriptorEqual missing SplitAtHLC) — both named in §7.7b and §8.1 checklist. ✓
Issue 3 (now source for cooldown) — committed to time.Now().UnixMilli() in §7.7d, rationale clearly documented. ✓
Codex P2: p50 in boundary bucket — interior lo/hi fallback in §5.4.1. ✓
Codex P2: H == route.Start — single-split form in §5.4.2. ✓
Codex P2: pre-allocated SplitAtHLC <= ReadTS — Observe-and-retry mirrors nextTxnTSAfter (kv/sharded_coordinator.go:1365-1388). ✓

One new issue introduced by the §7.7c encoder spec.

Issue — `EncodeRouteDescriptor` early return will silently drop `SplitAtHLC` for `End == nil` routes (§7.7c, M3-PR1 scope)

The current encoder (distribution/catalog.go:168-193) has two separate return paths:

if route.End == nil {
    out = append(out, 0)   // tag = 0
    return out, nil        // ← early return
}
out = append(out, 1)
out = appendU64(out, uint64(len(route.End)))
out = append(out, route.End...)
return out, nil            // ← End != nil return

§7.7c says SplitAtHLC is "appended as a fixed 8-byte big-endian tail after the End block" — but "after the End block" only reads naturally as the second return, not the early return. If M3-PR1 appends appendU64(out, route.SplitAtHLC) at the bottom without first removing the early return, every route with End == nil will encode without its SplitAtHLC tail. The v2 decoder, seeing version == 2, will then try to read 8 bytes that aren't present and return ErrCatalogInvalidRouteRecord — corrupting any End == nil route in the catalog after the encoder bump lands.

This is load-bearing, not a corner case: the rightmost route in the keyspace always has End == nil. A split of that route (the most common auto-split target when the last range gets hot) produces a right child with End = parent.End = nil. That child would immediately be unreadable from the catalog.

The codec test matrix (cases 1–7) does not include an End == nil route in any case, so this encoder bug would not be caught by the listed tests.

Fix needed in §7.7c and §8.1:

Add the following to §7.7c, alongside the existing encoder bullet: "The current EncodeRouteDescriptor has an early return for End == nil (distribution/catalog.go:184-186). The v2 encoder must append SplitAtHLC on both the End == nil and End != nil paths. The idiomatic fix is to convert the early return to a conditional and append SplitAtHLC once at the end, after the conditional:

if route.End == nil {
    out = append(out, 0)
} else {
    out = append(out, 1)
    out = appendU64(out, uint64(len(route.End)))
    out = append(out, route.End...)
}
out = appendU64(out, route.SplitAtHLC)  // both paths
return out, nil

The decoder's version >= 2 branch reads the 8-byte tail after decodeRouteDescriptorEnd, which already handles both the endFlag == 0 (nil) and endFlag == 1 (present) cases."

Add to the §7.7c codec round-trip test matrix: Case 1a — v2 with End == nil, SplitAtHLC = T ≠ 0, encoded by the v2 encoder and decoded by the v2 (relaxed) decoder: SplitAtHLC == T, End == nil, no error. This is the case that catches the early-return bug.

Add to §8.1 M3-PR1 test checklist: "codec round-trip test includes at least one End == nil route, asserting SplitAtHLC round-trips correctly."

Confirmed Resolved

All prior-round findings are addressed. Once §7.7c/§8.1 are updated with the End == nil encoder guidance and test case, M3-PR1 is ready to implement.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: afc95e3b43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:36:36Z

+  - **New code (v2 decoder) reading a future v3+ record** — reads `SplitAtHLC`, drains whatever v3 appended, returns a valid `RouteDescriptor`.
+  - **Old code (the v1-era decoder shipped *before* the bump) reading a v2 record** — this is the rolling-upgrade case the fence is about. The relaxed v1-era decoder, having no `version >= 2` branch, reads through `End`, then drains the 8-byte `SplitAtHLC` tail instead of erroring. It loses only the cooldown hint (equivalent to `SplitAtHLC = 0`, which is safe per (b)).
+- **Encoder.** `EncodeRouteDescriptor` writes `catalogRouteCodecVersion` (now `2`) and appends `SplitAtHLC` as a fixed 8-byte big-endian tail after the `End` block; `routeDescriptorEncodedSize` adds `catalogUint64Bytes` (`distribution/catalog.go:708`).
+- **Upgrade ordering fence (documented ops constraint).** Even with accept-and-ignore, the safe rollout is: deploy the relaxed-decoder build (the relaxation of *both* checks above) to **all** nodes first; only then does any node begin issuing `SplitRange` writes that emit v2 records. Because M3-PR1 ships the decoder relaxation and the v2 writer is only exercised once a split runs (M3-PR3 wires the scheduler; manual `SplitRange` is operator-initiated), the practical fence is "complete the M3-PR1 rollout cluster-wide before issuing any split." This is stated in the M3-PR1 PR description and the §8.4 acceptance criteria.


Gate v2 route writes during rolling upgrades

In a rolling upgrade where one M3-PR1 node is the default-group leader and an operator uses the already-existing manual SplitRange before every node has the relaxed decoder, this fence is not enforced: buildCatalogSplitOps always encodes children through distribution.EncodeRouteDescriptor (adapter/distribution_server.go:244-259), so the encoder bump in the same PR can write v2 records while current pre-M3 nodes still reject both a non-1 version byte and trailing bytes in DecodeRouteDescriptor. That can break catalog snapshots/watchers on the not-yet-upgraded nodes despite the doc calling the change rolling-upgrade safe; keep the writer on v1/gated until cluster-wide decoder rollout, or split decoder relaxation and v2 emission into separate releases.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T06:46:36Z

Addressed the remaining review issue: §7.7c encoder spec now requires appending the SplitAtHLC tail on BOTH End paths (the End==nil early return at distribution/catalog.go:185 would otherwise skip it; rightmost route / split right child always hit it), with the if/else restructure + Go snippet; added codec matrix Case 1a (v2 End==nil round-trip) and the §8.1 M3-PR1 checklist line requiring an End==nil round-trip test. Doc-only, Status: Proposed. @claude review

claude · 2026-06-11T06:46:51Z

Claude finished @bootjp's task in 5m 37s —— View job

Review: M3 Automation Design Doc (Round 5)

Load tools
Read the updated M3 design doc (post b8df8eb)
Verify round-4 issue was properly addressed
Check for new issues introduced by the revisions
Post review findings

Confirmed Resolved — Round 4

End == nil encoder early-return (round-4 Issue) — cleanly addressed. The doc now:

Identifies the exact early return at distribution/catalog.go:185 that would have silently dropped the SplitAtHLC tail for End == nil routes (the rightmost route and every split right child).
Specifies the if/else restructure with a precise Go snippet — both the End == nil and End != nil paths fall through to appendU64(out, route.SplitAtHLC) before return, so the tail is appended unconditionally.
Adds codec matrix Case 1a (v2, End == nil, SplitAtHLC = T ≠ 0 → v2 relaxed decoder round-trips without error, End == nil, SplitAtHLC == T).
Adds an §8.1 checklist line requiring "at least one End == nil route" in the round-trip test suite, so the early-return bug can never silently re-appear.

The decoder side is correctly unaffected: decodeRouteDescriptorEnd already consumes the end-flag and optional End bytes for both endFlag == 0 and endFlag == 1 before the version >= 2 tail-read, so the 8-byte tail always lands at the same offset regardless of the End == nil / non-nil case. No special-casing is needed in the decoder for Case 1a.

One Remaining Issue — Codex Comment Still Unaddressed

The codex comment posted on afc95e3b ("Gate v2 route writes during rolling upgrades") was not addressed in the round-4 trigger comment and is still open.

The concrete problem: M3-PR1 ships both the decoder relaxation and the v2 encoder bump in the same binary (catalogRouteCodecVersion bumped from 1 to 2, encoder write at distribution/catalog.go:175, decoder check at :200-202). This means:

A node upgraded to M3-PR1 gets both the relaxed decoder and the v2 encoder simultaneously.
In a rolling upgrade where some nodes are M3-PR1 and some are not, if an operator issues a manual SplitRange while the cluster is mid-upgrade, the M3-PR1 leader encodes the child RouteDescriptors with version = 2.
Pre-M3 nodes at catalog.go:200 check raw[0] != 1 → ErrCatalogInvalidRouteRecord. The Snapshot / watcher apply fails on those nodes, producing route-not-found errors until the upgrade completes.

The doc's current response (§7.7c "Upgrade ordering fence") says: "the practical fence is 'complete the M3-PR1 rollout cluster-wide before issuing any split.'" This is a documented ops constraint, not a technical one. The codex asked for technical enforcement, and the doc doesn't provide it.

Gap: the doc's round-3 revision explicitly wrote "shipping an accept-and-ignore-forward-version decoder first, then the v2 encoder" (from bootjp's round-3 dispositions), implying the two should be sequenced at the release level — but the M3-PR1 table ships both. There is an inconsistency between the intent and the plan.

Recommended fix (two options, choose one and state it explicitly):

Option A — split M3-PR1 into two PRs (technically enforces the ordering):

PR	Scope
M3-PR1a	Decoder relaxation only: relax both strict checks (version-byte `>= min`, drain unknown tail). No encoder change; no `SplitAtHLC` field; `catalogRouteCodecVersion` stays `1`. Roll this cluster-wide.
M3-PR1b	Encoder bump to `v2`, `SplitAtHLC` field in `RouteDescriptor`, `CloneRouteDescriptor` / `routeDescriptorEqual` updates, `SplitRange`-handler fenced-ts write, `hlcPhysicalMs` helper, metric scaffolding. Deploy after PR1a completes everywhere.

Once every node has PR1a, no pre-M3 decoder exists in the cluster: a PR1b-emitted v2 record is safe to decode everywhere. The "decoder first" intent is technically enforced rather than advisory.

Option B — keep M3-PR1 as one PR but document the failure mode explicitly. If the single-PR approach is retained, the "upgrade ordering fence" paragraph should explicitly state what happens if the constraint is violated: "if a manual SplitRange is issued while any node has not yet received the relaxed-decoder M3-PR1 build, that node's catalog snapshot/watcher will fail to decode the v2 child routes, causing route-not-found errors on that node until its upgrade completes. The fence is advisory; no code-level gate prevents a v2 write during a partial rollout." This gives M3-PR1 implementers and operators a clear consequence, not just an instruction.

Either option closes the gap. Option A is preferred because it makes the upgrade safe by construction; Option B is acceptable if the single-PR release boundary is a hard constraint.

Minor Nit

Codec matrix is missing Case 4a (old-reads-new direction with End == nil). The load-bearing rolling-upgrade case is Case 4 (v2 → v1-era relaxed decoder). Case 1a covers End == nil but only for the v2 decoder direction. Consider adding:

#	Encoder	Decoder	Expectation
4a	v2, `End == nil`, `SplitAtHLC = T ≠ 0` (hand-built)	v1-era relaxed decoder (no `version >= 2` branch, drains tail)	decodes without error; `End == nil`, `SplitAtHLC == 0` (tail drained); covers the rightmost-route rolling-upgrade direction

Case 4a is not strictly necessary (the individual pieces are covered by Cases 1a and 4), but it explicitly exercises the combination that matters most for the rightmost-route mid-upgrade scenario.

Summary

The round-4 issue is cleanly resolved — the if/else encoder restructure, Case 1a, and §8.1 checklist item are all correct and complete. One item remains: the codex comment on afc95e3b about technical enforcement of the decoder-before-encoder rollout ordering. Addressing it (Option A or Option B above) is the last gate before M3-PR1 is ready to implement.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8df8eb9e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:49:11Z

+N          = candidateWindows            // trailing flush columns summed (default 3, §5.2)
+Step_s     = keyvizStep.Seconds()        // flush-column width in seconds (default 60)
+write_rate = (Σ write_ops over N columns) / (N × Step_s) × 60   // ops/min
+read_rate  = (Σ read_ops  over N columns) / (N × Step_s) × 60   // ops/min
+score      = write_rate × Ww + read_rate × Wr                   // weighted ops/min


Separate scoring window from consecutive-window hysteresis

With the default candidateWindows = 3, this formula averages the last 3 columns and then §5.2 requires that averaged score to be over threshold for 3 consecutive flush columns. That no longer implements “3 consecutive windows over threshold”: a cold window can be hidden by the moving average and still advance the counter, while cold start needs 5 hot flushes before promotion because the counter does not advance until 3 columns exist. Use per-column scores for the consecutive test or split this into separate scoreWindowColumns and candidateWindows knobs so hysteresis matches the documented trigger.

Useful? React with 👍 / 👎.

…odec plan

bootjp · 2026-06-11T07:03:45Z

@claude review

Adopted Option A (the round-5 recommendation, and codex afc95e3b's "split decoder relaxation and v2 emission into separate releases"): the codec work in §7.7c is now split into two separately-deployed PRs — M3-PR1a ships the decoder relaxation alone (catalogRouteCodecVersion stays 1, no SplitAtHLC field, no encoder/helper change) and rolls out cluster-wide first; M3-PR1b then bumps the encoder to v2, adds SplitAtHLC, the fenced-ts SplitRange write, the helper updates, and hlcPhysicalMs. After PR1a is everywhere, no strict pre-relaxation decoder remains, so the first v2 record PR1b emits is decodable on every node by construction — the decoder-before-encoder ordering is now enforced at the release boundary, not by a documented ops constraint. I picked A over B (document the failure mode only) because B is advisory and does not provide the technical enforcement codex asked for. I considered the heavier runtime capability-fan-out gate (in-tree precedent: the encryption cutover's GetCapability fan-out / ErrCapabilityCheckFailed, docs/design/2026_05_25_implemented_6d6c3a_capability_fanout_wiring.md) and documented it as the fallback only if the two PRs are ever forced into one release — the two-PR split removes the unsafe window without new wire surface. §7.7c now states the enforcement mechanism, its failure mode if the deploy gate is violated (transient route-not-found on the lagging node, not data loss — the v2 record is well-formed), and the tests that pin it; the §8.1 milestone table (PR1a/PR1b rows), the codec matrix (added Case 4a for the rightmost-route End == nil old-reads-new direction; matrix authored in PR1a before the encoder exists), §8.4 acceptance criterion 10, and the §7.7d Scoping/lifecycle references were updated consistently. Status remains Proposed; doc-only. Pushed 545e718.

claude · 2026-06-11T07:04:00Z

Claude finished @bootjp's task in 4m 44s —— View job

Review: M3 Automation Design Doc (Round 6)

Load tools
Read the updated M3 design doc (post 545e718)
Verify round-5 issue (Option A decoder-before-encoder split) was addressed
Analyze new codex P2 issue (scoring window vs hysteresis window separation)
Post review findings

Confirmed Resolved — Round 5

Option A (decoder-before-encoder split, round-5 Issue) — cleanly addressed in 545e718. Verified across the doc:

§7.7c now specifies M3-PR1a (decoder relaxation only, catalogRouteCodecVersion stays 1, no v2 emission) and M3-PR1b (v2 encoder + SplitAtHLC) as two separately-deployed PRs, with the "why this is enforcement and not advice" rationale and the failure-mode-if-violated described precisely (transient route-not-found, not data loss). ✓
§8.1 PR breakdown table has explicit PR1a and PR1b rows, each with scope, tests, and Independently shippable? columns. The "deploy after PR1a is cluster-wide" note is in the PR1b row. ✓
§7.7c codec matrix has Case 4a (v2, End == nil, SplitAtHLC = T ≠ 0, hand-built → M3-PR1a relaxed decoder: decodes without error, SplitAtHLC == 0, covers rightmost-route mid-upgrade direction). ✓
§8.4 acceptance criterion 10 states the two-PR gate, the "by construction" enforcement claim, the failure mode, and explicitly names cases 4/4a in the codec matrix as the test pins. ✓
The capability-fan-out alternative (GetCapability, ErrCapabilityCheckFailed) is named as the fallback only if PR1a/PR1b are ever forced into one release. ✓

All prior-round findings are cleanly resolved. The doc is structurally complete.

Issue — Scoring Window and Hysteresis Window Are Coupled; Behavior Diverges from Parent Doc §6.2 (§5.1, §5.2, M3-PR2 scope)

This is the codex P2 comment from commit b8df8eb that was not addressed in 545e718. It is a real design defect, not a style nit.

The problem — two symptoms from one cause.

§5.1 defines score as the N-column rolling average rate, where N = candidateWindows = 3. §5.2 then advances consecutiveOver once per new flush column when score >= threshold — i.e., when the N-column rolling average is above threshold.

Symptom 1 — a cold column advances the counter. The rolling average smooths over N columns, so a single cold column can be absorbed by two flanking hot columns and still yield an above-threshold average. Example with N = 3 and threshold T:

Column	ops/min	Rolling avg (c_i-2, c_i-1, c_i)	consecutiveOver
1	2T	—	0 (< 3 cols exist)
2	2T	—	0
3	2T	avg(2T,2T,2T) = 2T ≥ T	1
4	0	avg(2T,2T,0) = 4T/3 ≥ T	2 (column 4 was cold!)
5	0	avg(2T,0,0) = 2T/3 < T	reset to 0

Column 4 was cold (0 ops) and still advanced consecutiveOver. If columns 4 and 5 were both 0.7T instead of 0, the average would be avg(2T, 0.7T, 0.7T) ≈ 1.13T ≥ T, advancing the counter for two cold columns and promoting after column 5 even though the most recent two flush windows were below threshold. This defeats the purpose of "3 consecutive windows over threshold."

Symptom 2 — cold start requires 2N−1 hot flushes, not N. §5.2 says the counter "does not advance until at least N committed columns exist." With N = candidateWindows = 3, the earliest first advance is when column 3 arrives; the third advance (promotion) is when column 5 arrives. Cold start always requires 5 hot flushes (2N−1 = 5 for N = 3), not the 3 that the parent doc §6.2 mandates ("Candidate after 3 consecutive windows over threshold").

Root cause. The score formula ties N to two distinct purposes: (a) smoothing the rate for display/logging, and (b) defining what "one window over threshold" means for the hysteresis counter. When N > 1, a single above-threshold column score is averaged with its N−1 neighbors, and the counter sees the averaged signal, not the individual-column signal. The parent doc intent is that each individual flush window either passes or fails the threshold check — not that the average of N windows passes.

Recommended fix — per-column score for the consecutive test; N-column average for display only.

Separate the two uses of N:

// Per-column score (used for the consecutiveOver decision):
per_col_write_rate = column_i.write_ops / Step_s × 60   // ops/min for this one column
per_col_read_rate  = column_i.read_ops  / Step_s × 60
per_col_score      = per_col_write_rate × Ww + per_col_read_rate × Wr

// N-column smoothed score (used for display/log only):
smoothed_score     = (Σ write_ops over N cols)/(N×Step_s)×60 × Ww + ...  (§5.1 formula, unchanged)

Advance/reset consecutiveOver based on per_col_score against threshold (and the hysteresisDownFactor band). Log and emit smoothed_score for the operator-facing output. With this:

Cold start: 3 individual hot columns → consecutiveOver = 3 → PROMOTE ✓ (matches parent doc §6.2 exactly; the "wait until N columns exist" guard can be removed because there is no rolling-average dependency on N prior columns)
Cold column: per_col_score < threshold resets consecutiveOver immediately ✓
Down-side band: per_col_score < 0.8 × threshold clears; per_col_score ∈ [0.8T, T) holds — same semantics as today, just applied per column ✓

The §5.1 note "if a future need arises to decouple them, add --autoSplitWindowColumns defaulting to candidateWindows — not done in M3" should be replaced with the explicit decoupled definition above. The --autoSplitWindowColumns flag can still exist for the smoothed display score, but the promotion decision always uses the per-column score (window = 1), matching the parent doc.

What needs to change in the doc:

§5.1 — distinguish per_col_score (hysteresis input) from smoothed_score (display). Remove or repurpose the "decouple in the future" footnote.
§5.2 — update to say consecutiveOver advances when per_col_score >= threshold (not the N-column rolling average score).
§5.2 cold-start guard — with per-column scoring the guard "don't advance until N columns exist" is no longer mathematically required (each individual column is self-contained); it can be removed, reducing cold start from 2N−1 to N hot flushes.
§8.2 tests — add the regression case the codex specifies: "N−1 hot columns followed by 1 cold column followed by N hot columns: the cold column resets the counter; total hot columns required for promotion is N, not N + (N−1) = 2N−1."

Summary

Round-5 issue is cleanly resolved — the Option A two-PR codec split is precisely specified and technically enforced. One issue remains: the codex P2 "Separate scoring window from consecutive-window hysteresis" comment on b8df8eb is unaddressed in 545e718. It is a behavioral divergence from the parent doc §6.2 requirement: the rolling-average score lets cold columns advance consecutiveOver, and cold start requires 2N−1 = 5 hot flushes instead of N = 3. The fix is to use per-column scores for the promotion decision and keep the N-column average only for display — a one-paragraph change in §5.1–5.2 plus one regression test in §8.2.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 545e71880d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T07:06:38Z

+
+### 6.3 Concurrency with in-flight splits
+
+The scheduler issues at most `maxSplitsPerCycle` (§5.5) `SplitRange` calls per cycle and waits for each to return (same-group splits are synchronous; cross-group returns a `job_id` immediately per M2 §5.1). A route with an in-flight cross-group `SplitJob` (not yet `DONE`) is excluded from candidacy until the job completes, so the detector cannot stack a second split on a route mid-migration.


Keep the per-cycle RPC cap consistent

In the single-key path, §6.4 now explicitly executes two back-to-back SplitRange RPCs while counting the compound decision as one budget unit, so this statement is no longer true under the default maxSplitsPerCycle = 1. If an implementer follows this line literally, the second split needed to isolate {H} will be blocked; if they follow §6.4, the advertised per-cycle SplitRange call cap is not actually enforced. Fresh evidence after the prior cooldown fix is that §6.3 still caps RPC calls rather than logical split decisions, so the design should either separate the logical-route budget from an RPC cap or explicitly exempt compound isolation here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T07:06:39Z

+}
+```
+
+`flag.Visit` (and the equivalent `flag.CommandLine.Lookup("…")` presence check) is already the idiomatic Go answer here; no config-wrapper type is needed. The same `keyBucketsExplicit` computation is mirrored in `cmd/server/demo.go` (which has its own flag set, §7.1). With this, the implied-vs-explicit config is:


Drop Lookup as a flag-presence check

flag.CommandLine.Lookup("keyvizKeyBucketsPerRoute") is not equivalent to the flag.Visit check above: Go's flag package defines Visit as visiting only flags that have been set, while Lookup returns the named flag if it is defined, regardless of whether the operator supplied it. If the implementation follows this parenthetical, keyBucketsExplicit will always be true and bare --autoSplit will keep the default bucket count of 1, reintroducing the no-split default that this section is trying to eliminate.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T07:13:45Z

@claude review

Addressed codex P2 "Separate scoring window from consecutive-window hysteresis" (comment 3393859791) and the matching round-6 claude finding by splitting the score into two pinned formulas in §5.1: a per-column score (a single committed flush column's weighted ops/min) that is now the only input to the §5.2 hysteresis counter, and a smoothed N-column score kept for operator-facing slog/display only — it no longer feeds any decision. I chose per-column-for-hysteresis (codex's simpler option) over splitting into scoreWindowColumns/candidateWindows knobs because it restores the literal parent-doc §6.2 "N consecutive hot columns" meaning with the smallest surface: a cold column can no longer hide inside a moving average, and cold start now promotes after exactly candidateWindows = 3 hot columns instead of 2N−1 = 5. Removed the now-unnecessary cold-start "wait until N committed columns exist" guard (§5.2, §4.2). Reconciled OQ-8's resolution to cite both formulas and state that hysteresis reads the per-column score. Updated §8.2 tests (cold-start-promotes-after-exactly-3, cold-column-resets-counter, and a hot/hot/cold run whose smoothed average stays above threshold must NOT promote), the rapid invariants, acceptance criterion 3, and the §7.1 --autoSplitCandidateWindows flag note.

Also folded in the two new codex P2 comments on this HEAD:

"Drop Lookup as a flag-presence check" (3393944493, §3.1): removed flag.CommandLine.Lookup as an "equivalent" presence check — it returns a defined flag regardless of whether it was set, which would make keyBucketsExplicit always true and reintroduce the no-split-by-default trap. flag.Visit (set-only) is now the sole correct mechanism.
"Keep the per-cycle RPC cap consistent" (3393944486, §6.3): maxSplitsPerCycle now explicitly budgets logical split decisions (distinct hot routes), not raw SplitRange RPCs, so a compound single-key isolation (1 decision / 2 RPCs) is unblocked under the default of 1; the bounded RPC count is maxSplitsPerCycle typically and 2× worst case. §5.5 and the §7.1 flag note match.

Status remains Proposed; doc-only.

claude · 2026-06-11T07:14:00Z

Claude finished @bootjp's task in 4m 1s —— View job

Review: M3 Automation Design Doc (Round 7)

Load tools
Read the updated M3 design doc (post 9c528c7)
Verify round-6 issue (per-column score / hysteresis separation) was addressed
Verify codex P2 "Drop Lookup" (§3.1) was addressed
Verify codex P2 "Per-cycle RPC cap consistent" (§6.3) was addressed
Check for new issues introduced by the revisions
Post review findings

All three round-6 / codex P2 items are cleanly resolved. Two new issues in the revised §5.1 pseudocode are worth fixing before M3-PR2.

Confirmed Resolved — Round 6

"Separate scoring window from consecutive-window hysteresis" (§5.1–5.2, codex P2) — cleanly addressed. Verified across the doc:

§5.1 now pins two distinct formulas: (A) per-column score (per_col_score = per_col_write_rate × Ww + per_col_read_rate × Wr, one column's weighted ops/min, drives the §5.2 counter) and (B) smoothed score (smoothed_score = smoothed_write_rate × Ww + smoothed_read_rate × Wr, trailing-N-column average, display/slog only). The rationale — cold column hiding inside the N-column average, 2N−1 cold-start — is fully documented. ✓
§5.2 advances consecutiveOver on the per-column score, the cold-start guard is removed, and the evalInterval < Step no-op rule (skip cycles where the freshest At is unchanged) is stated. ✓
§8.2 tests adds three new regression cases: cold-start promotes after exactly candidateWindows = 3 columns (not 2N−1 = 5); cold column resets; hot/hot/cold where the 3-column smoothed average stays above threshold must not promote. ✓
Rapid invariant updated: "a route is promoted iff its per-column score cleared threshold for candidateWindows consecutive committed columns … regardless of the smoothed average." ✓
OQ-8 RESOLVED with both formulas cited, hysteresis tied to per-column score. ✓
Acceptance criterion 3 updated to say "per-column score … fewer than candidateWindows consecutive committed flush columns" and "a single cold column inside the run resets the counter even if the smoothed average is still above threshold." ✓

"Drop Lookup as a flag-presence check" (§3.1, codex P2) — cleanly addressed. §3.1 now explicitly names flag.CommandLine.Lookup as the wrong mechanism ("returns the named flag whenever it is defined … regardless of whether the operator supplied it") and states flag.Visit (set-only) is the sole correct presence check. The Go snippet uses flag.Visit. ✓

"Keep the per-cycle RPC cap consistent" (§6.3, codex P2) — cleanly addressed. maxSplitsPerCycle now explicitly budgets "logical split decisions (distinct hot routes relieved), not raw SplitRange RPCs" in §5.5, §6.3, and the §7.1 flag description. The compound isolation (1 decision / 2 RPCs) is called out as consuming 1 budget unit under the default. ✓

New Issue 1 — Formula (B) has an undefined variable: `smoothed_rate` is defined but `smoothed_write_rate`/`smoothed_read_rate` are used (§5.1, M3-PR2 scope)

Formula (A) correctly defines two variables then combines them:

per_col_write_rate = column_i.write_ops / Step_s × 60
per_col_read_rate  = column_i.read_ops  / Step_s × 60
per_col_score      = per_col_write_rate × Ww + per_col_read_rate × Wr

Formula (B) defines a single combined variable but then uses two different, undefined variables in the final expression:

smoothed_rate  = (Σ ops over the trailing N committed columns) / (N × Step_s) × 60   // ← defined here
smoothed_score = smoothed_write_rate × Ww + smoothed_read_rate × Wr                   // ← uses undefined variables

smoothed_rate (line 197) is defined but never referenced. smoothed_write_rate and smoothed_read_rate (line 198) are used but never defined. An implementer reading the pseudocode literally faces two plausible interpretations:

Wrong: use the single defined variable for both weights → smoothed_rate × (Ww + Wr), which mixes read and write ops and produces the wrong number.
Correct (by inference): treat smoothed_rate as a pattern to apply separately for writes and reads, analogous to the (A) formula.

The second interpretation is right but requires inference; the first is a natural reading of "use the defined variable." This is a notation bug introduced in this round.

Fix: restructure (B) analogously to (A):

smoothed_write_rate = (Σ write_ops over the trailing N committed cols) / (N × Step_s) × 60
smoothed_read_rate  = (Σ read_ops  over the trailing N committed cols) / (N × Step_s) × 60
smoothed_score      = smoothed_write_rate × Ww + smoothed_read_rate × Wr

Remove the combined smoothed_rate line. This makes (B) parallel in structure to (A) and leaves no undefined variables in the pseudocode.

New Issue 2 — §5.2 references `evalInterval > Step` tests that are absent from §8.2

§5.2 explicitly states: "if evalInterval > Step and several columns accumulate between cycles, the detector processes each new column once, evaluating each column's own per-column score … see §8.2 tests."

But §8.2 lists only the symmetric direction: "the evalInterval < Step case where extra cycles with no new flush column must not advance consecutiveOver." The evalInterval > Step multi-column accumulation case is referenced but not present.

This means M3-PR2 can pass all listed §8.2 tests while implementing only the "skip unchanged-At cycle" direction — missing the forward direction entirely. A detector that only no-ops when freshestAt is unchanged (but does not iterate over multiple new columns when evalInterval > Step) would pass every listed test case.

Fix: add the missing case to §8.2:

evalInterval > Step multi-column accumulation: with evalInterval = 2 × Step, candidateWindows = 3, after one evaluation cycle where 2 new flush columns have committed since lastProcessedAt, the detector evaluates each of the 2 new columns in chronological order and advances consecutiveOver twice (assuming both per-column scores are above threshold); a further single hot column on a subsequent cycle reaches consecutiveOver = 3 and promotes. This confirms the counter counts columns, not cycles, in the forward direction.

Nit

§5.1 smoothed_rate comment says "ops/min" but the variable immediately disappears from the pseudocode (replaced by the undefined smoothed_write_rate/smoothed_read_rate). Once Issue 1 is fixed this becomes "weighted ops/min" in the final expression, matching the (A) formula comment. Minor cleanup after the fix.

Summary

All three prior-round items are cleanly resolved. Two new issues — a concrete undefined-variable notation bug in the formula (B) pseudocode (Issue 1) and a missing §8.2 test case referenced by §5.2 (Issue 2) — are both in §5.1/§5.2 scope and should be addressed before M3-PR2 begins implementation. Issue 1 is load-bearing: the pseudocode as written has a plausible wrong interpretation that produces an incorrect score.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c528c7854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T07:17:57Z

+### 5.5 Guardrails
+
+- **Minimum route size** (`minRouteSpan`, OQ-4): a route whose observed key span (distance between min and max observed key, or sub-range extent) is below a floor is not split — splitting a tiny range yields children too small to relieve load and churns the catalog. Because the engine has no per-route key-count, this is approximated from the sub-range/Top-K evidence; when that evidence is absent the guard is conservatively "do not split" (matches §5.4.3).
+- **Max routes** (`maxAutoRoutes`, default e.g. 1024): once the catalog holds `maxAutoRoutes` routes, the detector stops proposing new splits and emits `autosplit_skipped_total{reason="route_cap"}`. Prevents unbounded catalog growth (and bounds the watcher fan-out and the Composed-1 history-ring pressure noted at `distribution/engine.go:53-58`).


Reserve route-cap headroom for compound splits

When a single-key isolation is scheduled while the catalog has maxAutoRoutes - 1 live routes, this guard still allows the decision, but the compound flow in §5.4.2 performs two splits and increases the live route count by two. That leaves the catalog at maxAutoRoutes + 1, violating the advertised cap and the bounded watcher/history pressure it is meant to protect; the detector should check liveRoutes + decisionRouteDelta <= maxAutoRoutes (delta 1 for normal splits, 2 for compound isolation) before scheduling.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T07:17:57Z

+### 6.2 Same-group today; least-loaded target once M2 lands
+
+- **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.)
+- **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`. The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability.


Keep cross-group targeting out of the first compound split

When --autoSplitCrossGroup is later enabled, applying target_group_id on the same scheduler path means the first RPC of a compound hot-key isolation ([A,B) split at H) would make the intermediate right child [H,B) the cross-group/migrating child. The compound algorithm then immediately issues a second SplitRange on [H,B), but §6.3 says cross-group splits return a job_id immediately and routes with in-flight jobs are excluded, so the second split cannot be performed back-to-back. The design should either finish the compound isolation same-group first and only migrate a final child, or explicitly defer target selection until after isolation completes.

Useful? React with 👍 / 👎.

Already covered in §6.2 ("Compound single-key isolation is same-group-only, even under --autoSplitCrossGroup (codex P2)") and §6.4 (last bullet). Both SplitRange calls of a compound isolation always pass target_group_id = 0; cross-group target selection is never applied to a compound operation in progress because the intermediate child [H,B) would become an in-flight SplitJob target that §6.3 excludes from candidacy, stranding the isolation at compound_partial on its very first execution. Cross-group relocation of any of the three FINAL children is a separate, later decision (§6.2 second bullet) — only once they are RouteStateActive with no in-flight SplitJob. Could you confirm this resolves the finding?

chatgpt-codex-connector · 2026-06-11T07:17:57Z

+
+| PR | Scope | Tests | Independently shippable? |
+|---|---|---|---|
+| **M3-PR1a** (decoder relaxation, ship **first**) | Retire engine `RecordAccess`/`splitRange`/`Route.Load` threshold path (§3.4); confirm keyviz is the sole signal; **relax both strict decoder checks** — introduce `catalogRouteCodecVersionMin` (`= 1`), accept version byte `>= catalogRouteCodecVersionMin`, and drain (don't reject) any unknown trailing tail (§7.7c). **No `SplitAtHLC` struct field, no encoder change, no helper change; `catalogRouteCodecVersion` stays `1` and the encoder still writes v1.** This is the technical enforcement of the decoder-before-encoder ordering: it must reach every node before M3-PR1b deploys (§7.7c "Upgrade ordering fence", codex `afc95e3b`). | `distribution/engine_test.go` updated (dead-code removal); **codec matrix cases authored here, before the encoder exists** — case 3 (v1 still round-trips), **case 4 and case 4a (relaxed decoder reads a *hand-built* v2 record — incl. an `End == nil` v2 record — drains the 8-byte tail, no error, `SplitAtHLC == 0`)**, case 5 (forward-version v3 drained), case 6 (version `< min` rejected), case 7 (truncated tail rejected). | Yes — pure dead-code removal + decoder loosening with no wire-format emission change; v1 records still round-trip; no v2 record can yet be written. Safe to ship cluster-wide on its own. |


Align PR1a tests with the decoder it ships

This assigns the truncated-tail rejection case to M3-PR1a, but the same row says PR1a has no SplitAtHLC field and no encoder/field change, while §7.7c describes the PR1a decoder as a v1-era relaxed decoder that simply drains the unknown v2 tail. Such a decoder cannot distinguish an 8-byte v2 tail from a 1–7 byte truncated tail, so this test will either fail or force PR1a to know the v2 layout it is supposed to precede; move case 7 to PR1b or specify that PR1a already validates version >= 2 tail length.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T07:28:28Z

@claude review

Round 7 dispositions — all in §5.1/§5.2/§8.2, doc-only, Status: Proposed unchanged. Pushed a5fa123e.

Issue 1 (load-bearing) — FIXED. §5.1 formula (B) no longer has an undefined variable. Removed the single combined smoothed_rate line and restructured (B) to mirror (A): two named rates that are then combined.

smoothed_write_rate = (Σ write_ops over the trailing N committed cols) / (N × Step_s) × 60   // ops/min
smoothed_read_rate  = (Σ read_ops  over the trailing N committed cols) / (N × Step_s) × 60
smoothed_score      = smoothed_write_rate × Ww + smoothed_read_rate × Wr                       // weighted ops/min

Every variable is now defined; the weighted ops/min composition is unambiguous and parallel to (A), so the smoothed_rate × (Ww + Wr) mis-read is no longer possible.

Issue 2 — FIXED. Added the evalInterval > Step multi-column accumulation case to §8.2 (the forward direction §5.2 references): evalInterval = 2 × Step, candidateWindows = 3; one cycle commits 2 new columns since lastProcessedAt → detector evaluates each chronologically and advances consecutiveOver twice; a further single hot column reaches consecutiveOver = 3 and promotes — confirming the counter counts columns, not cycles, in the forward direction. A detector that only no-ops on unchanged At (the existing evalInterval < Step case) now fails this case.

Nit — FIXED. The standalone smoothed_rate line (commented "ops/min") is gone; the final smoothed_score line carries the "weighted ops/min" comment, matching formula (A). The two per-rate lines keep "ops/min" to match (A)'s per-rate comments.

No newer codex/gemini inline findings since 2026-06-11T07:40Z to fold in (pulls/951/comments query returned empty).

claude · 2026-06-11T07:28:45Z

Claude encountered an error after 1s —— View job

I'll analyze this and get back to you.

bootjp · 2026-06-11T08:40:53Z

@claude review

claude · 2026-06-11T08:41:12Z

Claude finished @bootjp's task in 3m 31s —— View job

Review: M3 Automation Design Doc (Round 8)

Load tools
Read the updated M3 design doc (post a5fa123)
Verify round-7 issues were addressed
Analyze three new codex P2 comments on 9c528c7
Check for any new issues introduced by the revisions
Post review findings

Confirmed Resolved — Round 7

Issue 1 (undefined variable in formula B, §5.1) — cleanly fixed. Formula (B) now mirrors (A) exactly: smoothed_write_rate and smoothed_read_rate are each defined (Σ over trailing N columns / (N×Step_s)×60), then combined into smoothed_score. The shared-variable smoothed_rate line is gone; no undefined variable remains in the pseudocode. ✓

Issue 2 (evalInterval > Step test case missing from §8.2) — cleanly fixed. §8.2 now contains the forward-direction case: with evalInterval = 2×Step and candidateWindows = 3, a single evaluation cycle that finds 2 new committed columns advances consecutiveOver twice (evaluating each chronologically); a further single hot column promotes. The counter counts columns, not cycles, in both directions. ✓

No new issues were introduced by the §5.1/§5.2/§8.2 revisions.

Three Codex P2 Comments from `9c528c7854` — Unaddressed in `a5fa123e`

The three codex P2 inline comments posted at 2026-06-11T07:17:57Z on commit 9c528c7854 were not folded into a5fa123e (the round-7 disposition comment says "No newer codex/gemini inline findings since 2026-06-11T07:40Z to fold in" — these comments predate that timestamp but are inline review comments rather than issue comments, so they may have been missed by the pulls/951/comments query). All three are valid:

Issue 1 — Route-cap headroom is wrong for compound splits (§5.5, codex P2 line 248)

§5.5 reads: "once the catalog holds maxAutoRoutes routes, the detector stops proposing new splits." The check is liveRoutes >= maxAutoRoutes. But a §5.4.2 compound single-key isolation produces three final children from one parent (net +2), not net +1. If the catalog has maxAutoRoutes - 1 live routes when the compound decision is scheduled, the two-RPC execution in §6.4 leaves the catalog at maxAutoRoutes + 1 — breaching the advertised cap that exists to bound watcher fan-out and the Composed-1 history-ring pressure (distribution/engine.go:53-58).

The fix is straightforward: the route-cap check must account for the delta of the decision type:

// before emitting a decision:
routeDelta := 1                          // normal split: 1 parent → 2 children, net +1
if isCompoundIsolation(decision) {
    routeDelta = 2                       // compound: 1 parent → 3 children, net +2
}
if liveRoutes + routeDelta > maxAutoRoutes {
    emit autosplit_skipped_total{reason="route_cap"}
    continue
}

The §5.5 guardrail description should be updated to state "the detector checks liveRoutes + decisionDelta <= maxAutoRoutes before emitting a decision (delta = 1 for normal splits, 2 for compound single-key isolation)." The §8.2 tests should add a case: schedule a compound isolation when liveRoutes == maxAutoRoutes - 1 → skipped with route_cap; schedule a normal split at the same live count → proceeds.

Issue 2 — Cross-group targeting is incompatible with compound isolation's first call (§6.2/§6.4, codex P2 line 262)

§6.4 executes a compound single-key isolation as two back-to-back SplitRange calls:

SplitRange([A,B), split_key=H) → [A,H) + intermediate child [H,B)
SplitRange([H,B), split_key=H·0x00) → [H, H·0x00) + [H·0x00, B)

When --autoSplitCrossGroup is later enabled (M3-PR4, §6.2), the scheduler passes a target_group_id on the SplitRange call. If it passes target_group_id on call 1, the intermediate child [H,B) becomes a cross-group migration target (an in-flight SplitJob, state RouteStateMigratingSource or RouteStateMigratingTarget). §6.3 explicitly excludes routes with an in-flight SplitJob from candidacy — and SplitRange on a migrating route is rejected. Call 2 cannot proceed. The compound isolation falls into the compound_partial path on the very first execution under cross-group mode, and [H,B) is stuck: it is too small to re-promote as a compound candidate and too large to match the already-isolated intent.

This is not a corner case — it fires on every compound single-key isolation when --autoSplitCrossGroup is on, making the cross-group + compound combination permanently broken without explicit doc guidance.

Required fix in §6.2 and §6.4: state explicitly that for compound single-key isolation the first call is always same-group (no target_group_id), regardless of --autoSplitCrossGroup. The target_group_id is applied only if any of the three final isolated children are subsequently evaluated for cross-group migration in a later decision — the compound operation itself is always resolved same-group first. A one-line constraint in §6.4: "compound single-key isolation always issues both SplitRange calls with target_group_id = 0 (same-group); cross-group target selection applies to isolated children in subsequent independent decisions, never to a compound operation in progress." §6.2 should note the same restriction in the M3-PR4 hook.

Issue 3 — Codec matrix case 7 assigned to M3-PR1a cannot be implemented by the PR1a decoder (§8.1, codex P2 line 452)

The codec matrix case 7:

v2 with a truncated tail (< 8 bytes after End) | v2 (relaxed) | rejected

§8.1 assigns case 7 to M3-PR1a alongside cases 3, 4, 4a, 5, 6. But the M3-PR1a decoder is defined as a v1-era decoder with no version >= 2 branch that simply drains the unknown v2 tail (§7.7c: "drain the reader… instead of erroring"). That decoder does not know that a v2 record's tail is supposed to be exactly 8 bytes — it just reads bytes until the reader is empty. Given a v2 record with only 4 bytes after End:

PR1a sees version byte 2 (>= min → accept).
PR1a has no version >= 2 branch: after decoding End, it drains the 4-byte tail.
PR1a returns a valid RouteDescriptor with SplitAtHLC = 0. No error. Case 7 would pass (no error observed) where it is supposed to fail (reject truncated tail).

The PR that can implement case 7 is M3-PR1b, which adds the explicit if version >= 2 { read exactly 8 bytes for SplitAtHLC } branch. A read of exactly 8 bytes on a record that only has 4 produces an io.ErrUnexpectedEOF, which the decoder maps to ErrCatalogInvalidRouteRecord. Case 7 is therefore only testable and meaningful after PR1b's version >= 2 branch exists.

Required fix: move case 7 from the M3-PR1a row to the M3-PR1b row in §8.1. The "truncated tail is corruption" invariant is enforced by the PR1b version >= 2 field-read, not by the PR1a drain. The PR1a test set remains cases 3, 4, 4a, 5, 6 (all of which are provable by the drain-unknown-tail decoder without any version-layout knowledge).

Summary

Both round-7 issues are cleanly resolved with no regressions. Three codex P2 comments on 9c528c7854 were not folded into a5fa123e and remain open: route-cap delta accounting for compound splits (§5.5), cross-group + compound isolation incompatibility (§6.2/§6.4), and case 7 test placement (§8.1). Issues 2 and 3 are both in the M3-PR4 and M3-PR1b scopes respectively and do not block M3-PR1a/PR2/PR3; Issue 1 is in M3-PR2/PR3 scope (the detector/scheduler that evaluates route cap). All three should be resolved in the doc before the relevant implementation PR begins.

…ope, case 7 placement

bootjp · 2026-06-11T08:50:07Z

@claude review

Folded the three remaining codex P2 findings from a5fa123e into dc2d52d4 (doc-only, Status: Proposed).

1. Route-cap delta accounting (§5.5, codex line 248) — Accepted. The cap is now checked against the decision's actual route delta, not the bare parent count. §5.5 "Max routes" now states the detector computes routeDelta per decision — +2 for a §5.4.2 two-boundary compound isolation (one parent → three children), +1 for a normal split or the H == route.Start single-split — and only emits when liveRoutes + routeDelta <= maxAutoRoutes. The old liveRoutes >= maxAutoRoutes would have admitted a compound at maxAutoRoutes - 1, leaving the catalog at maxAutoRoutes + 1 and breaching the cap. §8.2 adds the guardrail test (compound at maxAutoRoutes - 1 skipped with route_cap; normal split at the same count proceeds).

2. Cross-group + compound isolation incompatibility (§6.2/§6.4, codex line 262) — Accepted. Resolved by making compound single-key isolation same-group-only, even under --autoSplitCrossGroup. Both SplitRange calls (§6.4 steps 1–2) always pass target_group_id = 0. Carrying a target on call 1 would make the intermediate child [H,B) a cross-group SplitJob target — a route §6.3 excludes from candidacy and on which SplitRange is rejected — so call 2 could not run and the isolation would strand at compound_partial on its first execution under cross-group mode. Cross-group relocation now applies only to the three final isolated children, as independent later decisions once each is RouteStateActive with no in-flight SplitJob. Restated in §6.2 (M3-PR4 hook), §6.4 (execution steps + budget bullet), the §8.1 M3-PR4 row, and OQ-6.

3. Case 7 placement (§8.1, codex line 452) — Accepted. Moved codec round-trip case 7 (truncated-tail reject) from the M3-PR1a row to the M3-PR1b row. The PR1a drain-unknown-tail decoder accepts a < 8-byte tail by design (no version >= 2 branch → returns SplitAtHLC = 0, no error), so it cannot detect truncation; only PR1b's exact-8-byte read can (io.ErrUnexpectedEOF → ErrCatalogInvalidRouteRecord). PR1a's authored set is now cases 3, 4, 4a, 5, 6; PR1b adds case 7. The §7.7c matrix annotation, the case 7 table row (decoder column now "v2 (PR1b version >= 2 exact-8-byte read)"), and the load-bearing-cases sentence are all updated to mark case 7 as PR1b-scope.

Doc-only, Status: Proposed unchanged. Pushed SHA dc2d52d4.

claude · 2026-06-11T08:50:21Z

Claude finished @bootjp's task in 4m 19s —— View job

Review: M3 Automation Design Doc (Round 9)

Load tools
Read the updated M3 design doc (post dc2d52d)
Verify all round-8 issues were properly addressed
Check for any new issues introduced by the revisions
Post review findings

All three round-8 findings are cleanly addressed. One minor nit worth noting before M3-PR4.

Confirmed Resolved — Round 8

Route-cap delta accounting (§5.5, codex P2) — cleanly addressed. The guard is now liveRoutes + routeDelta <= maxAutoRoutes with routeDelta = +2 for a two-boundary compound isolation (1 parent → 3 children), routeDelta = +1 for a normal split or the H == route.Start single-split. §8.2 adds the regression cases: compound at liveRoutes == maxAutoRoutes − 1 is skipped with route_cap (delta +2 would breach); normal split at the same count proceeds (delta +1 is exactly at cap). ✓

Cross-group + compound isolation incompatibility (§6.2/§6.4, codex P2) — cleanly addressed. Both calls in §6.4 carry target_group_id = 0 explicitly. §6.2 names the constraint ("Compound single-key isolation is same-group-only, even under --autoSplitCrossGroup"), §6.4 repeats it with the mechanistic rationale (call 1 carrying a target_group_id would make the intermediate [H,B) a migrating-SplitJob route that §6.3 excludes from candidacy, blocking call 2). OQ-6 RESOLVED records the same restriction. §8.1 M3-PR4 row adds "a test that a compound isolation under --autoSplitCrossGroup still issues both calls same-group and a final child is migrated only in a later independent decision." ✓

Case 7 placement (§8.1, codex P2) — cleanly addressed. Case 7 (truncated tail rejected) is in the M3-PR1b row only. The PR1a row now says "Case 7 is not authored here — the PR1a drain-unknown-tail decoder accepts a short tail by design (no version >= 2 exact-read branch)…"; the PR1b row carries case 7 with the correct decoder label ("v2 (PR1b version >= 2 exact-8-byte read)"). The codec matrix annotation and the load-bearing-cases sentence are consistent. ✓

Nit — §6.4's "does not add to the live route count" is imprecise for subsequent cross-group operations

§6.4 ends with: "The route-cap budget for the compound is +2 routes (§5.5); a later cross-group move of a final child relocates an existing route and does not add to the live route count."

The intent is correct (the +2 accounts only for the compound itself; subsequent decisions are separate), but "does not add to the live route count" is not accurate for non-trivially-small final children. M2's SplitRange always splits a route into two children before migrating the right child (§1.2: "moving the right child of a split to another Raft group"). So a subsequent cross-group SplitRange on a final child like [A, H) yields [A, splitKey) + [splitKey, H) — net +1 route — exactly like any other SplitRange call. That +1 is covered by its own routeDelta check when the later decision runs, not by the compound's budget. No correctness problem; the compound cap and the subsequent cap are both correctly enforced independently. But the sentence as written implies the later operation is route-count-neutral, which misleads an M3-PR4 implementer into thinking they don't need a route-cap check on the cross-group step.

Suggested clarification: "The route-cap budget for the compound is +2 routes (§5.5); a subsequent independent decision to cross-group-migrate a final child is its own SplitRange call with its own routeDelta = +1 check at that later decision point."

Summary

All round-8 findings are cleanly resolved with no new substantive issues. The routeDelta accounting, the cross-group compound restriction, and the case 7 PR assignment are all correct and consistent across §5.5, §6.2, §6.4, §8.1, §8.2, §8.4, and OQ-6. The one nit above is a wording clarification in a forward-looking statement about M3-PR4 and does not block M3-PR1a through M3-PR3. The doc is ready for M3-PR1a to begin implementation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc2d52d4a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T08:52:58Z

+
+1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved.
+   - **Boundary-bucket fallback (review, codex P2).** `SubBucketBoundsFor` returns the **route's own `End`** as `hi` for the last sub-bucket (and `nil` for an unbounded tail; `keyviz/sampler.go:570-575` — the single-bucket case at `:571-573` likewise returns `start, end`). When the median (p50) bucket is the **last** bucket — an upper-tail hotspot where the final bucket alone carries `> 50%` of the load — its `hi` equals `route.End`, and §5.4's `split_key < route.End` rule plus `validateSplitKey`'s boundary rejection (`adapter/distribution_server.go:359-360,370-371`: `splitKey == parent.End` → `errDistributionSplitKeyAtBoundary`) would make that key invalid, so the detector would drop the decision and never split a genuinely hot tail. The detector therefore **falls back to the median bucket's `lo` (its lower boundary) when the chosen `hi` is `>= route.End` (or `route.End == nil`)**: an interior boundary that still places the dominant last bucket in the *right* child while satisfying `route.Start < split_key < route.End`. Symmetrically, if the p50 bucket is the **first** bucket and its `lo` equals `route.Start`, the detector uses that bucket's `hi` instead (it is necessarily interior because a route always has `> 1` sub-bucket on this path). If *both* candidate boundaries coincide with a route edge (degenerate single-sub-bucket route, only possible when `--keyvizKeyBucketsPerRoute == 1`, which this path is gated out of), the detector declines per §5.4.3. The §8.2 split-key tests add the "p50 in the last bucket → interior `lo` fallback" and "p50 in the first bucket → interior `hi` fallback" cases.
+2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key `H` lands in its own single-key child. Isolating `H` from a route `[A, B)` normally takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets.


Require absolute load for hot-key isolation

When a route is promoted by read load, this Top-K rule can still win on write share alone: a single incidental write to H gives H 100% of route writes even though the §5.1 score may be almost entirely reads. In that scenario the scheduler burns the split budget and cooldown isolating a write key that did not cause the hotspot, instead of using the sub-range p50 path to relieve the read-heavy range. Gate compound isolation on the hot key's absolute weighted contribution to the route score, or require a minimum/dominant write volume before preferring it.

Useful? React with 👍 / 👎.

Fixed in commit ce50ceb. §5.4.2 now gates hot-key isolation on share AND an absolute weighted-contribution floor (topKeyAbsoluteFloor, default = autoSplitThreshold/2 = 25_000 weighted ops/min on H_writes*Ww). A read-promoted route where a single incidental write to H gives H 100% of route writes no longer wins isolation — the share gate fires but the absolute floor doesn't, and the detector falls through to the §5.4.1 sub-range p50 path with autosplit_isolation_declined_total{reason="absolute_floor"}. So the split budget and cooldown go to relieving the actual read-heavy range, not isolating an incidental write key.

chatgpt-codex-connector · 2026-06-11T08:52:58Z

+### 5.5 Guardrails
+
+- **Minimum route size** (`minRouteSpan`, OQ-4): a route whose observed key span (distance between min and max observed key, or sub-range extent) is below a floor is not split — splitting a tiny range yields children too small to relieve load and churns the catalog. Because the engine has no per-route key-count, this is approximated from the sub-range/Top-K evidence; when that evidence is absent the guard is conservatively "do not split" (matches §5.4.3).
+- **Max routes** (`maxAutoRoutes`, default e.g. 1024): the cap bounds the **live** route count after a decision lands, so the detector must check it against the decision's actual **route delta**, not against the parent count alone (codex P2). A normal split turns one parent into two children — net **+1** route; a §5.4.2 compound single-key isolation turns one parent into **three** children (`[A,H)` + `{H}` + `[H·0x00,B)`) — net **+2** routes (the `H == route.Start` single-split special case in §5.4.2 is two children, net **+1**). The detector therefore computes `routeDelta` per decision (`+2` for the two-boundary compound isolation, `+1` for a normal split or the single-split isolation) and **only emits a decision when `liveRoutes + routeDelta <= maxAutoRoutes`**; otherwise it skips and emits `autosplit_skipped_total{reason="route_cap"}`. Checking the bare `liveRoutes >= maxAutoRoutes` would wrongly admit a compound isolation at `liveRoutes == maxAutoRoutes - 1`, leaving the catalog at `maxAutoRoutes + 1` and breaching the cap. This bounds catalog growth (and the watcher fan-out and the Composed-1 history-ring pressure noted at `distribution/engine.go:53-58`). The §8.2 guardrail tests add: a compound isolation scheduled at `liveRoutes == maxAutoRoutes - 1` is **skipped** with `route_cap` (delta `+2` would breach), while a normal split at the same live count **proceeds** (delta `+1` is exactly at the cap).


Reserve route-cap slots across the whole cycle

This check is correct for one decision, but when an operator raises maxSplitsPerCycle above 1 the detector can emit multiple decisions against the same starting liveRoutes. For example, with liveRoutes == maxAutoRoutes - 1, two normal splits each satisfy liveRoutes + 1 <= maxAutoRoutes individually, but together leave the catalog at maxAutoRoutes + 1. Track a cumulative reserved delta while selecting decisions in a cycle, or re-check the live route count after each scheduled split.

Useful? React with 👍 / 👎.

Fixed in commit ce50ceb. §5.5 now maintains a cycle-local reservedDelta accumulator initialised to 0 at the start of each evaluation cycle and incremented by each emitted decision's routeDelta; admission is liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes at every decision, not the bare per-decision check. With liveRoutes == maxAutoRoutes - 1 and maxSplitsPerCycle == 2, the first normal split admits and the second skips with route_cap. §8.2 adds the test case.

chatgpt-codex-connector · 2026-06-11T08:52:58Z

+### 6.2 Same-group today; least-loaded target once M2 lands
+
+- **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.)
+- **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`. The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability.


Avoid targeting groups with unknown load

In multi-group deployments, §4.1 says the default-group leader's local sampler has no data for routes served through another group's leader, so this local per-group aggregate can make unobserved groups look idle. Once --autoSplitCrossGroup is enabled, the least-loaded policy can therefore migrate a hot range into an already-overloaded group whose load is simply invisible to the default leader. Target selection should either use a cluster-wide/fan-out load source or exclude groups whose load is unknown from the leader-local view.

Useful? React with 👍 / 👎.

Fixed in commit ce50ceb. §6.2 now requires positive evidence of low load, not absence of evidence. M3 ships the local-view fast path: a target group is admitted only if the default-group leader is leader of at least one of the group's routes — groups with no local route ownership are categorically excluded as candidates with autosplit_target_skipped_total{reason="load_unknown"}. The cluster-wide cap_migration_v2-heartbeat fan-out extension is recorded under OQ-12 / M3.x as the eventual default. §8.2 adds the "node A has no local route ownership in groups B and C; cross-group skipped under load_unknown" test.

3 new findings from codex round-7 that needed doc updates; 2 others (line 233 boundary buckets, line 262 compound + cross-group, line 248 single-decision route delta) were already covered in the doc and will be answered with inline replies pointing at the already-written sections. Codex P2 line 235 — hot-key isolation by share alone wastes budget Top-K share-only gate (>= 0.8) wins on a single incidental write to H when the route score is almost entirely reads. Add an absolute weighted-contribution floor: isolation only fires when H_writes*Ww >= topKeyAbsoluteFloor (default = autoSplitThreshold/2 = 25_000 weighted ops/min). Share-only-fire-but-floor-skip falls through to the §5.4.1 sub-range p50 path and counts autosplit_isolation_declined_total{reason="absolute_floor"}. Codex P2 line 248 — route cap reservation across the cycle With maxSplitsPerCycle > 1 the detector can emit two decisions against the same starting liveRoutes, each satisfying the bare routeDelta check but together breaching maxAutoRoutes. Add a cycle-local reservedDelta accumulator: admission is now liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes at every decision, with the §8.2 test adding the two-normal-splits- at-cap case (first admitted, second skipped with route_cap). Codex P2 line 262 — least-loaded target with unknown-load groups Default-group leader's local sampler only sees routes it owns, so groups served via another leader read as 0-load and would be picked as least-loaded targets despite being overloaded. Target eligibility now requires positive evidence of low load (not absence of evidence): M3 ships the local-view fast path (admit only groups with at least one locally-led route) and records the cluster-wide fan-out as M3.x / OQ-12. Excluded groups count under autosplit_target_skipped_total{reason="load_unknown"}. All three deltas inherit the same five-lens self-review structure the doc already uses; tests called out in §8.2 cover each new case.

bootjp · 2026-06-11T17:22:11Z

@claude review — round 7 review (5 codex P2 findings on PR #951):

line 233 (boundary buckets) — already covered by §5.4.1 paragraph 2 "Boundary-bucket fallback". Inline reply 3397835939.
line 235 (absolute-load gate) — fixed in ce50ceb. §5.4.2 adds topKeyAbsoluteFloor (default 25_000 weighted ops/min on H_writes*Ww); share-only-fire-but-floor-skip falls through to §5.4.1 sub-range p50. Inline reply 3397836118.
line 248 (route cap across cycle) — fixed in ce50ceb. §5.5 adds a cycle-local reservedDelta accumulator; admission is liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes at every decision. Inline reply 3397836268.
line 262 (compound + cross-group) — already covered by §6.2 and §6.4 ("Compound single-key isolation is same-group-only, even under --autoSplitCrossGroup"). Inline reply 3397836023.
line 262 (unknown-load target) — fixed in ce50ceb. §6.2 now requires positive evidence of low load: local-view fast path admits only groups with at least one locally-led route; cluster-wide fan-out recorded under OQ-12. Inline reply 3397836376.

claude · 2026-06-11T17:22:31Z

Claude finished @bootjp's task in 4m 41s —— View job

Review: M3 Automation Design Doc (Round 10)

Load tools
Read the updated M3 design doc (post ce50cebb)
Verify round-9 nit and 5 codex P2 findings (3 fixes + 2 "already covered")
Check for new issues introduced by the revisions
Post review findings

All five trigger-comment items are confirmed addressed. Three new issues in the ce50cebb changes are worth resolving before M3-PR2.

Confirmed Resolved

line 233 (boundary buckets — "already covered") — §5.4.1 paragraph 2 "Boundary-bucket fallback (review, codex P2)" is present and complete: hi >= route.End → fall back to lo; lo == route.Start → fall back to hi; degenerate single-sub-bucket decline. Both §8.2 cases are listed. ✓

line 235 (absolute-load gate — ce50ceb) — §5.4.2 now gates compound isolation on topKeyAbsoluteFloor = autoSplitThreshold/2 = 25_000 weighted ops/min of H_writes × Ww. When share fires but the floor doesn't, the detector logs autosplit_isolation_declined_total{reason="absolute_floor"} and falls through to §5.4.1 p50. The rationale (prevents isolating an incidental write key on a read-promoted route) is correct and well-documented. ✓

line 248 (route-cap across cycle — ce50ceb) — §5.5 now maintains a cycle-local reservedDelta accumulator; admission is liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes. The three §8.2 cases (compound at maxAutoRoutes−1 skipped, normal split at the same count proceeds, two-split cycle admits first and skips second) are all listed. ✓

line 262 (compound + cross-group — "already covered") — §6.2 has the explicit "Compound single-key isolation is same-group-only, even under --autoSplitCrossGroup" rule; §6.4 restates it. ✓

line 262 (unknown-load target — ce50ceb) — §6.2 now requires positive evidence of low load via the local-view fast path: groups with no local route ownership are excluded with autosplit_target_skipped_total{reason="load_unknown"}. OQ-12 records the cluster-wide fan-out extension. The §8.2 "node A has no local route ownership in groups B and C" test is listed. ✓

Round-9 Nit — Still Open (Non-blocking)

§6.4 still ends with: "The route-cap budget for the compound is +2 routes (§5.5); a later cross-group move of a final child relocates an existing route and does not add to the live route count."

This remains imprecise: M2's SplitRange always splits a route into two children before migrating the right child, so a subsequent cross-group SplitRange on a final child yields net +1 route and carries its own routeDelta = +1 check at that later decision point. The statement "does not add to the live route count" would mislead an M3-PR4 implementer into thinking no route-cap check is needed on the cross-group step. This was noted in round 9 as a wording clarification that does not block M3-PR1a–PR3; it remains that.

New Issue 1 — `topKeyAbsoluteFloor` comparison is `--keyvizStep`-sensitive (§5.4.2, M3-PR2 scope)

§5.4.2 says: "The absolute floor is computed as H_writes × Ww (writes only, since Top-K is currently write-only…)" and compares it against topKeyAbsoluteFloor = 25_000 weighted ops/min.

H_writes from the Top-K sketch is a raw column count (writes-in-this-flush-window), not a per-minute rate. The threshold is in ops/min. For the default --keyvizStep = 60s, column count equals rate numerically (rate = count/60 × 60 = count), so the check happens to be correct. For any other step:

Step	Hot key at 6 250 raw writes/min	H_writes in one column	H_writes × Ww	Floor (25 000 ops/min)	Result
60s	6 250/min	6 250	25 000	25 000	pass (correct)
30s	6 250/min	3 125	12 500	25 000	fail (wrong)
120s	6 250/min	12 500	50 000	25 000	pass (also correct, coincidentally)

A hot key at exactly the intended 6 250 weighted writes/min (25 000 / Ww) fails the floor for Step < 60s even though it should pass, and passes above 6 250 for Step > 60s on a lower actual rate. This is exactly the Step↔sensitivity coupling OQ-8's normalization was designed to eliminate. The unit mismatch silently varies with --keyvizStep.

Fix: express the floor check in per-minute units, consistent with §5.1 formula (A):

H_write_rate = H_writes_in_column / Step_s × 60    // ops/min, same normalization as §5.1
absolute_contribution = H_write_rate × Ww           // weighted ops/min
inCooldown := absolute_contribution >= topKeyAbsoluteFloor

The description in §5.4.2 should say "the absolute floor is computed as (H_writes_in_column / Step_s × 60) × Ww (weighted ops/min, normalized consistently with §5.1 formula (A))" — not the raw H_writes × Ww. The §8.2 scoring tests should include a case that verifies Step-invariance of the floor gate (mirroring the Step-invariance case already required for the scoring formula by OQ-8's resolution).

New Issue 2 — Two new counters from `ce50ceb` are absent from §7.3's metric inventory (§5.4.2, §6.2, M3-PR3 scope)

§7.3 is the authoritative metric table for bounded-cardinality enforcement (the principle that "all metric names must be enumerated explicitly"). ce50cebb introduces two counter names in the doc body that do not appear in §7.3:

autosplit_isolation_declined_total{reason="absolute_floor"} (§5.4.2) — fired when the share gate passes but the absolute floor does not, causing fallthrough to p50. This is distinct from autosplit_skipped_total (the detector still proceeds to p50 and may issue a SplitRange), so it cannot just be a new reason value in the existing skipped counter. It needs its own entry in §7.3.
autosplit_target_skipped_total{reason="load_unknown"} (§6.2) — fired when a cross-group candidate group is excluded for lack of local load data. This too is absent from §7.3.

Without §7.3 entries for these counters, M3-PR3 implementers must discover them via a textual search through §5.4.2 and §6.2; the metric registration test (PR1b scope) will not cover them; and operators cannot configure dashboards against a complete, enumerated list. Both need to be added to §7.3's counter block.

Suggested additions:

- `autosplit_isolation_declined_total{reason}` — compound single-key isolation was
  considered but not emitted; reason ∈ {`absolute_floor`} (the detector fell through to
  p50 instead). Does NOT mean a split was skipped — the p50 path may still run.
- `autosplit_target_skipped_total{reason}` — a cross-group target candidate was excluded;
  reason ∈ {`load_unknown`} (no local route ownership in that group, §6.2).

New Issue 3 — §8.2 missing a test for the absolute-floor fallthrough (§5.4.2, M3-PR2 scope)

§8.2 lists three compound-split test cases (H == route.Start single-split, already-isolated [H, H·0x00) decline), and Issue 1 above adds a normalization case. But there is no listed test for:

"share gate fires (key H's share >= topKeyShare), but the absolute-floor gate does not (H's weighted ops/min < topKeyAbsoluteFloor): the detector must fall through to §5.4.1 p50 and must NOT emit a compound isolation decision."

Without this case in §8.2, an implementer can pass all listed tests while implementing the share check correctly but omitting the absolute floor entirely — or while implementing the floor in raw-count units (Issue 1) rather than normalized ops/min. The test is also the natural place to exercise the autosplit_isolation_declined_total{reason="absolute_floor"} counter (Issue 2).

Add to §8.2's split-key test list: "compound isolation — share fires but absolute floor doesn't (H is 100% of route writes by share but H_write_rate × Ww < topKeyAbsoluteFloor): detector emits p50 decision, not compound; autosplit_isolation_declined_total{reason="absolute_floor"} incremented."

Summary

All five trigger-comment items are cleanly addressed. The round-9 nit remains (§6.4 wording, non-blocking). Three new issues: Issue 1 (the topKeyAbsoluteFloor raw-count/rate mismatch) is load-bearing because it silently varies with --keyvizStep and would introduce the exact coupling OQ-8 was resolved to eliminate; Issues 2 and 3 (missing §7.3 entries and §8.2 test case) are concrete correctness gaps for the M3-PR2/PR3 implementer. All three are in §5.4.2/§6.2/§7.3/§8.2 scope — doc-only fixes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce50cebbfe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:23:47Z

+
+1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved.
+   - **Boundary-bucket fallback (review, codex P2).** `SubBucketBoundsFor` returns the **route's own `End`** as `hi` for the last sub-bucket (and `nil` for an unbounded tail; `keyviz/sampler.go:570-575` — the single-bucket case at `:571-573` likewise returns `start, end`). When the median (p50) bucket is the **last** bucket — an upper-tail hotspot where the final bucket alone carries `> 50%` of the load — its `hi` equals `route.End`, and §5.4's `split_key < route.End` rule plus `validateSplitKey`'s boundary rejection (`adapter/distribution_server.go:359-360,370-371`: `splitKey == parent.End` → `errDistributionSplitKeyAtBoundary`) would make that key invalid, so the detector would drop the decision and never split a genuinely hot tail. The detector therefore **falls back to the median bucket's `lo` (its lower boundary) when the chosen `hi` is `>= route.End` (or `route.End == nil`)**: an interior boundary that still places the dominant last bucket in the *right* child while satisfying `route.Start < split_key < route.End`. Symmetrically, if the p50 bucket is the **first** bucket and its `lo` equals `route.Start`, the detector uses that bucket's `hi` instead (it is necessarily interior because a route always has `> 1` sub-bucket on this path). If *both* candidate boundaries coincide with a route edge (degenerate single-sub-bucket route, only possible when `--keyvizKeyBucketsPerRoute == 1`, which this path is gated out of), the detector declines per §5.4.3. The §8.2 split-key tests add the "p50 in the last bucket → interior `lo` fallback" and "p50 in the first bucket → interior `hi` fallback" cases.
+2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8) **AND** the hot key's absolute weighted contribution to the route score is at least `topKeyAbsoluteFloor` (default = `autoSplitThreshold / 2` = 25_000 weighted ops/min) of the route's `score = write_ops * Ww + read_ops * Wr` (§5.1), split so the hot key `H` lands in its own single-key child. The absolute-weighted-contribution gate (codex round-7 P2) prevents a wasteful compound isolation on a read-promoted route where a single incidental write to `H` gives `H` 100% of route writes — share alone would pin a write key that did not cause the hotspot, burning the split budget and cooldown instead of using the §5.4.1 sub-range p50 path to relieve the read-heavy range. The absolute floor is computed as `H_writes * Ww` (writes only, since Top-K is currently write-only on the hot path; future read Top-K can extend with `+ H_reads * Wr`). When the share-only gate fires but the absolute floor does not, the detector logs `autosplit_isolation_declined_total{reason="absolute_floor"}` and falls through to the §5.4.1 sub-range p50 path instead. Isolating `H` from a route `[A, B)` normally takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets.


Normalize hot-key floor to ops/min

When --keyvizStep differs from 60s or candidateWindows covers multiple columns, comparing H_writes * Ww to topKeyAbsoluteFloor is comparing raw writes to a floor documented as weighted ops/min. For example, with a 10s step a key doing 4,500 writes in one column is 108,000 weighted ops/min but only 18,000 by this formula, so the absolute gate can incorrectly fall through to p50 instead of isolating the true hot key. Compute the hot-key contribution using the same per-column or smoothed ops/min normalization as §5.1 before applying this floor.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T17:23:47Z

+### 6.2 Same-group today; least-loaded target once M2 lands
+
+- **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.)
+- **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`.


Only migrate the child that carries the hotspot

In the lower-tail case where §5.4.1 picks the first bucket's hi, the hot p50 bucket remains in the left child, but this M2 path always passes target_group_id for the right child. With --autoSplitCrossGroup enabled that migrates the colder half away while the overloaded left child stays on the source group, so the cross-group split can burn a migration without relieving the hotspot. Gate target selection on which child contains the observed hot load, or keep this split same-group unless the right child is the hot side.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T17:23:47Z

+
+- **Version-byte check → minimum, not equality.** Introduce `catalogRouteCodecVersionMin` (= `1`) and `catalogRouteCodecVersion` (bumped to `2`). Reject only `raw[0] < catalogRouteCodecVersionMin` (truly old/corrupt) with `ErrCatalogInvalidRouteRecord`; accept any `raw[0]` in `[catalogRouteCodecVersionMin, catalogRouteCodecVersion]` and treat any `raw[0] > catalogRouteCodecVersion` as a forward version (parse the fields this build knows, ignore the rest — see the trailing-bytes rule below). The version byte is retained in a local `version` variable for the field-reading step.
+- **Field reading is version-keyed.** After `decodeRouteDescriptorHeader` and `decodeRouteDescriptorEnd` (unchanged), **read the 8-byte big-endian `SplitAtHLC` tail iff `version >= 2`**. For a v1 record (`version == 1`) there is no tail, so `SplitAtHLC` stays `0`. For a `version >= 2` record, read exactly 8 bytes for `SplitAtHLC`.
+- **Trailing-bytes check → skip-unknown-tail, not reject.** Replace `if r.Len() != 0 { return …, err }` with **"discard any bytes remaining after the fields this version knows."** Concretely: after the version-keyed field reads, drain the reader (`io.Copy(io.Discard, r)` or `r.Seek(0, io.SeekEnd)`) instead of erroring. This makes the decoder tolerant of *both* directions:


Keep same-version trailing bytes invalid

This decoder rule drains any bytes left after the fields this build knows, which also accepts corrupt records for the current format, e.g. a v1 record with appended garbage in PR1a or a v2 record with extra bytes after SplitAtHLC in PR1b. The rolling-upgrade case only requires draining tails for forward versions (version > catalogRouteCodecVersion); for the current version the old r.Len() == 0 check should remain so malformed catalog records still fail instead of being silently normalized.

Useful? React with 👍 / 👎.

Claude round-10 review on PR #951 (commit ce50ceb) — three new issues in the ce50ceb deltas: Issue 1 (§5.4.2) — topKeyAbsoluteFloor was Step-sensitive Comparing 'H_writes (raw column count) * Ww' against an ops/min threshold silently varies with --keyvizStep: at Step=30s a hot key at exactly the intended 6 250 weighted writes/min produces H_writes_in_column=3 125, raw H_writes*Ww=12 500, which fails a 25 000 floor — exactly the Step↔sensitivity coupling OQ-8 was resolved to eliminate. Express the floor check in normalised per-minute units consistent with §5.1 formula (A): H_write_rate = H_writes_in_column / Step_s * 60; check H_write_rate * Ww >= topKeyAbsoluteFloor. Issue 2 (§7.3) — two new counters from ce50ceb were missing autosplit_isolation_declined_total{reason='absolute_floor'} (§5.4.2) and autosplit_target_skipped_total{reason='load_unknown'} (§6.2) were referenced in doc body but absent from §7.3 inventory. Both are now enumerated; the isolation_declined description clarifies it is NOT skipped (a §5.4.1 p50 decision may still fire on the same route), and target_skipped clarifies it does not affect whether a split is scheduled. Issue 3 (§8.2) — missing tests for new gates Added explicit unit-test entries for: Step-invariance of the topKeyAbsoluteFloor gate across Step=30s/60s/120s; share-fires- but-floor-doesn't fallthrough producing a §5.4.1 p50 decision and the absolute_floor counter; unknown-load target exclusion producing the load_unknown counter; and the cycle-local reservedDelta two-decision case for the route_cap accumulator.

bootjp · 2026-06-11T17:40:46Z

@claude review — round 10 (3 new issues from ce50ceb):

Issue 1: topKeyAbsoluteFloor Step-sensitivity — fixed in commit d707344. §5.4.2 now computes the floor in normalised per-minute units (H_write_rate × Ww >= topKeyAbsoluteFloor), with a worked example showing why the raw-count form silently breaks at Step ≠ 60s. Restores the OQ-8 Step-invariance property.
Issue 2: §7.3 missing two new counters — fixed in commit d707344. Added autosplit_isolation_declined_total{reason='absolute_floor'} and autosplit_target_skipped_total{reason='load_unknown'} to §7.3, with explicit notes that isolation_declined is NOT skipped (§5.4.1 p50 may still fire) and target_skipped does not affect whether a split is scheduled.
Issue 3: §8.2 missing tests — fixed in commit d707344. Added: Step-invariance of topKeyAbsoluteFloor across Step=30s/60s/120s; share-fires-but-floor-doesn't fallthrough producing a §5.4.1 p50 decision and the absolute_floor counter; unknown-load target exclusion producing the load_unknown counter; and the cycle-local reservedDelta two-decision case for the route_cap accumulator.

claude · 2026-06-11T17:41:12Z

Claude finished @bootjp's task in 4m 28s —— View job

Review: M3 Automation Design Doc (Round 11)

Load tools
Read the updated M3 design doc (post d707344)
Verify round-10 issues (3 fixes in d707344)
Analyze new codex P2 comments on ce50ceb
Post review findings

All three round-10 issues are cleanly addressed. Two new codex P2 comments from ce50cebb remain unaddressed.

Confirmed Resolved — Round 10

Issue 1 (topKeyAbsoluteFloor Step-sensitivity, §5.4.2) — cleanly fixed. The floor is now computed as H_write_rate = H_writes_in_column / Step_s × 60 (weighted ops/min, normalized consistently with §5.1 formula (A)), not the raw H_writes × Ww. The Step=30s worked example (4 500 raw writes → 12 500 by old formula vs. correct 27 000 ops/min → fires the 25 000 floor) is present and unambiguous. The "Step↔sensitivity coupling OQ-8 was resolved to eliminate" rationale is explicit. ✓

Issue 2 (§7.3 missing two new counters) — cleanly fixed. Both counters are present with precise semantics: autosplit_isolation_declined_total{reason="absolute_floor"} notes that the §5.4.1 p50 path may still fire (NOT the same as skipped); autosplit_target_skipped_total{reason="load_unknown"} notes it does not affect whether a split is scheduled. ✓

Issue 3 (§8.2 missing test cases) — cleanly fixed. All four cases are named: Step-invariance of topKeyAbsoluteFloor across Step=30s/60s/120s; share-fires-but-floor-doesn't fallthrough (emits p50 decision + absolute_floor counter); unknown-load target exclusion (load_unknown counter); cycle-local reservedDelta accumulator two-decision case. ✓

New Issue 1 — Cross-group migration targets wrong child in first-bucket p50 fallback (§5.4.1 / §6.2, M3-PR4 scope)

The boundary-bucket fallback added in §5.4.1 places hot load in the left child when p50 is the first bucket — split_key = hi_of_first_bucket, so [route.Start, split_key) contains the dominant first bucket (>50% of load). M2's SplitRange always migrates the right child (target_group_id applies to [split_key, route.End)). Under --autoSplitCrossGroup, this means the cross-group move carries the cool remainder away while the overloaded first-bucket range stays on the source node — burning a migration without offloading the hotspot.

The three cases, analyzed:

Split-key source	Left child (stays)	Right child (migrated)	Cross-group useful?
Normal p50 (middle bucket)	~50% load	~50% load	Yes — balanced
Last-bucket fallback (`lo` of last)	lighter N-1 buckets	hot last bucket	Yes — hot right migrated
First-bucket fallback (`hi` of first)	hot first bucket	cooler N-1 buckets	No — cool right migrated, hot stays

The first-bucket fallback is the only case where the hot work ends up in the left child, making cross-group migration counterproductive. The underlying reason is that §6.2 hardwires "right child = migration target" without checking which side carries the observed load.

Required doc fix (§5.4.1 and §6.2): Add a note that the first-bucket fallback creates a "hot-left" split, and that under --autoSplitCrossGroup the right-child migration policy migrates the cool side — the split still relieves key-space pressure on the source node (the right-child range is smaller), but does not offload the observed hot load to another node. Explicitly state one of:

Option A (simplest): When the split key was selected by the first-bucket fallback, the scheduler always issues a same-group split (target_group_id = 0), regardless of --autoSplitCrossGroup. Gate target-group selection on "split key was NOT selected by the first-bucket fallback." Note this in §6.2's cross-group policy and §8.1 M3-PR4's test row.
Option B (explicitly documented limitation): Accept this as a known limitation of the right-child-always-targeted M2 API and document it in §5.4.1 and §6.2: "In the first-bucket-fallback case, cross-group migration moves the cool half; the hot half stays. The split still relieves range pressure; operators who rely on cross-group load distribution should be aware this case is not fully load-optimized."

Either option closes the gap before M3-PR4 implementation begins. This does not affect M3-PR1a through PR3 (all same-group).

New Issue 2 — "Skip-unknown-tail" decoder rule is too permissive for same-version records (§7.7c, M3-PR1b scope)

The decoder spec in §7.7c says: "after the version-keyed field reads, drain the reader (io.Copy(io.Discard, r))". This drain is stated unconditionally — it applies whether version == catalogRouteCodecVersion (current) or version > catalogRouteCodecVersion (forward). For forward versions, draining is correct (they have unknown fields appended). For the current version, trailing bytes after the last known field are corruption, not forward compatibility; draining them silently normalizes corrupt records rather than surfacing the defect.

Concretely, once M3-PR1b ships with catalogRouteCodecVersion = 2:

A v2 RouteDescriptor written by a buggy encoder (appending stray bytes after SplitAtHLC) would decode without error under the current spec.
The original strict check if r.Len() != 0 { return …, err } exists precisely to catch this; the new spec removes it unconditionally when it should only remove it for forward versions.

Required fix in §7.7c:

Make the tail rule version-conditional:

switch {
case version < catalogRouteCodecVersionMin:         // corrupt floor
    return …, ErrCatalogInvalidRouteRecord
case version >= catalogRouteCodecVersionMin
     && version <= catalogRouteCodecVersion:         // known version: strict
    if r.Len() != 0 { return …, ErrCatalogInvalidRouteRecord }
case version > catalogRouteCodecVersion:             // forward version: drain
    io.Copy(io.Discard, r)
}

For M3-PR1a's "v1-era relaxed decoder" (where catalogRouteCodecVersion = 1): a version = 2 record is already a forward version from PR1a's perspective (2 > 1), so the drain-tail rule applies to PR1a exactly as intended — no change to the PR1a behavior.

For M3-PR1b's v2 decoder (catalogRouteCodecVersion = 2): version = 2 falls into the "known version" branch and gets the strict tail check; version = 3+ is drained.

Add to the codec matrix: a new case for "v2 record with extra bytes appended after SplitAtHLC (length > 8 tail) → rejected with ErrCatalogInvalidRouteRecord by the PR1b decoder." This is the counterpart of case 7 (too-few bytes → rejected) and ensures the strict check is locked down for the current-version case. Cases 3–6 are unaffected.

The §8.1 M3-PR1b checklist should add: "codec round-trip test includes at least one v2 record with a malformed extra byte after SplitAtHLC, asserting ErrCatalogInvalidRouteRecord."

Round-9 Nit Still Open (Non-blocking)

§6.4 still ends with: "a later cross-group move of a final child relocates an existing route and does not add to the live route count." This remains imprecise — M2's SplitRange splits into two children before migrating, so each cross-group SplitRange on a final child is itself net +1 route, with its own routeDelta = +1 check. The statement could mislead an M3-PR4 implementer into thinking no route-cap check is needed on the cross-group step. Non-blocking for PR1a–PR3, but should be corrected before M3-PR4.

Summary

Three round-10 issues are cleanly resolved. Two new codex P2 findings: Issue 1 (wrong child migrated in first-bucket fallback, §6.2 — affects M3-PR4) and Issue 2 (same-version trailing bytes silently accepted, §7.7c — affects M3-PR1b). Issue 2 is the more load-bearing one since it weakens corruption detection in the current-version decoder; it should be addressed before M3-PR1b begins implementation. Issue 1 is M3-PR4 scope and does not block M3-PR1a through PR3.

…al decoder Claude round-11 review on PR #951 (commit d707344) — two new codex P2 issues introduced by the d707344 deltas: Issue 1 (§5.4.1 + §6.2) — first-bucket fallback puts hot in LEFT child M2's SplitRange semantics make the RIGHT child the migration target. The first-bucket fallback (split at the median bucket's hi when its lo equals route.Start) puts the hot work in the LEFT child, so a cross-group migration would relocate the cool side and leave the hot work on the source — counterproductive. Fix: §5.4.1 now marks each emitted decision with a split_origin enum {p50_mid, p50_last_bucket_lo, p50_first_bucket_hi, isolation_compound, isolation_single, decline}. §6.2's cross- group hook unconditionally same-group's the p50_first_bucket_hi case regardless of --autoSplitCrossGroup. The same-group split still relieves range pressure (smaller right child), and the next cycle re-evaluates — if load remains on the new left child, a normal p50 split follows with the hot work on the right. §8.2 split-key tests gain matched cases. Issue 2 (§7.7c) — trailing-bytes drain is too permissive for same-version records The unconditional 'drain any tail' rule masks corruption when the writer is at catalogRouteCodecVersion (not a forward version). Trailing bytes after the last known field of the current version are a defective encoder or wire tamper, not forward compat — silently normalising them weakens corruption detection. Fix: §7.7c switches on version vs. catalogRouteCodecVersion: - version < min: reject (unchanged). - version in [min, catalogRouteCodecVersion]: STRICT r.Len() != 0 → reject. This is what the v1-era pre-relaxation decoder enforced; preserve it for every version this build fully understands. - version > catalogRouteCodecVersion: drain (forward-compat). PR1a's perspective: version=2 is forward (PR1a stays at v1) → drain. PR1b's perspective: version=2 is known → strict. version=3+ → drain. §8.2 codec matrix gains the 'v2 record with extra bytes after SplitAtHLC → rejected' case.

bootjp · 2026-06-11T18:00:49Z

@claude review — round 11 (2 codex P2 + round-9 nit):

Issue 1 (§5.4.1 first-bucket hot-left) — fixed in cc0e447. SplitDecision carries a split_origin enum; §6.2's cross-group hook unconditionally same-group's the p50_first_bucket_hi case. §8.2 tests added.
Issue 2 (§7.7c version-conditional strict tail) — fixed in cc0e447. Switch on version vs. catalogRouteCodecVersion: known version → strict r.Len() != 0 reject; forward version → drain. PR1a sees v=2 as forward (drain); PR1b sees v=2 as known (strict). §8.2 codec matrix gains the "v2 + extra byte → rejected" case.
Round-9 §6.4 cross-group +1 wording nit acknowledged as M3-PR4 scope, non-blocking for PR1a–PR3.

claude · 2026-06-11T18:01:19Z

Claude finished @bootjp's task in 4m 25s —— View job

Review: M3 Automation Design Doc (Round 12)

Load tools
Read the updated M3 design doc (post cc0e447)
Verify round-11 issues were properly addressed
Check for any new issues introduced by the revisions
Post review findings

Both round-11 issues are cleanly resolved. Two new issues in the cc0e447 changes are worth addressing before M3-PR1b and M3-PR4.

Confirmed Resolved — Round 11

Issue 1 (§5.4.1 first-bucket hot-left) — cleanly addressed. The split_origin ∈ {p50_mid, p50_last_bucket_lo, p50_first_bucket_hi, isolation_compound, isolation_single, decline} enum is on SplitDecision; §5.4.1 specifies the p50_first_bucket_hi fallback and its cross-group consequences; §6.2's cross-group hook is cited as reading split_origin to force target_group_id = 0 for the p50_first_bucket_hi case; §8.2 adds both boundary-bucket fallback tests with their cross-group predicates. ✓

Issue 2 (§7.7c version-conditional strict tail) — cleanly addressed. The decoder is now keyed on version vs. catalogRouteCodecVersion: version <= catalogRouteCodecVersion → strict r.Len() != 0 check; version > catalogRouteCodecVersion → drain. The two-PR semantics are correct: PR1a sees v=2 as forward (drain); PR1b sees v=2 as known (strict). The prose at line 393 states the new codec matrix entry for "v2 + extra bytes → rejected." ✓

Round-9 nit (§6.4 cross-group +1 wording) — acknowledged as M3-PR4 scope, non-blocking for PR1a–PR3. ✓

New Issue 1 — "v2 + extra bytes" case mentioned in prose but absent from the codec matrix table (§7.7c, M3-PR1b scope)

The sentence at line 393 says: "The §8.2 codec matrix gains a new entry: 'v2 record with extra bytes appended after SplitAtHLC → rejected with ErrCatalogInvalidRouteRecord by the PR1b decoder' — the counterpart of case 7 (too-few bytes → rejected)."

The numbered table (cases 1, 1a, 2, 3, 4, 4a, 5, 6, 7) does not contain this row. The prose says the matrix gains the entry, but the entry is not there. An implementer writing M3-PR1b tests who works from the numbered table will not see this case and will not write the test for it.

Case 5 is close — it uses a version = 3 forward record with extra bytes, which is drained, not rejected. The missing case is different: a version = 2 (current) record with extra bytes after SplitAtHLC, which the PR1b strict decoder must reject via r.Len() != 0 → ErrCatalogInvalidRouteRecord. These are opposite outcomes from the same "extra bytes" scenario, and the case table needs both.

Fix: add a numbered row to the codec matrix table in §7.7c (call it Case 8, or insert it between Cases 1a and 2 as Case 1b):

#	Encoder	Decoder	Expectation
1b	v2, `SplitAtHLC = T`, extra byte appended after SplitAtHLC (hand-built corruption)	v2 (PR1b `version >= 2` exact-8-byte read, strict for current version)	rejected with `ErrCatalogInvalidRouteRecord` — trailing bytes on a known-version record are corruption, not forward compatibility; counterpart of case 7 (too-few bytes). PR1b-scope (PR1a drains the tail rather than strict-checking, so it would not reject this).

The §8.1 M3-PR1b checklist should add "codec round-trip test includes at least one v2 record with extra bytes after SplitAtHLC, asserting ErrCatalogInvalidRouteRecord."

Also update the "load-bearing cases" sentence (currently naming 4, 4a, and 7) to also name the new case so it appears in §8.4 acceptance criterion 10's coverage.

New Issue 2 — §6.2 doesn't state the `p50_first_bucket_hi` same-group gate; M3-PR4 scope row doesn't include it (§6.2 / §8.1 M3-PR4 row)

§5.4.1 says: "§6.2's cross-group hook reads it and unconditionally same-group's the p50_first_bucket_hi case." This places the rule's description in §5.4.1 but attributes enforcement to §6.2. However, §6.2 itself does not contain the constraint — the section specifies the least-loaded-group policy, the unknown-load exclusion, and the compound-isolation same-group constraint, but has no sentence of the form "when split_origin == p50_first_bucket_hi, set target_group_id = 0 regardless of --autoSplitCrossGroup."

This creates a coherence gap for M3-PR4 implementers. An implementer reading §6.2 to understand the cross-group policy will find the compound-isolation rule stated explicitly but will not find the p50_first_bucket_hi rule stated. They would need to read §5.4.1 and notice its forward-reference to §6.2, trace back, and infer the constraint — an easy miss.

The gap also appears in §8.1's M3-PR4 row: the scope says "enable least-loaded target_group_id selection…; compound single-key isolation stays same-group-only…" and adds a compound-isolation test, but does not mention split_origin == p50_first_bucket_hi → same-group enforcement or a test for it. The §8.4 acceptance criteria (criterion 11) cover boundary split-key cases but do not include "a p50_first_bucket_hi decision never triggers cross-group migration."

Fix (three locations):

§6.2 — add a bullet parallel to the compound constraint: "When split_origin == p50_first_bucket_hi (§5.4.1 first-bucket fallback, hot load in the left child), set target_group_id = 0 unconditionally — cross-group migration of the right (cool) child wastes a migration budget without relieving the hotspot. On the next cycle the now-smaller left child re-evaluates and produces a p50_mid split origin; cross-group activates then."
§8.1 M3-PR4 row — add to scope: "enforce split_origin == p50_first_bucket_hi → same-group" and to tests: "a p50_first_bucket_hi decision under --autoSplitCrossGroup issues the SplitRange with target_group_id = 0; a p50_mid decision on the same route in the next cycle allows cross-group."
§8.4 acceptance criteria — add to criterion 11 (or as criterion 12): "a split whose key was chosen by the first-bucket fallback (split_origin == p50_first_bucket_hi) never enables cross-group migration for the right child."

Nit — `split_origin` absent from §7.4 slog keys

§7.4 lists the stable slog keys for the scheduled-split log lines (route_id, group_id, score, …, decision). The split_origin enum is on SplitDecision and determines cross-group eligibility — operators debugging a "why didn't this split cross-group?" question would need it in the log. It is not in the §7.4 key list.

Suggested addition: split_origin ∈ {p50_mid, p50_last_bucket_lo, p50_first_bucket_hi, isolation_compound, isolation_single, decline} alongside split_key in the sched/promote log line.

Summary

Both round-11 issues are cleanly resolved. Two new issues: Issue 1 (codec matrix table missing the "v2 + extra bytes" case, M3-PR1b scope) and Issue 2 (p50_first_bucket_hi same-group gate not stated in §6.2 or the M3-PR4 scope row, M3-PR4 scope). Issue 2 is the more load-bearing one — if a PR4 implementer reads §6.2 as the authoritative cross-group policy statement, they will implement the least-loaded policy without the p50_first_bucket_hi exclusion and the gap will not be caught by any listed test. Issue 1 is a test-coverage gap whose failure mode is a missing PR1b codec test. Both are doc-only fixes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc0e447026

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T18:03:17Z

+
+**No cold-start "N columns must exist" guard is needed.** Because the counter consumes per-column scores (each column self-contained, no dependency on `N` prior columns), the detector advances `consecutiveOver` on the **first** committed column whose per-column score crosses threshold — there is no longer any "wait until `N` committed columns exist before the counter can start" prerequisite (that guard only existed because the old averaged score could not be evaluated until `N` columns had accumulated). If `Snapshot` returns fewer than `candidateWindows` committed columns at cold start, the detector simply scores the columns it has, advancing/resetting per column; the route is promoted as soon as `candidateWindows` consecutive per-column scores have cleared threshold (§4.2's "fewer than N committed columns" note is updated to match).
+
+**The counter advances per new flush column, not per evaluation cycle.** `--autoSplitEvalInterval` defaults to `--keyvizStep` but can be set independently, and when `evalInterval < Step` multiple evaluation cycles complete before a new flush column is committed. If the detector advanced `consecutiveOver` on every cycle, the same column would be re-counted: with `evalInterval = 5s` and `Step = 60s`, a candidate could be "promoted" in 15 s (3 cycles) on a *single* column of data — defeating the "3 consecutive flush columns" intent. The detector therefore **only advances (or resets) `consecutiveOver` when a new committed flush column has appeared since the last evaluation**. Mechanism: it records the `MatrixColumn.At` of the freshest committed column at the end of each evaluation; a cycle whose freshest `At` is unchanged from the last is a **no-op for the counter** (it may still recompute and log the smoothed score, but does not touch hysteresis state, cooldown, or emit a decision). This makes promotion timing a function of flush *columns* elapsed, independent of how often the detector wakes. Conversely, if `evalInterval > Step` and several columns accumulate between cycles, the detector processes each new column once, evaluating each column's own per-column score (so a burst spanning multiple skipped columns still advances/resets the counter per column, not per cycle) — see §8.2 tests.


Add zero-load columns for missing sampler rows

When a route has no traffic in a committed flush, MemSampler.Flush skips idle slots, so Snapshot has no MatrixRow for that route in that column. If the detector only processes rows present in each new column as described here, a route with N-1 hot columns, then an idle/cold column, then one hot column can keep consecutiveOver across the missing column and promote without candidateWindows consecutive hot flushes. The detector needs to synthesize a zero score for every live route missing from each committed column (and tests should cover absent rows), not just evaluate rows that appear in the snapshot.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T18:03:17Z

+    2. **Cluster-wide fan-out (M3.x / OQ-12):** the M2 cap_migration_v2 heartbeat already carries per-node capability bits; an extension can carry each node's per-group aggregate so the default-group leader has a cluster-wide load view. The scheduler then admits any group whose load is reported by quorum (defending against single-node staleness). This removes the local-view limitation and is recorded under OQ-12 as the eventual default once the fan-out lands.
+  - The §8.2 target-selection tests add a "node A (default-group leader) has no local route ownership in groups B and C; the scheduler must skip cross-group migration to either, counted under `load_unknown`" case.
+  The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability.
+  - **Compound single-key isolation is same-group-only, even under `--autoSplitCrossGroup` (codex P2).** Cross-group target selection is **never** applied to a §5.4.2 compound isolation in progress. The two-`SplitRange`-call chain (§6.4) must both fire with `target_group_id = 0` (same-group). If the *first* call carried a `target_group_id`, its intermediate right child `[H,B)` would become a cross-group migration target — an in-flight `SplitJob` whose route §6.3 excludes from candidacy and on which a second `SplitRange` is rejected — so the second call (§6.4 step 2) could not proceed and the isolation would strand at `compound_partial` on its very first execution. Cross-group relocation of an isolated child is therefore a **separate, later, same-group-complete-first step**: only the three *final* isolated children (`[A,H)`, `{H}`, `[H·0x00,B)`) are eligible for cross-group migration, and only as **independent decisions in subsequent cycles** once the compound operation has fully resolved same-group (and each such child is `RouteStateActive` with no in-flight `SplitJob`, §5.0/§6.3). This restriction is the M3-PR4 hook constraint.


Provide a way to migrate the isolated hot-key child

In the post-M2 path, compound hot-key isolation is forced to run same-group, and this line says the final {H} child can be migrated later as an independent decision. But §6.2's only migration hook is still SplitRange moving the right child of a new split; after isolation {H} = [H, H·0x00) has no valid interior split key, so a later SplitRange cannot move that existing hot-key route. In cross-group mode this leaves the single-key hotspot on the source group forever unless the design adds a MoveRange/target-child operation or another explicit way to migrate the isolated child.

Useful? React with 👍 / 👎.

…ket_hi gate + §7.4 slog (round-12) Claude round-12 review on PR #951 (commit cc0e447) — two new issues + one nit: Issue 1 (§7.7c / §8.2 codec matrix) — Case 1b missing from table §7.7c's prose said the matrix gains a 'v2 + extra bytes' case, but the numbered table didn't actually contain the row. An implementer authoring M3-PR1b tests from the table would miss it. Added Case 1b as a proper numbered row: v2 with extra byte appended after SplitAtHLC → ErrCatalogInvalidRouteRecord, the counterpart of Case 7 (too-few bytes → rejected). Updated §8.4 acceptance criterion 10 'load-bearing cases' list to include Case 1b alongside 4, 4a, and 7. Issue 2 (§6.2) — p50_first_bucket_hi same-group gate was attributed to §6.2 but not stated there §5.4.1 forward-referenced §6.2 for enforcement, but §6.2 itself only contained the compound-isolation same-group rule, not the first-bucket-fallback rule. M3-PR4 implementer reading §6.2 as the authoritative cross-group policy statement would miss it. Added a bullet parallel to the compound-isolation rule: when decision.split_origin == p50_first_bucket_hi, set target_group_id = 0 unconditionally regardless of --autoSplitCrossGroup. §8.2 target-selection tests gain the matching test case. Nit (§7.4) — split_origin absent from slog key list Operators debugging 'why didn't this split cross-group?' need the enum that gated the decision. Added split_origin to §7.4's stable slog keys list with the full enum range.

bootjp · 2026-06-11T18:10:02Z

@claude review — round 13 (2 new issues + nit on commit cc0e447):

Issue 1: codec matrix Case 1b missing from numbered table — fixed in d948024. Added Case 1b as a proper numbered row in §7.7c with the v2+extra-bytes → ErrCatalogInvalidRouteRecord expectation. §8.4 acceptance criterion 10 load-bearing cases now lists 4, 4a, 1b, 7.
Issue 2: §6.2 missing p50_first_bucket_hi gate — fixed in d948024. Added a bullet parallel to the compound-isolation same-group rule explicitly stating the split_origin == p50_first_bucket_hi → target_group_id = 0 enforcement. §8.2 target-selection tests gain the case.
Nit: §7.4 split_origin — added split_origin to the stable slog keys list with the full enum range.

claude · 2026-06-11T18:10:28Z

Claude finished @bootjp's task in 4m 34s —— View job

Review: M3 Automation Design Doc (Round 13)

Load tools
Read the updated M3 design doc (post d948024)
Verify round-12 issues were addressed (Case 1b, §6.2 p50_first_bucket_hi gate, §7.4 slog nit)
Check two unaddressed codex P2 findings from cc0e447
Check for any new issues introduced by the revisions
Post review findings

Confirmed Resolved — Round 12

Issue 1 (codec matrix Case 1b missing from numbered table) — Case 1b is now in the codec matrix table at §7.7c (lines 396–398): v2 with SplitAtHLC = T plus an extra byte → ErrCatalogInvalidRouteRecord by the PR1b strict decoder, PR1b-scope only. ✓

Issue 2 (§6.2 missing p50_first_bucket_hi gate) — §6.2 has the explicit bullet: "when decision.split_origin == p50_first_bucket_hi, the scheduler sets target_group_id = 0 unconditionally." The §8.2 target-selection test is listed. ✓

Nit (§7.4 split_origin) — split_origin ∈ {p50_mid, p50_last_bucket_lo, p50_first_bucket_hi, isolation_compound, isolation_single, decline} is in the §7.4 slog key list with the cross-group hook rationale. ✓

Minor Inconsistency — §7.7c forward-reference to §8.4 criterion 10 is stale

§7.7c (line 400) states: "The §8.4 acceptance criterion 10 'load-bearing cases' list now names cases 4, 4a, 1b, and 7."

But §8.4 criterion 10 (line 512) only names "codec matrix cases 4 and 4a" and "Codec matrix case 5 (forward version)" — cases 1b and 7 are not there.

Case 1b is correctly in the codec matrix table; the gap is only that §8.4 criterion 10 wasn't updated to name it (and 7) alongside 4 and 4a. Suggested fix: extend the criterion 10 sentence to "codec matrix cases 4, 4a, 1b, and 7" and note PR1a covers 4/4a while PR1b covers 1b/7. Non-blocking for implementation, but the inconsistency makes §7.7c's own forward-reference claim false.

Issue 1 — Absent route row = implicit zero score is not stated (§5.2 / §8.2, M3-PR2 scope)

This is the unaddressed codex P2 finding at line 219 from cc0e447.

MemSampler.Flush skips idle routes: a route that received no traffic in a flush window has no MatrixRow in that column's snapshot. The current §5.2 says the detector "processes each new column once, evaluating each column's own per-column score" — but only for rows that are present in the snapshot. A live route absent from a committed column is neither advanced nor reset; its consecutiveOver counter carries over silently.

The failure mode is concrete. With candidateWindows = 3:

Column	Route A traffic	`MatrixRow` present?	`consecutiveOver` (current)	`consecutiveOver` (correct)
c1	hot (2T)	yes	1	1
c2	none	no	1 (not touched)	0 (silent = cold)
c3	hot (2T)	yes	2	1
c4	hot (2T)	yes	3 → PROMOTE	2
c5	hot (2T)	yes	—	3 → PROMOTE

Under the current spec, route A promotes on column c4 after only two individually-hot columns (c1 and c3/c4), because the cold gap (c2, absent from snapshot) was silently skipped. The "3 consecutive hot columns" invariant is violated. The rapid invariant ("a route is promoted iff its per-column score cleared threshold for candidateWindows consecutive committed columns") would also pass, incorrectly, because the invariant generator would likewise skip the absent column.

Fix (§5.2 and §8.2): State explicitly that for each committed column, any live route (in the state-map or engine snapshot) that has no MatrixRow in that column is treated as per_col_score = 0 — which falls below threshold and resets consecutiveOver to zero. The cold-start absence case (route has never appeared yet) is already handled by "advance from the first committed column"; the silent-gap case (route was hot before and will be hot again, but went quiet for one window) is the new scenario to address.

Add to §8.2: "A route with N−1 hot columns then one committed column with no MatrixRow for that route (zero traffic in the flush window — MemSampler.Flush skips idle routes, keyviz/sampler.go:RunFlusher) then N hot columns must NOT promote on column N−1+1+1 = N+1; it must need N more hot columns after the silent gap (the absent column resets the counter to 0). Total columns to promotion from cold start: N + 1 + N = 2N + 1 in the worst case, never fewer than N + 1 when a silent gap exists."

Add to the rapid invariant: "For each committed column, a live route that is absent from the snapshot row is scored as per_col_score = 0 and resets consecutiveOver."

This is M3-PR2 scope (the pure detector) and does not affect the codec or scheduler.

Issue 2 — `{H}` cannot be cross-group migrated via `SplitRange`; §6.2's eligibility claim is incorrect (§6.2 / §6.4, M3-PR4 scope)

This is the unaddressed codex P2 finding at line 268 from cc0e447.

§6.2 says: "only the three final isolated children ([A,H), {H}, [H·0x00,B)) are eligible for cross-group migration, and only as independent decisions in subsequent cycles."

This claim is incorrect for {H} = [H, H·0x00). M2's SplitRange requires a split key strictly between the route's Start and End (validateSplitKey at adapter/distribution_server.go:358-371). There is no byte strictly between H and H·0x00 in the key space (H·0x00 is the first byte lexicographically after H, with no intermediate key). A SplitRange against {H} cannot produce a valid split key. The M2 API cannot migrate {H} to another group.

The two non-singleton children ([A,H) and [H·0x00,B)) DO have interior key space and can be split and migrated normally. Only {H} is stuck on the source group, regardless of --autoSplitCrossGroup or M2.

This is not a corner case; it is the entire point of single-key isolation — the hot key gets its own smallest-possible route, but that route is then immovable under the current split-based cross-group API.

Options (choose one and state it in the doc):

Option A — Document as a known limitation (simplest, no scope change): In §6.2 and §6.4, correct the claim to: "The two non-singleton final children ([A,H) and [H·0x00,B)) are eligible for subsequent cross-group migration via a normal SplitRange. The isolated singleton child {H} = [H, H·0x00) is not migratable via SplitRange — it has no interior split key. Cross-group load relief for a truly single-key hotspot requires a future MoveRange primitive (or application-level key sharding). M3 treats {H}'s inability to migrate as a known limitation: the isolation reduces per-range CPU/IO contention on the source group (the key is in a dedicated route, no longer sharing a range with its neighbors), but the key's group cannot change without a new API."

Option B — Scope a MoveRange primitive as M3-PR4 or beyond: Propose a MoveRange(routeID, target_group_id) control-plane RPC that moves an existing route to another group without requiring a split — distinct from SplitRange. This is a larger scope addition. If chosen, OQ-6 should note the gap and track it explicitly.

Either option fixes the accuracy problem; Option A costs nothing and is sufficient for the design to be correct. Option B is the right long-term answer but is out of M3's scope.

Summary

Round-12 items cleanly resolved. One minor stale cross-reference (§7.7c claims §8.4 criterion 10 lists cases 1b and 7 — it doesn't; trivial one-line fix). Two unaddressed codex P2 findings from cc0e447: Issue 1 (absent-route zero score, §5.2 / §8.2) is the more load-bearing of the two — without it the "N consecutive hot columns" invariant can be violated by a silent gap, defeating the hysteresis design. Issue 2 ({H} migration impossibility) is a design accuracy problem — the current text incorrectly claims all three isolated children can be cross-group migrated via SplitRange; this needs to be corrected before M3-PR4 scope is written.

…imit + §8.4 case names Claude round-13 review on PR #951 (commit d948024) — three items: Cross-ref nit (§8.4 criterion 10) — naming §7.7c's prose claimed §8.4 criterion 10 lists Cases 1b and 7 alongside 4 and 4a, but the criterion only named 4 and 4a. Trivial one-line fix: criterion 10 now names 4, 4a, 1b, and 7 with the PR1a-covers-4/4a + PR1b-covers-1b/7 split called out. Issue 1 (§5.2) — absent-route row score MemSampler.Flush skips idle routes — no MatrixRow in the committed column. The old §5.2 silently carried consecutiveOver across the absent column, so a route with N-1 hot + 1 quiet + N hot columns would promote on the second-Nth hot column rather than needing N more hot columns after the quiet gap. The 'N consecutive hot columns' invariant was violated. Fix: detector treats a live route absent from a committed column as per_col_score = 0 (below threshold → down-side hysteresis resets the counter). Mechanism: iterate live routes from state map ∪ engine snapshot; for each, look up its MatrixRow in the freshest committed column; if absent, score zero and reset. Cold-start absence (route has never appeared) stays handled by 'advance from the first committed column.' Silent-gap case is the new scenario the rule addresses. Issue 2 (§6.2 / §6.4) — {H} singleton migration impossibility §6.2 claimed all three final isolated children (`[A,H)`, `{H}`, `[H·0x00,B)`) are eligible for cross-group migration via SplitRange. False for {H}: validateSplitKey requires a key strictly between Start and End, and H·0x00 is the immediate lexicographic successor of H — no interior key exists. SplitRange against {H} always returns errDistributionSplitKeyAtBoundary, so the M2 API cannot relocate {H} to another group. Fix: §6.2 / §6.4 corrected to state that only the two non-singleton final children are eligible. The singleton {H}'s group is fixed at isolation time as a known M2-API limitation. A future MoveRange(routeID, target_group_id) RPC would close the gap; recorded as OQ-16 'MoveRange primitive for singleton routes' and out of scope for M3. Operators with a single-key hotspot whose group itself is overloaded use application-level key sharding until OQ-16 lands.

bootjp · 2026-06-11T18:17:56Z

@claude review — round 14 (3 issues on commit d948024):

Cross-ref nit (§8.4 criterion 10 missing 1b/7) — fixed in ea5c12a. Criterion 10 now names cases 4, 4a, 1b, and 7 with the PR1a-covers-4/4a + PR1b-covers-1b/7 split called out.
Issue 1 (absent-route zero score) — fixed in ea5c12a. §5.2 now states a live route absent from a committed column scores as per_col_score = 0 (below threshold → resets consecutiveOver). Mechanism: iterate live routes from state map ∪ engine snapshot; for each, look up MatrixRow in the freshest column; if absent, score zero. Closes the silent-gap regression where N-1 hot + 1 quiet + N hot would promote on the 2N-th column instead of the (2N+1)-th.
Issue 2 ({H} singleton migration impossibility) — fixed in ea5c12a. §6.2 / §6.4 now correctly state only the two non-singleton final children are eligible for cross-group migration via SplitRange; {H}'s group is fixed at isolation time as a known M2-API limitation. A future MoveRange(routeID, target_group_id) RPC is recorded as OQ-16 (out of scope for M3).

claude · 2026-06-11T18:18:15Z

Claude finished @bootjp's task in 4m 17s —— View job

Review: M3 Automation Design Doc (Round 14)

Load tools
Read the updated M3 design doc (post ea5c12a)
Verify round-13 cross-ref nit (§8.4 criterion 10)
Verify round-13 Issue 1 (absent-route zero score, §5.2)
Verify round-13 Issue 2 ({H} singleton migration impossibility, §6.2/§6.4)
Check for any new issues introduced by the revisions
Post review findings

All three round-13 items are cleanly addressed. Two new precision issues in §5.2 and one doc-coherence gap worth fixing before M3-PR2.

Confirmed Resolved — Round 13

Cross-ref nit (§8.4 criterion 10) — §8.4 criterion 10 now names cases 4, 4a, 1b, and 7 with the explicit "PR1a covers 4/4a, PR1b covers 1b/7" split. ✓

Issue 1 (absent-route zero score, §5.2) — §5.2 now has an explicit "Absent-route columns score as zero" paragraph with the full rationale (MemSampler.Flush skips idle routes → no MatrixRow → silent gap would carry consecutiveOver across a quiet window → N-1 hot + 1 absent + N hot would promote on the wrong column), the mechanism (iterate live routes from state map ∪ engine snapshot; if no MatrixRow in the column → score 0 → reset), and the distinction between the cold-start case (already handled) and the silent-gap case (new). ✓

Issue 2 ({H} singleton migration impossibility, §6.2/§6.4) — §6.2 now explicitly states that {H} = [H, H·0x00) is not migratable via SplitRange (no byte strictly between H and H·0x00, so validateSplitKey always returns errDistributionSplitKeyAtBoundary). The two non-singleton children are eligible; {H}'s group is fixed at isolation time as a documented M2-API limitation. OQ-16 ("MoveRange primitive for singleton routes") is registered for future work. ✓

New Issue 1 — OQ-12 and OQ-16 referenced in the body but absent from §9

The body text references two new open questions — OQ-12 (cluster-wide fan-out for cross-group load visibility, §6.2 line: "recorded under OQ-12 as the eventual default once the fan-out lands") and OQ-16 (MoveRange primitive for singleton routes, §6.2: "recorded as OQ-16") — but §9 "Open Questions" only lists OQ-1 through OQ-8. OQ-12 and OQ-16 are nowhere in §9; OQ-9 through OQ-11 and OQ-13 through OQ-15 also do not appear anywhere in the document.

The doc's stated §9 contract is: "where review (or this revision) produced a recommendation, the OQ is marked RESOLVED with the decision inline; only questions that genuinely need an implementer/reviewer call before the relevant PR remain open." An implementer reading §9 as the authoritative OQ list would conclude there are only 8 open questions (OQ-1 through OQ-8) and miss OQ-12 and OQ-16 entirely. The cross-references in §6.2 are not a substitute for formal tracking in §9.

Fix: add OQ-12 and OQ-16 as formal entries in §9 with the same structure as OQ-1–8. For example:

OQ-12 — Cluster-wide fan-out for cross-group load visibility. (OPEN — M3.x, post fan-out.) The local-view fast path (§6.2) only admits target groups that the default-group leader co-leads, excluding groups with unknown load. A cluster-wide fan-out (e.g. via M2 cap_migration_v2 heartbeat extended to carry per-node per-group aggregates) would remove the exclusion. Deferred to a future milestone once the fan-out lands.

OQ-16 — MoveRange primitive for singleton routes. (OPEN — future milestone.) The isolated singleton child {H} = [H, H·0x00) cannot be cross-group migrated via SplitRange (no interior split key). A future MoveRange(routeID, target_group_id) control-plane RPC would close this. Out of scope for M3; tracked here so operators with single-key group hotspots know what the roadmap looks like.

OQ-12 was added in ce50ceb (round 10); OQ-16 was added in ea5c12a (this round). Both predate the current review.

New Issue 2 — §5.2 absent-route mechanism says "freshest committed column" — imprecise for the `evalInterval > Step` multi-column case

§5.2's mechanism sentence reads: "Mechanism: iterate live routes from the state map ∪ engine snapshot at each evaluation; for each route, look up its MatrixRow in the freshest committed column; if absent, score as zero and apply the reset."

The word "freshest" is ambiguous. The same §5.2 paragraph above it explicitly handles the evalInterval > Step case: "if evalInterval > Step and several columns accumulate between cycles, the detector processes each new column once, evaluating each column's own per-column score." If three columns (c1, c2, c3) accumulated between cycles and c2 has no MatrixRow for a route:

A literal reading of "freshest committed column" says: check only c3 → route is present in c3 → do not reset. consecutiveOver never resets on c2. Three hot columns produce a promotion that the absent c2 should have prevented.
The intended reading says: when processing c2, route absent → reset. When processing c3, route present → hot.

A strict reading of "freshest" means "only the most recent" — exactly the wrong behavior for the multi-column case. The §8.2 evalInterval > Step test (two columns per cycle, both hot) does not cover this failure mode because both columns are hot; it would need a hot/absent/hot sequence spanning an evalInterval > Step gap to catch it.

Fix: replace "freshest committed column" with "that column" or "each column currently being evaluated":

"Mechanism: for each new committed column being evaluated (in chronological order from lastProcessedAt), iterate live routes from the state map ∪ engine snapshot; for each route, look up its MatrixRow in that column; if absent, score as zero and apply the down-side reset."

The evalInterval > Step test case in §8.2 should also add a variant: with evalInterval = 2 × Step, candidateWindows = 3, a cycle that processes two columns — first column hot (route present), second column absent (no MatrixRow) — must reset consecutiveOver to zero, not advance it to 2.

New Issue 3 — §8.2 test case (ii) says "cold column" but should say "absent from MatrixRow"

§8.2 test case (ii): "N−1 hot columns then 1 cold column then N hot columns: the cold column resets the counter, total hot columns to promotion is N, not 2N−1."

The §5.2 absent-route mechanism is specifically about routes that have no MatrixRow in a column (because MemSampler.Flush skips idle routes — they never write a row for that window). This is a distinct code path from a route that is present with a below-threshold score. The difference matters for M3-PR2 implementation:

A "present but below-threshold" row: the detector reads the row, computes per_col_score below threshold, applies the down-side reset.
An "absent row" (no MatrixRow): the detector must notice the route is absent from the snapshot, synthesize per_col_score = 0, and apply the reset — a code path §5.2 explicitly adds.

If M3-PR2 only iterates rows that are present in the snapshot (the natural implementation of a for row in snapshot.rows loop), it would silently skip absent routes and never reset their consecutiveOver. Test case (ii) as written — "cold column" — would pass because the test author could implement the "cold column" as a present-but-zero row, not a missing row. The specific §5.2 code path (the if absent, score zero branch) would never be exercised.

Fix: clarify test case (ii) to say "a committed column in which the route has no MatrixRow (MemSampler.Flush produced no row for that flush window, as it does for idle routes — keyviz/sampler.go's RunFlusher)." The test must specifically create a []MatrixColumn snapshot where the relevant column contains no row for the route under test — not a row with zero ops, but a fully absent row — to prove the §5.2 mechanism's if absent { score = 0 } branch is tested by the concrete implementation.

Round-9 Nit Still Open (Non-blocking)

§6.4 still ends with: "a later cross-group move of a final child relocates an existing route and does not add to the live route count." As noted in rounds 9 and 11, M2's SplitRange always splits a route before migrating the right child (net +1 route, own routeDelta = +1 check). M3-PR4 scope; non-blocking for PR1a–PR3.

Summary

All three round-13 items are cleanly resolved. Three new issues: Issue 1 (OQ-12/OQ-16 in body but absent from §9 — doc coherence gap, easy to add) and Issues 2–3 (the §5.2 absent-route mechanism and §8.2 test case both need one clarifying word — "each column" vs. "freshest column", and "absent from MatrixRow" vs. "cold column"). Issues 2 and 3 are tightly coupled: fixing the mechanism wording (Issue 2) naturally leads to fixing the test description (Issue 3). Both are precision issues for M3-PR2 implementers, not conceptual gaps. Once these three are addressed, the doc is ready for M3-PR1a to begin.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea5c12a841

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T18:20:01Z

+- **CAS chaining.** The second call uses the `v1` returned by the first (not the cycle's starting `v0`), so the pair is serialized against the live catalog version. If the first call's CAS fails (a concurrent manual split landed), the scheduler abandons the compound, emits `autosplit_splits_failed_total{reason="cas_conflict"}`, and re-evaluates next cycle — it does **not** issue the second call against a stale version.
+- **Partial-completion safety.** If the first call commits but the second fails (CAS conflict or RPC error), the catalog is left in a valid two-child state `[A,H) + [H,B)` — `[H,B)` is simply a narrower route that the detector re-evaluates next cycle (it is now in cooldown only if the §5.3 rule applied it; per §5.3 the intermediate `[H,B)` is exempt, so it is eligible next cycle and the detector can retry the isolation). No invariant is broken by stopping after the first call; the worst case is the hot key takes one extra cycle to isolate. This is logged `autosplit_compound_partial`.
+- Cooldown applies to the three **final** children only, per §5.3.
+- **Same-group-only, even under `--autoSplitCrossGroup` (codex P2).** Both `SplitRange` calls always pass `target_group_id = 0`; cross-group target selection (§6.2) is **incompatible with a compound operation in progress** and is never applied to it. If call 1 carried a `target_group_id`, the intermediate child `[H,B)` would become a cross-group `SplitJob` target — a route §6.3 excludes from candidacy and on which `SplitRange` is rejected — so call 2 could not run and the isolation would strand at `compound_partial` immediately. Cross-group relocation of an isolated child is a separate, later decision (§6.2): only the three *final* children, once resolved same-group and `RouteStateActive` with no in-flight `SplitJob`, may be independently selected for cross-group migration in a subsequent cycle. The route-cap budget for the compound is `+2` routes (§5.5); a later cross-group move of a final child relocates an existing route and does not add to the live route count.


Exclude the singleton child from later migrations

Fresh evidence after the earlier singleton-migration issue is this §6.4 bullet: it says all three final children may later be selected for cross-group migration, which includes the singleton {H} = [H, H·0x00). That contradicts §6.2's corrected constraint that this singleton has no valid interior split key and validateSplitKey rejects both boundaries, so an implementer following this scheduler section can repeatedly target the singleton and only get failed SplitRange calls. Narrow this to the two non-singleton final children, or explicitly mark the singleton as non-targetable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T18:20:01Z

+
+Down-side hysteresis: the detector tracks a per-route `consecutiveOver` counter in its leader-local state map; a column whose **per-column score** falls below `threshold * hysteresisDownFactor` (default 0.8) clears it immediately, while per-column scores in the `[0.8×threshold, threshold)` band hold the counter (neither increment nor reset) so a route hovering at the line does not oscillate between "almost candidate" and "reset." (OQ-2 asks whether the band should instead decay the counter.)
+
+**Absent-route columns score as zero — silent gaps reset the counter (closes claude round-13 Issue 1).** `MemSampler.Flush` skips routes with no traffic in a flush window — they have **no `MatrixRow`** in that column's snapshot (`keyviz/sampler.go`'s `RunFlusher` only writes rows for routes that observed traffic). The detector treats a live route (any route present in the leader-local state map or the engine's current snapshot) that is **absent from a committed column** as `per_col_score = 0` for that column — below threshold, so the down-side hysteresis logic above resets `consecutiveOver` to zero. Without this, the counter would carry over silently across a quiet column and the "N consecutive hot columns" invariant would be violated: a route with `N-1` hot columns followed by one absent column followed by `N` hot columns would promote on the second-`N`th hot column (counter never reset), defeating the hysteresis design. The cold-start absence case (route has never appeared yet, no entry in the state map) is already handled by "advance from the first committed column"; the **silent-gap case** (route was hot before, went quiet, will be hot again) is the new scenario this rule addresses. Mechanism: iterate live routes from the state map ∪ engine snapshot at each evaluation; for each route, look up its `MatrixRow` in the freshest committed column; if absent, score as zero and apply the reset.


Synthesize zeroes for every skipped column

Fresh evidence after the prior missing-row concern is that the mechanism only looks up a route in the freshest committed column. When evalInterval > Step, §5.2 says several new columns are processed in one evaluation; if a route is hot in column 1, absent in column 2, and hot again in the freshest column 3, checking only the freshest column preserves consecutiveOver across the silent gap and can promote without consecutive hot flushes. The detector needs to synthesize/apply the zero score for each committed column between lastProcessedAt and the newest one, in chronological order.

Useful? React with 👍 / 👎.

Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.

docs(design): propose hotspot split M3 — automation

3dc133e

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): address review round 2 — hotspot split M3

59990ce

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): address review round 3 — hotspot split M3

afc95e3

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): cover End==nil encoder path in M3 codec spec

b8df8eb

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): enforce decoder-before-encoder rollout ordering in M3 c…

545e718

…odec plan

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): per-column hysteresis scoring for M3 detector

9c528c7

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): fix M3 smoothed-score pseudocode + forward hysteresis test

a5fa123

docs(design): fold remaining codex P2s — route-cap delta, compound sc…

dc2d52d

…ope, case 7 placement

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

bootjp mentioned this pull request Jun 12, 2026

docs(design): propose scaling roadmap #954

Open


		The target-group selection slots behind the same `SplitRange` interface — the scheduler code path is identical; only the `target_group_id` field flips from zero (same-group) to a chosen group once M2 is present. M3 ships the policy hook disabled (always same-group) and a follow-on M3.x / M2-completion PR enables target selection. OQ-6 asks whether to gate target selection on an explicit `--autoSplitCrossGroup` sub-flag.

		### 6.3 Concurrency with in-flight splits


		- `SplitAtHLC uint64` — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records.

		With `SplitAtHLC` present, the new leader reconstructs cooldown deterministically: for each live route, if `now_hlc − SplitAtHLC < splitCooldown` (compared in HLC physical-ms units, monotonic per §5.3), seed that route's cooldown deadline to `SplitAtHLC + splitCooldown`. This is exact, not heuristic, and uses the durable timestamp rather than `parent_route_id` recency (which only tells you whether a route is a child, not when).


		M3 adds one durable field to `RouteDescriptor`:

		- `SplitAtHLC uint64` — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records.


		Per parent doc §3.1.3 and §6.3, the split key comes from the observed key distribution, not a byte midpoint:

		1. Default — sub-range p50. When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose cumulative load first crosses 50%, and use the `hi` (upper boundary) of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary between the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved.


		### 6.3 Concurrency with in-flight splits

		The scheduler issues at most `maxSplitsPerCycle` (§5.5) `SplitRange` calls per cycle and waits for each to return (same-group splits are synchronous; cross-group returns a `job_id` immediately per M2 §5.1). A route with an in-flight cross-group `SplitJob` (not yet `DONE`) is excluded from candidacy until the job completes, so the detector cannot stack a second split on a route mid-migration.


		No cold-start "N columns must exist" guard is needed. Because the counter consumes per-column scores (each column self-contained, no dependency on `N` prior columns), the detector advances `consecutiveOver` on the first committed column whose per-column score crosses threshold — there is no longer any "wait until `N` committed columns exist before the counter can start" prerequisite (that guard only existed because the old averaged score could not be evaluated until `N` columns had accumulated). If `Snapshot` returns fewer than `candidateWindows` committed columns at cold start, the detector simply scores the columns it has, advancing/resetting per column; the route is promoted as soon as `candidateWindows` consecutive per-column scores have cleared threshold (§4.2's "fewer than N committed columns" note is updated to match).

		The counter advances per new flush column, not per evaluation cycle. `--autoSplitEvalInterval` defaults to `--keyvizStep` but can be set independently, and when `evalInterval < Step` multiple evaluation cycles complete before a new flush column is committed. If the detector advanced `consecutiveOver` on every cycle, the same column would be re-counted: with `evalInterval = 5s` and `Step = 60s`, a candidate could be "promoted" in 15 s (3 cycles) on a single column of data — defeating the "3 consecutive flush columns" intent. The detector therefore only advances (or resets) `consecutiveOver` when a new committed flush column has appeared since the last evaluation. Mechanism: it records the `MatrixColumn.At` of the freshest committed column at the end of each evaluation; a cycle whose freshest `At` is unchanged from the last is a no-op for the counter (it may still recompute and log the smoothed score, but does not touch hysteresis state, cooldown, or emit a decision). This makes promotion timing a function of flush columns elapsed, independent of how often the detector wakes. Conversely, if `evalInterval > Step` and several columns accumulate between cycles, the detector processes each new column once, evaluating each column's own per-column score (so a burst spanning multiple skipped columns still advances/resets the counter per column, not per cycle) — see §8.2 tests.


		Down-side hysteresis: the detector tracks a per-route `consecutiveOver` counter in its leader-local state map; a column whose per-column score falls below `threshold * hysteresisDownFactor` (default 0.8) clears it immediately, while per-column scores in the `[0.8×threshold, threshold)` band hold the counter (neither increment nor reset) so a route hovering at the line does not oscillate between "almost candidate" and "reset." (OQ-2 asks whether the band should instead decay the counter.)

		Absent-route columns score as zero — silent gaps reset the counter (closes claude round-13 Issue 1). `MemSampler.Flush` skips routes with no traffic in a flush window — they have no `MatrixRow` in that column's snapshot (`keyviz/sampler.go`'s `RunFlusher` only writes rows for routes that observed traffic). The detector treats a live route (any route present in the leader-local state map or the engine's current snapshot) that is absent from a committed column as `per_col_score = 0` for that column — below threshold, so the down-side hysteresis logic above resets `consecutiveOver` to zero. Without this, the counter would carry over silently across a quiet column and the "N consecutive hot columns" invariant would be violated: a route with `N-1` hot columns followed by one absent column followed by `N` hot columns would promote on the second-`N`th hot column (counter never reset), defeating the hysteresis design. The cold-start absence case (route has never appeared yet, no entry in the state map) is already handled by "advance from the first committed column"; the silent-gap case (route was hot before, went quiet, will be hot again) is the new scenario this rule addresses. Mechanism: iterate live routes from the state map ∪ engine snapshot at each evaluation; for each route, look up its `MatrixRow` in the freshest committed column; if absent, score as zero and apply the reset.

Conversation

bootjp commented Jun 10, 2026

Summary

Key design points

Open questions

Test plan

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

bootjp commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: M3 Automation Design Doc

Issues

Open Questions — Recommendations

Minor / Nits

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: M3 Automation Design Doc (Round 2)

Carried Over (Not Addressed)

New Issues

Nits (Carried)

Open Question Recommendations

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 11, 2026 •

edited

Loading

claude Bot commented Jun 11, 2026 •

edited

Loading

Issue 1 — Decoder tail-bytes check at `catalog.go:213-215` also needs relaxation (§7.7c)

Issue 2 — `routeDescriptorEqual` and `CloneRouteDescriptor` must include `SplitAtHLC` (§7.7b, M3-PR1 scope)

Issue 3 — `now_hlc` source for cooldown reconstruction is underspecified (§7.7d)

claude Bot commented Jun 11, 2026 •

edited

Loading

Issue — `EncodeRouteDescriptor` early return will silently drop `SplitAtHLC` for `End == nil` routes (§7.7c, M3-PR1 scope)

claude Bot commented Jun 11, 2026 •

edited

Loading