docs(design): propose hotspot split M3 — automation#951
Conversation
|
Warning Review limit reached
More reviews will be available in 19 minutes and 3 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 3m 20s —— View job Review: M3 Automation Design Doc
This is a high-quality, well-structured proposal. All cited Issues1. Score normalization formula is underspecified (§5.1) The doc says the score is computed over "a 1-minute equivalent window (trailing flush columns summed and normalized to the flush Step)." But the exact formula is not given. When This is exactly what OQ-8 is pointing at, but the issue goes beyond units — the doc should commit to a specific formula. If the threshold stays in "ops/min", the normalization formula must be stated explicitly so unit tests in M3-PR2 can pin it. Without that, two contributors can independently implement different normalizations that both pass tests but produce different sensitivities. Recommendation: close OQ-8 in this doc by committing to 2.
The detector needs a stable reference to the last flush time. It's worth calling out explicitly whether the detector gets this from a This is a concrete implementation question that will arise in M3-PR2 — resolving it here avoids a revision mid-PR. 3. The design proposes seeding cooldown after election from "any route whose This means the durable-cooldown reconstruction requires adding a timestamp field to Recommendation: add 4. Single-key isolation (§5.4.2) requires two The proposed single-key isolation approach is: "split key = hot key, and a follow-on split at hot key's successor so the child contains only that key." This is two Options worth deciding now:
This affects both the scheduler implementation and the acceptance criteria in §8.4. The design should commit to one approach. 5. Detector state map doesn't describe how manual splits are reconciled (§6.3) The Recommendation: at the start of each evaluation cycle the detector should reconcile its state map against the live route set (e.g. discard state for any 6. §3.3 correctly notes that This is minor given detection is off-path, but worth noting: the five-lens performance review in M3-PR2 should benchmark or at least measure the lock contention under load (e.g. with Open Questions — RecommendationsOQ-1 ( OQ-2 (down-side hysteresis shape): Hold (current proposal) is the right choice. Decaying by one per cycle introduces an implicit max-hold time that's a function of OQ-3 (no-evidence fallback): Decline (current proposal). Allowing a min/max-observed-key midpoint as fallback re-introduces the midpoint-only problem the parent doc §3.1.3 explicitly rejects for unobserved subspaces. Better to push the operator toward enabling OQ-5 (leader-local vs cluster-wide): Leader-local is sufficient for M3. The rationale in §4.1 is well-argued (the leader serves the load it would split on). Flag this as a known limitation in the doc, not as a deferred OQ — it's already a deliberate non-goal. OQ-6 (cross-group gate): Recommend gating behind Minor / Nits
SummaryFundamentally sound design. The key things to resolve before M3-PR1 begins:
|
There was a problem hiding this comment.
Code Review
This pull request proposes a design document for Milestone 3 of the Hotspot Shard Split, focusing on automation. The design leverages the existing keyviz sampler to detect hot routes, implements a pure detector with hysteresis and cooldown, and schedules splits via the SplitRange RPC. The review feedback highlights three key design improvements: implementing a garbage-collection mechanism for the leader-local state map to prevent memory leaks, addressing the lack of creation timestamps in RouteDescriptor to enable accurate cooldown seeding on new leaders, and restricting auto-split eligibility to routes in RouteStateActive to avoid interfering with transitional migration states.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| The detector's hysteresis counters and cooldown deadlines live in a **leader-local in-memory state map** — they are **not** replicated through Raft. On a default-group leadership change: | ||
|
|
||
| - The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses). | ||
| - The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed. | ||
| - **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state. | ||
|
|
||
| Keeping detector state leader-local (not Raft-replicated) is a deliberate cost/safety trade: replicating per-cycle counters would add Raft traffic for what is a best-effort heuristic, and the worst case of a lost counter is a *delayed* split (safe), never an *erroneous* one. We state this explicitly so a reviewer does not mistake the non-replication for an oversight. |
There was a problem hiding this comment.
To prevent unbounded memory growth in the leader-local in-memory state map, the design should specify a pruning/garbage-collection mechanism. Since route IDs are deleted from the catalog after a split (and replaced by child routes), keeping their hysteresis counters and cooldown deadlines in the map indefinitely will cause a memory leak.\n\nThe detector should periodically (e.g., at the end of each evaluation cycle) compare the keys of the state map against the active routes in the current catalog snapshot and evict any route IDs that are no longer present.
There was a problem hiding this comment.
Accepted. Added §7.7(a): the detector reconciles its leader-local state map against the live catalog CatalogSnapshot at the start of every evaluation cycle and evicts any RouteID not in the live set, so the map is strictly bounded by the live route count (≤ maxAutoRoutes). autosplit_tracked_routes reflects the post-GC size. This also covers the disappeared/manual-split route case. Fixed in b56950a.
|
|
||
| - The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses). | ||
| - The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed. | ||
| - **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state. |
There was a problem hiding this comment.
The proposal to seed cooldowns on a new leader by checking if a route's creation is "recent per the catalog" has a design gap: the RouteDescriptor struct in catalog.go does not contain any creation timestamp, split timestamp, or catalog version of origin. Without this metadata, a newly elected leader cannot determine whether a route with a parent_route_id was split 1 minute ago (and should be in cooldown) or 1 hour ago (cooldown expired).\n\nTo resolve OQ-7 and make this safety mechanism viable, the design should either:\n1. Add a durable split_timestamp or created_at field to the RouteDescriptor (or catalog metadata).\n2. Store a mapping of catalog_version -> timestamp in the catalog metadata so the leader can resolve the age of the route's origin version.
There was a problem hiding this comment.
Accepted — you are right that the old 'creation is recent per the catalog' wording had no schema basis (RouteDescriptor at distribution/catalog.go:71-78 carries only RouteID, Start, End, GroupID, State, ParentRouteID — no timestamp). Took your option 1: added a durable RouteDescriptor.SplitAtHLC uint64 (HLC, set by the SplitRange handler at adapter/distribution_server.go:136), encoded under a bumped catalogRouteCodecVersion (catalog.go:24) with v1 records decoding to SplitAtHLC=0 for back-compat. The new leader reconstructs cooldown deterministically: now_hlc − SplitAtHLC < splitCooldown. See §7.7(b); scoped to M3-PR1 (§8.1). This resolves OQ-7. Fixed in b56950a.
|
|
||
| The target-group selection slots **behind the same `SplitRange` interface** — the scheduler code path is identical; only the `target_group_id` field flips from zero (same-group) to a chosen group once M2 is present. M3 ships the policy hook disabled (always same-group) and a follow-on M3.x / M2-completion PR enables target selection. OQ-6 asks whether to gate target selection on an explicit `--autoSplitCrossGroup` sub-flag. | ||
|
|
||
| ### 6.3 Concurrency with in-flight splits |
There was a problem hiding this comment.
To ensure safety during migrations, the design should explicitly state that only routes in RouteStateActive are eligible for auto-split detection. Routes in transitional states such as RouteStateWriteFenced, RouteStateMigratingSource, or RouteStateMigratingTarget must be excluded from candidacy. This prevents concurrent automated splits from interfering with active M2 migration jobs.
There was a problem hiding this comment.
Accepted. Added §5.0 as the first gate of every evaluation cycle: only RouteStateActive routes (distribution/catalog.go:51-58) are candidates; RouteStateWriteFenced/RouteStateMigratingSource/RouteStateMigratingTarget are excluded from scoring and promotion entirely. This subsumes the §6.3 in-flight-SplitJob exclusion. Asserted in §8.2 tests and §8.4 acceptance criterion 8. Fixed in b56950a.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3dc133ecb7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **M3 reads detection input from the keyviz sampler**, not from `Route.Load`. | ||
| - The engine's `RecordAccess` / `splitRange` / `Route.Load` path is **retired** (§3.4). No new call site is added for it. | ||
|
|
||
| This means **auto-split requires keyviz sampling to be active**. `--autoSplit` implies the sampler is constructed and flushing (§7.1). We do *not* require the operator to also pass `--keyvizEnabled`; enabling auto-split turns on the minimum sampling auto-split needs (route-granular counts; sub-range buckets and Top-K stay opt-in via the existing keyviz flags and only sharpen split-key selection — see §5.4). |
There was a problem hiding this comment.
Enable split-key evidence with autoSplit defaults
With the current keyviz defaults (DefaultKeyBucketsPerRoute = 1 and --keyvizHotKeysEnabled=false), turning on only --autoSplit constructs a route-granular sampler, but §5.4.3 later says the detector declines whenever neither sub-range buckets nor Top-K are enabled. That means the default advertised --autoSplit mode can never produce a split key or satisfy the M3 acceptance criteria unless operators also discover and enable another keyviz flag; either make --autoSplit enable a split-key evidence source or document it as required configuration.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Accepted — correct catch. With bare keyviz defaults (DefaultKeyBucketsPerRoute=1 at keyviz/sampler.go:91, --keyvizHotKeysEnabled=false at main.go:205) the detector had no split-key evidence and §5.4.3 would decline every split, making 'just --autoSplit' a permanent no-op. Fixed by making --autoSplit imply the evidence source it needs: when bucketing is left at the default, auto-split constructs the sampler with KeyBucketsPerRoute=autoSplitDefaultBuckets (default 16, under the 256 cap at sampler.go:115), so §5.4.1 sub-range-p50 produces a split key out of the box. Added the implied-vs-explicit config table in §3.1, the keyvizEnabled||autoSplit wiring condition + --autoSplitDefaultBuckets flag in §7.1, and updated OQ-3. Top-K stays opt-in (only the §5.4.2 single-key path). Fixed in b56950a.
| Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint: | ||
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket boundary nearest the cumulative-load median, and use that boundary key (reconstructed via `MemSampler.SubBucketBoundsFor`, `keyviz/sampler.go:559`) as the split key. This places the boundary where load is actually halved. | ||
| 2. **Single-key skew — isolate the hot key.** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key is isolated into its own narrow child (split key = hot key, and a follow-on split at hot key's successor so the child contains only that key). This is the single-hot-key hotspot case the whole feature targets. |
There was a problem hiding this comment.
Exempt compound hot-key splits from cooldown
This single-key isolation plan requires two SplitRange calls: first at the hot key, then a follow-on split at its successor. However §5.3 immediately puts both children into the 10-minute cooldown after a split is scheduled, and §6.3 defaults to one split per cycle; unless the scheduler treats this as an atomic/explicitly exempt compound operation, the child containing the hot key cannot be split again until cooldown expires, so the hot key is not isolated as claimed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Accepted — the contradiction was real: §5.4.2 needs two SplitRange calls and §5.3 + §6.3 would have locked the hot-key child in cooldown after the first. Reworked it as one atomic compound decision: §5.4.2 now emits both split keys (at H and at H·0x00) together; §6.4 executes them as a chained-CAS back-to-back pair counted as 1 budget unit (so default maxSplitsPerCycle=1 still works), with partial-completion safety; §5.3 makes the intermediate child [H,B) cooldown-exempt while applying cooldown to the three final children. So the hot key is actually isolated. Added autosplit_compound_partial_total and §8.2 assertions. Fixed in b56950a.
- §5.0: explicit RouteStateActive-only candidacy precondition; exclude WRITE_FENCED/MIGRATING_* (gemini medium #3). - §3.1/§7.1: --autoSplit implies sub-range buckets (autoSplitDefaultBuckets, default 16) + sampler-wiring condition keyvizEnabled||autoSplit, with an implied-vs-explicit config table; closes the bare-keyviz-defaults no-op trap (codex P2 #1). - §5.3/§5.4.2/§6.4: compound single-key isolation as one atomic decision (1 budget unit, 2 chained CAS SplitRange calls); intermediate child is cooldown-exempt so the hot key can actually be isolated (codex P2 #2). - §7.7a: bounded leader-local state map via per-cycle GC reconciliation against the live catalog (gemini medium #1); also reconciles disappeared/manual-split routes. - §7.7b: add durable RouteDescriptor.SplitAtHLC for deterministic post-election cooldown reconstruction; replaces the unimplementable 'recent per catalog' seeding (gemini medium #2); resolves OQ-7; scoped to M3-PR1. - §7.3 metrics normalized to bounded {reason} enum; added autosplit_compound_partial_total. §8.1/§8.4 updated for new scope + criteria.
|
@claude review |
|
Claude finished @bootjp's task in 4m 53s —— View job Review: M3 Automation Design Doc (Round 2)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b56950acb2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| - **`SplitAtHLC uint64`** — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records. | ||
|
|
||
| With `SplitAtHLC` present, the new leader reconstructs cooldown deterministically: for each live route, if `now_hlc − SplitAtHLC < splitCooldown` (compared in HLC physical-ms units, monotonic per §5.3), seed that route's cooldown deadline to `SplitAtHLC + splitCooldown`. This is exact, not heuristic, and uses the durable timestamp rather than `parent_route_id` recency (which only tells you *whether* a route is a child, not *when*). |
There was a problem hiding this comment.
Convert HLCs before comparing cooldown durations
When reconstructing cooldown after a leader change, this formula appears to subtract raw HLC values and add splitCooldown directly to an HLC. In this repo HLCs encode physical milliseconds in the upper bits and a 16-bit logical counter in the low bits (kv/hlc.go layout), so a raw difference advances by 1<<16 per millisecond; a 10-minute cooldown would look expired after only about 9 ms, allowing immediate re-splits after elections. The design should shift/extract the physical millisecond component before duration math, then convert back only for diagnostics if needed.
Useful? React with 👍 / 👎.
|
|
||
| M3 adds one durable field to `RouteDescriptor`: | ||
|
|
||
| - **`SplitAtHLC uint64`** — the HLC timestamp at which the split that produced this route was applied. Set by the `SplitRange` handler (`adapter/distribution_server.go:136`) when it writes each child `RouteDescriptor`, using the leader-issued HLC (`HLC.Next()`, never the local wall clock — CLAUDE.md). It is encoded/decoded alongside `ParentRouteID` in `EncodeRouteDescriptor`/`DecodeRouteDescriptor` (`distribution/catalog.go:168-217`) under a bumped `catalogRouteCodecVersion` (currently 1 at `catalog.go:24`); the decoder treats a missing field on a v1 record as `SplitAtHLC = 0` (= "unknown / pre-upgrade", never in cooldown), so the change is backward-compatible with existing catalog records. |
There was a problem hiding this comment.
Use a fenced or committed timestamp for SplitAtHLC
This proposes writing a durable catalog field with HLC.Next(), but kv/hlc.go documents that new persistence-grade timestamp allocations must use NextFenced() so an expired physical ceiling fails closed. Since SplitAtHLC is persisted in RouteDescriptor and later drives cooldown reconstruction, use the split transaction's committed timestamp or allocate it with the fenced path; otherwise the autosplit metadata can be written with an unfenced stale-leader timestamp in the exact contexts the HLC fence is meant to prevent.
Useful? React with 👍 / 👎.
| ### 7.1 Default OFF behind a flag | ||
|
|
||
| - `--autoSplit` (bool, default `false`). When false, none of the detector/scheduler goroutines start; behavior is identical to today. When true, the default-group leader runs the detector + scheduler loop, and the keyviz sampler is constructed and flushed if it is not already. | ||
| - **Sampler-wiring condition.** The sampler is built when `keyvizEnabled || autoSplit` (today it is built only on `keyvizEnabled`; see `main.go:2005` `newKeyvizSampler` and the analogous wiring in `cmd/server/demo.go`). M3-PR3's flag-wiring change touches both `main.go` and `cmd/server/demo.go`. When auto-split forces the sampler on and the operator left `--keyvizKeyBucketsPerRoute` at its default of 1, auto-split raises it to `autoSplitDefaultBuckets` (default 16) so split-key evidence exists by default (§3.1 table). An explicit `--keyvizKeyBucketsPerRoute 1` is honored as-is (operator opts into route-granular-only → §5.4.3 decline). |
There was a problem hiding this comment.
Track whether key bucket count was explicitly set
This requires distinguishing an omitted --keyvizKeyBucketsPerRoute from an explicit --keyvizKeyBucketsPerRoute 1, but the current startup flags are plain flag.Int values in main.go/cmd/server/demo.go, where both cases have the same final value. Without adding explicit flag-presence tracking (for example via flag.Visit or a config wrapper), the implementation has to either always raise 1 to autoSplitDefaultBuckets and ignore the documented operator opt-out, or honor 1 and reintroduce the default no-split trap this section is trying to close.
Useful? React with 👍 / 👎.
|
@claude review Round-2 review addressed in 59990ce (doc-only; codex P2 #1 — "Convert HLCs before comparing cooldown durations" — ACCEPTEDVerified against codex P2 #2 — "Use a fenced or committed timestamp for SplitAtHLC" — ACCEPTEDVerified: codex P2 #3 — "Track whether key bucket count was explicitly set" — ACCEPTEDVerified: Claude round-2 review — incorporated
No rebuttals this round — all findings were valid. |
|
Claude finished @bootjp's task in 4m 13s —— View job Review: M3 Automation Design Doc (Round 3)
All round-2 findings are cleanly addressed. The Issue 1 — Decoder tail-bytes check at
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59990ceaf6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint: | ||
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved. |
There was a problem hiding this comment.
Handle p50 buckets that end at the route boundary
When the median load falls in the last sub-bucket, SubBucketBoundsFor returns that bucket's hi as the route's own End (or nil for an unbounded tail), and §5.4 later requires split_key < route.End; SplitRange also rejects split keys at the boundary. In an upper-tail hotspot where the last bucket carries >50% of the load, the default --autoSplit path will therefore choose an invalid split key and drop the decision instead of splitting a common hot range. The p50 path needs an interior fallback for first/last median buckets, such as using the adjacent lower boundary when hi is the route end.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Already covered in §5.4.1 paragraph 2 ("Boundary-bucket fallback (review, codex P2)") immediately under the line you flagged: when the p50 bucket's hi equals route.End (or is nil for an unbounded tail), the detector falls back to the median bucket's lo — an interior boundary that places the dominant last bucket in the right child while satisfying route.Start < split_key < route.End. The symmetric first-bucket case (lo == route.Start) falls back to hi. §8.2 has both "p50 in the last bucket → interior lo fallback" and "p50 in the first bucket → interior hi fallback" tests. Could you confirm this resolves the finding?
| Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint: | ||
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved. | ||
| 2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key `H` lands in its own single-key child. Isolating `H` from a route `[A, B)` takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets. |
There was a problem hiding this comment.
Support hot keys already at the lower route boundary
This compound flow always starts by splitting at H, but when the hot key is already the route start (H == A) that first SplitRange is invalid because the existing handler rejects split keys equal to parent.Start. That case occurs naturally after a previous split at the hot key (or for a route whose lower bound is the key), and the detector's own boundary validation would drop the isolation even though a single split at H·0x00 would isolate {H}. The algorithm should special-case H == route.Start instead of requiring two boundaries unconditionally.
Useful? React with 👍 / 👎.
| M3 adds one durable field to `RouteDescriptor`: | ||
|
|
||
| - **`SplitAtHLC uint64`** — the **committed HLC timestamp of the `SplitRange` transaction that produced this route**. This must be a leader-fenced, Raft-committed timestamp, **not** an arbitrary `HLC.Next()` allocation (review correctly flagged this: `HLC.Next()` bypasses the physical-ceiling fence — `kv/hlc.go:110-128` — so a stale leader could mint a `SplitAtHLC` inside a window the ceiling fence is meant to forbid, the exact case the fence exists to prevent). The safe source is already at hand: the `SplitRange` handler commits the catalog mutation through `coordinator.Dispatch` (`adapter/distribution_server.go:229-235`), and the coordinator allocates that transaction's commit ts via the **fenced** path (`resolveTxnCommitTS`, `kv/sharded_coordinator.go:1091-1107` → `nextTxnTSAfter`, `:1365-1379` → `NextFenced()`). So `SplitAtHLC` is **defined to be that transaction's commit ts** — the same value the split mutation is durably applied at. | ||
| - **Mechanism (so the stored value equals the actual commit ts).** Today `CoordinateResponse` returns only `CommitIndex`, not the commit ts (`kv/coordinator.go:165-167`), so the handler cannot read the ts back after dispatch. Instead the handler **pre-allocates** the fenced commit ts via `HLC.NextFenced()` on the default-group leader, stamps it into both child `RouteDescriptor.SplitAtHLC` *before* encoding, and passes it as the explicit `OperationGroup.CommitTS` (with `StartTS = snapshot.ReadTS`). The coordinator honors a caller-supplied `CommitTS` (it `Observe`s it to keep the clock monotonic, `kv/sharded_coordinator.go:1098-1102`) rather than minting a fresh one, so the value encoded in the descriptors is provably the transaction's commit ts. If `NextFenced()` returns `ErrCeilingExpired` (the leader's physical ceiling has lapsed), the handler fails the `SplitRange` closed (`FailedPrecondition`) instead of writing an unfenced timestamp — a brief, safe rejection, not a corrupt record. |
There was a problem hiding this comment.
Allocate SplitAtHLC after observing the snapshot timestamp
Pre-allocating CommitTS with a single HLC.NextFenced() can still produce a timestamp <= snapshot.ReadTS after restart or leader change when the durable catalog's latest timestamp has the same physical HLC component but a higher logical counter than this process has observed. Dispatch rejects caller-supplied commit timestamps that are not greater than StartTS, so SplitRange would fail even though a valid fenced timestamp exists; mirror nextTxnTSAfter(readTS) semantics by observing readTS and retrying NextFenced() before stamping SplitAtHLC.
Useful? React with 👍 / 👎.
|
@claude review Round-3 review addressed in Issue 1 — second strict decoder check at Issue 2 — Issue 3 — Plus three codex P2 inline findings at HEAD (all ACCEPTED):
Status stays |
|
Claude finished @bootjp's task in 5m 43s —— View job Review: M3 Automation Design Doc (Round 4)
All six round-3 findings and the three codex P2 inline findings are cleanly addressed. Verified against the codebase:
One new issue introduced by the §7.7c encoder spec. Issue —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: afc95e3b43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **New code (v2 decoder) reading a future v3+ record** — reads `SplitAtHLC`, drains whatever v3 appended, returns a valid `RouteDescriptor`. | ||
| - **Old code (the v1-era decoder shipped *before* the bump) reading a v2 record** — this is the rolling-upgrade case the fence is about. The relaxed v1-era decoder, having no `version >= 2` branch, reads through `End`, then drains the 8-byte `SplitAtHLC` tail instead of erroring. It loses only the cooldown hint (equivalent to `SplitAtHLC = 0`, which is safe per (b)). | ||
| - **Encoder.** `EncodeRouteDescriptor` writes `catalogRouteCodecVersion` (now `2`) and appends `SplitAtHLC` as a fixed 8-byte big-endian tail after the `End` block; `routeDescriptorEncodedSize` adds `catalogUint64Bytes` (`distribution/catalog.go:708`). | ||
| - **Upgrade ordering fence (documented ops constraint).** Even with accept-and-ignore, the safe rollout is: deploy the relaxed-decoder build (the relaxation of *both* checks above) to **all** nodes first; only then does any node begin issuing `SplitRange` writes that emit v2 records. Because M3-PR1 ships the decoder relaxation and the v2 writer is only exercised once a split runs (M3-PR3 wires the scheduler; manual `SplitRange` is operator-initiated), the practical fence is "complete the M3-PR1 rollout cluster-wide before issuing any split." This is stated in the M3-PR1 PR description and the §8.4 acceptance criteria. |
There was a problem hiding this comment.
Gate v2 route writes during rolling upgrades
In a rolling upgrade where one M3-PR1 node is the default-group leader and an operator uses the already-existing manual SplitRange before every node has the relaxed decoder, this fence is not enforced: buildCatalogSplitOps always encodes children through distribution.EncodeRouteDescriptor (adapter/distribution_server.go:244-259), so the encoder bump in the same PR can write v2 records while current pre-M3 nodes still reject both a non-1 version byte and trailing bytes in DecodeRouteDescriptor. That can break catalog snapshots/watchers on the not-yet-upgraded nodes despite the doc calling the change rolling-upgrade safe; keep the writer on v1/gated until cluster-wide decoder rollout, or split decoder relaxation and v2 emission into separate releases.
Useful? React with 👍 / 👎.
|
Addressed the remaining review issue: §7.7c encoder spec now requires appending the SplitAtHLC tail on BOTH End paths (the End==nil early return at distribution/catalog.go:185 would otherwise skip it; rightmost route / split right child always hit it), with the if/else restructure + Go snippet; added codec matrix Case 1a (v2 End==nil round-trip) and the §8.1 M3-PR1 checklist line requiring an End==nil round-trip test. Doc-only, Status: Proposed. @claude review |
|
Claude finished @bootjp's task in 5m 37s —— View job Review: M3 Automation Design Doc (Round 5)
Confirmed Resolved — Round 4
The decoder side is correctly unaffected: One Remaining Issue — Codex Comment Still UnaddressedThe codex comment posted on The concrete problem: M3-PR1 ships both the decoder relaxation and the v2 encoder bump in the same binary (
The doc's current response (§7.7c "Upgrade ordering fence") says: "the practical fence is 'complete the M3-PR1 rollout cluster-wide before issuing any split.'" This is a documented ops constraint, not a technical one. The codex asked for technical enforcement, and the doc doesn't provide it. Gap: the doc's round-3 revision explicitly wrote "shipping an accept-and-ignore-forward-version decoder first, then the v2 encoder" (from bootjp's round-3 dispositions), implying the two should be sequenced at the release level — but the M3-PR1 table ships both. There is an inconsistency between the intent and the plan. Recommended fix (two options, choose one and state it explicitly): Option A — split M3-PR1 into two PRs (technically enforces the ordering):
Once every node has PR1a, no pre-M3 decoder exists in the cluster: a PR1b-emitted v2 record is safe to decode everywhere. The "decoder first" intent is technically enforced rather than advisory. Option B — keep M3-PR1 as one PR but document the failure mode explicitly. If the single-PR approach is retained, the "upgrade ordering fence" paragraph should explicitly state what happens if the constraint is violated: "if a manual Either option closes the gap. Option A is preferred because it makes the upgrade safe by construction; Option B is acceptable if the single-PR release boundary is a hard constraint. Minor NitCodec matrix is missing Case 4a (old-reads-new direction with
Case 4a is not strictly necessary (the individual pieces are covered by Cases 1a and 4), but it explicitly exercises the combination that matters most for the rightmost-route mid-upgrade scenario. SummaryThe round-4 issue is cleanly resolved — the if/else encoder restructure, Case 1a, and §8.1 checklist item are all correct and complete. One item remains: the codex comment on |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8df8eb9e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| N = candidateWindows // trailing flush columns summed (default 3, §5.2) | ||
| Step_s = keyvizStep.Seconds() // flush-column width in seconds (default 60) | ||
| write_rate = (Σ write_ops over N columns) / (N × Step_s) × 60 // ops/min | ||
| read_rate = (Σ read_ops over N columns) / (N × Step_s) × 60 // ops/min | ||
| score = write_rate × Ww + read_rate × Wr // weighted ops/min |
There was a problem hiding this comment.
Separate scoring window from consecutive-window hysteresis
With the default candidateWindows = 3, this formula averages the last 3 columns and then §5.2 requires that averaged score to be over threshold for 3 consecutive flush columns. That no longer implements “3 consecutive windows over threshold”: a cold window can be hidden by the moving average and still advance the counter, while cold start needs 5 hot flushes before promotion because the counter does not advance until 3 columns exist. Use per-column scores for the consecutive test or split this into separate scoreWindowColumns and candidateWindows knobs so hysteresis matches the documented trigger.
Useful? React with 👍 / 👎.
|
@claude review Adopted Option A (the round-5 recommendation, and codex |
|
Claude finished @bootjp's task in 4m 44s —— View job Review: M3 Automation Design Doc (Round 6)
Confirmed Resolved — Round 5Option A (decoder-before-encoder split, round-5 Issue) — cleanly addressed in 545e718. Verified across the doc:
All prior-round findings are cleanly resolved. The doc is structurally complete. Issue — Scoring Window and Hysteresis Window Are Coupled; Behavior Diverges from Parent Doc §6.2 (§5.1, §5.2, M3-PR2 scope)This is the codex P2 comment from commit The problem — two symptoms from one cause. §5.1 defines Symptom 1 — a cold column advances the counter. The rolling average smooths over N columns, so a single cold column can be absorbed by two flanking hot columns and still yield an above-threshold average. Example with
Column 4 was cold (0 ops) and still advanced Symptom 2 — cold start requires Root cause. The Recommended fix — per-column score for the consecutive test; N-column average for display only. Separate the two uses of Advance/reset
The §5.1 note "if a future need arises to decouple them, add What needs to change in the doc:
SummaryRound-5 issue is cleanly resolved — the Option A two-PR codec split is precisely specified and technically enforced. One issue remains: the codex P2 "Separate scoring window from consecutive-window hysteresis" comment on |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 545e71880d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### 6.3 Concurrency with in-flight splits | ||
|
|
||
| The scheduler issues at most `maxSplitsPerCycle` (§5.5) `SplitRange` calls per cycle and waits for each to return (same-group splits are synchronous; cross-group returns a `job_id` immediately per M2 §5.1). A route with an in-flight cross-group `SplitJob` (not yet `DONE`) is excluded from candidacy until the job completes, so the detector cannot stack a second split on a route mid-migration. |
There was a problem hiding this comment.
Keep the per-cycle RPC cap consistent
In the single-key path, §6.4 now explicitly executes two back-to-back SplitRange RPCs while counting the compound decision as one budget unit, so this statement is no longer true under the default maxSplitsPerCycle = 1. If an implementer follows this line literally, the second split needed to isolate {H} will be blocked; if they follow §6.4, the advertised per-cycle SplitRange call cap is not actually enforced. Fresh evidence after the prior cooldown fix is that §6.3 still caps RPC calls rather than logical split decisions, so the design should either separate the logical-route budget from an RPC cap or explicitly exempt compound isolation here.
Useful? React with 👍 / 👎.
| } | ||
| ``` | ||
|
|
||
| `flag.Visit` (and the equivalent `flag.CommandLine.Lookup("…")` presence check) is already the idiomatic Go answer here; no config-wrapper type is needed. The same `keyBucketsExplicit` computation is mirrored in `cmd/server/demo.go` (which has its own flag set, §7.1). With this, the implied-vs-explicit config is: |
There was a problem hiding this comment.
Drop Lookup as a flag-presence check
flag.CommandLine.Lookup("keyvizKeyBucketsPerRoute") is not equivalent to the flag.Visit check above: Go's flag package defines Visit as visiting only flags that have been set, while Lookup returns the named flag if it is defined, regardless of whether the operator supplied it. If the implementation follows this parenthetical, keyBucketsExplicit will always be true and bare --autoSplit will keep the default bucket count of 1, reintroducing the no-split default that this section is trying to eliminate.
Useful? React with 👍 / 👎.
|
@claude review Addressed codex P2 "Separate scoring window from consecutive-window hysteresis" (comment 3393859791) and the matching round-6 claude finding by splitting the score into two pinned formulas in §5.1: a per-column score (a single committed flush column's weighted ops/min) that is now the only input to the §5.2 hysteresis counter, and a smoothed N-column score kept for operator-facing slog/display only — it no longer feeds any decision. I chose per-column-for-hysteresis (codex's simpler option) over splitting into Also folded in the two new codex P2 comments on this HEAD:
Status remains Proposed; doc-only. |
|
Claude finished @bootjp's task in 4m 1s —— View job Review: M3 Automation Design Doc (Round 7)
All three round-6 / codex P2 items are cleanly resolved. Two new issues in the revised §5.1 pseudocode are worth fixing before M3-PR2. Confirmed Resolved — Round 6"Separate scoring window from consecutive-window hysteresis" (§5.1–5.2, codex P2) — cleanly addressed. Verified across the doc:
"Drop Lookup as a flag-presence check" (§3.1, codex P2) — cleanly addressed. §3.1 now explicitly names "Keep the per-cycle RPC cap consistent" (§6.3, codex P2) — cleanly addressed. New Issue 1 — Formula (B) has an undefined variable:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c528c7854
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ### 5.5 Guardrails | ||
|
|
||
| - **Minimum route size** (`minRouteSpan`, OQ-4): a route whose observed key span (distance between min and max observed key, or sub-range extent) is below a floor is not split — splitting a tiny range yields children too small to relieve load and churns the catalog. Because the engine has no per-route key-count, this is approximated from the sub-range/Top-K evidence; when that evidence is absent the guard is conservatively "do not split" (matches §5.4.3). | ||
| - **Max routes** (`maxAutoRoutes`, default e.g. 1024): once the catalog holds `maxAutoRoutes` routes, the detector stops proposing new splits and emits `autosplit_skipped_total{reason="route_cap"}`. Prevents unbounded catalog growth (and bounds the watcher fan-out and the Composed-1 history-ring pressure noted at `distribution/engine.go:53-58`). |
There was a problem hiding this comment.
Reserve route-cap headroom for compound splits
When a single-key isolation is scheduled while the catalog has maxAutoRoutes - 1 live routes, this guard still allows the decision, but the compound flow in §5.4.2 performs two splits and increases the live route count by two. That leaves the catalog at maxAutoRoutes + 1, violating the advertised cap and the bounded watcher/history pressure it is meant to protect; the detector should check liveRoutes + decisionRouteDelta <= maxAutoRoutes (delta 1 for normal splits, 2 for compound isolation) before scheduling.
Useful? React with 👍 / 👎.
| ### 6.2 Same-group today; least-loaded target once M2 lands | ||
|
|
||
| - **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.) | ||
| - **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`. The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability. |
There was a problem hiding this comment.
Keep cross-group targeting out of the first compound split
When --autoSplitCrossGroup is later enabled, applying target_group_id on the same scheduler path means the first RPC of a compound hot-key isolation ([A,B) split at H) would make the intermediate right child [H,B) the cross-group/migrating child. The compound algorithm then immediately issues a second SplitRange on [H,B), but §6.3 says cross-group splits return a job_id immediately and routes with in-flight jobs are excluded, so the second split cannot be performed back-to-back. The design should either finish the compound isolation same-group first and only migrate a final child, or explicitly defer target selection until after isolation completes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Already covered in §6.2 ("Compound single-key isolation is same-group-only, even under --autoSplitCrossGroup (codex P2)") and §6.4 (last bullet). Both SplitRange calls of a compound isolation always pass target_group_id = 0; cross-group target selection is never applied to a compound operation in progress because the intermediate child [H,B) would become an in-flight SplitJob target that §6.3 excludes from candidacy, stranding the isolation at compound_partial on its very first execution. Cross-group relocation of any of the three FINAL children is a separate, later decision (§6.2 second bullet) — only once they are RouteStateActive with no in-flight SplitJob. Could you confirm this resolves the finding?
|
|
||
| | PR | Scope | Tests | Independently shippable? | | ||
| |---|---|---|---| | ||
| | **M3-PR1a** (decoder relaxation, ship **first**) | Retire engine `RecordAccess`/`splitRange`/`Route.Load` threshold path (§3.4); confirm keyviz is the sole signal; **relax both strict decoder checks** — introduce `catalogRouteCodecVersionMin` (`= 1`), accept version byte `>= catalogRouteCodecVersionMin`, and drain (don't reject) any unknown trailing tail (§7.7c). **No `SplitAtHLC` struct field, no encoder change, no helper change; `catalogRouteCodecVersion` stays `1` and the encoder still writes v1.** This is the technical enforcement of the decoder-before-encoder ordering: it must reach every node before M3-PR1b deploys (§7.7c "Upgrade ordering fence", codex `afc95e3b`). | `distribution/engine_test.go` updated (dead-code removal); **codec matrix cases authored here, before the encoder exists** — case 3 (v1 still round-trips), **case 4 and case 4a (relaxed decoder reads a *hand-built* v2 record — incl. an `End == nil` v2 record — drains the 8-byte tail, no error, `SplitAtHLC == 0`)**, case 5 (forward-version v3 drained), case 6 (version `< min` rejected), case 7 (truncated tail rejected). | Yes — pure dead-code removal + decoder loosening with no wire-format emission change; v1 records still round-trip; no v2 record can yet be written. Safe to ship cluster-wide on its own. | |
There was a problem hiding this comment.
Align PR1a tests with the decoder it ships
This assigns the truncated-tail rejection case to M3-PR1a, but the same row says PR1a has no SplitAtHLC field and no encoder/field change, while §7.7c describes the PR1a decoder as a v1-era relaxed decoder that simply drains the unknown v2 tail. Such a decoder cannot distinguish an 8-byte v2 tail from a 1–7 byte truncated tail, so this test will either fail or force PR1a to know the v2 layout it is supposed to precede; move case 7 to PR1b or specify that PR1a already validates version >= 2 tail length.
Useful? React with 👍 / 👎.
|
@claude review Round 7 dispositions — all in §5.1/§5.2/§8.2, doc-only, Issue 1 (load-bearing) — FIXED. §5.1 formula (B) no longer has an undefined variable. Removed the single combined Every variable is now defined; the weighted ops/min composition is unambiguous and parallel to (A), so the Issue 2 — FIXED. Added the Nit — FIXED. The standalone No newer codex/gemini inline findings since 2026-06-11T07:40Z to fold in ( |
|
Claude encountered an error after 1s —— View job I'll analyze this and get back to you. |
|
@claude review |
|
Claude finished @bootjp's task in 3m 31s —— View job Review: M3 Automation Design Doc (Round 8)
Confirmed Resolved — Round 7Issue 1 (undefined variable in formula B, §5.1) — cleanly fixed. Formula (B) now mirrors (A) exactly: Issue 2 ( No new issues were introduced by the §5.1/§5.2/§8.2 revisions. Three Codex P2 Comments from
|
…ope, case 7 placement
|
@claude review Folded the three remaining codex P2 findings from 1. Route-cap delta accounting (§5.5, codex line 248) — Accepted. The cap is now checked against the decision's actual route delta, not the bare parent count. §5.5 "Max routes" now states the detector computes 2. Cross-group + compound isolation incompatibility (§6.2/§6.4, codex line 262) — Accepted. Resolved by making compound single-key isolation same-group-only, even under 3. Case 7 placement (§8.1, codex line 452) — Accepted. Moved codec round-trip case 7 (truncated-tail reject) from the M3-PR1a row to the M3-PR1b row. The PR1a drain-unknown-tail decoder accepts a Doc-only, |
|
Claude finished @bootjp's task in 4m 19s —— View job Review: M3 Automation Design Doc (Round 9)
All three round-8 findings are cleanly addressed. One minor nit worth noting before M3-PR4. Confirmed Resolved — Round 8Route-cap delta accounting (§5.5, codex P2) — cleanly addressed. The guard is now Cross-group + compound isolation incompatibility (§6.2/§6.4, codex P2) — cleanly addressed. Both calls in §6.4 carry Case 7 placement (§8.1, codex P2) — cleanly addressed. Case 7 (truncated tail rejected) is in the M3-PR1b row only. The PR1a row now says "Case 7 is not authored here — the PR1a drain-unknown-tail decoder accepts a short tail by design (no Nit — §6.4's "does not add to the live route count" is imprecise for subsequent cross-group operations§6.4 ends with: "The route-cap budget for the compound is The intent is correct (the +2 accounts only for the compound itself; subsequent decisions are separate), but "does not add to the live route count" is not accurate for non-trivially-small final children. M2's Suggested clarification: "The route-cap budget for the compound is SummaryAll round-8 findings are cleanly resolved with no new substantive issues. The |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dc2d52d4a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved. | ||
| - **Boundary-bucket fallback (review, codex P2).** `SubBucketBoundsFor` returns the **route's own `End`** as `hi` for the last sub-bucket (and `nil` for an unbounded tail; `keyviz/sampler.go:570-575` — the single-bucket case at `:571-573` likewise returns `start, end`). When the median (p50) bucket is the **last** bucket — an upper-tail hotspot where the final bucket alone carries `> 50%` of the load — its `hi` equals `route.End`, and §5.4's `split_key < route.End` rule plus `validateSplitKey`'s boundary rejection (`adapter/distribution_server.go:359-360,370-371`: `splitKey == parent.End` → `errDistributionSplitKeyAtBoundary`) would make that key invalid, so the detector would drop the decision and never split a genuinely hot tail. The detector therefore **falls back to the median bucket's `lo` (its lower boundary) when the chosen `hi` is `>= route.End` (or `route.End == nil`)**: an interior boundary that still places the dominant last bucket in the *right* child while satisfying `route.Start < split_key < route.End`. Symmetrically, if the p50 bucket is the **first** bucket and its `lo` equals `route.Start`, the detector uses that bucket's `hi` instead (it is necessarily interior because a route always has `> 1` sub-bucket on this path). If *both* candidate boundaries coincide with a route edge (degenerate single-sub-bucket route, only possible when `--keyvizKeyBucketsPerRoute == 1`, which this path is gated out of), the detector declines per §5.4.3. The §8.2 split-key tests add the "p50 in the last bucket → interior `lo` fallback" and "p50 in the first bucket → interior `hi` fallback" cases. | ||
| 2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key `H` lands in its own single-key child. Isolating `H` from a route `[A, B)` normally takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets. |
There was a problem hiding this comment.
Require absolute load for hot-key isolation
When a route is promoted by read load, this Top-K rule can still win on write share alone: a single incidental write to H gives H 100% of route writes even though the §5.1 score may be almost entirely reads. In that scenario the scheduler burns the split budget and cooldown isolating a write key that did not cause the hotspot, instead of using the sub-range p50 path to relieve the read-heavy range. Gate compound isolation on the hot key's absolute weighted contribution to the route score, or require a minimum/dominant write volume before preferring it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ce50ceb. §5.4.2 now gates hot-key isolation on share AND an absolute weighted-contribution floor (topKeyAbsoluteFloor, default = autoSplitThreshold/2 = 25_000 weighted ops/min on H_writes*Ww). A read-promoted route where a single incidental write to H gives H 100% of route writes no longer wins isolation — the share gate fires but the absolute floor doesn't, and the detector falls through to the §5.4.1 sub-range p50 path with autosplit_isolation_declined_total{reason="absolute_floor"}. So the split budget and cooldown go to relieving the actual read-heavy range, not isolating an incidental write key.
| ### 5.5 Guardrails | ||
|
|
||
| - **Minimum route size** (`minRouteSpan`, OQ-4): a route whose observed key span (distance between min and max observed key, or sub-range extent) is below a floor is not split — splitting a tiny range yields children too small to relieve load and churns the catalog. Because the engine has no per-route key-count, this is approximated from the sub-range/Top-K evidence; when that evidence is absent the guard is conservatively "do not split" (matches §5.4.3). | ||
| - **Max routes** (`maxAutoRoutes`, default e.g. 1024): the cap bounds the **live** route count after a decision lands, so the detector must check it against the decision's actual **route delta**, not against the parent count alone (codex P2). A normal split turns one parent into two children — net **+1** route; a §5.4.2 compound single-key isolation turns one parent into **three** children (`[A,H)` + `{H}` + `[H·0x00,B)`) — net **+2** routes (the `H == route.Start` single-split special case in §5.4.2 is two children, net **+1**). The detector therefore computes `routeDelta` per decision (`+2` for the two-boundary compound isolation, `+1` for a normal split or the single-split isolation) and **only emits a decision when `liveRoutes + routeDelta <= maxAutoRoutes`**; otherwise it skips and emits `autosplit_skipped_total{reason="route_cap"}`. Checking the bare `liveRoutes >= maxAutoRoutes` would wrongly admit a compound isolation at `liveRoutes == maxAutoRoutes - 1`, leaving the catalog at `maxAutoRoutes + 1` and breaching the cap. This bounds catalog growth (and the watcher fan-out and the Composed-1 history-ring pressure noted at `distribution/engine.go:53-58`). The §8.2 guardrail tests add: a compound isolation scheduled at `liveRoutes == maxAutoRoutes - 1` is **skipped** with `route_cap` (delta `+2` would breach), while a normal split at the same live count **proceeds** (delta `+1` is exactly at the cap). |
There was a problem hiding this comment.
Reserve route-cap slots across the whole cycle
This check is correct for one decision, but when an operator raises maxSplitsPerCycle above 1 the detector can emit multiple decisions against the same starting liveRoutes. For example, with liveRoutes == maxAutoRoutes - 1, two normal splits each satisfy liveRoutes + 1 <= maxAutoRoutes individually, but together leave the catalog at maxAutoRoutes + 1. Track a cumulative reserved delta while selecting decisions in a cycle, or re-check the live route count after each scheduled split.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ce50ceb. §5.5 now maintains a cycle-local reservedDelta accumulator initialised to 0 at the start of each evaluation cycle and incremented by each emitted decision's routeDelta; admission is liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes at every decision, not the bare per-decision check. With liveRoutes == maxAutoRoutes - 1 and maxSplitsPerCycle == 2, the first normal split admits and the second skips with route_cap. §8.2 adds the test case.
| ### 6.2 Same-group today; least-loaded target once M2 lands | ||
|
|
||
| - **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.) | ||
| - **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`. The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability. |
There was a problem hiding this comment.
Avoid targeting groups with unknown load
In multi-group deployments, §4.1 says the default-group leader's local sampler has no data for routes served through another group's leader, so this local per-group aggregate can make unobserved groups look idle. Once --autoSplitCrossGroup is enabled, the least-loaded policy can therefore migrate a hot range into an already-overloaded group whose load is simply invisible to the default leader. Target selection should either use a cluster-wide/fan-out load source or exclude groups whose load is unknown from the leader-local view.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ce50ceb. §6.2 now requires positive evidence of low load, not absence of evidence. M3 ships the local-view fast path: a target group is admitted only if the default-group leader is leader of at least one of the group's routes — groups with no local route ownership are categorically excluded as candidates with autosplit_target_skipped_total{reason="load_unknown"}. The cluster-wide cap_migration_v2-heartbeat fan-out extension is recorded under OQ-12 / M3.x as the eventual default. §8.2 adds the "node A has no local route ownership in groups B and C; cross-group skipped under load_unknown" test.
3 new findings from codex round-7 that needed doc updates; 2 others
(line 233 boundary buckets, line 262 compound + cross-group, line
248 single-decision route delta) were already covered in the doc
and will be answered with inline replies pointing at the
already-written sections.
Codex P2 line 235 — hot-key isolation by share alone wastes budget
Top-K share-only gate (>= 0.8) wins on a single incidental write
to H when the route score is almost entirely reads. Add an
absolute weighted-contribution floor: isolation only fires when
H_writes*Ww >= topKeyAbsoluteFloor (default = autoSplitThreshold/2
= 25_000 weighted ops/min). Share-only-fire-but-floor-skip falls
through to the §5.4.1 sub-range p50 path and counts
autosplit_isolation_declined_total{reason="absolute_floor"}.
Codex P2 line 248 — route cap reservation across the cycle
With maxSplitsPerCycle > 1 the detector can emit two decisions
against the same starting liveRoutes, each satisfying the bare
routeDelta check but together breaching maxAutoRoutes. Add a
cycle-local reservedDelta accumulator: admission is now
liveRoutes + reservedDelta + decision.routeDelta <= maxAutoRoutes
at every decision, with the §8.2 test adding the two-normal-splits-
at-cap case (first admitted, second skipped with route_cap).
Codex P2 line 262 — least-loaded target with unknown-load groups
Default-group leader's local sampler only sees routes it owns,
so groups served via another leader read as 0-load and would be
picked as least-loaded targets despite being overloaded. Target
eligibility now requires positive evidence of low load (not
absence of evidence): M3 ships the local-view fast path (admit
only groups with at least one locally-led route) and records the
cluster-wide fan-out as M3.x / OQ-12. Excluded groups count under
autosplit_target_skipped_total{reason="load_unknown"}.
All three deltas inherit the same five-lens self-review structure
the doc already uses; tests called out in §8.2 cover each new case.
|
@claude review — round 7 review (5 codex P2 findings on PR #951):
|
|
Claude finished @bootjp's task in 4m 41s —— View job Review: M3 Automation Design Doc (Round 10)
All five trigger-comment items are confirmed addressed. Three new issues in the Confirmed Resolvedline 233 (boundary buckets — "already covered") — §5.4.1 paragraph 2 "Boundary-bucket fallback (review, codex P2)" is present and complete: line 235 (absolute-load gate — ce50ceb) — §5.4.2 now gates compound isolation on line 248 (route-cap across cycle — ce50ceb) — §5.5 now maintains a cycle-local line 262 (compound + cross-group — "already covered") — §6.2 has the explicit "Compound single-key isolation is same-group-only, even under line 262 (unknown-load target — ce50ceb) — §6.2 now requires positive evidence of low load via the local-view fast path: groups with no local route ownership are excluded with Round-9 Nit — Still Open (Non-blocking)§6.4 still ends with: "The route-cap budget for the compound is This remains imprecise: M2's New Issue 1 —
|
| Step | Hot key at 6 250 raw writes/min | H_writes in one column | H_writes × Ww | Floor (25 000 ops/min) | Result |
|---|---|---|---|---|---|
| 60s | 6 250/min | 6 250 | 25 000 | 25 000 | pass (correct) |
| 30s | 6 250/min | 3 125 | 12 500 | 25 000 | fail (wrong) |
| 120s | 6 250/min | 12 500 | 50 000 | 25 000 | pass (also correct, coincidentally) |
A hot key at exactly the intended 6 250 weighted writes/min (25 000 / Ww) fails the floor for Step < 60s even though it should pass, and passes above 6 250 for Step > 60s on a lower actual rate. This is exactly the Step↔sensitivity coupling OQ-8's normalization was designed to eliminate. The unit mismatch silently varies with --keyvizStep.
Fix: express the floor check in per-minute units, consistent with §5.1 formula (A):
H_write_rate = H_writes_in_column / Step_s × 60 // ops/min, same normalization as §5.1
absolute_contribution = H_write_rate × Ww // weighted ops/min
inCooldown := absolute_contribution >= topKeyAbsoluteFloor
The description in §5.4.2 should say "the absolute floor is computed as (H_writes_in_column / Step_s × 60) × Ww (weighted ops/min, normalized consistently with §5.1 formula (A))" — not the raw H_writes × Ww. The §8.2 scoring tests should include a case that verifies Step-invariance of the floor gate (mirroring the Step-invariance case already required for the scoring formula by OQ-8's resolution).
New Issue 2 — Two new counters from ce50ceb are absent from §7.3's metric inventory (§5.4.2, §6.2, M3-PR3 scope)
§7.3 is the authoritative metric table for bounded-cardinality enforcement (the principle that "all metric names must be enumerated explicitly"). ce50cebb introduces two counter names in the doc body that do not appear in §7.3:
autosplit_isolation_declined_total{reason="absolute_floor"}(§5.4.2) — fired when the share gate passes but the absolute floor does not, causing fallthrough to p50. This is distinct fromautosplit_skipped_total(the detector still proceeds to p50 and may issue aSplitRange), so it cannot just be a newreasonvalue in the existingskippedcounter. It needs its own entry in §7.3.autosplit_target_skipped_total{reason="load_unknown"}(§6.2) — fired when a cross-group candidate group is excluded for lack of local load data. This too is absent from §7.3.
Without §7.3 entries for these counters, M3-PR3 implementers must discover them via a textual search through §5.4.2 and §6.2; the metric registration test (PR1b scope) will not cover them; and operators cannot configure dashboards against a complete, enumerated list. Both need to be added to §7.3's counter block.
Suggested additions:
- `autosplit_isolation_declined_total{reason}` — compound single-key isolation was
considered but not emitted; reason ∈ {`absolute_floor`} (the detector fell through to
p50 instead). Does NOT mean a split was skipped — the p50 path may still run.
- `autosplit_target_skipped_total{reason}` — a cross-group target candidate was excluded;
reason ∈ {`load_unknown`} (no local route ownership in that group, §6.2).
New Issue 3 — §8.2 missing a test for the absolute-floor fallthrough (§5.4.2, M3-PR2 scope)
§8.2 lists three compound-split test cases (H == route.Start single-split, already-isolated [H, H·0x00) decline), and Issue 1 above adds a normalization case. But there is no listed test for:
"share gate fires (key H's share >=
topKeyShare), but the absolute-floor gate does not (H's weighted ops/min <topKeyAbsoluteFloor): the detector must fall through to §5.4.1 p50 and must NOT emit a compound isolation decision."
Without this case in §8.2, an implementer can pass all listed tests while implementing the share check correctly but omitting the absolute floor entirely — or while implementing the floor in raw-count units (Issue 1) rather than normalized ops/min. The test is also the natural place to exercise the autosplit_isolation_declined_total{reason="absolute_floor"} counter (Issue 2).
Add to §8.2's split-key test list: "compound isolation — share fires but absolute floor doesn't (H is 100% of route writes by share but H_write_rate × Ww < topKeyAbsoluteFloor): detector emits p50 decision, not compound; autosplit_isolation_declined_total{reason="absolute_floor"} incremented."
Summary
All five trigger-comment items are cleanly addressed. The round-9 nit remains (§6.4 wording, non-blocking). Three new issues: Issue 1 (the topKeyAbsoluteFloor raw-count/rate mismatch) is load-bearing because it silently varies with --keyvizStep and would introduce the exact coupling OQ-8 was resolved to eliminate; Issues 2 and 3 (missing §7.3 entries and §8.2 test case) are concrete correctness gaps for the M3-PR2/PR3 implementer. All three are in §5.4.2/§6.2/§7.3/§8.2 scope — doc-only fixes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce50cebbfe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket whose **cumulative load first crosses 50%**, and use the **`hi` (upper boundary)** of that bucket — `SubBucketBoundsFor(routeID, bucket)` returns `(lo, hi []byte, ok bool)` (`keyviz/sampler.go:559`); the split key is the `hi` of the median bucket, i.e. the boundary *between* the p50 bucket and the next one, so all of the p50 bucket's load stays in the left child. This places the boundary where load is actually halved. | ||
| - **Boundary-bucket fallback (review, codex P2).** `SubBucketBoundsFor` returns the **route's own `End`** as `hi` for the last sub-bucket (and `nil` for an unbounded tail; `keyviz/sampler.go:570-575` — the single-bucket case at `:571-573` likewise returns `start, end`). When the median (p50) bucket is the **last** bucket — an upper-tail hotspot where the final bucket alone carries `> 50%` of the load — its `hi` equals `route.End`, and §5.4's `split_key < route.End` rule plus `validateSplitKey`'s boundary rejection (`adapter/distribution_server.go:359-360,370-371`: `splitKey == parent.End` → `errDistributionSplitKeyAtBoundary`) would make that key invalid, so the detector would drop the decision and never split a genuinely hot tail. The detector therefore **falls back to the median bucket's `lo` (its lower boundary) when the chosen `hi` is `>= route.End` (or `route.End == nil`)**: an interior boundary that still places the dominant last bucket in the *right* child while satisfying `route.Start < split_key < route.End`. Symmetrically, if the p50 bucket is the **first** bucket and its `lo` equals `route.Start`, the detector uses that bucket's `hi` instead (it is necessarily interior because a route always has `> 1` sub-bucket on this path). If *both* candidate boundaries coincide with a route edge (degenerate single-sub-bucket route, only possible when `--keyvizKeyBucketsPerRoute == 1`, which this path is gated out of), the detector declines per §5.4.3. The §8.2 split-key tests add the "p50 in the last bucket → interior `lo` fallback" and "p50 in the first bucket → interior `hi` fallback" cases. | ||
| 2. **Single-key skew — isolate the hot key (compound split).** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8) **AND** the hot key's absolute weighted contribution to the route score is at least `topKeyAbsoluteFloor` (default = `autoSplitThreshold / 2` = 25_000 weighted ops/min) of the route's `score = write_ops * Ww + read_ops * Wr` (§5.1), split so the hot key `H` lands in its own single-key child. The absolute-weighted-contribution gate (codex round-7 P2) prevents a wasteful compound isolation on a read-promoted route where a single incidental write to `H` gives `H` 100% of route writes — share alone would pin a write key that did not cause the hotspot, burning the split budget and cooldown instead of using the §5.4.1 sub-range p50 path to relieve the read-heavy range. The absolute floor is computed as `H_writes * Ww` (writes only, since Top-K is currently write-only on the hot path; future read Top-K can extend with `+ H_reads * Wr`). When the share-only gate fires but the absolute floor does not, the detector logs `autosplit_isolation_declined_total{reason="absolute_floor"}` and falls through to the §5.4.1 sub-range p50 path instead. Isolating `H` from a route `[A, B)` normally takes **two boundaries** — one at `H` and one at `H`'s successor `H·0x00` — so the middle child is exactly `[H, H·0x00)` = `{H}`. The detector emits this as **one atomic compound decision** carrying both split keys, not two independent decisions across cycles. The scheduler executes it as a back-to-back pair of `SplitRange` calls (first `[A,B) → [A,H) + [H,B)`, then `[H,B) → [H, H·0x00) + [H·0x00, B)`), counted and budgeted as described in §6.4 (compound = 1 budget unit, cooldown exempt for the intermediate child). This is the single-hot-key hotspot case the whole feature targets. |
There was a problem hiding this comment.
Normalize hot-key floor to ops/min
When --keyvizStep differs from 60s or candidateWindows covers multiple columns, comparing H_writes * Ww to topKeyAbsoluteFloor is comparing raw writes to a floor documented as weighted ops/min. For example, with a 10s step a key doing 4,500 writes in one column is 108,000 weighted ops/min but only 18,000 by this formula, so the absolute gate can incorrectly fall through to p50 instead of isolating the true hot key. Compute the hot-key contribution using the same per-column or smoothed ops/min normalization as §5.1 before applying this floor.
Useful? React with 👍 / 👎.
| ### 6.2 Same-group today; least-loaded target once M2 lands | ||
|
|
||
| - **M3 standalone (no M2):** `SplitRange` does a same-group split — both children stay in the source group. This already relieves a hot **range** by halving the key span each child serves and is the first-value deliverable. (It does not relieve a single-group CPU/IO hotspot, since both children stay on the same Raft group — that needs M2's cross-group move.) | ||
| - **Once M2 lands:** the scheduler selects `target_group_id` for the right child via a **least-loaded-group policy** and passes it on the same `SplitRange` RPC (M2's new field). "Least-loaded" is computed from the same keyviz per-group aggregate (sum of route scores per group) the detector already has, choosing the group with the lowest aggregate score among groups in `--raftGroups`, excluding the source group and any group already targeted by an in-flight `SplitJob`. |
There was a problem hiding this comment.
Only migrate the child that carries the hotspot
In the lower-tail case where §5.4.1 picks the first bucket's hi, the hot p50 bucket remains in the left child, but this M2 path always passes target_group_id for the right child. With --autoSplitCrossGroup enabled that migrates the colder half away while the overloaded left child stays on the source group, so the cross-group split can burn a migration without relieving the hotspot. Gate target selection on which child contains the observed hot load, or keep this split same-group unless the right child is the hot side.
Useful? React with 👍 / 👎.
|
|
||
| - **Version-byte check → minimum, not equality.** Introduce `catalogRouteCodecVersionMin` (= `1`) and `catalogRouteCodecVersion` (bumped to `2`). Reject only `raw[0] < catalogRouteCodecVersionMin` (truly old/corrupt) with `ErrCatalogInvalidRouteRecord`; accept any `raw[0]` in `[catalogRouteCodecVersionMin, catalogRouteCodecVersion]` and treat any `raw[0] > catalogRouteCodecVersion` as a forward version (parse the fields this build knows, ignore the rest — see the trailing-bytes rule below). The version byte is retained in a local `version` variable for the field-reading step. | ||
| - **Field reading is version-keyed.** After `decodeRouteDescriptorHeader` and `decodeRouteDescriptorEnd` (unchanged), **read the 8-byte big-endian `SplitAtHLC` tail iff `version >= 2`**. For a v1 record (`version == 1`) there is no tail, so `SplitAtHLC` stays `0`. For a `version >= 2` record, read exactly 8 bytes for `SplitAtHLC`. | ||
| - **Trailing-bytes check → skip-unknown-tail, not reject.** Replace `if r.Len() != 0 { return …, err }` with **"discard any bytes remaining after the fields this version knows."** Concretely: after the version-keyed field reads, drain the reader (`io.Copy(io.Discard, r)` or `r.Seek(0, io.SeekEnd)`) instead of erroring. This makes the decoder tolerant of *both* directions: |
There was a problem hiding this comment.
Keep same-version trailing bytes invalid
This decoder rule drains any bytes left after the fields this build knows, which also accepts corrupt records for the current format, e.g. a v1 record with appended garbage in PR1a or a v2 record with extra bytes after SplitAtHLC in PR1b. The rolling-upgrade case only requires draining tails for forward versions (version > catalogRouteCodecVersion); for the current version the old r.Len() == 0 check should remain so malformed catalog records still fail instead of being silently normalized.
Useful? React with 👍 / 👎.
Claude round-10 review on PR #951 (commit ce50ceb) — three new issues in the ce50ceb deltas: Issue 1 (§5.4.2) — topKeyAbsoluteFloor was Step-sensitive Comparing 'H_writes (raw column count) * Ww' against an ops/min threshold silently varies with --keyvizStep: at Step=30s a hot key at exactly the intended 6 250 weighted writes/min produces H_writes_in_column=3 125, raw H_writes*Ww=12 500, which fails a 25 000 floor — exactly the Step↔sensitivity coupling OQ-8 was resolved to eliminate. Express the floor check in normalised per-minute units consistent with §5.1 formula (A): H_write_rate = H_writes_in_column / Step_s * 60; check H_write_rate * Ww >= topKeyAbsoluteFloor. Issue 2 (§7.3) — two new counters from ce50ceb were missing autosplit_isolation_declined_total{reason='absolute_floor'} (§5.4.2) and autosplit_target_skipped_total{reason='load_unknown'} (§6.2) were referenced in doc body but absent from §7.3 inventory. Both are now enumerated; the isolation_declined description clarifies it is NOT skipped (a §5.4.1 p50 decision may still fire on the same route), and target_skipped clarifies it does not affect whether a split is scheduled. Issue 3 (§8.2) — missing tests for new gates Added explicit unit-test entries for: Step-invariance of the topKeyAbsoluteFloor gate across Step=30s/60s/120s; share-fires- but-floor-doesn't fallthrough producing a §5.4.1 p50 decision and the absolute_floor counter; unknown-load target exclusion producing the load_unknown counter; and the cycle-local reservedDelta two-decision case for the route_cap accumulator.
|
@claude review — round 10 (3 new issues from ce50ceb):
|
|
Claude finished @bootjp's task in 4m 28s —— View job Review: M3 Automation Design Doc (Round 11)
All three round-10 issues are cleanly addressed. Two new codex P2 comments from Confirmed Resolved — Round 10Issue 1 ( Issue 2 (§7.3 missing two new counters) — cleanly fixed. Both counters are present with precise semantics: Issue 3 (§8.2 missing test cases) — cleanly fixed. All four cases are named: Step-invariance of New Issue 1 — Cross-group migration targets wrong child in first-bucket p50 fallback (§5.4.1 / §6.2, M3-PR4 scope)The boundary-bucket fallback added in §5.4.1 places hot load in the left child when p50 is the first bucket — The three cases, analyzed:
The first-bucket fallback is the only case where the hot work ends up in the left child, making cross-group migration counterproductive. The underlying reason is that §6.2 hardwires "right child = migration target" without checking which side carries the observed load. Required doc fix (§5.4.1 and §6.2): Add a note that the first-bucket fallback creates a "hot-left" split, and that under
Either option closes the gap before M3-PR4 implementation begins. This does not affect M3-PR1a through PR3 (all same-group). New Issue 2 — "Skip-unknown-tail" decoder rule is too permissive for same-version records (§7.7c, M3-PR1b scope)The decoder spec in §7.7c says: "after the version-keyed field reads, drain the reader ( Concretely, once M3-PR1b ships with
Required fix in §7.7c: Make the tail rule version-conditional: For M3-PR1a's "v1-era relaxed decoder" (where For M3-PR1b's v2 decoder ( Add to the codec matrix: a new case for "v2 record with extra bytes appended after The §8.1 M3-PR1b checklist should add: "codec round-trip test includes at least one v2 record with a malformed extra byte after Round-9 Nit Still Open (Non-blocking)§6.4 still ends with: "a later cross-group move of a final child relocates an existing route and does not add to the live route count." This remains imprecise — M2's SummaryThree round-10 issues are cleanly resolved. Two new codex P2 findings: Issue 1 (wrong child migrated in first-bucket fallback, §6.2 — affects M3-PR4) and Issue 2 (same-version trailing bytes silently accepted, §7.7c — affects M3-PR1b). Issue 2 is the more load-bearing one since it weakens corruption detection in the current-version decoder; it should be addressed before M3-PR1b begins implementation. Issue 1 is M3-PR4 scope and does not block M3-PR1a through PR3. |
…al decoder Claude round-11 review on PR #951 (commit d707344) — two new codex P2 issues introduced by the d707344 deltas: Issue 1 (§5.4.1 + §6.2) — first-bucket fallback puts hot in LEFT child M2's SplitRange semantics make the RIGHT child the migration target. The first-bucket fallback (split at the median bucket's hi when its lo equals route.Start) puts the hot work in the LEFT child, so a cross-group migration would relocate the cool side and leave the hot work on the source — counterproductive. Fix: §5.4.1 now marks each emitted decision with a split_origin enum {p50_mid, p50_last_bucket_lo, p50_first_bucket_hi, isolation_compound, isolation_single, decline}. §6.2's cross- group hook unconditionally same-group's the p50_first_bucket_hi case regardless of --autoSplitCrossGroup. The same-group split still relieves range pressure (smaller right child), and the next cycle re-evaluates — if load remains on the new left child, a normal p50 split follows with the hot work on the right. §8.2 split-key tests gain matched cases. Issue 2 (§7.7c) — trailing-bytes drain is too permissive for same-version records The unconditional 'drain any tail' rule masks corruption when the writer is at catalogRouteCodecVersion (not a forward version). Trailing bytes after the last known field of the current version are a defective encoder or wire tamper, not forward compat — silently normalising them weakens corruption detection. Fix: §7.7c switches on version vs. catalogRouteCodecVersion: - version < min: reject (unchanged). - version in [min, catalogRouteCodecVersion]: STRICT r.Len() != 0 → reject. This is what the v1-era pre-relaxation decoder enforced; preserve it for every version this build fully understands. - version > catalogRouteCodecVersion: drain (forward-compat). PR1a's perspective: version=2 is forward (PR1a stays at v1) → drain. PR1b's perspective: version=2 is known → strict. version=3+ → drain. §8.2 codec matrix gains the 'v2 record with extra bytes after SplitAtHLC → rejected' case.
|
@claude review — round 11 (2 codex P2 + round-9 nit):
|
|
Claude finished @bootjp's task in 4m 25s —— View job Review: M3 Automation Design Doc (Round 12)
Both round-11 issues are cleanly resolved. Two new issues in the cc0e447 changes are worth addressing before M3-PR1b and M3-PR4. Confirmed Resolved — Round 11Issue 1 (§5.4.1 first-bucket hot-left) — cleanly addressed. The Issue 2 (§7.7c version-conditional strict tail) — cleanly addressed. The decoder is now keyed on Round-9 nit (§6.4 cross-group +1 wording) — acknowledged as M3-PR4 scope, non-blocking for PR1a–PR3. ✓ New Issue 1 — "v2 + extra bytes" case mentioned in prose but absent from the codec matrix table (§7.7c, M3-PR1b scope)The sentence at line 393 says: "The §8.2 codec matrix gains a new entry: 'v2 record with extra bytes appended after The numbered table (cases 1, 1a, 2, 3, 4, 4a, 5, 6, 7) does not contain this row. The prose says the matrix gains the entry, but the entry is not there. An implementer writing M3-PR1b tests who works from the numbered table will not see this case and will not write the test for it. Case 5 is close — it uses a Fix: add a numbered row to the codec matrix table in §7.7c (call it Case 8, or insert it between Cases 1a and 2 as Case 1b):
The §8.1 M3-PR1b checklist should add "codec round-trip test includes at least one v2 record with extra bytes after Also update the "load-bearing cases" sentence (currently naming 4, 4a, and 7) to also name the new case so it appears in §8.4 acceptance criterion 10's coverage. New Issue 2 — §6.2 doesn't state the
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc0e447026
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| **No cold-start "N columns must exist" guard is needed.** Because the counter consumes per-column scores (each column self-contained, no dependency on `N` prior columns), the detector advances `consecutiveOver` on the **first** committed column whose per-column score crosses threshold — there is no longer any "wait until `N` committed columns exist before the counter can start" prerequisite (that guard only existed because the old averaged score could not be evaluated until `N` columns had accumulated). If `Snapshot` returns fewer than `candidateWindows` committed columns at cold start, the detector simply scores the columns it has, advancing/resetting per column; the route is promoted as soon as `candidateWindows` consecutive per-column scores have cleared threshold (§4.2's "fewer than N committed columns" note is updated to match). | ||
|
|
||
| **The counter advances per new flush column, not per evaluation cycle.** `--autoSplitEvalInterval` defaults to `--keyvizStep` but can be set independently, and when `evalInterval < Step` multiple evaluation cycles complete before a new flush column is committed. If the detector advanced `consecutiveOver` on every cycle, the same column would be re-counted: with `evalInterval = 5s` and `Step = 60s`, a candidate could be "promoted" in 15 s (3 cycles) on a *single* column of data — defeating the "3 consecutive flush columns" intent. The detector therefore **only advances (or resets) `consecutiveOver` when a new committed flush column has appeared since the last evaluation**. Mechanism: it records the `MatrixColumn.At` of the freshest committed column at the end of each evaluation; a cycle whose freshest `At` is unchanged from the last is a **no-op for the counter** (it may still recompute and log the smoothed score, but does not touch hysteresis state, cooldown, or emit a decision). This makes promotion timing a function of flush *columns* elapsed, independent of how often the detector wakes. Conversely, if `evalInterval > Step` and several columns accumulate between cycles, the detector processes each new column once, evaluating each column's own per-column score (so a burst spanning multiple skipped columns still advances/resets the counter per column, not per cycle) — see §8.2 tests. |
There was a problem hiding this comment.
Add zero-load columns for missing sampler rows
When a route has no traffic in a committed flush, MemSampler.Flush skips idle slots, so Snapshot has no MatrixRow for that route in that column. If the detector only processes rows present in each new column as described here, a route with N-1 hot columns, then an idle/cold column, then one hot column can keep consecutiveOver across the missing column and promote without candidateWindows consecutive hot flushes. The detector needs to synthesize a zero score for every live route missing from each committed column (and tests should cover absent rows), not just evaluate rows that appear in the snapshot.
Useful? React with 👍 / 👎.
| 2. **Cluster-wide fan-out (M3.x / OQ-12):** the M2 cap_migration_v2 heartbeat already carries per-node capability bits; an extension can carry each node's per-group aggregate so the default-group leader has a cluster-wide load view. The scheduler then admits any group whose load is reported by quorum (defending against single-node staleness). This removes the local-view limitation and is recorded under OQ-12 as the eventual default once the fan-out lands. | ||
| - The §8.2 target-selection tests add a "node A (default-group leader) has no local route ownership in groups B and C; the scheduler must skip cross-group migration to either, counted under `load_unknown`" case. | ||
| The migration mechanics (the resumable job) are entirely M2's; the scheduler just chooses the target and fires the RPC, then tracks the returned `job_id` for observability. | ||
| - **Compound single-key isolation is same-group-only, even under `--autoSplitCrossGroup` (codex P2).** Cross-group target selection is **never** applied to a §5.4.2 compound isolation in progress. The two-`SplitRange`-call chain (§6.4) must both fire with `target_group_id = 0` (same-group). If the *first* call carried a `target_group_id`, its intermediate right child `[H,B)` would become a cross-group migration target — an in-flight `SplitJob` whose route §6.3 excludes from candidacy and on which a second `SplitRange` is rejected — so the second call (§6.4 step 2) could not proceed and the isolation would strand at `compound_partial` on its very first execution. Cross-group relocation of an isolated child is therefore a **separate, later, same-group-complete-first step**: only the three *final* isolated children (`[A,H)`, `{H}`, `[H·0x00,B)`) are eligible for cross-group migration, and only as **independent decisions in subsequent cycles** once the compound operation has fully resolved same-group (and each such child is `RouteStateActive` with no in-flight `SplitJob`, §5.0/§6.3). This restriction is the M3-PR4 hook constraint. |
There was a problem hiding this comment.
Provide a way to migrate the isolated hot-key child
In the post-M2 path, compound hot-key isolation is forced to run same-group, and this line says the final {H} child can be migrated later as an independent decision. But §6.2's only migration hook is still SplitRange moving the right child of a new split; after isolation {H} = [H, H·0x00) has no valid interior split key, so a later SplitRange cannot move that existing hot-key route. In cross-group mode this leaves the single-key hotspot on the source group forever unless the design adds a MoveRange/target-child operation or another explicit way to migrate the isolated child.
Useful? React with 👍 / 👎.
…ket_hi gate + §7.4 slog (round-12) Claude round-12 review on PR #951 (commit cc0e447) — two new issues + one nit: Issue 1 (§7.7c / §8.2 codec matrix) — Case 1b missing from table §7.7c's prose said the matrix gains a 'v2 + extra bytes' case, but the numbered table didn't actually contain the row. An implementer authoring M3-PR1b tests from the table would miss it. Added Case 1b as a proper numbered row: v2 with extra byte appended after SplitAtHLC → ErrCatalogInvalidRouteRecord, the counterpart of Case 7 (too-few bytes → rejected). Updated §8.4 acceptance criterion 10 'load-bearing cases' list to include Case 1b alongside 4, 4a, and 7. Issue 2 (§6.2) — p50_first_bucket_hi same-group gate was attributed to §6.2 but not stated there §5.4.1 forward-referenced §6.2 for enforcement, but §6.2 itself only contained the compound-isolation same-group rule, not the first-bucket-fallback rule. M3-PR4 implementer reading §6.2 as the authoritative cross-group policy statement would miss it. Added a bullet parallel to the compound-isolation rule: when decision.split_origin == p50_first_bucket_hi, set target_group_id = 0 unconditionally regardless of --autoSplitCrossGroup. §8.2 target-selection tests gain the matching test case. Nit (§7.4) — split_origin absent from slog key list Operators debugging 'why didn't this split cross-group?' need the enum that gated the decision. Added split_origin to §7.4's stable slog keys list with the full enum range.
|
@claude review — round 13 (2 new issues + nit on commit cc0e447):
|
|
Claude finished @bootjp's task in 4m 34s —— View job Review: M3 Automation Design Doc (Round 13)
Confirmed Resolved — Round 12Issue 1 (codec matrix Case 1b missing from numbered table) — Case 1b is now in the codec matrix table at §7.7c (lines 396–398): v2 with Issue 2 (§6.2 missing Nit (§7.4 Minor Inconsistency — §7.7c forward-reference to §8.4 criterion 10 is stale§7.7c (line 400) states: "The §8.4 acceptance criterion 10 'load-bearing cases' list now names cases 4, 4a, 1b, and 7." But §8.4 criterion 10 (line 512) only names "codec matrix cases 4 and 4a" and "Codec matrix case 5 (forward version)" — cases 1b and 7 are not there. Case 1b is correctly in the codec matrix table; the gap is only that §8.4 criterion 10 wasn't updated to name it (and 7) alongside 4 and 4a. Suggested fix: extend the criterion 10 sentence to "codec matrix cases 4, 4a, 1b, and 7" and note PR1a covers 4/4a while PR1b covers 1b/7. Non-blocking for implementation, but the inconsistency makes §7.7c's own forward-reference claim false. Issue 1 — Absent route row = implicit zero score is not stated (§5.2 / §8.2, M3-PR2 scope)This is the unaddressed codex P2 finding at line 219 from cc0e447.
The failure mode is concrete. With
Under the current spec, route A promotes on column c4 after only two individually-hot columns (c1 and c3/c4), because the cold gap (c2, absent from snapshot) was silently skipped. The "3 consecutive hot columns" invariant is violated. The rapid invariant ("a route is promoted iff its per-column score cleared threshold for Fix (§5.2 and §8.2): State explicitly that for each committed column, any live route (in the state-map or engine snapshot) that has no Add to §8.2: "A route with Add to the rapid invariant: "For each committed column, a live route that is absent from the snapshot row is scored as This is M3-PR2 scope (the pure detector) and does not affect the codec or scheduler. Issue 2 —
|
…imit + §8.4 case names Claude round-13 review on PR #951 (commit d948024) — three items: Cross-ref nit (§8.4 criterion 10) — naming §7.7c's prose claimed §8.4 criterion 10 lists Cases 1b and 7 alongside 4 and 4a, but the criterion only named 4 and 4a. Trivial one-line fix: criterion 10 now names 4, 4a, 1b, and 7 with the PR1a-covers-4/4a + PR1b-covers-1b/7 split called out. Issue 1 (§5.2) — absent-route row score MemSampler.Flush skips idle routes — no MatrixRow in the committed column. The old §5.2 silently carried consecutiveOver across the absent column, so a route with N-1 hot + 1 quiet + N hot columns would promote on the second-Nth hot column rather than needing N more hot columns after the quiet gap. The 'N consecutive hot columns' invariant was violated. Fix: detector treats a live route absent from a committed column as per_col_score = 0 (below threshold → down-side hysteresis resets the counter). Mechanism: iterate live routes from state map ∪ engine snapshot; for each, look up its MatrixRow in the freshest committed column; if absent, score zero and reset. Cold-start absence (route has never appeared) stays handled by 'advance from the first committed column.' Silent-gap case is the new scenario the rule addresses. Issue 2 (§6.2 / §6.4) — {H} singleton migration impossibility §6.2 claimed all three final isolated children (`[A,H)`, `{H}`, `[H·0x00,B)`) are eligible for cross-group migration via SplitRange. False for {H}: validateSplitKey requires a key strictly between Start and End, and H·0x00 is the immediate lexicographic successor of H — no interior key exists. SplitRange against {H} always returns errDistributionSplitKeyAtBoundary, so the M2 API cannot relocate {H} to another group. Fix: §6.2 / §6.4 corrected to state that only the two non-singleton final children are eligible. The singleton {H}'s group is fixed at isolation time as a known M2-API limitation. A future MoveRange(routeID, target_group_id) RPC would close the gap; recorded as OQ-16 'MoveRange primitive for singleton routes' and out of scope for M3. Operators with a single-key hotspot whose group itself is overloaded use application-level key sharding until OQ-16 lands.
|
@claude review — round 14 (3 issues on commit d948024):
|
|
Claude finished @bootjp's task in 4m 17s —— View job Review: M3 Automation Design Doc (Round 14)
All three round-13 items are cleanly addressed. Two new precision issues in §5.2 and one doc-coherence gap worth fixing before M3-PR2. Confirmed Resolved — Round 13Cross-ref nit (§8.4 criterion 10) — §8.4 criterion 10 now names cases 4, 4a, 1b, and 7 with the explicit "PR1a covers 4/4a, PR1b covers 1b/7" split. ✓ Issue 1 (absent-route zero score, §5.2) — §5.2 now has an explicit "Absent-route columns score as zero" paragraph with the full rationale (MemSampler.Flush skips idle routes → no MatrixRow → silent gap would carry Issue 2 ({H} singleton migration impossibility, §6.2/§6.4) — §6.2 now explicitly states that New Issue 1 — OQ-12 and OQ-16 referenced in the body but absent from §9The body text references two new open questions — OQ-12 (cluster-wide fan-out for cross-group load visibility, §6.2 line: "recorded under OQ-12 as the eventual default once the fan-out lands") and OQ-16 (MoveRange primitive for singleton routes, §6.2: "recorded as OQ-16") — but §9 "Open Questions" only lists OQ-1 through OQ-8. OQ-12 and OQ-16 are nowhere in §9; OQ-9 through OQ-11 and OQ-13 through OQ-15 also do not appear anywhere in the document. The doc's stated §9 contract is: "where review (or this revision) produced a recommendation, the OQ is marked RESOLVED with the decision inline; only questions that genuinely need an implementer/reviewer call before the relevant PR remain open." An implementer reading §9 as the authoritative OQ list would conclude there are only 8 open questions (OQ-1 through OQ-8) and miss OQ-12 and OQ-16 entirely. The cross-references in §6.2 are not a substitute for formal tracking in §9. Fix: add OQ-12 and OQ-16 as formal entries in §9 with the same structure as OQ-1–8. For example:
OQ-12 was added in ce50ceb (round 10); OQ-16 was added in ea5c12a (this round). Both predate the current review. New Issue 2 — §5.2 absent-route mechanism says "freshest committed column" — imprecise for the
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea5c12a841
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **CAS chaining.** The second call uses the `v1` returned by the first (not the cycle's starting `v0`), so the pair is serialized against the live catalog version. If the first call's CAS fails (a concurrent manual split landed), the scheduler abandons the compound, emits `autosplit_splits_failed_total{reason="cas_conflict"}`, and re-evaluates next cycle — it does **not** issue the second call against a stale version. | ||
| - **Partial-completion safety.** If the first call commits but the second fails (CAS conflict or RPC error), the catalog is left in a valid two-child state `[A,H) + [H,B)` — `[H,B)` is simply a narrower route that the detector re-evaluates next cycle (it is now in cooldown only if the §5.3 rule applied it; per §5.3 the intermediate `[H,B)` is exempt, so it is eligible next cycle and the detector can retry the isolation). No invariant is broken by stopping after the first call; the worst case is the hot key takes one extra cycle to isolate. This is logged `autosplit_compound_partial`. | ||
| - Cooldown applies to the three **final** children only, per §5.3. | ||
| - **Same-group-only, even under `--autoSplitCrossGroup` (codex P2).** Both `SplitRange` calls always pass `target_group_id = 0`; cross-group target selection (§6.2) is **incompatible with a compound operation in progress** and is never applied to it. If call 1 carried a `target_group_id`, the intermediate child `[H,B)` would become a cross-group `SplitJob` target — a route §6.3 excludes from candidacy and on which `SplitRange` is rejected — so call 2 could not run and the isolation would strand at `compound_partial` immediately. Cross-group relocation of an isolated child is a separate, later decision (§6.2): only the three *final* children, once resolved same-group and `RouteStateActive` with no in-flight `SplitJob`, may be independently selected for cross-group migration in a subsequent cycle. The route-cap budget for the compound is `+2` routes (§5.5); a later cross-group move of a final child relocates an existing route and does not add to the live route count. |
There was a problem hiding this comment.
Exclude the singleton child from later migrations
Fresh evidence after the earlier singleton-migration issue is this §6.4 bullet: it says all three final children may later be selected for cross-group migration, which includes the singleton {H} = [H, H·0x00). That contradicts §6.2's corrected constraint that this singleton has no valid interior split key and validateSplitKey rejects both boundaries, so an implementer following this scheduler section can repeatedly target the singleton and only get failed SplitRange calls. Narrow this to the two non-singleton final children, or explicitly mark the singleton as non-targetable.
Useful? React with 👍 / 👎.
|
|
||
| Down-side hysteresis: the detector tracks a per-route `consecutiveOver` counter in its leader-local state map; a column whose **per-column score** falls below `threshold * hysteresisDownFactor` (default 0.8) clears it immediately, while per-column scores in the `[0.8×threshold, threshold)` band hold the counter (neither increment nor reset) so a route hovering at the line does not oscillate between "almost candidate" and "reset." (OQ-2 asks whether the band should instead decay the counter.) | ||
|
|
||
| **Absent-route columns score as zero — silent gaps reset the counter (closes claude round-13 Issue 1).** `MemSampler.Flush` skips routes with no traffic in a flush window — they have **no `MatrixRow`** in that column's snapshot (`keyviz/sampler.go`'s `RunFlusher` only writes rows for routes that observed traffic). The detector treats a live route (any route present in the leader-local state map or the engine's current snapshot) that is **absent from a committed column** as `per_col_score = 0` for that column — below threshold, so the down-side hysteresis logic above resets `consecutiveOver` to zero. Without this, the counter would carry over silently across a quiet column and the "N consecutive hot columns" invariant would be violated: a route with `N-1` hot columns followed by one absent column followed by `N` hot columns would promote on the second-`N`th hot column (counter never reset), defeating the hysteresis design. The cold-start absence case (route has never appeared yet, no entry in the state map) is already handled by "advance from the first committed column"; the **silent-gap case** (route was hot before, went quiet, will be hot again) is the new scenario this rule addresses. Mechanism: iterate live routes from the state map ∪ engine snapshot at each evaluation; for each route, look up its `MatrixRow` in the freshest committed column; if absent, score as zero and apply the reset. |
There was a problem hiding this comment.
Synthesize zeroes for every skipped column
Fresh evidence after the prior missing-row concern is that the mechanism only looks up a route in the freshest committed column. When evalInterval > Step, §5.2 says several new columns are processed in one evaluation; if a route is hot in column 1, absent in column 2, and hot again in the freshest column 3, checking only the freshest column preserves consecutiveOver across the silent gap and can promote without consecutive hot flushes. The detector needs to synthesize/apply the zero score for each committed column between lastProcessedAt and the newest one, in chronological order.
Useful? React with 👍 / 👎.
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.

Design-doc-only PR. No implementation code. Per the repo's design-doc-first workflow, this lands the M3 proposal for review before any implementation.
New file:
docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md(Status: Proposed,Author: bootjp,Date: 2026-06-11).Summary
Milestone 3 of the hotspot shard split design: automation — auto-detection + an auto-split scheduler. M1 (durable versioned route catalog, manual same-group
SplitRange, route watcher) has shipped; M2 (#945) proposes the cross-group migration plane. M3 composes with M2 but delivers first value standalone: auto-detection + same-group auto-split (the M1 capability) work without M2; cross-group target selection slots in behind the sameSplitRangeinterface once M2 lands.It closes the gap the parent doc §1 calls out:
distribution/engine.go'sRecordAccess/threshold-splitRangeexists but is wired into nothing (only tests call it; verified by grep) and is midpoint-only + mutates the engine directly (bypassing the catalog/watcher).Key design points
MemSampleris already wired into the exact request-path functions the parent doc names (groupMutations→observeMutation,observeReadinkv/sharded_coordinator.go), is allocation-free on the hot path, splits reads vs writes, is windowed, has sub-range + Top-K split-key evidence, and bounds cardinality. Driving detection off it avoids a second parallel pipeline. The engine's deadRecordAccess/splitRange/Route.Loadpath is removed in M3-PR1. M3 adds zero hot-path cost — it consumes already-flushed windows off-path.MemSampler.Snapshotfor routes it leads (the leader serves the load it would split on). NoReportAccessRPC and no RPC piggybacking; cluster fan-out stays an admin-UI concern and can slot behind the detector interface later if needed.score = write_ops*Ww + read_ops*Wr(4/1), 50k ops/min threshold, 3-consecutive-window up-side hysteresis + 0.8× down-side band, per-route 10-min cooldown (monotonic deadline, not wall clock), split-key from observed distribution (sub-range p50 / single-key isolation / decline when no evidence — never blind midpoint), guardrails: min route span, max routes, per-cycle split budget.proto.Distribution.SplitRangeRPC withexpected_catalog_versionCAS; all catalog mutations go throughSplitRange(repo convention). Same-group today; once M2 lands the least-loaded-group target policy flips thetarget_group_idfield on the same RPC.--autoSplit; runtime kill switch (atomic.Bool); fixed-enum metrics (no per-route/key labels); stable slog keys; detector state is leader-local and resets on election (re-earns confidence; cooldown re-seeded fromparent_route_idlineage) — stated explicitly so non-replication isn't mistaken for an oversight.Every current-state claim cites concrete
file:lineevidence, verified before writing.Open questions
Route.Load/Engine.Stats, or keepStatsfor diagnostics?[0.8×, 1.0×)band, or decay it?minRouteSpanwithout per-route key counts: approximate from sampler evidence, or add a cheap key-count estimate?--autoSplitCrossGroup?parent_route_id+ a catalog split timestamp enough, or do we need durablelast_split_at?Stepto avoid coupling sensitivity to--keyvizStep?Test plan
Doc-only — no Go tests run. Markdown verified clean; all cited
file:linereferences confirmed againstmain. The doc's own §8.2 lays out the implementation-phase test strategy (table-driven detector unit tests +pgregory.net/rapidinvariants, detector→SplitRangeintegration in the 3-node demo, Jepsen hotspot workload deferred to M4 per the parent doc).