Skip to content

[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resolver driver overhaul#5329

Open
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-feat-pegboard-envoy-rate-limit-envoy-ws-ingress-and-cap-get_pages-trim-unused-metrics-tnzyzqrlfrom
stack/slop-claude-opus-4-8-high-feat-universaldb-postgres-leader-resolver-driver-overhaul-xrrmomwz
Open

[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resolver driver overhaul#5329
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-feat-pegboard-envoy-rate-limit-envoy-ws-ingress-and-cap-get_pages-trim-unused-metrics-tnzyzqrlfrom
stack/slop-claude-opus-4-8-high-feat-universaldb-postgres-leader-resolver-driver-overhaul-xrrmomwz

Conversation

@MasterPtato

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

Copy link
Copy Markdown
Contributor Author

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Code Review: feat(universaldb): postgres leader-resolver driver overhaul

The overall architecture is solid — leader election via epoch-fenced CAS on udb_lease, SKIP-LOCKED group-commit batching, versioned BARE wire format, and a dedicated LISTEN connection for wakeup fanout. The design correctly mirrors several FDB invariants (monotone watermark, conflict-range OCC, versionstamp ordering). The issues below range from correctness concerns to style violations per CLAUDE.md and minor dead code.


Bug / Correctness

1. wait_for_leader polls with sleep — should use watch::changed()

engine/packages/universaldb/src/driver/postgres/commit.rs

async fn wait_for_leader(shared: &Arc<PostgresShared>) -> Result<LeaseInfo> {
    let deadline = Instant::now() + LEADER_WAIT_TIMEOUT;
    loop {
        if let Some(lease) = shared.current_lease() { return Ok(lease); }
        if Instant::now() >= deadline { ... }
        tokio::time::sleep(RESULT_POLL_INTERVAL).await;  // polling
    }
}

PostgresShared already holds a watch::Sender<Option<LeaseInfo>> (lease_tx). CLAUDE.md explicitly bans loop { check; sleep } poll patterns: "Never poll a shared-state counter with loop { if ready; sleep(Nms).await; }. Pair the counter with a … watch::channel that every … site pings." wait_for_leader should subscribe to shared.lease_rx with changed().await (plus a timeout) instead of sleeping. This is the hot path during leader transitions — every pending commit blocks on it.


2. udb_commit_requests_pending index missing epoch column — suboptimal under failover

engine/packages/universaldb/src/driver/postgres/database.rs

CREATE INDEX IF NOT EXISTS udb_commit_requests_pending
    ON udb_commit_requests (id) WHERE status = 'pending';

The drain query filters WHERE status = 'pending' AND epoch = $1. The partial index covers only status = 'pending', so Postgres scans all pending rows to filter by epoch. After a failover, stale-epoch pending rows accumulate until GC (up to 60 s), and the new leader's drain scans all of them on every batch. A composite index (epoch, id) WHERE status = 'pending' would fix this.


3. connect_and_run listener race — new subscriptions can be silently skipped

engine/packages/universaldb/src/driver/postgres/listener.rs

let poll_handle = tokio::spawn(async move { Self::poll_connection(connection, ...).await });
// ... re-LISTEN loop ...
*client.lock().await = Some(new_client);  // stored only after all LISTENs

client is set to Some(new_client) only after the re-LISTEN loop completes. If listen() is called concurrently while the loop is mid-execution, client is still None and the LISTEN is silently skipped. The subscription won't be registered until the next reconnect. Consider storing the client earlier or tracking pending channels and replaying them once the client is set.


4. table_stats CTE in handle_get_estimated_range_size is dead code

engine/packages/universaldb/src/driver/postgres/transaction_task.rs

table_stats AS (
    SELECT reltuples::bigint as total_rows
    FROM pg_class WHERE relname = 'kv' AND relkind = 'r'
)
SELECT CASE WHEN r.estimated_count = 0 THEN 0
            ELSE (r.sample_size * 100)::bigint
       END
FROM range_stats r, table_stats t

total_rows from table_stats is never referenced in the projection. The CTE and cross-join add noise and the cross-join's single-row assumption would silently corrupt results if the table were renamed or absent. Either remove table_stats entirely, or wire total_rows into the estimation logic as intended.


5. GC query deletes all rows, but comment says "terminal and orphaned"

engine/packages/universaldb/src/driver/postgres/database.rs

/// Terminal and orphaned commit-request rows older than this are garbage collected.
...
"DELETE FROM udb_commit_requests WHERE created_at < now() - ($1::bigint * interval '1 second')"

The query has no status filter — it deletes all rows (including pending) older than 60 s. The comment is misleading. While await_result handles a GC'd row gracefully (Ok(None)NotCommitted), the comment should say "all rows" or the query should filter to terminal statuses only if pending rows should be spared. At minimum the doc comment must match the implementation.


Code Style (CLAUDE.md violations)

6. anyhow! macro used instead of .context()

engine/packages/universaldb/src/driver/postgres/transaction_task.rs

let _ = response.send(Err(anyhow!("postgres transaction connection failed")));

CLAUDE.md: "Prefer .context() over the anyhow! macro." Use anyhow::Error::msg("…") or attach .context("…") to the originating error where available. There are multiple repetitions of this inside fail_receiver that should all be updated.


7. commit() and commit_ref() have identical bodies

engine/packages/universaldb/src/driver/postgres/transaction.rs

The two trait methods are word-for-word the same implementation. Extract a private async fn do_commit(&self) and have both methods delegate to it.


8. Catch-all _ => Status::Pending on string-backed status

engine/packages/universaldb/src/driver/postgres/commit.rs

let status = match status.as_str() {
    "committed" => Status::Committed,
    "conflict"  => Status::Conflict,
    _           => Status::Pending,    // silently handles unknown statuses
};

Any unexpected string (e.g., a future 'claimed' intermediate status) is silently treated as Pending, causing waiters to hang until timeout. At minimum add an explicit "pending" arm and log a tracing::warn! for the fallback. CLAUDE.md prefers exhaustive matching on closed value sets.


Performance

9. seq_high reads last_value before first nextval — off-by-one at startup

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs

let seq_high: i64 = conn
    .query_one("SELECT last_value FROM udb_version_seq", &[])
    ...
Ok(durable.max(seq_high).max(0) as u64)

Postgres sequences initialized with START WITH 1 return last_value = 1 even before the first nextval (when is_called = false). On a fresh database, recovery_floor returns 1, which causes every early commit with read_version = 0 to hit the cold-window guard unnecessarily. Check is_called from pg_sequences before trusting last_value, or start the sequence at 0.


10. Unused dependencies: base64 and tokio-util in Cargo.toml

engine/packages/universaldb/Cargo.toml

Both crates appear to have been added in this PR but are not referenced in any new source files. These should be removed to avoid bloating build times.


Minor / Non-blocking

11. LEASE_TTL_SECS passed as string for interval arithmetic

now() + ($3 || ' seconds')::interval

Using $3 * interval '1 second' with $3 typed as BIGINT is more idiomatic Postgres and avoids the string cast.

12. recovery_floor opens two sequential pool connections

Both current_durable_version and seq_high reads each check out a pool connection. A single query joining udb_lease and reading udb_version_seq would reduce round-trips and pool pressure at startup.

13. Status enum missing #[derive(Debug)]

engine/packages/universaldb/src/driver/postgres/commit.rs

Small but useful to add for tracing/logging consistency.


Summary

Severity # Items
Bug / Correctness 5 #1 (poll-sleep), #2 (index), #3 (listener race), #4 (dead CTE), #5 (GC comment)
CLAUDE.md violation 3 #6 (anyhow! macro), #7 (commit duplication), #8 (catch-all string match)
Performance 2 #9 (seq_high startup), #10 (unused deps)
Minor 3 #11 (interval style), #12 (extra pool conn), #13 (missing Debug)

Highest-priority fixes: #1 (hot-path polling), #2 (index), #4 (dead CTE), #6 (anyhow! macro), and #7 (commit duplication).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant