Skip to content

Phase 1.5: Stage A federated master - leader-leased idempotent billing, join-as-master v0, snapshots, sweep#47

Open
ehsan6sha wants to merge 9 commits into
mainfrom
phase-1.5-stage-a
Open

Phase 1.5: Stage A federated master - leader-leased idempotent billing, join-as-master v0, snapshots, sweep#47
ehsan6sha wants to merge 9 commits into
mainfrom
phase-1.5-stage-a

Conversation

@ehsan6sha

Copy link
Copy Markdown
Member

Phase 1.5 — Stage A federated master: full stack on box #2

Implements the Stage A milestone of the federated-master roadmap: a second full master (operator-run) with double-run-proof billing, the capstone installer v0, signed pinset snapshots, a replication sweep, and a fenced failover/resync runbook. Everything additive + flag-gated (default OFF — single-master behavior byte-identical when dark).

What's in

  • FM-2 billing safety (flags BILLING_IDEMPOTENCY, CRON_LEADER_LEASE):
    • migration 018: partial UNIQUE (user_id, reference_id) WHERE tx_type='hourly_deduction' (CONCURRENTLY, dup pre-check, .down.sql);
    • deduction uses a deterministic hour:YYYY-MM-DDTHH key and the history INSERT is the dedup gate before any balance change;
    • blockScanner records + credits in one transaction (creditUserTx); a crash can no longer strand a recorded-but-uncredited deposit;
    • cron family gated by a Postgres advisory-lock leader lease (holder crash ⇒ standby takes over on its next tick; a partitioned ex-leader can't bill at all — the shared DB is the arbiter).
  • update-scripts/join-as-master.sh v0 (capstone installer first cut): detect-installed / adopt-or-halt (adopts the Phase-1 kubo+cluster writer; halts on a foreign postgres), ordered migrations with halt-on-error, dockerized stack (healthchecks + restart policies + label-scoped watchtower), gateway profile auto-enabled when the fula-gateway image exists (built via feat(docker): reproducible fula-gateway image for federated masters fula-api#30), safeguard crons installed + first snapshot taken immediately, params persisted to .env, idempotent re-runs.
  • pinset-snapshot.sh — signed (ed25519) authoritative pinset dumps, --verify/--restore/--install-cron (early FM-3 restore path).
  • replication-sweep.sh — below-REPL_MIN detection + recover + alert log + --strict (closes the S4 "no automated sweep" gap).
  • migration 019 — real fresh-install bug found by this e2e: post-PII linkWallet stores wallet_address=NULL but fresh schemas keep NOT NULL (012 missed it) ⇒ every fresh deployment rejected wallet links. Guarded .down.sql.
  • Failover/resync runbook (fenced flips; no-verification-no-flip).
  • Tests: 30 vitest unit tests + 2 live-Postgres integration tests + the e2e drill suite (tests/e2e/phase-1.5/).

E2E evidence (clean Ubuntu 24.04 box, real daemons, real Postgres)

Drill suite 60-master-drills.sh — final run all green (RESULT pass=18 fail=0):

  • D1 stack healthy (postgres/API/webui; gateway profile live on :9000 with durable pin queue active)
  • D2 migration 018 present
  • D3 two webui masters → exactly one leader + one standby → exactly ONE hourly_deduction row per (user, hour)
  • D4 kill -9 the leader → standby acquires the lease → STILL exactly one row (idempotency under failover)
  • D5 live-PG integration: concurrent same-hour deductions deduct once; replayed deposit credits once, atomically (2/2)
  • D6 snapshot taken + signature verifies + tampered file rejected + unpin→--restore re-pins
  • D7 sweep clean → forced under-replication (3 of 4 peers down) detected + alerted → reconverged clean

Mixed-fleet/no-forced-upgrade invariant unaffected: all changes are master-side and dark by default; providers and existing data untouched (Phase 1 drills covered the provider side).

Cross-repo

Scoping note

The full FxFiles-flow upload/download fidelity suite runs with Phase 2 (which changes the upload path; this phase doesn't touch it). Stage A failover is operator-fenced per the runbook until FM-1 (bucket-root CAS, Phase 2.5) enables auto-failover.

Closes #46

🤖 Generated with Claude Code

ehsan6sha and others added 9 commits June 11, 2026 19:19
…c deposit credit, cron leader-lease

Federated masters (Phase 1.5, Stage A) groundwork. All flag-gated, default
OFF - single-master behavior is byte-identical when dark:

- migration 018: partial UNIQUE index (user_id, reference_id) WHERE
  tx_type=hourly_deduction (CONCURRENTLY; pre-checks duplicates; .down.sql)
- deductionJob (BILLING_IDEMPOTENCY=true): deterministic reference_id
  hour:YYYY-MM-DDTHH (UTC) and the history INSERT becomes the dedup gate
  (ON CONFLICT DO NOTHING) BEFORE the balance update - N masters deduct
  exactly once per (user, hour)
- blockScanner (BILLING_IDEMPOTENCY=true): deposit insert + creditUserTx +
  claimed_at in ONE transaction - a crash can no longer strand a
  recorded-but-uncredited tx; creditService gains creditUserTx (caller-owned
  transaction; creditUser delegates, zero behavior change)
- leaderLease (CRON_LEADER_LEASE=true): Postgres session advisory lock on a
  dedicated client gates every cron tick; holder crash frees the lock so a
  standby master takes over on its next tick; SIGTERM releases explicitly
- tests: hour-bucket determinism/UTC/collision, fee formula unchanged,
  flag-off no-op gate (DB-free; multi-master paths covered by Phase 1.5 e2e)

Part of #46

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… replication sweep

Phase 1.5 Stage A deliverables (#46):
- docker/master: compose stack (postgres-pinning, pinning OpenAPI from
  main_postgres.go, pinning-webui with cron family) - host networking +
  127.0.0.1 binds like prod, healthchecks + restart policies + label-scoped
  watchtower; optional fula-gateway profile (auto-enabled when image exists)
- update-scripts/join-as-master.sh v0: capstone installer first cut -
  detect/adopt-or-halt (adopts the Phase-1 kubo+cluster writer; halts on a
  foreign postgres-pinning), ordered migrations with halt-on-error + marker,
  idempotent re-runs, params persisted to .env (phase-common pattern)
- update-scripts/pinset-snapshot.sh: signed (ed25519) authoritative pinset
  dumps + --verify/--restore/--install-cron (early FM-3 restore path)
- update-scripts/replication-sweep.sh: below-REPL_MIN detection + recover +
  alert log + --strict for drills (closes the S4 sweep gap)
- test seams: processUserDeduction exported; cron intervals env-overridable
  (SCANNER_INTERVAL_MS/DEDUCTION_INTERVAL_MS, defaults unchanged)
- tests: fm2-billing-integration (live-Postgres; skips cleanly without DB) -
  concurrent same-hour deduction races deduct once; replayed deposit credits
  once and leaves no recorded-but-uncredited state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ver, snapshots, sweep

D1 stack health, D2 migration-018 presence, D3 two webui masters -> one
leader + one standby + exactly one hourly_deduction per (user, hour),
D4 kill -9 leader -> standby acquires lease, STILL one row (idempotency
under failover), D5 live-Postgres vitest integration, D6 snapshot
take/verify/tamper-reject/unpin+restore, D7 sweep clean -> forced
under-replication detected (--strict) + alerted -> reconverged clean.

Part of #46

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…_PORT default)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… auto-generate + persist

Found by the live installer run (webui FATALs without them; container
crash-looped silently). join-as-master.sh now generates both once
(openssl rand) and persists to .env; compose fails fast with a clear
message if absent; drill webuis receive them too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found by the live Phase 1.5 e2e run:
- migration 019: user_wallets.wallet_address DROP NOT NULL - post-PII
  linkWallet stores hash-only (wallet_address=NULL) so EVERY fresh install
  rejected wallet links; 012 relaxed user_email but missed this column
  (guarded .down.sql). Real fresh-deploy bug, not test-only.
- compose: mount fula-gateway-state at /var/lib/fula-gateway - the gateway
  durable state paths are hardcoded there; without the volume the S2 pin
  queue silently degrades to fire-and-forget and the bucket registry resets
  on restart.
- drills: D5 runner no longer swallows vitest exit (pipefail + explicit
  2-passed check); D6 takes the FIRST cid from the snapshot stream
  (was capturing a multiline list) and polls 60s for the restore.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
join-as-master.sh now installs the pinset-snapshot (6h) and
replication-sweep (30min) cron entries and takes the first snapshot
immediately - the restore path exists from minute one, per the
safeguards invariant (S4/S6 must be scheduled, not manual).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pass count

The tests passed (2/2 on live Postgres) but the colored output broke the
literal match - strip escapes, then assert.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1.5: Stage A federated master - full stack on box #2 (leader-lease, billing idempotency, snapshots, sweep, join-as-master v0)

1 participant