Skip to content

fix: DEVOPS-364 governance proposal creation 504 + stuck publish spinner#777

Merged
frankmeds merged 3 commits into
mainfrom
fix/governance-proposal-504-and-spinner
Jun 4, 2026
Merged

fix: DEVOPS-364 governance proposal creation 504 + stuck publish spinner#777
frankmeds merged 3 commits into
mainfrom
fix/governance-proposal-504-and-spinner

Conversation

@frankmeds

@frankmeds frankmeds commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Problem

Users could not create proposals on the governance portal. Two distinct bugs, found by reproducing on staging and tracing the GCP logs:

  1. Backend 504. POST /api/message calls getLiquidity(), which downloads the entire gZIL ZRC2 balances map (~5 MB) from api.zilliqa.com on every proposal. From the GKE cluster this takes ~30 s, exceeding the gateway's 30 s backend_timeout -> 504. Verified on staging:

    OPTIONS /api/message 200 (0.38s)
    POST    /api/message 504 (30.37s, backend_timeout)
    # app log: "Signature verified" -> 30s gap -> "Zilliqa liquidity fetched"
    
  2. Frontend infinite spinner. client.request never settled its promise when the error body was not JSON (a 504 gateway HTML page) or the request was CORS-blocked (e.json().then(json => reject(json)) left the outer promise pending). The publish spinner spun forever with no error shown.

Changes

governance-api - lib/zilliqa/custom-fetch.ts, lib/routes/message.ts, cd/overlays/{staging,production}/backendpolicy.yaml

  • Fix the 504 by raising the backend timeout, not by shrinking the fetch. The full balances map is fetched on purpose (GetSmartContractSubState(..., "balances", [])): it is pinned to IPFS as the whole-electorate voter-scoring snapshot that governance-snapshot reads via get-scores.ts (proposal.balances[voter]). GCPBackendPolicy.spec.default.timeoutSec is raised to 90 s (staging + production) so the ~30 s fetch is no longer killed by GKE's 30 s default.
  • Normalise the submitter's lookup key to lowercase 0x... (Scilla state key format) for the MIN_BALANCE gate; also fixes a latent checksum mismatch for bech32/ZilPay addresses that previously made the gate read undefined.
  • Seed the submitter's key before LP parsing so ZilSwap/XCAD liquidity is still credited toward the 30-gZIL gate when the user has no direct balance.
  • Guard null RPC results; default missing balances to "0" (fail-closed gate).

governance-snapshot - src/helpers/client.ts

  • Rewrite request() to always settle: AbortController (95 s timeout, just above the gateway's 90 s so a real 504 surfaces instead of the client aborting first), defensive JSON parsing, structured rejects ({code, error_description} or fallback). A 504/non-JSON now shows an error toast instead of hanging.

Note on the approach (PR review C1): an earlier iteration scoped the balances fetch to the submitter ([ownerKey], ~109 B / ~1.8 s) to fix the 504. That was reverted: the same balances object is pinned to IPFS and consumed by the frontend as the whole-electorate scoring oracle, so scoping it would have collapsed every new proposal's vote tally to the submitter alone. The 504 is addressed by the raised timeout instead, keeping the full snapshot intact. Caching the snapshot server-side is a sensible future optimisation as the holder set grows.

Testing

Framework-free node:assert regression tests (the packages have no test runner), wired as npm test in both packages:

  • custom-fetch.test.ts: fullMapFetchTest asserts the balances RPC fetches the full map (params[2] == []) and that the electorate snapshot is retained (a second holder survives), guarding C1; the submitter's balance is read back via the normalized lowercase key; LP-only holder credited via the seeded key; null result -> "0" without throwing.
  • client.test.ts: rejects (no hang) on non-JSON 504; resolves on success; rejects with timeout on abort; preserves server JSON error; success-empty-body resolves.

Both suites pass (npm test in each package); both packages typecheck clean.

Verify on staging

  • ZilPay with >=30 gZIL -> proposal succeeds (201); proposal creation completes within the 90 s backend timeout (no 504).
  • MetaMask on the gZIL space -> fast 400 MIN_BALANCE with a visible toast (no infinite spinner).

Follow-ups (intentionally out of scope)

  1. Pre-existing: proposal() / vote() are called without await and their early-return responses are discarded (message.ts:229); malformed proposals still hit the slow path and can double-send the response.
  2. Product decision: EVM 0x addresses hold no gZIL on Zilliqa, so MetaMask proposals on the gZIL space always hit MIN_BALANCE - decide on address mapping vs. restricting to ZilPay.
  3. Logging: pino level is not mapped to Cloud Logging severity; 404s are logged at error level.
  4. Performance: proposal creation re-fetches the full ~5 MB holder snapshot (~30 s) on every submission; consider caching/indexing it server-side as the holder set grows (the 90 s timeout is headroom, not a permanent fix). The client.ts 95 s ceiling is also global; a per-call timeout would keep fast-fail for metadata GETs.

frankmeds added 2 commits June 4, 2026 15:49
getLiquidity downloaded the entire gZIL ZRC2 balances map (~5MB) on every proposal just to read one address, taking ~30s from the cluster and exceeding the gateway's 30s backend_timeout, returning 504.

- Fetch only the submitter's entry (index [ownerKey]) instead of the full map.
- Normalise the key to lowercase 0x (Scilla state key format; also fixes a latent checksum mismatch for bech32/ZilPay addresses).
- Seed the submitter's key so ZilSwap/XCAD LP balances are still credited with no direct balance.
- Guard null RPC results; default missing balances to '0' (fail-closed gate).
- Add framework-free regression tests.
client.request left its promise pending when an error response was not JSON (a 504 gateway HTML page) or the request was CORS-blocked, leaving the publish spinner stuck forever.

- Rewrite with async/await; always resolve or reject.
- AbortController with a 45s timeout.
- Parse error bodies defensively; reject with {code,error_description} or a fallback.
- Add framework-free regression tests.
@frankmeds frankmeds changed the title fix: governance proposal creation 504 + stuck publish spinner fix: DEVOPS-364 governance proposal creation 504 + stuck publish spinner Jun 4, 2026
…timeout

PR #777 review (C1): scoping the balances fetch to the submitter also shrank the IPFS-pinned snapshot that the frontend uses as the whole-electorate voter-scoring oracle (get-scores.ts reads proposal.balances[voter] before any live fallback), so proposals created after deploy would have counted only the submitter's vote.

- custom-fetch.ts: keep fetching the FULL holder map (index []) for the pinned snapshot; retain the lowercase-key gate lookup, null guards and LP seeding.
- backendpolicy.yaml (staging + production): GCPBackendPolicy timeoutSec=90 so the ~30s full-map fetch is not killed by GKE's 30s default (the actual 504 fix).
- client.ts: raise request timeout to 95s (above the gateway) so a real 504 surfaces instead of the client aborting first.
- test: fullMapFetchTest now asserts the FULL map is fetched + electorate retained (guards C1); wire 'npm test' in both packages (M1).
@frankmeds frankmeds merged commit c340291 into main Jun 4, 2026
3 checks passed
@frankmeds frankmeds deleted the fix/governance-proposal-504-and-spinner branch June 4, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant