Skip to content

Remove bootstrap nodes and replace with operator peers#3909

Open
lrsaturnino wants to merge 31 commits into
mainfrom
feature/decouple-firewall-allowlist
Open

Remove bootstrap nodes and replace with operator peers#3909
lrsaturnino wants to merge 31 commits into
mainfrom
feature/decouple-firewall-allowlist

Conversation

@lrsaturnino
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino commented Mar 25, 2026

Problem

The network relies on centrally-managed bootstrap nodes for initial peer discovery. Their embedded public keys bypass firewall IsRecognized() staking checks via the AllowList, meaning an unstaked or slashed embedded peer retains permanent network access. Operators hardcoding bootstrap addresses in --network.peers will lose connectivity when bootstrap infrastructure is decommissioned.

Solution

  • Replace bootstrap node entries with operator peers (/dns4/ or /ip4/ format)
  • Decouple firewall AllowList from discovery peers — pass firewall.EmptyAllowList so all peers pass IsRecognized() staking checks
  • Remove dead ExtractPeersPublicKeys function
  • Deprecate --network.bootstrap flag with runtime warning
  • Rename connected_bootstrap_count metric to connected_wellknown_peers_count

Tests

  • TestValidate_EmptyAllowList_RecognizedPeerAccepted — recognized peer passes via IsRecognized path
  • TestValidate_EmptyAllowList_UnrecognizedPeerRejected — unrecognized peer rejected with no AllowList bypass
  • TestValidate_EmptyAllowList_PreviouslyAllowlistedPeerMustPassIsRecognized — previously allowlisted peer no longer bypasses checks
  • TestResolvePeers — updated expectations for new operator peer entries (mainnet and testnet)
  • TestNetworkBootstrapFlagDescription_ContainsDeprecationNotice — flag description includes deprecation text
  • TestIsBootstrap — returns correct boolean value
  • TestConnectedWellknownPeersCountMetricName — metric constant has correct value
  • TestObserveConnectedWellknownPeersCount_Callable — renamed function exists and executes without panic

Summary by CodeRabbit

  • Deprecations

    • The --network.bootstrap CLI flag is now deprecated and scheduled for removal in v3.0.0.
  • Network Configuration

    • Updated peer node addresses for mainnet and testnet connectivity.
    • Updated Electrum service endpoint for testnet.
  • Metrics

    • Enhanced network connectivity metrics for peer monitoring.

Review Change Stack

Replace all Boar bootstrap node entries with curated DNS-backed operator
peers to complete the bootstrap infrastructure decommission. This is the
final phase — Staked nodes were removed in v2.5.1.

Peer list changes:
- Mainnet: 2 Boar entries replaced with 5 operator peers at keep-nodes.io
- Testnet: 1 Boar entry replaced with 2 operator peers at test.keep-nodes.io
- All entries use /dns4/ format for IP change resilience

Security — AllowList decoupling:
- Pass firewall.EmptyAllowList instead of extracting embedded peer keys
- All peers (including embedded operators) now pass IsRecognized() staking
  checks with no firewall bypass
- Remove dead ExtractPeersPublicKeys function and its tests

Deprecations and renames:
- Deprecate --network.bootstrap flag with runtime warning
- Rename connected_bootstrap_count metric to connected_wellknown_peers_count

Documentation:
- Add operator migration guide covering Boar address removal,
  --network.peers override behavior, and monitoring updates
@lrsaturnino lrsaturnino changed the title Remove Boar bootstrap nodes and replace with operator peers Remove bootstrap nodes and replace with operator peers Mar 25, 2026
Operator migration guidance will be distributed separately from the
code release.
Replace placeholder mainnet entries with the beta staker node
(143.198.18.229:3919) as the sole embedded peer for initial testing.
Testnet placeholders remain until operator coordination is complete.
@lrsaturnino lrsaturnino marked this pull request as ready for review March 25, 2026 04:53
gotestsum's default `./...` package pattern only applies when no args
are passed after `--`. With `-- -timeout 15m`, it forwards args directly
to `go test`, which defaults to `.` (root package only). The root
package has no test files, so CI has been silently running 0 tests.
Replace placeholder hostnames with actual values from config/_peers/testnet.
Remove TestConnectedWellknownPeersCountMetricName and TestMetricConstants
which only assert that string constants equal themselves. The compiler
already ensures rename safety. Keep the callable integration test which
validates the function exists and executes without panicking.
…d var

Change EmptyAllowList from an exported mutable package-level var to an
exported function returning the package-level singleton. This prevents
external code from accidentally mutating the shared empty allowlist.
Add a note that connected_wellknown_peers_count was previously named
connected_bootstrap_count, so operators can update Prometheus queries
and Grafana dashboards accordingly.
Specify concrete removal version so the deprecated flag does not linger
indefinitely.
Add a TODO comment noting that at least one additional mainnet peer
across a different operator/ASN should be added before production
rollout to avoid a single point of failure for initial peer discovery.
Go 1.24 vet rejects non-constant format strings in fmt.Errorf. This
pre-existing issue was hidden because CI was not running tests.
lionakhnazarov
lionakhnazarov previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@lionakhnazarov lionakhnazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

piotr-roslaniec and others added 2 commits March 26, 2026 09:40
## Summary

- CI has been silently running **0 tests** because `gotestsum --
-timeout 15m` (without `./...`) only tests the root package, which has
no test files
- Fix: add explicit `./...` so all subpackages are tested
- Also fix `peers_test.go` placeholder hostnames to match actual
`config/_peers/testnet` values — this test failure was hidden by the
above bug

## Root cause

`gotestsum`'s default `./...` package pattern only applies when **no
args** are passed after `--`. With `-- -timeout 15m`, gotestsum forwards
args directly to `go test`, which defaults to `.` (current directory
only). CI log confirms: `DONE 0 tests in 8.107s`.

## Test plan

- [ ] CI should now run all Go tests (expect 100+ tests instead of 0)
- [ ] `TestResolvePeers/sepolia_network` should pass with corrected
hostnames


🤖 Generated with [Claude Code](https://claude.com/claude-code)
piotr-roslaniec and others added 10 commits April 2, 2026 07:57
gotestsum's default `./...` package pattern only applies when no args
are passed after `--`. With `-- -timeout 15m`, it forwards args directly
to `go test`, which defaults to `.` (root package only). The root
package has no test files, so CI has been silently running 0 tests.
Replace placeholder hostnames with actual values from config/_peers/testnet.
Remove TestConnectedWellknownPeersCountMetricName and TestMetricConstants
which only assert that string constants equal themselves. The compiler
already ensures rename safety. Keep the callable integration test which
validates the function exists and executes without panicking.
…d var

Change EmptyAllowList from an exported mutable package-level var to an
exported function returning the package-level singleton. This prevents
external code from accidentally mutating the shared empty allowlist.
Add a note that connected_wellknown_peers_count was previously named
connected_bootstrap_count, so operators can update Prometheus queries
and Grafana dashboards accordingly.
Specify concrete removal version so the deprecated flag does not linger
indefinitely.
Add a TODO comment noting that at least one additional mainnet peer
across a different operator/ASN should be added before production
rollout to avoid a single point of failure for initial peer discovery.
Go 1.24 vet rejects non-constant format strings in fmt.Errorf. This
pre-existing issue was hidden because CI was not running tests.
## Summary

- CI has been silently running **0 tests** because `gotestsum --
-timeout 15m` (without `./...`) only tests the root package, which has
no test files
- Fix: add explicit `./...` so all subpackages are tested
- Also fix `peers_test.go` placeholder hostnames to match actual
`config/_peers/testnet` values — this test failure was hidden by the
above bug

## Root cause

`gotestsum`'s default `./...` package pattern only applies when **no
args** are passed after `--`. With `-- -timeout 15m`, gotestsum forwards
args directly to `go test`, which defaults to `.` (current directory
only). CI log confirms: `DONE 0 tests in 8.107s`.

## Test plan

- [ ] CI should now run all Go tests (expect 100+ tests instead of 0)
- [ ] `TestResolvePeers/sepolia_network` should pass with corrected
hostnames


🤖 Generated with [Claude Code](https://claude.com/claude-code)
@piotr-roslaniec piotr-roslaniec force-pushed the feature/decouple-firewall-allowlist branch from 16ace1b to 5a6bb93 Compare April 2, 2026 06:04
piotr-roslaniec
piotr-roslaniec previously approved these changes Apr 2, 2026
The count() loop could call close(watcher.channel) on consecutive ticks
before the WatchBlocks cleanup goroutine removed the cancelled watcher
from the list, causing a "close of closed channel" panic. Use sync.Once
to guarantee the channel is closed exactly once.
piotr-roslaniec
piotr-roslaniec previously approved these changes Apr 3, 2026
@pdyraga
Copy link
Copy Markdown
Member

pdyraga commented Apr 7, 2026

I strongly advise against this change.

Bootstrap nodes in the tBTC client are not just an address book. They work in a relay mode, improving the dissemination of network messages. Given the specificity of the tECDSA algorithm used by tBTC nodes, the delivery of messages on time is essential, and messages are dropped by nodes when CPU load is high executing tECDSA steps. This is how libp2p floodsub works, and this is how we designed message delivery buffers in the client. Another symptom that is clearly visible in metrics is the number of connections being dropped when executing CPU-intensive tECDSA steps.

Bootstrap nodes work on a high-availability infrastructure, do not participate in tECDSA, and provide a stable entrance point to the network. If several operator nodes lose their in-memory address books for any reason (could be, for example, a high CPU load leading to a restart or panic in the algorithm), they will find HA bootstrap immediately and recover the peer list in a matter of seconds.

Eliminating bootstrap nodes should be preceded by changing the network layer design, probably moving to gossip mode, and performing load tests on a larger network resembling mainnet conditions. We made an attempt at this architecture change in the past, and it was not a trivial endeavour.

@nkuba
Copy link
Copy Markdown
Member

nkuba commented Apr 7, 2026

I agree with @pdyraga here.

The current network architecture has been battle-tested over a long period of time. It went through months of iterative testing on testnet and then further validation in production. This setup is not accidental, it reflects a lot of learnings around reliability and behavior under load.

Bootstrap nodes seem to play a role beyond just peer discovery, especially when it comes to message propagation and overall network stability during heavy tECDSA workloads. Removing them without a proven alternative introduces a real risk.

I strongly recommend against changing such a critical part of the system without thorough validation. At minimum, this should be backed by extensive testing and observation in controlled environments and then gradual rollout with production monitoring.

It’s also worth noting that the testnet is no longer maintained and has effectively been shut down, which makes it even harder to properly validate changes like this before exposing them to mainnet conditions.

In short, I’m in favor of the security improvements, but removing bootstrap infrastructure at this stage feels premature without stronger evidence and testing.

- Updated Electrum URL from WebSocket to TCP.
- Replaced existing peer addresses with new IP-based entries for testnet.
Replace the single mainnet peer (a documented single point of failure for initial peer discovery) with the full set of staying-operator nodes. Peer IDs were verified against live node diagnostics and each address confirmed reachable on its libp2p port; operators advertising stable DNS names use /dns4/ so entries survive host/IP changes, and non-default ports are preserved. Resolves the in-file TODO to add peers across different operators/ASNs before production rollout.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 69043df4-c783-4784-90ce-045af8435697

📥 Commits

Reviewing files that changed from the base of the PR and between 42c97f3 and 4497643.

📒 Files selected for processing (4)
  • go.mod
  • pkg/firewall/firewall.go
  • pkg/firewall/firewall_test.go
  • pkg/tbtc/node_test.go
✅ Files skipped from review due to trivial changes (1)
  • go.mod
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/tbtc/node_test.go
  • pkg/firewall/firewall_test.go

📝 Walkthrough

Walkthrough

This PR deprecates bootstrap-based peer networking in favor of well-known peers, refactors the firewall allow-list API, updates metrics naming, removes obsolete bootstrap extraction logic, and refreshes peer configuration. Network initialization no longer derives firewall policy from bootstrap peer keys.

Changes

Bootstrap deprecation and well-known peers migration

Layer / File(s) Summary
Bootstrap CLI flag and isBootstrap() deprecation
cmd/flags.go, cmd/start.go, cmd/start_test.go
network.bootstrap flag marked deprecated with removal scheduled for v3.0.0; isBootstrap() emits a warning when enabled. Tests verify deprecation notice in flag usage and isBootstrap() behavior for enabled/disabled states.
EmptyAllowList API and network firewall initialization
pkg/firewall/firewall.go, pkg/firewall/firewall_test.go, cmd/start.go
Exported EmptyAllowList variable converted to EmptyAllowList() function returning a private singleton. Network initialization constructs firewall policy using EmptyAllowList() instead of extracting bootstrap peer public keys. Eight existing firewall tests updated to invoke EmptyAllowList() as a function, and three new tests validate that recognized peers pass, unrecognized peers fail, and previously allowlisted peers must pass IsRecognized checks.
Well-known peers metrics observation
pkg/clientinfo/metrics.go, pkg/clientinfo/metrics_test.go, cmd/start.go
Metric constant renamed from ConnectedBootstrapCountMetricName to ConnectedWellknownPeersCountMetricName; ObserveConnectedBootstrapCount replaced with ObserveConnectedWellknownPeersCount method. Observer wired during client info initialization. Test validates the method is callable without panicking.
Remove ExtractPeersPublicKeys helper
pkg/net/libp2p/libp2p.go, pkg/net/libp2p/libp2p_test.go
Exported ExtractPeersPublicKeys function removed along with three unit tests; reflect import dropped.
Peer lists and Electrum endpoint updates
config/_peers/mainnet, config/_peers/testnet, config/_electrum_urls/testnet, config/peers_test.go, config/electrum_test.go
Mainnet peer list replaced with 20 direct IPv4/TCP multiaddrs; testnet peer list replaced with three IP-based multiaddrs on ports 3919–3921; testnet Electrum endpoint switched from WebSocket to TCP. Test vectors updated to match new configurations.
Coordination layer test result handling
pkg/tbtc/node_test.go
Mock coordination results now include window field; result consumer skips nil values; post-run assertions keyed by window.coordinationBlock instead of array position, accepting either the 3600 block action or (if absent) the 4500 moved-funds sweep due to asynchronous processing.
Go module formatting
go.mod
Blank line added after toolchain directive for consistency.

🎯 3 (Moderate) | ⏱️ ~25 minutes

A bootstrap fades to history's hall,
Well-known peers now answer the call,
The firewall stands stronger today,
With metrics that measure the way,
No more extracted keys—just peers all! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removing bootstrap nodes and replacing them with operator peers, which is the central theme across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/decouple-firewall-allowlist

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/flags.go (1)

205-219: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update network.peers help text to match the well-known peers migration.

The deprecation messaging here is clear, but the network.peers description still says “bootstrap nodes,” which is now misleading for operators.

Suggested text update
 	cmd.Flags().StringSliceVar(
 		&cfg.LibP2P.Peers,
 		"network.peers",
 		[]string{},
-		"Addresses of the network bootstrap nodes.",
+		"Addresses of well-known network peers.",
 	)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/flags.go` around lines 205 - 219, The help text for the "network.peers"
flag (registered via cmd.Flags().StringSliceVar and stored in cfg.LibP2P.Peers)
still refers to "bootstrap nodes"; update that description to reference the
well-known peers migration instead (e.g., "Addresses of peers to connect to
after the well-known peers migration" or similar) so operators are not misled by
the old "bootstrap nodes" wording. Ensure you only change the flag description
string passed to cmd.Flags().StringSliceVar for "network.peers".
🧹 Nitpick comments (1)
pkg/chain/local_v1/blockcounter.go (1)

18-22: ⚡ Quick win

Consider adding documentation for closeOnce field.

The closeOnce field prevents a subtle race where multiple count() iterations might attempt to close the same watcher's channel after its context is cancelled but before the removal goroutine removes it from the slice. A brief comment would help future maintainers understand this synchronization.

📝 Suggested documentation
 type watcher struct {
 	ctx       context.Context
 	channel   chan uint64
+	// closeOnce ensures channel is closed exactly once even if multiple
+	// count() iterations detect ctx is done before removal goroutine runs.
 	closeOnce sync.Once
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/chain/local_v1/blockcounter.go` around lines 18 - 22, Add a short comment
to the watcher struct explaining the purpose of the closeOnce field: it is used
to ensure the watcher's channel is closed exactly once to avoid a race where
concurrent count() iterations (or the context-cancel path) may attempt to close
the same channel before the removal goroutine removes the watcher from the
slice. Mention the relevant symbols watcher, closeOnce, count(), and the removal
goroutine so future maintainers understand the synchronization intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/_electrum_urls/testnet`:
- Line 1: The testnet Electrum endpoint file currently contains a single
hard-coded entry which creates a single point-of-failure; update the testnet
list (config/_electrum_urls/testnet) to include multiple Electrum endpoints
(prefer DNS-backed hostnames where available) so
resolveElectrum/readElectrumUrls in config/electrum.go can failover; add several
geographically and provider-diverse entries (mix of tcp://host:port and
tls://host:port or DNS seeds) to provide redundancy and ensure
auto-configuration succeeds if one host is down.

In `@config/_peers/testnet`:
- Around line 1-3: The three testnet peer multiaddresses
(/ip4/143.198.69.177/tcp/3919/ipfs/16Uiu2HAmNqFJzYacvQJhTm116LmpNhTXbEMinuCXibKxTEj2W8iu,
/ip4/143.198.69.177/tcp/3920/ipfs/16Uiu2HAmNqFJzYacvQJhTm116LmpNhTXbEMinuCXibKxTEj2W8iu,
/ip4/143.198.69.177/tcp/3921/ipfs/16Uiu2HAm3yEkd3vXaCnSkxU5ViP2pDvV7fLN6V3fwPZrMYbH6PV3,
etc.) all use the same IP creating a SPOF; replace duplicate /ip4/143.198.69.177
entries with peers on distinct IPs or convert to DNS-based multiaddresses (using
/dnsaddr/host.example.com/tcp/<port>/ipfs/<peerID>) so you can update endpoints
without client changes, and ensure each peer entry references a different
host/IP and correct peer ID before committing.

In `@config/peers_test.go`:
- Around line 20-22: The test's expectedPeers slice only includes one mainnet
endpoint but the source list in config/_peers/mainnet contains 20 entries;
update the test (the expectedPeers value in peers_test.go / the test that loads
peers) to include all 20 mainnet multiaddrs from config/_peers/mainnet so the
assertion matches the actual list, or alternatively change the test to
explicitly assert the expected length (20) and/or perform a documented
spot-check rather than requiring exact equality; modify the expectedPeers
variable or the test assertions accordingly (look for expectedPeers and the test
that compares loaded peers).
- Around line 26-27: The sepolia test expectations in TestResolvePeers (the
sepolia_network case in config/peers_test.go) expect DNS-based multiaddrs that
the resolver does not produce; update the test so the expected peers for the
sepolia_network case match the resolver output (replace the two
"/dns4/keep-operator-*.test.keep-nodes.io/tcp/3920/ipfs/..." entries with the
actual resolved "/ip4/143.198.69.177/tcp/{3919,3920,3921}/ipfs/..." multiaddrs
and correct peer IDs/ports), or alternatively change the resolver/config that
builds ResolvePeers to emit the DNS multiaddrs—locate the sepolia case in the
TestResolvePeers table/expected slice and make the expected slice match the real
resolved addresses.

---

Outside diff comments:
In `@cmd/flags.go`:
- Around line 205-219: The help text for the "network.peers" flag (registered
via cmd.Flags().StringSliceVar and stored in cfg.LibP2P.Peers) still refers to
"bootstrap nodes"; update that description to reference the well-known peers
migration instead (e.g., "Addresses of peers to connect to after the well-known
peers migration" or similar) so operators are not misled by the old "bootstrap
nodes" wording. Ensure you only change the flag description string passed to
cmd.Flags().StringSliceVar for "network.peers".

---

Nitpick comments:
In `@pkg/chain/local_v1/blockcounter.go`:
- Around line 18-22: Add a short comment to the watcher struct explaining the
purpose of the closeOnce field: it is used to ensure the watcher's channel is
closed exactly once to avoid a race where concurrent count() iterations (or the
context-cancel path) may attempt to close the same channel before the removal
goroutine removes the watcher from the slice. Mention the relevant symbols
watcher, closeOnce, count(), and the removal goroutine so future maintainers
understand the synchronization intent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 1be4a938-bded-4b31-8380-7382f768a852

📥 Commits

Reviewing files that changed from the base of the PR and between 66b187e and cec1932.

📒 Files selected for processing (18)
  • .github/workflows/client.yml
  • cmd/flags.go
  • cmd/start.go
  • cmd/start_test.go
  • config/_electrum_urls/testnet
  • config/_peers/mainnet
  • config/_peers/testnet
  • config/peers_test.go
  • go.mod
  • pkg/chain/local_v1/blockcounter.go
  • pkg/clientinfo/metrics.go
  • pkg/clientinfo/metrics_test.go
  • pkg/firewall/firewall.go
  • pkg/firewall/firewall_test.go
  • pkg/net/libp2p/libp2p.go
  • pkg/net/libp2p/libp2p_test.go
  • pkg/tbtc/node_test.go
  • pkg/tbtcpg/internal/test/marshaling.go
💤 Files with no reviewable changes (2)
  • pkg/net/libp2p/libp2p.go
  • pkg/net/libp2p/libp2p_test.go

Comment thread config/_electrum_urls/testnet Outdated
@@ -1 +1 @@
wss://electrum.testnet.boar.network:443/QxbJgaSLUHqrgAa9BW7bDpnGPxrlhnCa
tcp://134.199.227.217:50001
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid shipping only one hard-coded testnet Electrum endpoint.

With a one-entry list here, default Electrum resolution has no failover path (see config/electrum.go, readElectrumUrls / resolveElectrum, Lines 12-75). If this host is unavailable, auto-configuration fails for all fresh clients. Please provide multiple endpoints (preferably DNS-backed where possible) to remove this SPOF.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/_electrum_urls/testnet` at line 1, The testnet Electrum endpoint file
currently contains a single hard-coded entry which creates a single
point-of-failure; update the testnet list (config/_electrum_urls/testnet) to
include multiple Electrum endpoints (prefer DNS-backed hostnames where
available) so resolveElectrum/readElectrumUrls in config/electrum.go can
failover; add several geographically and provider-diverse entries (mix of
tcp://host:port and tls://host:port or DNS seeds) to provide redundancy and
ensure auto-configuration succeeds if one host is down.

Comment thread config/_peers/testnet Outdated
Comment on lines +1 to +3
/ip4/143.198.69.177/tcp/3919/ipfs/16Uiu2HAmNqFJzYacvQJhTm116LmpNhTXbEMinuCXibKxTEj2W8iu
/ip4/143.198.69.177/tcp/3920/ipfs/16Uiu2HAm3yEkd3vXaCnSkxU5ViP2pDvV7fLN6V3fwPZrMYbH6PV3
/ip4/143.198.69.177/tcp/3921/ipfs/16Uiu2HAm8WXhNRasZvRpt5SJxuY4B7jkkKT8PYYqw4pWbMZyYJ2s No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

All testnet peers share the same IP address, creating a single point of failure.

All three peer multiaddresses point to 143.198.69.177. If this IP becomes unreachable (network partition, host failure, DDoS), the entire testnet peer discovery fails. This contradicts the high-availability relay concerns raised by reviewers (pdyraga, nkuba) and increases the SPOF risk documented in the PR commits.

Consider distributing peers across multiple IP addresses or reverting to DNS-based multiaddresses that can be updated to point to different IPs without client code changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/_peers/testnet` around lines 1 - 3, The three testnet peer
multiaddresses
(/ip4/143.198.69.177/tcp/3919/ipfs/16Uiu2HAmNqFJzYacvQJhTm116LmpNhTXbEMinuCXibKxTEj2W8iu,
/ip4/143.198.69.177/tcp/3920/ipfs/16Uiu2HAmNqFJzYacvQJhTm116LmpNhTXbEMinuCXibKxTEj2W8iu,
/ip4/143.198.69.177/tcp/3921/ipfs/16Uiu2HAm3yEkd3vXaCnSkxU5ViP2pDvV7fLN6V3fwPZrMYbH6PV3,
etc.) all use the same IP creating a SPOF; replace duplicate /ip4/143.198.69.177
entries with peers on distinct IPs or convert to DNS-based multiaddresses (using
/dnsaddr/host.example.com/tcp/<port>/ipfs/<peerID>) so you can update endpoints
without client changes, and ensure each peer entry references a different
host/IP and correct peer ID before committing.

Comment thread config/peers_test.go
Comment on lines 20 to 22
expectedPeers: []string{
"/dns4/bst-a01.tbtc.boar.network/tcp/5001/ipfs/16Uiu2HAmAmCrLuUmnBgpavU8y8JBUN6jWAQ93JwydZy3ABRyY6wU",
"/dns4/bst-b01.tbtc.boar.network/tcp/5001/ipfs/16Uiu2HAm4w5HdJQxBnadGRepaiGfWVvtMzhdAGZVcrf9i71mv69V",
"/ip4/143.198.18.229/tcp/3919/ipfs/16Uiu2HAmDP4Z6LCogRMictJ6deGs4DRo99A5JTz5u3CLMg7URxC6",
}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Compare actual mainnet peer count with test expectations

echo "=== Actual mainnet peers count ==="
cat config/_peers/mainnet | grep -v '^#' | grep -v '^$' | wc -l

echo -e "\n=== Test expects this many mainnet peers ==="
rg -A 3 '"mainnet network"' config/peers_test.go | rg '/ip4/' | wc -l

echo -e "\n=== All mainnet peers ==="
cat config/_peers/mainnet

Repository: threshold-network/keep-core

Length of output: 1982


Align mainnet peer test expectations with the full mainnet list

config/_peers/mainnet contains 20 peer endpoints, but config/peers_test.go currently expects only 1 mainnet peer (/ip4/143.198.18.229/tcp/3919/ipfs/...). If the PR intent is to replace mainnet peers with 20 endpoints, update the test to cover all 20 (or adjust the PR/logic to explicitly document that this test is only a spot-check).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/peers_test.go` around lines 20 - 22, The test's expectedPeers slice
only includes one mainnet endpoint but the source list in config/_peers/mainnet
contains 20 entries; update the test (the expectedPeers value in peers_test.go /
the test that loads peers) to include all 20 mainnet multiaddrs from
config/_peers/mainnet so the assertion matches the actual list, or alternatively
change the test to explicitly assert the expected length (20) and/or perform a
documented spot-check rather than requiring exact equality; modify the
expectedPeers variable or the test assertions accordingly (look for
expectedPeers and the test that compares loaded peers).

Comment thread config/peers_test.go Outdated
The testnet peer list and its test referenced stale identities: the embedded peers carried re-keyed peer IDs, and the test expected DNS names (keep-operator-N.test.keep-nodes.io) that no longer resolve. Refresh to the live testnet4 operator cohort on 143.198.69.177 (ports 3919-3921), peer IDs verified against each node's diagnostics endpoint, and align peers_test.go to match. Host and ports are unchanged; only the peer IDs are corrected.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
config/peers_test.go (1)

26-27: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Include the full sepolia peer set in expectedPeers.

Given the PR objective says testnet peers were refreshed to ports 3919–3921, the sepolia network expectations on Line 26–Line 27 validate only two peers. This weakens the regression check for peer resolution completeness.

Suggested update
 		"sepolia network": {
 			network: network.Testnet,
 			expectedPeers: []string{
 				"/ip4/143.198.69.177/tcp/3919/ipfs/16Uiu2HAkvjus5MH3y2tJBC6Bt1Ff9tiSowxGCw8J4FzLonnfDeG2",
 				"/ip4/143.198.69.177/tcp/3920/ipfs/16Uiu2HAmSBn6CgZ4r7HnC4RVMMFMe5vfkLvykUUfS3MnKiHLSuPD",
+				"/ip4/143.198.69.177/tcp/3921/ipfs/<peer-id-for-3921>",
 			},
 		},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/peers_test.go` around lines 26 - 27, The sepolia expectedPeers slice
in peers_test.go only contains two peer entries (ports 3919 and 3920) but the
testnet peers were refreshed to include ports 3919–3921; update the
expectedPeers value used in the Test (the expectedPeers variable/fixture in
peers_test.go) to include the full sepolia peer set (add the missing peer for
port 3921 with the corresponding /ipfs/ multiaddr) so the assertion validates
all three peers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@config/peers_test.go`:
- Around line 26-27: The sepolia expectedPeers slice in peers_test.go only
contains two peer entries (ports 3919 and 3920) but the testnet peers were
refreshed to include ports 3919–3921; update the expectedPeers value used in the
Test (the expectedPeers variable/fixture in peers_test.go) to include the full
sepolia peer set (add the missing peer for port 3921 with the corresponding
/ipfs/ multiaddr) so the assertion validates all three peers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7fb4666b-3874-4b6e-84c3-8e12676aad1d

📥 Commits

Reviewing files that changed from the base of the PR and between cec1932 and 5ea17b7.

📒 Files selected for processing (2)
  • config/_peers/testnet
  • config/peers_test.go
✅ Files skipped from review due to trivial changes (1)
  • config/_peers/testnet

…oint

The testnet Electrum URL was switched to the self-hosted testnet4 ElectrumX endpoint (tcp://134.199.227.217:50001), but electrum_test.go still expected the legacy Boar endpoint. Update the testnet expectation to match the embedded URL file; the endpoint was verified live (ElectrumX 1.19.0, protocol 1.4) and is actively used by the testnet4 operator cohort. Mainnet electrum is unchanged.
Resolve conflicts from concurrent firewall and CI changes on main:

- pkg/firewall/firewall_test.go: combine main's removal of the positive
  result cache (positive recognition is now rechecked on each Validate
  call) with this branch's EmptyAllowList()-as-function refactor. Drop
  the positiveResultCache field from all test policies, adopt the renamed
  TestValidate_PeerRecognized_Rechecked assertion, and keep the new
  EmptyAllowList decoupling tests, all calling EmptyAllowList().
- pkg/tbtcpg/internal/test/marshaling.go: take main's errors.New fix for
  the non-constant format string, superseding this branch's fmt.Errorf("%s").
- .github/workflows/client.yml: take main's Go test step, which already
  runs ./... and adds coverage reporting, superseding this branch's
  standalone ./... addition.
The testnet Electrum URL had been pointed at a self-hosted server running
Bitcoin testnet4, while keep-core's testnet stack and the embedded
config-validation integration tests target testnet3. A testnet4 server
cannot serve the testnet3 transactions, blocks, and Merkle proofs the
integration tests expect, so the electrum integration suite failed with
'not found' and 'out of range' errors.

Restore the testnet3 Electrum endpoint, which is live, on testnet3, and at
the chain tip the tests expect, and align the resolveElectrum expectation.
@lrsaturnino
Copy link
Copy Markdown
Member Author

lrsaturnino commented May 30, 2026

Thanks @pdyraga, @nkuba — I take this seriously and agree it shouldn't be a hard cutover. Let me be precise about scope, share what's been validated so far, and lay out how I'd roll it out.

Scope. This retires the dedicated bst-*.tbtc.boar.network bootstrap hosts and leans on the staying-operator nodes for discovery. Boar's operator node stays a well-known peer; what's being removed is the dedicated, always-on, non-tECDSA bootstrap role.

The tradeoff — it's the one you're flagging: the bootstrap hosts are HA and don't run tECDSA, so they never restart under signing load and give a stable re-entry point; the operator peers that now carry discovery do run tECDSA. We're trading a dedicated non-tECDSA anchor for redundancy across the ~20 operator peers. That's a real change, which is why I'd stage and monitor it rather than cut over.

Validation completed so far (controlled environment — Bitcoin testnet4 + Sepolia L1, 3-operator cohort, client v2.5.3-rc.1 @ ac7e62e):

Scenario Result
Zero-bootstrap cold start, rolling restart, partial outage (S-1–S-3) PASS — mesh converges and recovers to full peering with no bootstrap node present
24h stability (17 ticks) PASS — peer count stable, 0 panic/fatal, flat CPU ~40% / mem ~272 MB, no version drift
DKG (S-5a) PASS on-chain — 2 wallets, M8 ≤5 (2 and 1 attempts); wallet #1 operator-log capture was the one PARTIAL sub-gate, covered by wallet #2
Signing + sweep + SPV + mint (S-5b) PASS — full deposit→sweep→Sepolia mint, incl. strict e2e-deposit.ts + on-chain TBTC mint receipt
DKG + signing under CPU stress (S-5c) PASS — stress-ng ~70% + a 5h run; DKG no retry elevation; Redemption + DepositSweep signing both complete with SPV proofs
Stressed vs unloaded (S-5d) DKG retries Δ0; signing +37s (Redemption) / +50s (DepositSweep) — no material degradation
Deprecation warning (S-4) & rollback rehearsal PASS — --network.bootstrap warns; image swap + rollback recovers full mesh in ~10s

These were validated on a Sepolia dev deployment (not the canonical Threshold Sepolia contracts) plus Bitcoin testnet4 — reproducible below:

Artifact Deployment Block Reference (tx / event)
DKG WalletCreated (3 wallets) Sepolia devWalletRegistry 0x872d27463e7FE6aC843d7F7AE5f401BC5547c51B 10833304 / 10840352 / 10846422 on-chain events
Deposit reveal Sepolia devBridge 0xd48ae449332a75bc8a28678345d82379bf767656 10839195 0x84c2c749dcc20104a18579c997dcab3442241ca78e4beeb591865b80bb8668a5
Mint settlement (TBTC Transfer(0x0→…)) Sepolia devBridge 0xd48ae449332a75bc8a28678345d82379bf767656 10850702 0x24c9228b507fccd918766cb3b41df34a4a5db5530b3869244c6e7a4749c6bc15
BTC sweep Bitcoin testnet4 4fd7f08a57b7bede9c7f385802262aee9bcbfa3ee6bfe64bb1ad5ac462eb9b0b (mempool.space/testnet4)

Scope: 3-operator, group-size-3 cohort — it validates functional correctness and behavior under CPU stress at small scale; it does not reproduce mainnet TSS message volume (~100-of-N groups). A few items remain open (a re-staking rerun, refreshing the stale Sepolia verification pointer, and restoring SPV diagnostics coverage). That gap is exactly why the mainnet step is staged and monitored rather than released directly.

Rollout:

  • Staged, reversible: canary on a couple of operator nodes first, then widen in waves. I'd keep the bst-* bootstrap hosts running until the canary proves operator-peer discovery holds — so if discovery or message delivery degrades, we re-add the bootstrap entries and roll back before going wider. The hosts are decommissioned only once the wider rollout is stable. No single fleet-wide cutover.
  • Message-delivery monitoring: you're right that floodsub drops messages when a peer's outbound queue backs up under tECDSA CPU. The current monitoring watches connection counts, not delivery — I'll add a message-drop/round-volume metric so we have production signal on exactly that before any bootstrap goes offline.
  • Relay: reading the client, the only topic.Relay() forwarder is the Random Beacon's (opt-in, off by default), and --network.bootstrap mode skips beacon/tBTC init so it subscribes to no application topic — so in that mode the bst-* hosts wouldn't relay tBTC traffic at the app layer. You'd know exactly how they're deployed, so correct me if they run as full subscribing nodes.
  • Longer term I agree the deeper answer is the network-layer change you mentioned (we're on floodsub, not gossipsub); I'd scope that separately.

@lrsaturnino
Copy link
Copy Markdown
Member Author

lrsaturnino commented May 30, 2026

Thanks @nkuba, and agreed on the security improvements. On testnet — you're right the public testnet is effectively gone; the validation above ran on a privately-operated testnet4 cohort against Sepolia. We won't overstate it (group size 3, three operator nodes run by one party, so it doesn't reproduce mainnet TSS volume) — which is exactly why we're leaning on the staged mainnet rollout with production monitoring as the real validation vehicle rather than presenting testnet as proof of mainnet safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants