Skip to content

Scope schema discovery to target host's cluster in multi-cluster CHI#1965

Open
lukas-pfannschmidt-tr wants to merge 1 commit intoAltinity:0.27.0from
lukas-pfannschmidt-tr:fix/schema-discovery-cluster-scope
Open

Scope schema discovery to target host's cluster in multi-cluster CHI#1965
lukas-pfannschmidt-tr wants to merge 1 commit intoAltinity:0.27.0from
lukas-pfannschmidt-tr:fix/schema-discovery-cluster-scope

Conversation

@lukas-pfannschmidt-tr
Copy link
Copy Markdown

@lukas-pfannschmidt-tr lukas-pfannschmidt-tr commented Apr 22, 2026

Summary

Fixes #1964.

HostCreateTables used api.ClickHouseInstallation{} as the scope for Names(NameFQDNs, ...) when building the endpoint list passed to QueryUnzip2Columns / QueryAny. In a CHI that defines multiple clusters, this walked hosts from every cluster and allowed QueryAny to pick a source node outside the target cluster.

The SQL does filter by cluster name via clusterAllReplicas('<target>', system.tables), but sqlCreateTableReplicated joins against the executing node's local system.databases:

FROM clusterAllReplicas('<cluster>', system.tables) tables
LOCAL JOIN system.databases databases on (databases.name = tables.database)
WHERE database NOT IN (...) AND databases.engine IN ('Ordinary','Atomic','Memory','Lazy')

So the returned set of CREATE statements depends on which cluster's node answered first — resulting in missing or incorrect schemas on newly added replicas in multi-cluster CHIs.

Timeline (for context)

  • The CHI-wide scoping in schema discovery dates back to CreatePodFQDNsOfCHI(host.GetCHI()) pre-2021 and was preserved through the 2021 unification (6b946799d) and the 2024 schemer refactor (b64a6241d).
  • The LOCAL JOIN system.databases form landed in d49187d0b (first in release-0.23.6), which made the executing-node dependency stricter.
  • The 0.26 "rework k8s DNS" commit 6d625de69 added a trailing dot to patternNamespaceDomain (%s.svc.cluster.local.). That fixed slow/failing DNS resolution under ndots:5, but it also removed an accidental failure-mode that had been masking the scoping bug: before 0.26, cross-cluster endpoints in the CHI-wide list could fail DNS quickly and QueryAny would fall through to a same-cluster endpoint. After 0.26, every CHI endpoint resolves reliably, so QueryAny returns from whichever is first in the slice — which, with CHI-wide scoping, can be a node from a different cluster.

The 0.26 DNS change didn't cause this bug — it exposed it. The root cause is the schema-discovery endpoint scope.

Change

Use api.Cluster{} scope in getReplicatedObjectsSQLs and getDistributedObjectsSQLs, so schema-discovery endpoints are restricted to the target host's own cluster. This matches the scoping already used by shouldCreateReplicatedObjects / shouldCreateDistributedObjects for the related gating logic.

Six call sites updated:

  • pkg/model/chi/schemer/replicated.go (databases, tables, functions)
  • pkg/model/chi/schemer/distributed.go (databases, tables, functions)

Test plan

  • go build ./...
  • Manual verification on a multi-cluster CHI: scale one cluster and confirm new replicas receive the correct schemas sourced from the same cluster.

Single-cluster CHIs are unaffected: cluster-scoped FQDNs equal CHI-scoped FQDNs in that case.

HostCreateTables used api.ClickHouseInstallation{} as the scope for
Names(NameFQDNs, ...) when building the endpoint list passed to
QueryUnzip2Columns/QueryAny. In a CHI that defines multiple clusters,
this walked hosts from every cluster and allowed QueryAny to pick a
source node outside the target cluster. The SQL filters by cluster name
via clusterAllReplicas, but sqlCreateTableReplicated joins against the
executing node's local system.databases, so the returned set of CREATE
statements depends on which cluster's node answered first — leading to
missing or wrong schemas on newly added replicas.

Use api.Cluster{} scope in getReplicatedObjectsSQLs and
getDistributedObjectsSQLs so schema-discovery endpoints are restricted
to the target host's own cluster, matching the scoping already used by
shouldCreateReplicatedObjects / shouldCreateDistributedObjects.

Fixes Altinity#1964

Signed-off-by: Lukas Pfannschmidt <lukas.pfannschmidt@traderepublic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

planned for review This feature is planned for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants