Skip to content

direct: add resilience against eventual consistency + fix tests#5694

Open
denik wants to merge 5 commits into
mainfrom
denik/eventual-consistency
Open

direct: add resilience against eventual consistency + fix tests#5694
denik wants to merge 5 commits into
mainfrom
denik/eventual-consistency

Conversation

@denik

@denik denik commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Add deterministic eventual consistency simulation to testserver for dashboard backend (first GET always returns stale response, then correct one).
  • Update direct engine to retry 404s when we know the resource should exist (e.g. after create or update).

Why

We've seen the dashboard API being eventually consistent which causes cloud tests to fail.

Tests

  • Update tests to avoid reading stale values (e.g. parse output of PUT instead of doing follow up GET).
  • In some cases, retry GET request if we can see it is stale (reading old ETAG value).
  • New script retry.py does retry based on substring in the response.

…ngine

The testserver now returns 404 on the first dashboard GET after a create
(eventual-consistency token), and the direct engine retries reads on 404
when it knows the resource should exist (has an ID on record).

Co-authored-by: Isaac
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

4 files changed
Suggested: @pietern
Also eligible: @janniklasrose, @shreyas-goenka, @andrewnester, @anton-107, @lennartkats-db

/bundle/ - needs approval

5 files changed
Suggested: @pietern
Also eligible: @janniklasrose, @shreyas-goenka, @andrewnester, @anton-107, @lennartkats-db

General files (require maintainer)

7 files changed
Based on git history:

  • @pietern -- recent work in libs/testserver/, bundle/direct/, bundle/direct/dresources/

Any maintainer (@andrewnester, @anton-107, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@denik denik temporarily deployed to test-trigger-is June 23, 2026 19:02 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 23, 2026 19:02 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: f553647

Run: 28090280059

Env ❌​FAIL 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 13 244 1024 4:43
💚​ aws windows 7 13 246 1022 5:23
💚​ aws-ucws linux 7 13 334 940 5:16
💚​ aws-ucws windows 7 13 336 938 5:46
🔄​ azure linux 3 15 245 1022 11:00
❌​ azure windows 2 4 15 244 1020 11:38
🔄​ azure-ucws linux 3 15 337 936 6:43
❌​ azure-ucws windows 3 1 15 338 934 8:33
💚​ gcp linux 1 15 246 1024 4:12
💚​ gcp windows 1 15 248 1022 4:56
27 interesting tests: 13 SKIP, 6 RECOVERED, 5 flaky, 3 FAIL
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🔄​ TestAccept 💚​R 💚​R 💚​R 💚​R 🔄​f 🔄​f 🔄​f 💚​R 💚​R 💚​R
🔄​ TestAccept/bundle/deployment/bind/alert 🙈​s 🙈​s 🙈​s 🙈​s ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p
🔄​ TestAccept/bundle/deployment/bind/alert/DATABRICKS_BUNDLE_ENGINE=terraform ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p
🔄​ TestAccept/bundle/generate/alert ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p 🔄​f ✅​p ✅​p ✅​p
🔄​ TestAccept/bundle/generate/alert/DATABRICKS_BUNDLE_ENGINE=direct ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p 🔄​f ✅​p ✅​p ✅​p
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
❌​ TestFetchRepositoryInfoAPI_FromRepo ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ❌​F ✅​p ✅​p
❌​ TestFetchRepositoryInfoAPI_FromRepo/root ✅​p ✅​p ✅​p ✅​p ✅​p ❌​F ✅​p ❌​F ✅​p ✅​p
❌​ TestFetchRepositoryInfoAPI_FromRepo/subdir ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ❌​F ✅​p ✅​p
Top 9 slowest tests (at least 2 minutes):
duration env testname
3:23 aws-ucws windows TestAccept
3:20 aws windows TestAccept
3:15 gcp windows TestAccept
3:08 azure-ucws windows TestAccept
2:56 azure windows TestAccept/bundle/generate/auto-bind/DATABRICKS_BUNDLE_ENGINE=terraform
2:35 azure windows TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=terraform
2:04 azure windows TestAccept/bundle/deploy/mlops-stacks/DATABRICKS_BUNDLE_ENGINE=terraform
2:03 azure windows TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=direct
2:03 azure linux TestAccept/bundle/deployment/bind/job/generate-and-bind/DATABRICKS_BUNDLE_ENGINE=direct

These just delegated to DoRead with no readiness polling. The post-create
eventual-consistency read is already handled by refreshRemoteState, which
retries on 404 via retryOnTransientOrMissing.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 04:46 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 04:46 — with GitHub Actions Inactive
The matrix DATABRICKS_BUNDLE_ENGINE value is only set on the CLI subprocess
env, so reading it via env.Get(t.Context()) in PrepareServerAndClient returned
"" and the EC token was never selected -- the simulation was dead in tests.

Thread the per-variant env into PrepareServerAndClient and gate EC on an
explicit TESTS_STALE_ONCE=1 (direct engine only). Enable it for the dashboards
tests and the no_drift invariant; migrate/continue_293 invoke terraform or the
old CLI which do not retry, so they are left out.

With EC genuinely on, WaitAfterCreate is required again to consume the
post-create stale inside deploy; a 404 retry is expected and logged at debug
(not warn). Retry interval is set to 1ms for acceptance to avoid 15s sleeps.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 05:21 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 05:21 — with GitHub Actions Inactive
It was committed as 100644, so on CI (which has no local +x bit) the script
failed with "Permission denied" and the etag replacement never registered.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 07:55 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 07:55 — with GitHub Actions Inactive
The retry interval was set globally, which would also apply on cloud where the
real propagation delay needs the real interval. Inject it in the runner only
when the testserver simulates eventual consistency (StaleOnceEnabled) and the
run is local, leaving cloud and non-EC tests on the default interval.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 09:53 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 09:53 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants