direct: add resilience against eventual consistency + fix tests#5694
Open
denik wants to merge 5 commits into
Open
direct: add resilience against eventual consistency + fix tests#5694denik wants to merge 5 commits into
denik wants to merge 5 commits into
Conversation
…ngine The testserver now returns 404 on the first dashboard GET after a create (eventual-consistency token), and the direct engine retries reads on 404 when it knows the resource should exist (has an ID on record). Co-authored-by: Isaac
Contributor
Approval status: pending
|
Collaborator
Integration test reportCommit: f553647
27 interesting tests: 13 SKIP, 6 RECOVERED, 5 flaky, 3 FAIL
Top 9 slowest tests (at least 2 minutes):
|
These just delegated to DoRead with no readiness polling. The post-create eventual-consistency read is already handled by refreshRemoteState, which retries on 404 via retryOnTransientOrMissing. Co-authored-by: Isaac
The matrix DATABRICKS_BUNDLE_ENGINE value is only set on the CLI subprocess env, so reading it via env.Get(t.Context()) in PrepareServerAndClient returned "" and the EC token was never selected -- the simulation was dead in tests. Thread the per-variant env into PrepareServerAndClient and gate EC on an explicit TESTS_STALE_ONCE=1 (direct engine only). Enable it for the dashboards tests and the no_drift invariant; migrate/continue_293 invoke terraform or the old CLI which do not retry, so they are left out. With EC genuinely on, WaitAfterCreate is required again to consume the post-create stale inside deploy; a 404 retry is expected and logged at debug (not warn). Retry interval is set to 1ms for acceptance to avoid 15s sleeps. Co-authored-by: Isaac
It was committed as 100644, so on CI (which has no local +x bit) the script failed with "Permission denied" and the etag replacement never registered. Co-authored-by: Isaac
The retry interval was set globally, which would also apply on cloud where the real propagation delay needs the real interval. Inject it in the runner only when the testserver simulates eventual consistency (StaleOnceEnabled) and the run is local, leaving cloud and non-EC tests on the default interval. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Why
We've seen the dashboard API being eventually consistent which causes cloud tests to fail.
Tests