branch-4.1: (cloud) Hold table write lock across first-time dynamic partition setup to prevent CREATE MV race #62755#62863
Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
Open
Conversation
…up to prevent CREATE MV race (#62755) In cloud mode, `InternalCatalog.createTable` releases `db.writeLock` right after writing the `OP_CREATE_TABLE` edit log, and only then invokes `DynamicPartitionScheduler.executeDynamicPartitionFirstTime` to create the first batch of dynamic partitions for the new table. There is no lock guarding the gap between these two steps, which opens a race window: Thread A (CREATE TABLE with dynamic_partition) -> db.writeLock -> db.createTableWithoutLock() # writes OP_CREATE_TABLE (idToPartition is empty) -> db.writeUnlock # <- race window opens here -> executeDynamicPartitionFirstTime -> for each partition: addPartition() -> olapTable.readLock -> checkNormalStateForAlter() Thread B (CREATE MATERIALIZED VIEW from another client, concurrent) -> olapTable.writeLockOrDdlException # succeeds, A's db lock does not block this -> checkNormalStateForAlter # passes, state is still NORMAL -> for (Partition p : olapTable.getPartitions()) # snapshots leader's half-built state mvJob.addMVIndex(partitionId, ...) -> olapTable.setState(ROLLUP) -> logAlterJob(OP_ALTER_JOB_V2) # journals a rollup that references partitions # which have not (and never will) appear as # OP_ADD_PARTITION entries in the journal Thread A resumes on the next addPartition -> checkNormalStateForAlter throws "state(ROLLUP) not NORMAL" -> CREATE TABLE returns ERR, but OP_CREATE_TABLE and OP_ALTER_JOB_V2 are already durably on disk, leaving a permanent inconsistency in the journal. In cloud mode, have two clients fire the following two statements against the same table within the same second: CREATE TABLE IF NOT EXISTS t ( ... ) PARTITION BY RANGE(ts) () PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-7", "dynamic_partition.end" = "1", "dynamic_partition.create_history_partition" = "true" ); CREATE MATERIALIZED VIEW mv AS SELECT ... FROM t GROUP BY ...; The new regression test `test_create_table_and_create_mv_race.groovy` uses the debug point `FE.createOlapTable.beforeFirstTimeDynamicPartition` (param `sleepMs`) to widen the race window and reproduce it deterministically. Once the bad journal entry is persisted, any FE replaying it hits: NullPointerException: Cannot invoke DataProperty.getStorageMedium() because the return value of PartitionInfo.getDataProperty(long) is null at RollupJobV2.addTabletToInvertedIndex(RollupJobV2.java:762) at RollupJobV2.replayCreateJob(RollupJobV2.java:745) at EditLog.loadJournal(EditLog.java:939) `EditLog.loadJournal:1448` calls `System.exit(-1)`, so the FE JVM exits immediately. Consequences observed in production: - All followers that replicate the bad entry crash on replay; supervisor restarts them and they crash again on the same entry, entering a Extend the lifetime of `olapTable.writeLock` inside `InternalCatalog.createTable`: 1. After `OP_CREATE_TABLE` has been written (i.e. `result.second == false`, the table was newly registered), acquire `olapTable.writeLock()` before releasing `db.writeLock`. 2. Wrap everything that used to run after `db.writeUnlock` (colocate persist, `executeDynamicPartitionFirstTime`, `registerOrRemoveDynamicPartitionTable`, `createOrUpdateRuntimeInfo`) in a new try/finally and release the table lock in the finally according to the `holdTableLock` flag. With this, Thread A holds the table write lock across the whole first-time dynamic partition setup. Any concurrent CREATE MV / SCHEMA CHANGE blocks on `olapTable.writeLockOrDdlException` until A releases the lock, at which point `olapTable.getPartitions()` reflects the full partition set and the rollup job B constructs only references partitions that have matching `OP_ADD_PARTITION` entries in the journal. The inconsistency is gone. The lock is scoped to this one new table, so other tables in the same database are unaffected. The new table has no user traffic yet, so the extra lock hold time is effectively free. The fix also introduces the debug point `FE.createOlapTable.beforeFirstTimeDynamicPartition` (param `sleepMs`), used only by the regression test to widen the race window. It is disabled by default in production. - Regression: `regression-test/suites/cloud_p0/partition/test_create_table_and_create_mv_race.groovy` runs CREATE TABLE and CREATE MV concurrently and asserts that MV completes no earlier than CREATE TABLE. Without the fix, MV either returns during the injected sleep (assertion fails) or CREATE TABLE throws a ROLLUP-state error (future.get() re-throws before the assertion). With the fix, MV blocks on the table lock and the test passes. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [x] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [x] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [x] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
|
run buildall |
Contributor
FE Regression Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #62755