Skip to content

branch-4.1: (cloud) Hold table write lock across first-time dynamic partition setup to prevent CREATE MV race #62755#62863

Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-62755-branch-4.1
Open

branch-4.1: (cloud) Hold table write lock across first-time dynamic partition setup to prevent CREATE MV race #62755#62863
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-62755-branch-4.1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Cherry-picked from #62755

…up to prevent CREATE MV race (#62755)

In cloud mode, `InternalCatalog.createTable` releases `db.writeLock`
right after writing the `OP_CREATE_TABLE` edit log, and only then
invokes `DynamicPartitionScheduler.executeDynamicPartitionFirstTime` to
create the first batch of dynamic partitions for the new table. There is
no lock guarding the gap between these two steps, which opens a race
window:

    Thread A (CREATE TABLE with dynamic_partition)
      -> db.writeLock
-> db.createTableWithoutLock() # writes OP_CREATE_TABLE (idToPartition
is empty)
      -> db.writeUnlock                   # <- race window opens here
      -> executeDynamicPartitionFirstTime
         -> for each partition:
              addPartition()
                -> olapTable.readLock
                -> checkNormalStateForAlter()

    Thread B (CREATE MATERIALIZED VIEW from another client, concurrent)
-> olapTable.writeLockOrDdlException # succeeds, A's db lock does not
block this
-> checkNormalStateForAlter # passes, state is still NORMAL
-> for (Partition p : olapTable.getPartitions()) # snapshots leader's
half-built state
           mvJob.addMVIndex(partitionId, ...)
      -> olapTable.setState(ROLLUP)
-> logAlterJob(OP_ALTER_JOB_V2) # journals a rollup that references
partitions
# which have not (and never will) appear as
# OP_ADD_PARTITION entries in the journal

    Thread A resumes on the next addPartition
      -> checkNormalStateForAlter throws "state(ROLLUP) not NORMAL"
      -> CREATE TABLE returns ERR,
but OP_CREATE_TABLE and OP_ALTER_JOB_V2 are already durably on disk,
         leaving a permanent inconsistency in the journal.

In cloud mode, have two clients fire the following two statements
against the same table within the same second:

    CREATE TABLE IF NOT EXISTS t (
        ...
    ) PARTITION BY RANGE(ts) () PROPERTIES (
        "dynamic_partition.enable" = "true",
        "dynamic_partition.time_unit" = "DAY",
        "dynamic_partition.start" = "-7",
        "dynamic_partition.end" = "1",
        "dynamic_partition.create_history_partition" = "true"
    );
    CREATE MATERIALIZED VIEW mv AS SELECT ... FROM t GROUP BY ...;

The new regression test `test_create_table_and_create_mv_race.groovy`
uses the debug point
`FE.createOlapTable.beforeFirstTimeDynamicPartition` (param `sleepMs`)
to widen the race window and reproduce it deterministically.

Once the bad journal entry is persisted, any FE replaying it hits:

    NullPointerException: Cannot invoke DataProperty.getStorageMedium()
because the return value of PartitionInfo.getDataProperty(long) is null
        at RollupJobV2.addTabletToInvertedIndex(RollupJobV2.java:762)
        at RollupJobV2.replayCreateJob(RollupJobV2.java:745)
        at EditLog.loadJournal(EditLog.java:939)

`EditLog.loadJournal:1448` calls `System.exit(-1)`, so the FE JVM exits
immediately. Consequences observed in production:

- All followers that replicate the bad entry crash on replay; supervisor
restarts them and they crash again on the same entry, entering a

Extend the lifetime of `olapTable.writeLock` inside
`InternalCatalog.createTable`:

1. After `OP_CREATE_TABLE` has been written (i.e. `result.second ==
false`, the table was newly registered), acquire `olapTable.writeLock()`
before releasing `db.writeLock`.
2. Wrap everything that used to run after `db.writeUnlock` (colocate
persist, `executeDynamicPartitionFirstTime`,
`registerOrRemoveDynamicPartitionTable`, `createOrUpdateRuntimeInfo`) in
a new try/finally and release the table lock in the finally according to
the `holdTableLock` flag.

With this, Thread A holds the table write lock across the whole
first-time dynamic partition setup. Any concurrent CREATE MV / SCHEMA
CHANGE blocks on `olapTable.writeLockOrDdlException` until A releases
the lock, at which point `olapTable.getPartitions()` reflects the full
partition set and the rollup job B constructs only references partitions
that have matching `OP_ADD_PARTITION` entries in the journal. The
inconsistency is gone.

The lock is scoped to this one new table, so other tables in the same
database are unaffected. The new table has no user traffic yet, so the
extra lock hold time is effectively free.

The fix also introduces the debug point
`FE.createOlapTable.beforeFirstTimeDynamicPartition` (param `sleepMs`),
used only by the regression test to widen the race window. It is
disabled by default in production.

- Regression:
`regression-test/suites/cloud_p0/partition/test_create_table_and_create_mv_race.groovy`
runs CREATE TABLE and CREATE MV concurrently and asserts that MV
completes no earlier than CREATE TABLE. Without the fix, MV either
returns during the injected sleep (assertion fails) or CREATE TABLE
throws a ROLLUP-state error (future.get() re-throws before the
assertion). With the fix, MV blocks on the table lock and the test
passes.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [x] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [x] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@github-actions github-actions Bot requested a review from yiguolei as a code owner April 27, 2026 07:58
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Apr 27, 2026
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 43.18% (19/44) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants