Skip to content

fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884

Open
horus-leonardo wants to merge 1 commit intoapache:masterfrom
horus-leonardo:fix/dynamic-partition-scheduler-runtimeinfos-leak
Open

fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884
horus-leonardo wants to merge 1 commit intoapache:masterfrom
horus-leonardo:fix/dynamic-partition-scheduler-runtimeinfos-leak

Conversation

@horus-leonardo
Copy link
Copy Markdown

@horus-leonardo horus-leonardo commented Apr 27, 2026

What problem does this PR solve?

Issue Number: close #62883

Related PR: none

Problem Summary:

DynamicPartitionScheduler.runtimeInfos accumulates entries indefinitely. The map is keyed by tableId and gets a new entry every time the scheduler runs against a table with dynamic_partition.enable=true or partitionRetentionCount > 0.

removeRuntimeInfo(long tableId) is called in exactly one place: ShowDynamicPartitionCommand.doRun(), which only fires when a user issues SHOW DYNAMIC PARTITION and only for tables still present in the catalog that have lost their dynamic_partition property. No catalog mutation path calls it — DROP TABLE, DROP DATABASE, and tables that turn off dynamic_partition or zero out partitionRetentionCount all leave permanent entries. In automated ETL workloads where nobody runs SHOW, the map grows unbounded.

This patch wires removeRuntimeInfo() into the three canonical cleanup points:

  1. InternalCatalog.unprotectDropTable() — alongside db.unregisterTable().
  2. executeDynamicPartition() db == null branch — after iterator.remove().
  3. executeDynamicPartition() olapTable invalid/lost-properties branch — after iterator.remove().

Found via heap dump analysis after an FE OOM on 4.0.5-rc01 today (2026-04-27) in a high-DDL-churn ETL workload. The map had reached ~1.5M entries / 554 MB retained heap. We are rolling out a patched build to production now and will follow up on the issue thread with steady-state retention numbers after a week of uptime.

Full bug report and heap dump details in #62883.

Release note

Fix FE memory leak in DynamicPartitionScheduler.runtimeInfos for tables that are dropped, lose their dynamic_partition.enable property, or have partitionRetentionCount reset to 0.

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:

Manual test: heap dump analysis on a 4.0.5-rc01 FE that OOMed under an ETL workload doing ~24K DDL/hour against dynamic_partition tables. The dump showed runtimeInfos holding ~1M–1.5M stale entries (2,097,152-bucket ConcurrentHashMap$Node[], 554 MB retained on DynamicPartitionScheduler, 17% of live heap post-GC walk). The patched build is being deployed today; I will report steady-state heap numbers in the issue thread after a week of production uptime.

A unit test reproducing the leak would need to drive the dynamic-partition scheduler against a synthetic catalog and assert runtimeInfos.size() after DROP. Happy to add one if maintainers prefer that over the production validation.

  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

DynamicPartitionScheduler.runtimeInfos accumulates entries indefinitely
when tables are dropped or lose their dynamic_partition properties.
removeRuntimeInfo(tableId) is called from ShowDynamicPartitionCommand
but only opportunistically: it requires a user to issue
SHOW DYNAMIC PARTITION and only catches tables still present in the
catalog that have lost their dynamic_partition property. No catalog
mutation path calls it.

Fix:
- Call removeRuntimeInfo() in InternalCatalog.unprotectDropTable() so
  the entry is cleared when a table is dropped.
- Call removeRuntimeInfo() in executeDynamicPartition() at the two
  cleanup points where the iterator removes a table from the scheduling
  set (db gone, olapTable null/MTMV/no-dynamic-partition).

In a high-DDL-churn workload (CREATE/DROP loops on tables with
dynamic_partition.enable=true or partitionRetentionCount > 0) this map
can grow unbounded and cause FE OOM after extended uptime.

Closes apache#62883

Signed-off-by: Leonardo Constanski <leonardo@horusbi.com.br>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] (dynamic-partition) DynamicPartitionScheduler.runtimeInfos leaks entries on DROP TABLE, causing FE OOM

1 participant