fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884
Open
horus-leonardo wants to merge 1 commit intoapache:masterfrom
Open
fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884horus-leonardo wants to merge 1 commit intoapache:masterfrom
horus-leonardo wants to merge 1 commit intoapache:masterfrom
Conversation
DynamicPartitionScheduler.runtimeInfos accumulates entries indefinitely when tables are dropped or lose their dynamic_partition properties. removeRuntimeInfo(tableId) is called from ShowDynamicPartitionCommand but only opportunistically: it requires a user to issue SHOW DYNAMIC PARTITION and only catches tables still present in the catalog that have lost their dynamic_partition property. No catalog mutation path calls it. Fix: - Call removeRuntimeInfo() in InternalCatalog.unprotectDropTable() so the entry is cleared when a table is dropped. - Call removeRuntimeInfo() in executeDynamicPartition() at the two cleanup points where the iterator removes a table from the scheduling set (db gone, olapTable null/MTMV/no-dynamic-partition). In a high-DDL-churn workload (CREATE/DROP loops on tables with dynamic_partition.enable=true or partitionRetentionCount > 0) this map can grow unbounded and cause FE OOM after extended uptime. Closes apache#62883 Signed-off-by: Leonardo Constanski <leonardo@horusbi.com.br>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #62883
Related PR: none
Problem Summary:
DynamicPartitionScheduler.runtimeInfosaccumulates entries indefinitely. The map is keyed bytableIdand gets a new entry every time the scheduler runs against a table withdynamic_partition.enable=trueorpartitionRetentionCount > 0.removeRuntimeInfo(long tableId)is called in exactly one place:ShowDynamicPartitionCommand.doRun(), which only fires when a user issuesSHOW DYNAMIC PARTITIONand only for tables still present in the catalog that have lost theirdynamic_partitionproperty. No catalog mutation path calls it — DROP TABLE, DROP DATABASE, and tables that turn off dynamic_partition or zero outpartitionRetentionCountall leave permanent entries. In automated ETL workloads where nobody runsSHOW, the map grows unbounded.This patch wires
removeRuntimeInfo()into the three canonical cleanup points:InternalCatalog.unprotectDropTable()— alongsidedb.unregisterTable().executeDynamicPartition()db == nullbranch — afteriterator.remove().executeDynamicPartition()olapTableinvalid/lost-properties branch — afteriterator.remove().Found via heap dump analysis after an FE OOM on 4.0.5-rc01 today (2026-04-27) in a high-DDL-churn ETL workload. The map had reached ~1.5M entries / 554 MB retained heap. We are rolling out a patched build to production now and will follow up on the issue thread with steady-state retention numbers after a week of uptime.
Full bug report and heap dump details in #62883.
Release note
Fix FE memory leak in
DynamicPartitionScheduler.runtimeInfosfor tables that are dropped, lose theirdynamic_partition.enableproperty, or havepartitionRetentionCountreset to 0.Check List (For Author)
Manual test: heap dump analysis on a 4.0.5-rc01 FE that OOMed under an ETL workload doing ~24K DDL/hour against
dynamic_partitiontables. The dump showedruntimeInfosholding ~1M–1.5M stale entries (2,097,152-bucketConcurrentHashMap$Node[], 554 MB retained onDynamicPartitionScheduler, 17% of live heap post-GC walk). The patched build is being deployed today; I will report steady-state heap numbers in the issue thread after a week of production uptime.A unit test reproducing the leak would need to drive the dynamic-partition scheduler against a synthetic catalog and assert
runtimeInfos.size()after DROP. Happy to add one if maintainers prefer that over the production validation.Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)