HIVE-28755: Statistics Management Task#6438
HIVE-28755: Statistics Management Task#6438DanielZhu58 wants to merge 4 commits intoapache:masterfrom
Conversation
|
There was a problem hiding this comment.
Pull request overview
Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.
Changes:
- Introduce
StatisticsManagementTask(aMetastoreTaskThread) that deletes column stats older than a configured retention window. - Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
- Add a micro-benchmark entry and a new unit test class for statistics management.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java | Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing). |
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java | Wires the new benchmark into the benchmark suite (one suite name is inconsistent). |
| standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java | New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object). |
| standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java | New background task implementation (contains multiple correctness/compilation/runtime issues). |
| standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java | Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (excludeVal != null) { | ||
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set on the table.", dbName, tblName); |
There was a problem hiding this comment.
The exclude check treats any non-null table parameter value as excluded; if the property is present but set to "false" it will still skip deletion. Check the value (e.g., equalsIgnoreCase("true")) instead of only nullness.
| if (excludeVal != null) { | |
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set on the table.", dbName, tblName); | |
| if ("true".equalsIgnoreCase(excludeVal)) { | |
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set to true on the table.", dbName, tblName); |
| ColumnStatistics cs = new ColumnStatistics(); | ||
| ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl); | ||
| desc.setCatName("hive"); | ||
| cs.addToStatsObj(obj); | ||
|
|
||
| client.updateTableColumnStatistics(cs); | ||
| } |
There was a problem hiding this comment.
ColumnStatistics is missing its statsDesc assignment. updateTableColumnStatistics dereferences statsObj.getStatsDesc() and will NPE unless you call cs.setStatsDesc(desc) (and typically set the engine if needed).
| final StatisticsManagementTask statsTask = new StatisticsManagementTask(); | ||
| try { | ||
| client.getHadoopConf().set("hive.metastore.uris", client.getServerURI().toString()); | ||
| client.getHadoopConf().set("metastore.statistics.management.database.pattern", dbName); |
There was a problem hiding this comment.
This sets a config key (metastore.statistics.management.database.pattern) that doesn't appear to be defined in MetastoreConf or read by StatisticsManagementTask, so it currently has no effect. Either wire this config into the task or drop it from the benchmark setup to avoid misleading results.
| client.getHadoopConf().set("metastore.statistics.management.database.pattern", dbName); |
| // check if the stats are deleted | ||
| for (int i = 0; i < tableCount; i++) { | ||
| String tableName = tableNamePrefix + "_" + i; | ||
| List<Partition> partitions = client.listPartitions(dbName, tableName); | ||
| for (Partition partition : partitions) { | ||
| Map<String, String> params = partition.getParameters(); | ||
| if (params.containsKey("lastAnalyzed")) { | ||
| throw new AssertionError("Partition stats not deleted for table: " + tableName); |
There was a problem hiding this comment.
The post-run verification checks whether the partition parameters still contain lastAnalyzed, but the task deletes column statistics and won't remove this partition parameter. The assertion will stay true even when stats are deleted; verify via getPartitionColumnStatistics/getTableColumnStatistics (or direct TAB_COL_STATS query) instead.
| // check if the stats are deleted | |
| for (int i = 0; i < tableCount; i++) { | |
| String tableName = tableNamePrefix + "_" + i; | |
| List<Partition> partitions = client.listPartitions(dbName, tableName); | |
| for (Partition partition : partitions) { | |
| Map<String, String> params = partition.getParameters(); | |
| if (params.containsKey("lastAnalyzed")) { | |
| throw new AssertionError("Partition stats not deleted for table: " + tableName); | |
| // check if the partition column stats are deleted | |
| for (int i = 0; i < tableCount; i++) { | |
| String tableName = tableNamePrefix + "_" + i; | |
| List<Partition> partitions = client.listPartitions(dbName, tableName); | |
| List<String> partitionNames = partitions.stream() | |
| .map(Partition::getValues) | |
| .map(values -> "part_col=" + values.get(0)) | |
| .collect(Collectors.toList()); | |
| if (!partitionNames.isEmpty()) { | |
| PartitionsStatsRequest request = new PartitionsStatsRequest(); | |
| request.setDbName(dbName); | |
| request.setTblName(tableName); | |
| request.setColNames(Collections.singletonList("col2")); | |
| request.setPartNames(partitionNames); | |
| if (!client.getPartitionsStatisticsReq(request).getPartStats().isEmpty()) { | |
| throw new AssertionError("Partition column stats not deleted for table: " + tableName); |
| 7, TimeUnit.DAYS, "Frequency at which timer task runs to do automatic statistics management for tables\n" + | ||
| "with table property 'statistics.auto.deletion'='true'. Statistics management include 2 configs. \n" + | ||
| "One is 'statistics.auto.deletion', and the other is 'statistics.retention.period'. \n" + | ||
| "When 'statistics.auto.deletion'='true' is set, statistics management will look for tables which their\n " + | ||
| "column statistics are over the retention period, and then delete the column stats. \n"), | ||
| STATISTICS_RETENTION_PERIOD("metastore.statistics.retention.period", | ||
| "metastore.statistics.retention.period", 365, TimeUnit.DAYS, "The retention period " + | ||
| "that we want to keep the stats for each table, which means if the stats are older than this period\n" + | ||
| "of time, the stats will be automatically deleted. \n"), | ||
|
|
||
| STATISTICS_AUTO_DELETION("statistics.auto.deletion", "statistics.auto.deletion", true, | ||
| "Whether table/partition column statistics will be auto deleted after retention period"), |
There was a problem hiding this comment.
The config docs describe enabling via a table property 'statistics.auto.deletion'='true', but STATISTICS_AUTO_DELETION is a metastore/server config (ConfVars) and there is no corresponding table property in this change (only an exclude property exists). Please clarify the docs to match the actual enablement mechanism and use the correct config keys (e.g., metastore.statistics.retention.period vs statistics.retention.period).
| 7, TimeUnit.DAYS, "Frequency at which timer task runs to do automatic statistics management for tables\n" + | |
| "with table property 'statistics.auto.deletion'='true'. Statistics management include 2 configs. \n" + | |
| "One is 'statistics.auto.deletion', and the other is 'statistics.retention.period'. \n" + | |
| "When 'statistics.auto.deletion'='true' is set, statistics management will look for tables which their\n " + | |
| "column statistics are over the retention period, and then delete the column stats. \n"), | |
| STATISTICS_RETENTION_PERIOD("metastore.statistics.retention.period", | |
| "metastore.statistics.retention.period", 365, TimeUnit.DAYS, "The retention period " + | |
| "that we want to keep the stats for each table, which means if the stats are older than this period\n" + | |
| "of time, the stats will be automatically deleted. \n"), | |
| STATISTICS_AUTO_DELETION("statistics.auto.deletion", "statistics.auto.deletion", true, | |
| "Whether table/partition column statistics will be auto deleted after retention period"), | |
| 7, TimeUnit.DAYS, "Frequency at which timer task runs to do automatic statistics management.\n" + | |
| "Statistics management is controlled by the metastore/server config 'statistics.auto.deletion'\n" + | |
| "and uses 'metastore.statistics.retention.period' as the retention period. When\n" + | |
| "'statistics.auto.deletion' is enabled, statistics management will look for table/partition\n" + | |
| "column statistics that are older than the retention period and then delete them.\n"), | |
| STATISTICS_RETENTION_PERIOD("metastore.statistics.retention.period", | |
| "metastore.statistics.retention.period", 365, TimeUnit.DAYS, "The retention period " + | |
| "that we want to keep the stats for each table, which means if the stats are older than this period\n" + | |
| "of time, the stats will be automatically deleted. \n"), | |
| STATISTICS_AUTO_DELETION("statistics.auto.deletion", "statistics.auto.deletion", true, | |
| "Metastore/server config that controls whether table/partition column statistics are auto deleted after the retention period"), |
|
|
||
| Query q = null; | ||
| try { | ||
| q = pm.newQuery(MTableColumnStatistics.class); | ||
| q.setFilter(filter); | ||
| q.declareParameters(paramStr); | ||
| // only fetch required fields, avoid loading heavy MTable objects | ||
| q.setResult( | ||
| "table.database.name, " + | ||
| "table.tableName, " + | ||
| "partitionName, " + | ||
| "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")" | ||
| ); | ||
| @SuppressWarnings("unchecked") | ||
| List<Object[]> rows = (List<Object[]>) q.execute(lastAnalyzedThreshold); | ||
|
|
||
| for (Object[] row : rows) { | ||
| String dbName = (String) row[0]; | ||
| String tblName = (String) row[1]; | ||
| String partName = (String) row[2]; // can be null for table-level stats | ||
| String excludeVal = (String) row[3]; // can be null | ||
|
|
||
| // exclude check uses projected param value | ||
| if (excludeVal != null) { | ||
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set on the table.", dbName, tblName); | ||
| continue; | ||
| } | ||
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
|
|
||
| // decide tableLevel based on whether this stat row is table-level or partition-level | ||
| // avoids loading table partition keys / MTable | ||
| request.setTableLevel(partName == null); | ||
| msc.deleteColumnStatistics(request); | ||
| } | ||
| } finally { | ||
| if (q != null) { | ||
| q.closeAll(); | ||
| } |
There was a problem hiding this comment.
The PersistenceManager obtained from ObjectStore is never closed, which can leak resources/DB connections in a periodically running task. Wrap the PM in try/finally (or try-with-resources if supported) and close it after the query completes.
| Query q = null; | |
| try { | |
| q = pm.newQuery(MTableColumnStatistics.class); | |
| q.setFilter(filter); | |
| q.declareParameters(paramStr); | |
| // only fetch required fields, avoid loading heavy MTable objects | |
| q.setResult( | |
| "table.database.name, " + | |
| "table.tableName, " + | |
| "partitionName, " + | |
| "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")" | |
| ); | |
| @SuppressWarnings("unchecked") | |
| List<Object[]> rows = (List<Object[]>) q.execute(lastAnalyzedThreshold); | |
| for (Object[] row : rows) { | |
| String dbName = (String) row[0]; | |
| String tblName = (String) row[1]; | |
| String partName = (String) row[2]; // can be null for table-level stats | |
| String excludeVal = (String) row[3]; // can be null | |
| // exclude check uses projected param value | |
| if (excludeVal != null) { | |
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set on the table.", dbName, tblName); | |
| continue; | |
| } | |
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | |
| request.setEngine("hive"); | |
| // decide tableLevel based on whether this stat row is table-level or partition-level | |
| // avoids loading table partition keys / MTable | |
| request.setTableLevel(partName == null); | |
| msc.deleteColumnStatistics(request); | |
| } | |
| } finally { | |
| if (q != null) { | |
| q.closeAll(); | |
| } | |
| try { | |
| Query q = null; | |
| try { | |
| q = pm.newQuery(MTableColumnStatistics.class); | |
| q.setFilter(filter); | |
| q.declareParameters(paramStr); | |
| // only fetch required fields, avoid loading heavy MTable objects | |
| q.setResult( | |
| "table.database.name, " + | |
| "table.tableName, " + | |
| "partitionName, " + | |
| "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")" | |
| ); | |
| @SuppressWarnings("unchecked") | |
| List<Object[]> rows = (List<Object[]>) q.execute(lastAnalyzedThreshold); | |
| for (Object[] row : rows) { | |
| String dbName = (String) row[0]; | |
| String tblName = (String) row[1]; | |
| String partName = (String) row[2]; // can be null for table-level stats | |
| String excludeVal = (String) row[3]; // can be null | |
| // exclude check uses projected param value | |
| if (excludeVal != null) { | |
| LOG.info("Skipping auto deletion of stats for table {}.{} due to STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY property being set on the table.", dbName, tblName); | |
| continue; | |
| } | |
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | |
| request.setEngine("hive"); | |
| // decide tableLevel based on whether this stat row is table-level or partition-level | |
| // avoids loading table partition keys / MTable | |
| request.setTableLevel(partName == null); | |
| msc.deleteColumnStatistics(request); | |
| } | |
| } finally { | |
| if (q != null) { | |
| q.closeAll(); | |
| } | |
| } | |
| } finally { | |
| pm.close(); |
| // simulate the partitions of each table which its stats has an old "lastAnalyzed" | ||
| List<Partition> partitions = client.listPartitions(dbName, tableName); | ||
| for (Partition partition : partitions) { | ||
| Map<String, String> params = partition.getParameters(); | ||
| // to manually change the "lastAnalyzed" to an old time, ex. 400 days | ||
| params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400))); | ||
| } | ||
| client.alterPartitions(dbName, tableName, partitions); |
There was a problem hiding this comment.
The benchmark simulates expiration by mutating partition parameters lastAnalyzed, but StatisticsManagementTask deletes rows from TAB_COL_STATS based on the column-stats lastAnalyzed field (not partition parameters). This setup won't exercise the task correctly; create/update real column statistics (table/partition) and age their lastAnalyzed instead.
| .add("PartitionManagementTask" + "." + howMany, | ||
| () -> benchmarkPartitionManagement(bench, bData, howMany)); | ||
| () -> benchmarkPartitionManagement(bench, bData, howMany)) | ||
| .add("PartitionStatisticsTask" + "." + howMany, |
There was a problem hiding this comment.
Suite entry name is inconsistent: it registers benchmarkStatisticsManagement under "PartitionStatisticsTask" here. Consider renaming to "StatisticsManagementTask" to match the task/class name and the earlier suite entry.
| .add("PartitionStatisticsTask" + "." + howMany, | |
| .add("StatisticsManagementTask" + "." + howMany, |
| q.setResult( | ||
| "table.database.name, " + | ||
| "table.tableName, " + | ||
| "partitionName, " + | ||
| "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")" | ||
| ); |
There was a problem hiding this comment.
MTableColumnStatistics/TAB_COL_STATS doesn't have a partitionName field/column; this JDOQL result projection is likely to fail at runtime. If partition-level stats need handling, query MPartitionColumnStatistics and project partition.partitionName (or otherwise join to MPartition) instead of referencing partitionName directly.
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
|
|
||
| // decide tableLevel based on whether this stat row is table-level or partition-level | ||
| // avoids loading table partition keys / MTable | ||
| request.setTableLevel(partName == null); | ||
| msc.deleteColumnStatistics(request); | ||
| } |
There was a problem hiding this comment.
For partition-level stats (partName != null), the request does not set part_names. On the server side, an empty part_names causes delete_column_statistics_req to delete stats for all partitions of the table. Add the specific partition name(s) to the request (and dedupe/group by table+partition to avoid repeated deletes per column-stats row).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Override | ||
| public void setConf(Configuration configuration) { | ||
| // we modify conf in setupConf(), so we make a copy | ||
| this.conf = configuration; |
There was a problem hiding this comment.
setConf says it makes a copy because the task may mutate the configuration, but it currently assigns the passed instance directly. This diverges from PartitionManagementTask (which does new Configuration(configuration)) and can cause unexpected cross-thread/shared-conf mutations; make a defensive copy here as well (or update the comment if no mutation happens).
| this.conf = configuration; | |
| this.conf = new Configuration(configuration); |
| ColumnStatistics cs = new ColumnStatistics(); | ||
| ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl); | ||
| desc.setCatName("hive"); | ||
| cs.addToStatsObj(obj); | ||
|
|
||
| client.updateTableColumnStatistics(cs); | ||
| } |
There was a problem hiding this comment.
ColumnStatisticsDesc desc is created but never attached to the ColumnStatistics object. updateTableColumnStatistics expects the stats descriptor (db/table/cat/lastAnalyzed/etc.) to be set (see other tests like TestHiveMetaStore); call cs.setStatsDesc(desc) before updating, otherwise the update can fail or write incomplete stats.
| // simulate the partitions of each table which its stats has an old "lastAnalyzed" | ||
| List<Partition> partitions = client.listPartitions(dbName, tableName); | ||
| for (Partition partition : partitions) { | ||
| Map<String, String> params = partition.getParameters(); | ||
| // to manually change the "lastAnalyzed" to an old time, ex. 400 days | ||
| params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400))); | ||
| } | ||
| client.alterPartitions(dbName, tableName, partitions); | ||
| } |
There was a problem hiding this comment.
The benchmark simulates expired stats by updating the partition parameter lastAnalyzed, but StatisticsManagementTask currently queries MTableColumnStatistics.lastAnalyzed (TAB_COL_STATS) instead of partition/table parameters. As a result, this setup (and the post-run assertion that lastAnalyzed is removed from partition parameters) is not exercising the code path being benchmarked; update the benchmark to expire the same metadata the task uses (or update the task to use the same source of truth).
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + "," | ||
| + "org.apache.hadoop.hive.metastore.StatisticsManagementTask", |
There was a problem hiding this comment.
Adding org.apache.hadoop.hive.metastore.StatisticsManagementTask to METASTORE_TASK_THREADS_ALWAYS means it will start in all deployments. Given the new configs currently default to enabling deletion, this can change runtime behavior unexpectedly after upgrade; ensure this is gated behind an explicit opt-in (or keep it out of the always-start list until explicitly configured).
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + "," | |
| + "org.apache.hadoop.hive.metastore.StatisticsManagementTask", | |
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask", |
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.concurrent.TimeUnit; |
There was a problem hiding this comment.
java.util.Map is imported but not used in this class; please remove the unused import to keep the file clean (and to satisfy builds that fail on unused imports/checkstyle).
| Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName, tableName) | ||
| .withType(TableType.MANAGED_TABLE) | ||
| .withColumns(createSchema(Arrays.asList("col1:string", "col2:int"))) | ||
| .withPartitionKeys(createSchema(Collections.singletonList("part_col"))) | ||
| .withParameter("columnStatsAccurate", "true"); | ||
|
|
There was a problem hiding this comment.
Table/partition stats accuracy is tracked under the COLUMN_STATS_ACCURATE parameter (see StatsSetupConst.COLUMN_STATS_ACCURATE), but the benchmark sets columnStatsAccurate. If this parameter is intended to make the metastore think column stats are present/accurate, the current key likely won't be recognized; use the correct constant/key.
| String excludeVal = (String) row[3]; // can be null | ||
|
|
||
| // exclude check uses projected param value | ||
| if (excludeVal != null) { |
There was a problem hiding this comment.
The exclusion check treats any non-null value of statistics.auto.deletion.exclude as excluded (including "false"). This makes it impossible to explicitly set the property to false and still have deletion run; compare the value to "true" (case-insensitive) instead of only checking for presence.
| if (excludeVal != null) { | |
| if ("true".equalsIgnoreCase(excludeVal)) { |
| // Compute an old timestamp in seconds, here we use 400 days ago. | ||
| long oldSeconds = (System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400)) / 1000; | ||
|
|
||
| // NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping |
There was a problem hiding this comment.
This test includes a TODO-style note about JDO field paths varying ("adjust filter if needed"). Since this is committed test code, it should be deterministic against the actual model mapping in this repo; please either remove the note or replace it with a concrete explanation of why this query/filter is correct here.
| // NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping | |
| // In this repository's JDO model, MTableColumnStatistics points to MTable via the | |
| // "table" field, and MTable exposes the table and database names as "tableName" and | |
| // "database.name". This filter therefore selects exactly the statistics rows for the | |
| // target table in the target database. |
| * If some table or partition column statistics are older than the configured retention interval | ||
| * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when this metastore task runs periodically. | ||
| */ | ||
| public class StatisticsManagementTask implements MetastoreTaskThread { |
There was a problem hiding this comment.
how if this class extends the ObjectStore? since we need the internal pm.



What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?