HIVE-29413: Avoid code duplication by updating getPartCols method for iceberg tables#6413
HIVE-29413: Avoid code duplication by updating getPartCols method for iceberg tables#6413ramitg254 wants to merge 7 commits intoapache:masterfrom
Conversation
|
@ramitg254 please take a look: 9e7535c. I would suggest following similar approach |
|
but here we are creating separate method getEffectivePartCols() and leaving getPartCols() as it is, which as per our discussion on that closed pr we shouldn't do that, and only go ahead with updating getPartCols() |
Where did I say that? The ask was to keep the original method unchanged. same here |
|
oh I got confused due to this comment: #6337 (comment) in which getSupportedPartCols() was just separate method similar to getEffectivePartCols() |
|
I am fine with that earlier approach as well but recently I saw this one: https://issues.apache.org/jira/browse/HIVE-29525 so I thought we should have unified getPartCols() and getCols() which gives similar results as native hive tables as first step towards solving this after that those plan logics can be taken care of later on when that ticket will be addressed. please share your thoughts on this idea |
| } | ||
|
|
||
| List<String> partialPvals = MetaStoreUtils.getPvals(tbl.getPartCols(), partialPartSpec); | ||
| List<String> partialPvals = MetaStoreUtils.getPvals(tbl.getEffectivePartCols(), partialPartSpec); |
| if (tbl.getDataLocation() != null) { | ||
| Path partPath = new Path(tbl.getDataLocation(), | ||
| Warehouse.makePartName(tbl.getPartCols(), | ||
| Warehouse.makePartName(tbl.getEffectivePartCols(), |
| ArrayList<ColumnInfo> partitionColumns = new ArrayList<ColumnInfo>(); | ||
| for (FieldSchema part_col : viewTable.getPartCols()) { | ||
| colName = part_col.getName(); | ||
| for (FieldSchema partCol : viewTable.getEffectivePartCols()) { |
| List<String> pvals = new ArrayList<String>(); | ||
| for (FieldSchema field : tbl.getPartCols()) { | ||
| List<String> pvals = new ArrayList<>(); | ||
| for (FieldSchema field : tbl.getEffectivePartCols()) { |
There was a problem hiding this comment.
do we have tests for that. non-native use DummyPartition isn't it?
| List<String> pvals = new ArrayList<String>(); | ||
| for (FieldSchema field : table.getPartCols()) { | ||
| List<String> pvals = new ArrayList<>(); | ||
| for (FieldSchema field : table.getEffectivePartCols()) { |
| /** | ||
| * These fields are all cached fields. The information comes from tTable. | ||
| */ | ||
| private List<FieldSchema> cachedPartCols; |
There was a problem hiding this comment.
- maybe rename to simply partitionCols since it's not actually a cache?
- can we reuse ttable? t.setPartitionKeys?
There was a problem hiding this comment.
-
yes it can be renamed to partitionCols as it was added because for iceberg table getStorageHandler.getPartitionKeys() calls convertToIceberg so too much calls to metastore was made for a given particular running query and too many calls were leading to sometime timed out exception and some other exception due to some outdated conf.
to avoid that it was added so it is not really a cahe -
I think we shouldn't setPartitionKeys for ttable for non native tables as partition evolution and other stuff are supported
| return cachedPartCols; | ||
| } | ||
|
|
||
| private boolean isTableTypeSet() { |
| f_list.addAll(getCols()); | ||
| f_list.addAll(getPartCols()); | ||
| return f_list; | ||
| ArrayList<FieldSchema> allCols = new ArrayList<>(getCols()); |
| return hasNonNativePartitionSupport() ? getStorageHandler().isPartitioned(this) : | ||
| CollectionUtils.isNotEmpty(getPartCols()); | ||
| return hasNonNativePartitionSupport() ? getStorageHandler().isPartitioned(this) : | ||
| CollectionUtils.isNotEmpty(getEffectivePartCols()); |
| org.apache.hadoop.hive.metastore.api.Partition tp) { | ||
|
|
||
| List<FieldSchema> fsl = getPartCols(); | ||
| List<FieldSchema> fsl = getEffectivePartCols(); |
There was a problem hiding this comment.
do we need to change here? tests? does it duplicate IcebergTableUtil.getPartitionSpec?
| Table tab = cppCtx.getParseContext().getViewProjectToTableSchema().get(op); | ||
| List<FieldSchema> fullFieldList = new ArrayList<FieldSchema>(tab.getCols()); | ||
| fullFieldList.addAll(tab.getPartCols()); | ||
| List<FieldSchema> fullFieldList = new ArrayList<>(tab.getAllCols()); |
There was a problem hiding this comment.
no need to wrap in yet another list
|
|
||
| private static List<PrimitiveTypeInfo> extractPartColTypes(Table tab) { | ||
| List<FieldSchema> pCols = tab.getPartCols(); | ||
| List<FieldSchema> pCols = tab.getEffectivePartCols(); |
| usePartitionColumns(properties, partColNames); | ||
| } else { | ||
| List<FieldSchema> partCols = table.getPartCols(); | ||
| List<FieldSchema> partCols = table.getEffectivePartCols(); |
| } | ||
| queryStr.append(','); | ||
| appendCols(targetTable.getPartCols(), alias, null, FieldSchema::getName); | ||
| appendCols(targetTable.getEffectivePartCols(), alias, null, FieldSchema::getName); |
There was a problem hiding this comment.
i don't think we need this, it might duplicate the columns
| public void appendAcidSelectColumns(Operation operation) { | ||
| queryStr.append("ROW__ID,"); | ||
| for (FieldSchema fieldSchema : targetTable.getPartCols()) { | ||
| for (FieldSchema fieldSchema : targetTable.getEffectivePartCols()) { |
There was a problem hiding this comment.
it's definitely not needed in native
| @Override | ||
| public List<String> getDeleteValues(Operation operation) { | ||
| List<String> deleteValues = new ArrayList<>(1 + targetTable.getPartCols().size()); | ||
| List<String> deleteValues = new ArrayList<>(1 + targetTable.getEffectivePartCols().size()); |
| //insert into newTableName select * from ts <where partition spec> | ||
| StringBuilder rewrittenQueryStr = generateExportQuery( | ||
| newTable.getPartCols(), tokRefOrNameExportTable, (ASTNode) tokRefOrNameExportTable.parent, newTableName); | ||
| newTable.getEffectivePartCols(), |
There was a problem hiding this comment.
this is acid, we don't need to touch it
There was a problem hiding this comment.
did this beacuse of this #6413 (comment) if you think it can break things then I will switch it back to old one
| this.specType = SpecType.STATIC_PARTITION; | ||
| this.partitions = partitions; | ||
| List<FieldSchema> partCols = this.tableHandle.getPartCols(); | ||
| List<FieldSchema> partCols = this.tableHandle.getEffectivePartCols(); |
| if (isPartitionStats) { | ||
| if (partTransformSpec == null) { | ||
| for (FieldSchema fs : tbl.getPartCols()) { | ||
| for (FieldSchema fs : tbl.getEffectivePartCols()) { |
There was a problem hiding this comment.
i don't think it's needed - part columns are already part of col list. tests?
| { | ||
| // check partitioning column order and types | ||
| List<FieldSchema> existingTablePartCols = table.getPartCols(); | ||
| List<FieldSchema> existingTablePartCols = table.getEffectivePartCols(); |
| this.onClause = onClause; | ||
| allTargetTableColumns.addAll(targetTable.getCols()); | ||
| allTargetTableColumns.addAll(targetTable.getPartCols()); | ||
| allTargetTableColumns.addAll(targetTable.getEffectivePartCols()); |
There was a problem hiding this comment.
i don't think we need to change this + we can simplify allTargetTableColumns.addAll(targetTable.getAllCols()
| private static int calculatePartPrefix(Table tbl, Set<String> partSpecKeys) { | ||
| int partPrefixToDrop = 0; | ||
| for (FieldSchema fs : tbl.getPartCols()) { | ||
| for (FieldSchema fs : tbl.getEffectivePartCols()) { |
There was a problem hiding this comment.
any tests covering this for iceberg?
There was a problem hiding this comment.
I am not aware about that did this because of :#6413 (comment)
| } else { | ||
| // partition spec is not specified but column schema can have partitions specified | ||
| for(FieldSchema f : targetTable.getPartCols()) { | ||
| for(FieldSchema f : targetTable.getEffectivePartCols()) { |
There was a problem hiding this comment.
do we really need this? tests?
| List<String> cols = new ArrayList<String>(); | ||
| if (qbp.getAnalyzeRewrite() != null) { | ||
| List<FieldSchema> partitionCols = tab.getPartCols(); | ||
| List<FieldSchema> partitionCols = tab.getEffectivePartCols(); |
There was a problem hiding this comment.
we don't even enter here, see if above - !tab.hasNonNativePartitionSupport()
| } | ||
| } else { | ||
| partColSchema.addAll(tbl.getPartCols()); | ||
| partColSchema.addAll(tbl.getEffectivePartCols()); |
|
so many |
|
@deniskuzZ I was updating getPartCols() with getEffectivePartCols() to moste places as we should eventually move to this generic common method. |
@ramitg254 i like the idea of having a single
Since you've already identified them, why not apply the |
|
I was planning to but updating getCols() will alone cause test failures for all q files whichever has describe command for iceberg tables and also query plans will itself get affected as stats logic current take this getCols() into account and there are around 90+ occurences of it in code so it will lead to breakage as well so I thought it will be better if we take care of it as a separate change |
I guess that was the main intent — to integrate Iceberg partition handling into the existing code with minimal workarounds/code duplication. Maybe I’m missing something, but, unfortunately, I don’t see much value in the current state of PR, sorry. Let’s see what Krisztian thinks about it. |
What changes were proposed in this pull request?
added getEffectivePartCols() in most places possible to avoid code duplication.
Why are the changes needed?
getPartCols() does not have support for iceberg tables.
Does this PR introduce any user-facing change?
No
How was this patch tested?
ci tests and local build