Skip to content

[AURON #2217] Support Iceberg _spec_id metadata in native scan#2218

Open
weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit:feat/support-iceberg-spec-id-metadata
Open

[AURON #2217] Support Iceberg _spec_id metadata in native scan#2218
weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit:feat/support-iceberg-spec-id-metadata

Conversation

@weimingdiit
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2217

Rationale for this change

Native Iceberg scan currently supports projecting the _file metadata column, but falls back when _spec_id is requested. _spec_id is a file-level Iceberg metadata column and can be materialized as a per-file constant value in the native scan, similar to _file.

What changes are included in this PR?

This PR adds native scan support for the Iceberg _spec_id metadata column.

  • Allows _spec_id in Iceberg native scan metadata column validation.
  • Materializes _spec_id from FileScanTask.file().specId() as a per-file partition value.
  • Adds integration tests for _spec_id projection and mixed data/metadata projection with _file and _spec_id.

Are there any user-facing changes?

Yes. Queries that project Iceberg _spec_id can now use the native Iceberg scan path instead of falling back to Spark.

How was this patch tested?

CI.

Signed-off-by: weimingdiit <weimingdiit@gmail.com>
@weimingdiit weimingdiit marked this pull request as ready for review April 28, 2026 02:34
@slfan1989 slfan1989 requested a review from Copilot April 28, 2026 07:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native Iceberg scan support for projecting the _spec_id metadata column (a file-level constant), avoiding Spark fallback when _spec_id is requested.

Changes:

  • Allow _spec_id as a supported Iceberg metadata column in native-scan validation.
  • Materialize _spec_id per file from FileScanTask.file().specId() via native partition values.
  • Add integration tests for _spec_id projection and mixed projections with _file.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergScanSupport.scala Marks _spec_id as a supported metadata column so native planning doesn’t fallback.
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/execution/auron/plan/NativeIcebergTableScanExec.scala Builds per-file _spec_id mapping and emits it as a constant partition value alongside _file.
thirdparty/auron-iceberg/src/test/scala/org/apache/auron/iceberg/AuronIcebergIntegrationSuite.scala Adds integration coverage for native projection of _spec_id (alone and with _file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +129 to +132
case name if name == MetadataColumns.SPEC_ID.name() =>
NativeConverters
.convertExpr(Literal.create(fileSpecIds(filePath), IntegerType))
.getLiteral
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileSpecIds(filePath) uses Map.apply, which will throw a NoSuchElementException with little context if the key isn’t present (e.g., due to path normalization differences between Iceberg DataFile.location() and Spark PartitionedFile.filePath.toString). Consider using getOrElse and throwing an IllegalStateException with a clear message, and/or normalizing the key the same way in both buildFileSpecIds and metadataPartitionValues to guarantee consistent lookups.

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +235
fileTasks
.map(task => task.file().location() -> task.file().specId())
.groupBy(_._1)
.mapValues(_.head._2)
.toMap
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildFileSpecIds currently does map(...).groupBy(...).mapValues(_.head...), which allocates a full grouped collection of all tasks. Since duplicates are expected only because a single file can be split into multiple ranges, this can be built in a single pass (and optionally validated for consistency when the same path appears multiple times) to reduce memory and CPU overhead on large scans.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Iceberg _spec_id metadata column in native scan

2 participants