[AURON #2217] Support Iceberg _spec_id metadata in native scan by weimingdiit · Pull Request #2218 · apache/auron

weimingdiit · 2026-04-27T12:45:43Z

Which issue does this PR close?

Rationale for this change

Native Iceberg scan currently supports projecting the _file metadata column, but falls back when _spec_id is requested. _spec_id is a file-level Iceberg metadata column and can be materialized as a per-file constant value in the native scan, similar to _file.

What changes are included in this PR?

This PR adds native scan support for the Iceberg _spec_id metadata column.

Allows _spec_id in Iceberg native scan metadata column validation.
Materializes _spec_id from FileScanTask.file().specId() as a per-file partition value.
Adds integration tests for _spec_id projection and mixed data/metadata projection with _file and _spec_id.

Are there any user-facing changes?

Yes. Queries that project Iceberg _spec_id can now use the native Iceberg scan path instead of falling back to Spark.

How was this patch tested?

CI.

Signed-off-by: weimingdiit <weimingdiit@gmail.com>

Copilot

Pull request overview

Adds native Iceberg scan support for projecting the _spec_id metadata column (a file-level constant), avoiding Spark fallback when _spec_id is requested.

Changes:

Allow _spec_id as a supported Iceberg metadata column in native-scan validation.
Materialize _spec_id per file from FileScanTask.file().specId() via native partition values.
Add integration tests for _spec_id projection and mixed projections with _file.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergScanSupport.scala	Marks `_spec_id` as a supported metadata column so native planning doesn’t fallback.
thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/execution/auron/plan/NativeIcebergTableScanExec.scala	Builds per-file `_spec_id` mapping and emits it as a constant partition value alongside `_file`.
thirdparty/auron-iceberg/src/test/scala/org/apache/auron/iceberg/AuronIcebergIntegrationSuite.scala	Adds integration coverage for native projection of `_spec_id` (alone and with `_file`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T07:24:39Z

+        case name if name == MetadataColumns.SPEC_ID.name() =>
+          NativeConverters
+            .convertExpr(Literal.create(fileSpecIds(filePath), IntegerType))
+            .getLiteral


fileSpecIds(filePath) uses Map.apply, which will throw a NoSuchElementException with little context if the key isn’t present (e.g., due to path normalization differences between Iceberg DataFile.location() and Spark PartitionedFile.filePath.toString). Consider using getOrElse and throwing an IllegalStateException with a clear message, and/or normalizing the key the same way in both buildFileSpecIds and metadataPartitionValues to guarantee consistent lookups.

Copilot · 2026-04-28T07:24:39Z

+    fileTasks
+      .map(task => task.file().location() -> task.file().specId())
+      .groupBy(_._1)
+      .mapValues(_.head._2)
+      .toMap


buildFileSpecIds currently does map(...).groupBy(...).mapValues(_.head...), which allocates a full grouped collection of all tasks. Since duplicates are expected only because a single file can be split into multiple ranges, this can be built in a single pass (and optionally validated for consistency when the same path appears multiple times) to reduce memory and CPU overhead on large scans.

[AURON apache#2217] Support Iceberg _spec_id metadata in native scan

282d504

Signed-off-by: weimingdiit <weimingdiit@gmail.com>

github-actions Bot added the thirdparty-iceberg label Apr 27, 2026

weimingdiit marked this pull request as ready for review April 28, 2026 02:34

slfan1989 requested a review from Copilot April 28, 2026 07:20

Copilot started reviewing on behalf of slfan1989 April 28, 2026 07:21 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #2217] Support Iceberg _spec_id metadata in native scan#2218

[AURON #2217] Support Iceberg _spec_id metadata in native scan#2218
weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit:feat/support-iceberg-spec-id-metadata

weimingdiit commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weimingdiit commented Apr 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants