Skip to content

added support for MapFromEntries#21720

Open
athlcode wants to merge 19 commits intoapache:mainfrom
athlcode:support/MapFromEntries
Open

added support for MapFromEntries#21720
athlcode wants to merge 19 commits intoapache:mainfrom
athlcode:support/MapFromEntries

Conversation

@athlcode
Copy link
Copy Markdown
Contributor

@athlcode athlcode commented Apr 18, 2026

Which issue does this PR close?

Closes # (none: prerequisite for apache/datafusion-comet#2706; follow-up to #17779 and #19274).

Rationale for this change

Spark's default mapKeyDedupPolicy is EXCEPTION, raising SparkRuntimeException with error class DUPLICATED_MAP_KEY on duplicate map keys. The existing Spark map_from_entries / map_from_arrays UDFs silently kept the last occurrence, forcing downstream engines like datafusion-comet to fall back to Spark.

What changes are included in this PR?

Duplicate map keys in Spark map_from_entries, map_from_arrays, and str_to_map now raise [DUPLICATED_MAP_KEY] Duplicate map key {key} was found, matching Spark's default behaviour and error class.

Are these changes tested?

Yes, via sqllogictest assertions covering the new error across the affected Spark map UDFs.

Are there any user-facing changes?

Yes. Duplicate keys now raise [DUPLICATED_MAP_KEY] under the default policy instead of silently collapsing to the last occurrence. No new config keys, no API changes.

@github-actions github-actions Bot added common Related to common crate spark labels Apr 18, 2026
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) and removed common Related to common crate labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @athlcode. Since this changes the current behavior, I think it would be a good opportunity to add a config that allows switching between exception and last key wins (e.g., datafusion.execution.mapKeyDedupPolicy). Otherwise existing code relying on the last key wins policy won't be able to adapt.

cc: @Jefffrey and @comphead since you both reviewed the original map_from_entries.

Comment thread datafusion/spark/src/function/map/utils.rs Outdated
@coderfender
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good @athlcode . Left some comments and I would like to see some more tests covering various edge cases both on. the rust side and SLT side to add more confidence

Comment thread datafusion/spark/src/function/map/utils.rs Outdated
Comment thread datafusion/sqllogictest/test_files/spark/map/map_from_entries.slt
@athlcode
Copy link
Copy Markdown
Contributor Author

The PR looks good @athlcode . Left some comments and I would like to see some more tests covering various edge cases both on. the rust side and SLT side to add more confidence

yep fixing comments and will add tests

@athlcode
Copy link
Copy Markdown
Contributor Author

Thanks @athlcode. Since this changes the current behavior, I think it would be a good opportunity to add a config that allows switching between exception and last key wins (e.g., datafusion.execution.mapKeyDedupPolicy). Otherwise existing code relying on the last key wins policy won't be able to adapt.

cc: @Jefffrey and @comphead since you both reviewed the original map_from_entries.

thanks @nuno-faria. I am looking into this as well

@Jefffrey
Copy link
Copy Markdown
Contributor

Yes I think it would be a good idea to make this configurable instead of just switching the behaviour (otherwise we're in a loop of having TODO item to support the other behaviour)

@coderfender
Copy link
Copy Markdown
Contributor

I agree with @Jefffrey unless you would have an immediate follow up @athlcode ?

@athlcode
Copy link
Copy Markdown
Contributor Author

@coderfender I agree with @Jefffrey as well, thank you. The way to do this would be to use datafusion.execution.map_key_dedup_policy config as @nuno-faria mentioned earlier.

@coderfender
Copy link
Copy Markdown
Contributor

Sure @athlcode please take a look at the current config naming conventions to keep them consistent

@github-actions github-actions Bot added the common Related to common crate label Apr 27, 2026
@coderfender
Copy link
Copy Markdown
Contributor

@athlcode , it seems like there are test failures here.

@athlcode
Copy link
Copy Markdown
Contributor Author

@athlcode , it seems like there are test failures here.

fixing the issues. Will tag once I fix and push, thank you.

Comment thread datafusion/common/src/config.rs Outdated
/// `false` — ANSI SQL mode is disabled by default.
pub enable_ansi_mode: bool, default = false

/// Policy for handling duplicate keys in Spark-compatible map-construction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should consider having a namespace for Spark specific configs, cc @andygrove @comphead

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. A few weeks ago , I was proposing semantic layering in DF essentially calling spark functions with same name as datafusion , spark level config , ansi support and potentially spark friendly SQL. This should help users migrating from spark to directly move their leverage logic in datafusion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderfender @Jefffrey thank you, updated to namespace

Comment thread datafusion/common/src/config.rs Outdated
/// - `"LAST_WIN"`: keep the last occurrence of each duplicate key.
///
/// Values are case-insensitive. Only affects functions under `datafusion/spark`.
pub map_key_dedup_policy: String, default = "EXCEPTION".to_string()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer having the enum value here along with the parsing, instead of only being a string

Reference

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, added MapKeyDedupPolicy enum with Exception (default) + LastWin

)
.unwrap();

let map = as_map(&result);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let map = as_map(&result);
let map = result.as_map();

Using https://docs.rs/arrow/latest/arrow/array/trait.AsArray.html#method.as_map

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the suggestion, pushed the changes

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 7, 2026
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 7, 2026
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants