added support for MapFromEntries#21720
Conversation
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @athlcode. Since this changes the current behavior, I think it would be a good opportunity to add a config that allows switching between exception and last key wins (e.g., datafusion.execution.mapKeyDedupPolicy). Otherwise existing code relying on the last key wins policy won't be able to adapt.
cc: @Jefffrey and @comphead since you both reviewed the original map_from_entries.
coderfender
left a comment
There was a problem hiding this comment.
The PR looks good @athlcode . Left some comments and I would like to see some more tests covering various edge cases both on. the rust side and SLT side to add more confidence
…ort/MapFromEntries
yep fixing comments and will add tests |
thanks @nuno-faria. I am looking into this as well |
|
Yes I think it would be a good idea to make this configurable instead of just switching the behaviour (otherwise we're in a loop of having TODO item to support the other behaviour) |
|
@coderfender I agree with @Jefffrey as well, thank you. The way to do this would be to use datafusion.execution.map_key_dedup_policy config as @nuno-faria mentioned earlier. |
|
Sure @athlcode please take a look at the current config naming conventions to keep them consistent |
|
@athlcode , it seems like there are test failures here. |
fixing the issues. Will tag once I fix and push, thank you. |
| /// `false` — ANSI SQL mode is disabled by default. | ||
| pub enable_ansi_mode: bool, default = false | ||
|
|
||
| /// Policy for handling duplicate keys in Spark-compatible map-construction |
There was a problem hiding this comment.
I wonder if we should consider having a namespace for Spark specific configs, cc @andygrove @comphead
There was a problem hiding this comment.
This makes sense to me. A few weeks ago , I was proposing semantic layering in DF essentially calling spark functions with same name as datafusion , spark level config , ansi support and potentially spark friendly SQL. This should help users migrating from spark to directly move their leverage logic in datafusion
There was a problem hiding this comment.
@coderfender @Jefffrey thank you, updated to namespace
| /// - `"LAST_WIN"`: keep the last occurrence of each duplicate key. | ||
| /// | ||
| /// Values are case-insensitive. Only affects functions under `datafusion/spark`. | ||
| pub map_key_dedup_policy: String, default = "EXCEPTION".to_string() |
There was a problem hiding this comment.
Would prefer having the enum value here along with the parsing, instead of only being a string
Reference
There was a problem hiding this comment.
thank you, added MapKeyDedupPolicy enum with Exception (default) + LastWin
| ) | ||
| .unwrap(); | ||
|
|
||
| let map = as_map(&result); |
There was a problem hiding this comment.
| let map = as_map(&result); | |
| let map = result.as_map(); |
Using https://docs.rs/arrow/latest/arrow/array/trait.AsArray.html#method.as_map
There was a problem hiding this comment.
thank you for the suggestion, pushed the changes
…ort/MapFromEntries
…darshan7/datafusion into support/MapFromEntries
Which issue does this PR close?
Closes # (none: prerequisite for apache/datafusion-comet#2706; follow-up to #17779 and #19274).
Rationale for this change
Spark's default mapKeyDedupPolicy is EXCEPTION, raising SparkRuntimeException with error class DUPLICATED_MAP_KEY on duplicate map keys. The existing Spark map_from_entries / map_from_arrays UDFs silently kept the last occurrence, forcing downstream engines like datafusion-comet to fall back to Spark.
What changes are included in this PR?
Duplicate map keys in Spark map_from_entries, map_from_arrays, and str_to_map now raise [DUPLICATED_MAP_KEY] Duplicate map key {key} was found, matching Spark's default behaviour and error class.
Are these changes tested?
Yes, via sqllogictest assertions covering the new error across the affected Spark map UDFs.
Are there any user-facing changes?
Yes. Duplicate keys now raise [DUPLICATED_MAP_KEY] under the default policy instead of silently collapsing to the last occurrence. No new config keys, no API changes.