[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787
Open
aviyehuda wants to merge 2 commits into
Open
[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787aviyehuda wants to merge 2 commits into
aviyehuda wants to merge 2 commits into
Conversation
…ng whitespace Retry integral, boolean, and decimal parsing after trim when the first parse attempt fails, aligning behavior with float/double and fixing from_csv/read.csv for common inputs like "1, 1".
HyukjinKwon
reviewed
Jun 25, 2026
HyukjinKwon
left a comment
Member
There was a problem hiding this comment.
0 blocking, 3 non-blocking, 1 nit.
Behaviorally correct for the default locale and a clean alignment with float/double parsing; the substantive follow-ups are the non-US-locale decimal gap and the exception-on-hot-path approach.
Design / architecture (1)
- UnivocityParser.scala:182: whitespace tolerance is a per-value exception-retry; the motivating
"1, 1"throws+catches a NumberFormatException per integral/boolean/US-decimal cell — an up-frontvalue.trimis behaviorally identical without it — see inline
Correctness (1)
- UnivocityParser.scala:232: non-US-locale decimals aren't fixed —
cannotParseDecimalError(a SparkRuntimeException) isn't caught byretryWithTrim— see inline
Suggestions (1)
- Tests cover only leading whitespace under the default US locale; add a trailing-whitespace case (
"1 ,1") and a non-US-locale decimal case — the latter would have caught the Correctness finding above.
Nits: 1 minor item (see inline comments).
…erics Fix code review feedback on the whitespace parsing change: - Replace retryWithTrim with up-front trim for integral, boolean, and decimal converters to avoid per-cell exceptions on the common "1, 1" path - Trim before decimalParser so non-US locales (e.g. de-DE) also handle surrounding whitespace, since DecimalFormat throws SparkRuntimeException rather than NumberFormatException Add UnivocityParserSuite coverage for spaced decimals under multiple locales.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add
retryWithTriminUnivocityParserfor integral, boolean, and decimal value converters. When the first parse attempt fails withNumberFormatExceptionorIllegalArgumentException, the parser retries after trimming leading/trailing whitespace.This aligns behavior with float/double parsing (which already accepts surrounding whitespace via Java's parsers) for both
from_csvandspark.read.csv, since they shareUnivocityParser.Reproduction example: