GH-41488: [C++][Python] Apply timestamp_parsers as fallback when parsing CSV date and time columns#50146
GH-41488: [C++][Python] Apply timestamp_parsers as fallback when parsing CSV date and time columns#50146pearu wants to merge 1 commit into
Conversation
|
Two out-of-scope discoveries made while working on this, recorded here rather than folded into the PR to keep it minimal:
|
|
|
|
|
78ca3cb to
e0af29e
Compare
|
A third discovery for the list above: this |
e0af29e to
e75ecec
Compare
|
|
|
The single CI failure ( |
There was a problem hiding this comment.
Pull request overview
This PR extends the CSV reader’s ConvertOptions::timestamp_parsers behavior so that, when columns are explicitly typed as date32/date64/time32/time64, the reader first attempts the existing ISO-8601 parsing and then falls back to the user-provided timestamp parsers (with flooring/extracting semantics consistent with casting). It also keeps type inference strict (ISO-only) to avoid silent truncation.
Changes:
- Add a new C++ CSV date/time value decoder that falls back to
timestamp_parsersafter ISO parsing and applies flooring/time-of-day extraction. - Ensure CSV type inference for date/time remains ISO-only even when
timestamp_parsersare configured. - Update C++/Python docs and add C++/Python tests; improve vendored Windows
strptimesupport for C-locale day/month names.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/pyarrow/tests/test_csv.py | Adds Python coverage for date/time typed columns using timestamp_parsers fallback and inference guard. |
| python/pyarrow/_csv.pyx | Documents the new fallback behavior in Python ConvertOptions docstring. |
| docs/source/cpp/csv.rst | Adds a “Date and time parsing” section documenting fallback + semantics. |
| cpp/src/arrow/vendored/musl/strptime.c | Adds a C-locale nl_langinfo fallback table for Windows/testing to support %b/%B/%p/.... |
| cpp/src/arrow/csv/options.h | Documents fallback semantics for timestamp_parsers in C++ API docs. |
| cpp/src/arrow/csv/inference_internal.h | Ensures date/time inference ignores configured timestamp_parsers. |
| cpp/src/arrow/csv/converter.cc | Implements fallback decoder + converter factory changes for date/time types. |
| cpp/src/arrow/csv/converter_test.cc | Adds C++ tests for date/time fallback parsing behavior and edge cases. |
| # Month names are parsed case-insensitively | ||
| rows = b"a\n15-OCT-15\n18-Jun-90\n" | ||
| opts = ConvertOptions(column_types={'a': pa.date32()}, | ||
| timestamp_parsers=['%d-%b-%y']) | ||
| table = self.read_bytes(rows, convert_options=opts) | ||
| assert table.to_pydict() == { | ||
| 'a': [date(2015, 10, 15), date(1990, 6, 18)], | ||
| } |
There was a problem hiding this comment.
Thanks — I checked this and the tests are deterministic as written, so I've left them unchanged.
Neither the test binary nor pyarrow adopts the environment's LC_TIME: there is no setlocale(LC_ALL/LC_TIME, "") anywhere in Arrow's C++ (outside vendored code) or in pyarrow, and CPython coerces only LC_CTYPE at startup, never LC_TIME. So a process started with a non-English LC_TIME in the environment still runs strptime in the C locale. On glibc there is a second reason: strptime's %b/%B keeps the C-locale (English) month names as a fallback even under a non-English locale, so English abbreviations parse regardless (setlocale(LC_TIME, "fr_FR.UTF-8") followed by parsing "15-JUL-15" still succeeds).
Minimal check: CSV %b parsing is independent of the environment locale
Each child process below is started with a non-English LC_ALL/LC_TIME in its environment — exactly the scenario in the comment — and still parses the English month abbreviation correctly:
import os
import subprocess
import sys
CHILD = r"""
import os
import pyarrow as pa
from pyarrow import csv
from datetime import date
data = b"a\n15-JUL-15\n" # English abbreviated month name "JUL"
opts = csv.ConvertOptions(column_types={"a": pa.date32()},
timestamp_parsers=["%d-%b-%y"])
got = csv.read_csv(pa.py_buffer(data), convert_options=opts).to_pydict()
assert got == {"a": [date(2015, 7, 15)]}, got
print("OK with LC_ALL=%-14r ->" % os.environ.get("LC_ALL"), got)
"""
for lc in ("C", "fr_FR.UTF-8", "de_DE.UTF-8"):
env = dict(os.environ, LC_ALL=lc, LC_TIME=lc)
subprocess.run([sys.executable, "-c", CHILD], env=env, check=True)Output (identical whether or not those locales are actually installed):
OK with LC_ALL='C' -> {'a': [datetime.date(2015, 7, 15)]}
OK with LC_ALL='fr_FR.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]}
OK with LC_ALL='de_DE.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]}
A genuinely locale-dependent case does remain — an application that itself calls setlocale(LC_ALL, "") under a non-English locale, on a libc whose strptime lacks that English fallback — but that is pre-existing (it affects timestamp columns too) and out of scope here. Happy to file a follow-up issue if that is worth tracking.
…n parsing CSV date and time columns CSV columns explicitly typed as date32, date64, time32 or time64 could only be parsed from strict ISO-8601 strings; ConvertOptions::timestamp_parsers was consulted only for timestamp columns. Make the user-defined timestamp parsers act as a fallback for these column types: the built-in ISO-8601 parser is tried first (preserving existing behavior), then each configured parser in order. A timestamp produced by a fallback parser is floored to the day boundary for dates and reduced to the time of day for times, consistent with casting a timestamp to a date or time type. Type inference of date and time columns is deliberately unaffected: inference keeps using strict ISO-8601 parsing, otherwise a value with a time-of-day part could be inferred as a date and silently truncated. Also provide C-locale name tables to the vendored musl strptime used on Windows, where nl_langinfo() is unavailable: this makes %a/%A/%b/%B/%h/ %p/%c/%r/%x/%X work on Windows (matching musl's C locale), so that the month-name formats from the original issue reports parse on all platforms. Closes apacheGH-28303. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
e75ecec to
fcf4f95
Compare
|
@jorisvandenbossche Do you want to take a look at this PR? |
Rationale for this change
CSV columns explicitly typed as
date32,date64,time32ortime64can only be parsed from strict ISO-8601 strings:ConvertOptions::timestamp_parsersis consulted only fortimestampcolumns. Reading e.g.15-OCT-15into adate32column fails even withtimestamp_parsers=["%d-%b-%y"], and7:55:00(non-zero-padded hour) fails fortime32[s]. Users currently work around this by declaring such columns astimestamp, reading, then casting back to the date/time type.Effect on the issues collected in #41488:
%d-%b-%y).date32column withtimestamp_parsers=["%Y/%m/%d"]— now works as requested).ArrowInvalid: CSV conversion error to date32[day]: invalid value '2000-01-01 00:00:00' #37180 is addressed but not auto-closed: it asks for ISO timestamp strings (2000-01-01 00:00:00) in adate32column to convert by default. With this PR that works by opting in viatimestamp_parsers=[ISO8601]; the no-parsers default still errors, deliberately, so that time-of-day truncation only happens when the user asked for it. Whether the remaining default-behavior ask should be implemented or declined is left to that issue.%b/%Bfail on Windows) and [C++][Python] strptime fails to parse with %p on Windows #31971 (%pfails on Windows) are also resolved for alltimestamp_parsersusers;%zremains unsupported on Windows (kStrptimeSupportsZone, unchanged).What changes are included in this PR?
DateTimeWithParsersValueDecoderincsv/converter.cc, used for date32/date64/time32/time64 columns whentimestamp_parsersis non-empty. It tries the built-in ISO-8601 parser first (preserving all existing behavior), then each configured parser in order. A timestamp produced by a fallback parser is floored to the day boundary for dates and reduced to the time of day for times, consistent with casting a timestamp to a date or time type. Values carrying a zone offset are rejected, as for zone-less timestamp columns. When no parsers are configured, the pre-existing decoder is used unchanged.timestamp_parserscleared, so inference keeps strict ISO-8601 semantics (otherwise a value with a time-of-day part could be inferred as a date and silently truncated). The existingtest_timestamp_parsersPython test pins this behavior.ConvertOptions::timestamp_parsers(C++ and Python docstrings) and a new "Date and time parsing" section in the C++ CSV user guide.strptimeused on Windows, wherenl_langinfo()is unavailable. Previously the%a/%A/%b/%B/%h/%p/%c/%r/%x/%Xspecifiers were compiled out on Windows, so the month-name formats from the original issue reports (%d-%b-%y) could not work there for any column type. The tables match musl's C locale, and name matching is case-insensitive as on glibc/musl/BSD. The fallback path is compiled and verified on Linux via theARROW_TEST_FALLBACK_LANGINFOhook.Are these changes tested?
Yes:
Date32Conversion.UserDefinedParsers,Date64Conversion.UserDefinedParsers,Time32Conversion.UserDefinedParsers,Time64Conversion.UserDefinedParsers) covering custom formats, mixed ISO + custom values in one column (backward compatibility of ISO values when parsers are set), pre-epoch flooring with a time-of-day component (distinguishes floor from truncating division), time-of-day extraction from pre-epoch timestamps, zone-offset rejection, and error cases.Are there any user-facing changes?
Yes:
ConvertOptions::timestamp_parsersnow also applies, as a fallback after ISO-8601, to columns explicitly typed as date32/date64/time32/time64 (previously such values always errored). No breaking changes: behavior withouttimestamp_parsersis untouched, ISO values keep parsing when parsers are set, and type inference is unchanged. All language bindings gain the behavior without API changes.AI usage disclosure
This PR was developed with AI assistance (Claude Code): the decoder, tests and documentation were AI-generated under my direction, then reviewed line-by-line and iterated on by me (design decisions: fallback-after-ISO semantics, silent flooring, inference isolation, and several implementation details adjusted during review). I own and can debug these changes.
🤖 Generated with Claude Code