GH-41488: [C++][Python] Apply timestamp_parsers as fallback when parsing CSV date and time columns by pearu · Pull Request #50146 · apache/arrow

pearu · 2026-06-10T09:42:17Z

Rationale for this change

CSV columns explicitly typed as date32, date64, time32 or time64 can only be parsed from strict ISO-8601 strings: ConvertOptions::timestamp_parsers is consulted only for timestamp columns. Reading e.g. 15-OCT-15 into a date32 column fails even with timestamp_parsers=["%d-%b-%y"], and 7:55:00 (non-zero-padded hour) fails for time32[s]. Users currently work around this by declaring such columns as timestamp, reading, then casting back to the date/time type.

Effect on the issues collected in #41488:

Closes [C++] Unable to read date64 or date32 in specific format from CSV #28303 (date32/date64 with a custom format such as %d-%b-%y).
Closes Allow ConvertOptions.timestamp_parsers for date types #33357 (its reproducer — date32 column with timestamp_parsers=["%Y/%m/%d"] — now works as requested).
[Python] ArrowInvalid: CSV conversion error to date32[day]: invalid value '2000-01-01 00:00:00' #37180 is addressed but not auto-closed: it asks for ISO timestamp strings (2000-01-01 00:00:00) in a date32 column to convert by default. With this PR that works by opting in via timestamp_parsers=[ISO8601]; the no-parsers default still errors, deliberately, so that time-of-day truncation only happens when the user asked for it. Whether the remaining default-behavior ask should be implemented or declined is left to that issue.
[C++] Use timestamp parsers for date32() CSV parsing #26224 (closed as not-planned when consolidated into CSV reader cannot parse dates or times #41488) asked for exactly the date32 behavior implemented here.
As a side effect of the vendored strptime fix, [C++][R] strptime fails to parse with %b or %B on Windows #31816 (%b/%B fail on Windows) and [C++][Python] strptime fails to parse with %p on Windows #31971 (%p fails on Windows) are also resolved for all timestamp_parsers users; %z remains unsupported on Windows (kStrptimeSupportsZone, unchanged).

What changes are included in this PR?

A new DateTimeWithParsersValueDecoder in csv/converter.cc, used for date32/date64/time32/time64 columns when timestamp_parsers is non-empty. It tries the built-in ISO-8601 parser first (preserving all existing behavior), then each configured parser in order. A timestamp produced by a fallback parser is floored to the day boundary for dates and reduced to the time of day for times, consistent with casting a timestamp to a date or time type. Values carrying a zone offset are rejected, as for zone-less timestamp columns. When no parsers are configured, the pre-existing decoder is used unchanged.
Type inference is deliberately unaffected: the Date/Time inference stages now explicitly use options with timestamp_parsers cleared, so inference keeps strict ISO-8601 semantics (otherwise a value with a time-of-day part could be inferred as a date and silently truncated). The existing test_timestamp_parsers Python test pins this behavior.
Documentation of the fallback and flooring semantics in ConvertOptions::timestamp_parsers (C++ and Python docstrings) and a new "Date and time parsing" section in the C++ CSV user guide.
C-locale name tables for the vendored musl strptime used on Windows, where nl_langinfo() is unavailable. Previously the %a/%A/%b/%B/%h/%p/%c/%r/%x/%X specifiers were compiled out on Windows, so the month-name formats from the original issue reports (%d-%b-%y) could not work there for any column type. The tables match musl's C locale, and name matching is case-insensitive as on glibc/musl/BSD. The fallback path is compiled and verified on Linux via the ARROW_TEST_FALLBACK_LANGINFO hook.

Are these changes tested?

Yes:

New C++ tests (Date32Conversion.UserDefinedParsers, Date64Conversion.UserDefinedParsers, Time32Conversion.UserDefinedParsers, Time64Conversion.UserDefinedParsers) covering custom formats, mixed ISO + custom values in one column (backward compatibility of ISO values when parsers are set), pre-epoch flooring with a time-of-day component (distinguishes floor from truncating division), time-of-day extraction from pre-epoch timestamps, zone-offset rejection, and error cases.
New Python tests with the reproducers from [C++] Unable to read date64 or date32 in specific format from CSV #28303 and CSV reader cannot parse dates or times #41488, plus an inference-unchanged guard.

Are there any user-facing changes?

Yes: ConvertOptions::timestamp_parsers now also applies, as a fallback after ISO-8601, to columns explicitly typed as date32/date64/time32/time64 (previously such values always errored). No breaking changes: behavior without timestamp_parsers is untouched, ISO values keep parsing when parsers are set, and type inference is unchanged. All language bindings gain the behavior without API changes.

AI usage disclosure

This PR was developed with AI assistance (Claude Code): the decoder, tests and documentation were AI-generated under my direction, then reviewed line-by-line and iterated on by me (design decisions: fallback-after-ISO semantics, silent flooring, inference isolation, and several implementation details adjusted during review). I own and can debug these changes.

🤖 Generated with Claude Code

GitHub Issue: CSV reader cannot parse dates or times #41488

pearu · 2026-06-10T09:42:45Z

Two out-of-scope discoveries made while working on this, recorded here rather than folded into the PR to keep it minimal:

MultipleParsersTimestampValueDecoder::Decode (pre-existing, csv/converter.cc) declares its zone_offset_present flag once outside the parser loop. The built-in parsers happen to write the out-parameter on every call (the strptime parser unconditionally, the ISO-8601 parser resets it to false before scanning), so this is currently harmless — but TimestampParser is a public interface, and a user-implemented parser that only writes the flag when an offset is found could observe a stale value from a previous loop iteration. The new decoder in this PR declares the flag per-iteration; the timestamp decoder could get the same two-line treatment as a MINOR follow-up.
The "Timestamp inference/parsing" section of the C++ CSV user guide (docs/source/cpp/csv.rst) does not mention ConvertOptions::timestamp_parsers at all — custom timestamp parsing was undocumented in the user guide before the date/time subsection added here. A short paragraph there could be a docs follow-up.

github-actions · 2026-06-10T09:46:54Z

⚠️ GitHub issue #41488 has been automatically assigned in GitHub to PR creator.

github-actions · 2026-06-10T09:53:03Z

⚠️ GitHub issue #41488 has been automatically assigned in GitHub to PR creator.

pearu · 2026-06-10T10:08:50Z

CI triage of the first run (3 failing jobs, all Windows): all three shared one root cause — the tests used the %d-%b-%y format from the original issue reproducer, but month names (%b) are not supported by the vendored musl strptime used on Windows: cpp/src/arrow/vendored/musl/strptime.c force-undefines HAVE_LANGINFO on _WIN32, which compiles out the %a/%A/%b/%B/%h/%c/%p/%r/%x/%X cases entirely. The feature code is unaffected; numeric-format and time tests passed on Windows.

Fixed (amended) by making numeric formats the primary test coverage and keeping the month-name reproducer guarded to non-Windows (#ifndef _WIN32 in C++, sys.platform != "win32" in Python, following the existing kStrptimeSupportsZone / test_strftime precedents).
UPDATE: sorry for this noise, Claude was too eager to post comments, it is better restrained now.

A third discovery for the list above: this %b limitation is pre-existing and applies equally to timestamp columns with timestamp_parsers on Windows — it just had no CI coverage because the existing timestamp tests only use numeric formats. Could deserve its own issue (either implementing C-locale month names in the vendored strptime, or documenting the limitation in timestamp_parsers docs).

github-actions · 2026-06-10T13:29:40Z

⚠️ GitHub issue #41488 has been automatically assigned in GitHub to PR creator.

pearu · 2026-06-10T14:16:23Z

The single CI failure (AMD64 Conda C++ AVX2) is unrelated to this PR: Gandiva's TestTime.TestCastTimestampWithTZ fails identically on main since this morning (passing at 4e25461, failing from ca47cd1 on — see e.g. this main run). castTIMESTAMP_utf8 returns 0 for the Canada/Pacific tz name, pointing at tz-database resolution in the CI conda environment — code this PR does not touch.

Copilot

Pull request overview

This PR extends the CSV reader’s ConvertOptions::timestamp_parsers behavior so that, when columns are explicitly typed as date32/date64/time32/time64, the reader first attempts the existing ISO-8601 parsing and then falls back to the user-provided timestamp parsers (with flooring/extracting semantics consistent with casting). It also keeps type inference strict (ISO-only) to avoid silent truncation.

Changes:

Add a new C++ CSV date/time value decoder that falls back to timestamp_parsers after ISO parsing and applies flooring/time-of-day extraction.
Ensure CSV type inference for date/time remains ISO-only even when timestamp_parsers are configured.
Update C++/Python docs and add C++/Python tests; improve vendored Windows strptime support for C-locale day/month names.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/pyarrow/tests/test_csv.py	Adds Python coverage for date/time typed columns using `timestamp_parsers` fallback and inference guard.
python/pyarrow/_csv.pyx	Documents the new fallback behavior in Python `ConvertOptions` docstring.
docs/source/cpp/csv.rst	Adds a “Date and time parsing” section documenting fallback + semantics.
cpp/src/arrow/vendored/musl/strptime.c	Adds a C-locale `nl_langinfo` fallback table for Windows/testing to support `%b/%B/%p/...`.
cpp/src/arrow/csv/options.h	Documents fallback semantics for `timestamp_parsers` in C++ API docs.
cpp/src/arrow/csv/inference_internal.h	Ensures date/time inference ignores configured `timestamp_parsers`.
cpp/src/arrow/csv/converter.cc	Implements fallback decoder + converter factory changes for date/time types.
cpp/src/arrow/csv/converter_test.cc	Adds C++ tests for date/time fallback parsing behavior and edge cases.

pearu · 2026-06-22T18:05:31Z

+        # Month names are parsed case-insensitively
+        rows = b"a\n15-OCT-15\n18-Jun-90\n"
+        opts = ConvertOptions(column_types={'a': pa.date32()},
+                              timestamp_parsers=['%d-%b-%y'])
+        table = self.read_bytes(rows, convert_options=opts)
+        assert table.to_pydict() == {
+            'a': [date(2015, 10, 15), date(1990, 6, 18)],
+        }


Thanks — I checked this and the tests are deterministic as written, so I've left them unchanged.

Neither the test binary nor pyarrow adopts the environment's LC_TIME: there is no setlocale(LC_ALL/LC_TIME, "") anywhere in Arrow's C++ (outside vendored code) or in pyarrow, and CPython coerces only LC_CTYPE at startup, never LC_TIME. So a process started with a non-English LC_TIME in the environment still runs strptime in the C locale. On glibc there is a second reason: strptime's %b/%B keeps the C-locale (English) month names as a fallback even under a non-English locale, so English abbreviations parse regardless (setlocale(LC_TIME, "fr_FR.UTF-8") followed by parsing "15-JUL-15" still succeeds).

Minimal check: CSV %b parsing is independent of the environment locale

Each child process below is started with a non-English LC_ALL/LC_TIME in its environment — exactly the scenario in the comment — and still parses the English month abbreviation correctly:

import os import subprocess import sys CHILD = r""" import os import pyarrow as pa from pyarrow import csv from datetime import date data = b"a\n15-JUL-15\n" # English abbreviated month name "JUL" opts = csv.ConvertOptions(column_types={"a": pa.date32()}, timestamp_parsers=["%d-%b-%y"]) got = csv.read_csv(pa.py_buffer(data), convert_options=opts).to_pydict() assert got == {"a": [date(2015, 7, 15)]}, got print("OK with LC_ALL=%-14r ->" % os.environ.get("LC_ALL"), got) """ for lc in ("C", "fr_FR.UTF-8", "de_DE.UTF-8"): env = dict(os.environ, LC_ALL=lc, LC_TIME=lc) subprocess.run([sys.executable, "-c", CHILD], env=env, check=True)

Output (identical whether or not those locales are actually installed):

OK with LC_ALL='C' -> {'a': [datetime.date(2015, 7, 15)]} OK with LC_ALL='fr_FR.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]} OK with LC_ALL='de_DE.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]}

A genuinely locale-dependent case does remain — an application that itself calls setlocale(LC_ALL, "") under a non-English locale, on a libc whose strptime lacks that English fallback — but that is pre-existing (it affects timestamp columns too) and out of scope here. Happy to file a follow-up issue if that is worth tracking.

…n parsing CSV date and time columns CSV columns explicitly typed as date32, date64, time32 or time64 could only be parsed from strict ISO-8601 strings; ConvertOptions::timestamp_parsers was consulted only for timestamp columns. Make the user-defined timestamp parsers act as a fallback for these column types: the built-in ISO-8601 parser is tried first (preserving existing behavior), then each configured parser in order. A timestamp produced by a fallback parser is floored to the day boundary for dates and reduced to the time of day for times, consistent with casting a timestamp to a date or time type. Type inference of date and time columns is deliberately unaffected: inference keeps using strict ISO-8601 parsing, otherwise a value with a time-of-day part could be inferred as a date and silently truncated. Also provide C-locale name tables to the vendored musl strptime used on Windows, where nl_langinfo() is unavailable: this makes %a/%A/%b/%B/%h/ %p/%c/%r/%x/%X work on Windows (matching musl's C locale), so that the month-name formats from the original issue reports parse on all platforms. Closes apacheGH-28303. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pitrou · 2026-06-23T09:55:00Z

@jorisvandenbossche Do you want to take a look at this PR?

github-actions Bot added the awaiting review Awaiting review label Jun 10, 2026

github-actions Bot added Component: C++ Component: Python Component: Documentation labels Jun 10, 2026

pearu force-pushed the pearu/fix-csv-date-time-parsers branch from 78ca3cb to e0af29e Compare June 10, 2026 10:08

pearu force-pushed the pearu/fix-csv-date-time-parsers branch from e0af29e to e75ecec Compare June 10, 2026 13:19

pearu marked this pull request as ready for review June 10, 2026 14:13

pearu requested review from AlenkaF, raulcd and rok as code owners June 10, 2026 14:13

This was referenced Jun 11, 2026

[C++][R] strptime fails to parse with %b or %B on Windows #31816

Open

[C++][Python] strptime fails to parse with %p on Windows #31971

Open

[C++] Strptime issues umbrella #31324

Open

thisisnic requested a review from Copilot June 22, 2026 15:09

Copilot started reviewing on behalf of thisisnic June 22, 2026 15:10 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 22, 2026

pearu force-pushed the pearu/fix-csv-date-time-parsers branch from e75ecec to fcf4f95 Compare June 22, 2026 18:17

pearu requested a review from pitrou as a code owner June 22, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41488: [C++][Python] Apply timestamp_parsers as fallback when parsing CSV date and time columns#50146

GH-41488: [C++][Python] Apply timestamp_parsers as fallback when parsing CSV date and time columns#50146
pearu wants to merge 1 commit into
apache:mainfrom
pearu:pearu/fix-csv-date-time-parsers

pearu commented Jun 10, 2026 •

edited

Loading

Uh oh!

pearu commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

pearu commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

pearu commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

pearu Jun 22, 2026

Uh oh!

pitrou commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pearu commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

AI usage disclosure

Uh oh!

pearu commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

pearu commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

pearu commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

pearu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pearu commented Jun 10, 2026 •

edited

Loading

pearu commented Jun 10, 2026 •

edited

Loading