GH-43574: [Python] do not add partition columns from file path when reading single file#49853
Open
bkurtz wants to merge 1 commit intoapache:mainfrom
Open
GH-43574: [Python] do not add partition columns from file path when reading single file#49853bkurtz wants to merge 1 commit intoapache:mainfrom
bkurtz wants to merge 1 commit intoapache:mainfrom
Conversation
…when reading single file Reverts a small portion of bd44410
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #43574 (I think; fixes the aspect of it that I ran into, but haven't checked the OP's issue).
Reverts a small portion of bd44410
Rationale for this change
This reverts a change made in pyarrow 17 which means that reading a single file returns different results when that file happens to be located in a path that contains
x=ysegments (i.e. that look like hive partition columns) than when it doesn't. Particularly given the way some higher-level calls wrap this functionality, e.g. by already opening a file before it is passed toParquetDataset, this can lead to confusing results, e.g. that are different when running code on a local vs remote filesystem. For example, for single-file local reads,pandas.read_parquetalready opens a filehandle to pass to pyarrow, while for remote reads, it passes a single-file path + filesystem, resulting in code that works differently when tested on a local filesystem compared to the deployed cloud filesystem.The original change was introduced in #39438 and there was a discussion thread about it (sorry; github's links to resolved discussions don't always work well!) The gist of the discussion thread seems to be that the PR author thought that this code was unused, when in fact the subsequent issue shows that it is used.
Screenshot of the original discussion thread to help you find it:

What changes are included in this PR?
Restores special "single file" handling for single-file paths passed to
ParquetDatasetconstructor, and analogous to the handling for an open file handle.This results in the loaded dataset not parsing the full file path for hive partition columns, which results in a different set of columns.
Are these changes tested?
Added a new unit test. Verified that it fixes the issue I'd been observing, and which I'd commented on in #43574, though I don't have a working reproduction to verify that it fixes the original issue there.
Are there any user-facing changes?
This PR includes breaking changes to public APIs. In particular, it changes the columns returned by single-file calls to
pyarrow.parquet.read_table(...), bringing the results back in line with pyarrow<17.While technically a breaking change, it should be noted that the original PR that introduced this change in pyarrow 17 did not call out this change as a breaking change. However, it's been some time since then, and it's plausible that some applications have developed dependencies on the current behavior.