Skip to content

[Fix] Full update of data_dir in DatasetInfo with update_data_dir()#5297

Open
youliangtan wants to merge 1 commit intotensorflow:masterfrom
youliangtan:master
Open

[Fix] Full update of data_dir in DatasetInfo with update_data_dir()#5297
youliangtan wants to merge 1 commit intotensorflow:masterfrom
youliangtan:master

Conversation

@youliangtan
Copy link
Copy Markdown

Description

The previous DatasetInfo.update_data_dir() only updates the datadir in each split, but not its DatasetIdentity, making the update incomplete. The datadir property doesn't have a setter too, thus, this adds a single liner to update the identity when the update_data_dir() is called.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Mar 4, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@fineguy fineguy self-requested a review March 6, 2024 09:04
@fineguy
Copy link
Copy Markdown
Collaborator

fineguy commented Mar 6, 2024

Thanks @youliangtan for your interest in TFDS!

Do you have an example when the update is incomplete causing issues?

We use this function only once in:

# The generated DatasetInfo contains references to `tmp_data_dir`
self.info.update_data_dir(self.data_dir)
to fix data_dir in splits, but the identity should already have the right data_dir after dataset builder was initialized:
if self.data_path.exists():
self.info.read_from_directory(self._data_dir)
else: # Use the code version (do not restore data)
self.info.initialize_from_bucket()

@fineguy fineguy self-assigned this Mar 6, 2024
@youliangtan
Copy link
Copy Markdown
Author

Thanks for the reply, I'm using this method to update a new data_dir, after DatasetInfo is constructed

The code example is here. This script is mainly to merge or reshard the rlds data, maybe there's better way of doing this too. 🤷‍♂️

@fineguy
Copy link
Copy Markdown
Collaborator

fineguy commented Mar 7, 2024

I think DatasetInfo.update_data_dir should be a private method and only used internally.

For your code I suggest to consider these options :

@youliangtan
Copy link
Copy Markdown
Author

Thanks for the suggestion. I think I will keep the current code as light as it is for now without depending on external lib.

Also, could the update_data_dir() be a public method? else it should better to be named as _update_data_dir 🤔

@fineguy
Copy link
Copy Markdown
Collaborator

fineguy commented Mar 8, 2024

I agree, that would be a nice change.

@tomvdw
Copy link
Copy Markdown
Collaborator

tomvdw commented Mar 13, 2024

Have you looked into builder_from_directories? (see

def builder_from_directories(
) This merges multiple datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants