Add support for partitioned datasets by copybara-service[bot] · Pull Request #4610 · tensorflow/datasets

copybara-service · 2023-01-04T15:27:40Z

Add support for partitioned datasets

For some datasets it makes sense to partition the data. For example, the wikipedia dataset has a language and a snapshot date dimension. Another example would be a production pipeline that creates a new dataset periodically (e.g. using TFX). Currently, there is no native support for this in TFDS, which makes adding new partitions unhandy.

This change adds native support to TFDS.

Notes:

although the wikipedia config names are backwards compatible, loading them would fail because it expects the data to be stored in a nested way, e.g. DATA_DIR/wikipedia/20220620/nl/.... I'm including it here to get comments, but this is something that needs to be fixed before submitting.
not ideal is that loading a partition with tfds.load can be done in two ways: tfds.load("wikipedia/20220620.nl") and tfds.load("wikipedia<snapshot=20220620,language=nl>"). I personally think that the <> syntax is nice because it's more explicit, the ordering doesn't matter, and it is extensible to more powerful syntax like wildcards or lists to load multiple partitions at once.

In future work, we can add a richer load API, e.g. tfds.load("wikipedia<snapshot=20221122,language=*>").

For some datasets it makes sense to partition the data. For example, the wikipedia dataset has a language and a snapshot date dimension. Another example would be a production pipeline that creates a new dataset periodically (e.g. using TFX). Currently, there is no native support for this in TFDS, which makes adding new partitions unhandy. This change adds native support to TFDS. Notes: * although the wikipedia config names are backwards compatible, loading them would fail because it expects the data to be stored in a nested way, e.g. `DATA_DIR/wikipedia/20220620/nl/...`. I'm including it here to get comments, but this is something that needs to be fixed before submitting. * not ideal is that loading a partition with tfds.load can be done in two ways: `tfds.load("wikipedia/20220620.nl")` and `tfds.load("wikipedia<snapshot=20220620,language=nl>")`. I personally think that the <> syntax is nice because it's more explicit, the ordering doesn't matter, and it is extensible to more powerful syntax like wildcards or lists to load multiple partitions at once. In future work, we can add a richer load API, e.g. `tfds.load("wikipedia<snapshot=20221122,language=*>")`. PiperOrigin-RevId: 465993640

copybara-service Bot force-pushed the test_465993640 branch 3 times, most recently from daf1d38 to 77c8a6f Compare January 5, 2023 08:29

copybara-service Bot force-pushed the test_465993640 branch from 77c8a6f to 0032104 Compare January 5, 2023 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for partitioned datasets#4610

Add support for partitioned datasets#4610
copybara-service[bot] wants to merge 1 commit intomasterfrom
test_465993640

copybara-service Bot commented Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

copybara-service Bot commented Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant