Skip to content

Add support for partitioned datasets#4610

Open
copybara-service[bot] wants to merge 1 commit intomasterfrom
test_465993640
Open

Add support for partitioned datasets#4610
copybara-service[bot] wants to merge 1 commit intomasterfrom
test_465993640

Conversation

@copybara-service
Copy link
Copy Markdown
Contributor

Add support for partitioned datasets

For some datasets it makes sense to partition the data. For example, the wikipedia dataset has a language and a snapshot date dimension. Another example would be a production pipeline that creates a new dataset periodically (e.g. using TFX). Currently, there is no native support for this in TFDS, which makes adding new partitions unhandy.

This change adds native support to TFDS.

Notes:

  • although the wikipedia config names are backwards compatible, loading them would fail because it expects the data to be stored in a nested way, e.g. DATA_DIR/wikipedia/20220620/nl/.... I'm including it here to get comments, but this is something that needs to be fixed before submitting.

  • not ideal is that loading a partition with tfds.load can be done in two ways: tfds.load("wikipedia/20220620.nl") and tfds.load("wikipedia<snapshot=20220620,language=nl>"). I personally think that the <> syntax is nice because it's more explicit, the ordering doesn't matter, and it is extensible to more powerful syntax like wildcards or lists to load multiple partitions at once.

In future work, we can add a richer load API, e.g. tfds.load("wikipedia<snapshot=20221122,language=*>").

@copybara-service copybara-service Bot force-pushed the test_465993640 branch 3 times, most recently from daf1d38 to 77c8a6f Compare January 5, 2023 08:29
For some datasets it makes sense to partition the data. For example, the wikipedia dataset has a language and a snapshot date dimension. Another example would be a production pipeline that creates a new dataset periodically (e.g. using TFX). Currently, there is no native support for this in TFDS, which makes adding new partitions unhandy.

This change adds native support to TFDS.

Notes:

* although the wikipedia config names are backwards compatible, loading them would fail because it expects the data to be stored in a nested way, e.g. `DATA_DIR/wikipedia/20220620/nl/...`. I'm including it here to get comments, but this is something that needs to be fixed before submitting.

* not ideal is that loading a partition with tfds.load can be done in two ways: `tfds.load("wikipedia/20220620.nl")` and `tfds.load("wikipedia<snapshot=20220620,language=nl>")`. I personally think that the <> syntax is nice because it's more explicit, the ordering doesn't matter, and it is extensible to more powerful syntax like wildcards or lists to load multiple partitions at once.

In future work, we can add a richer load API, e.g. `tfds.load("wikipedia<snapshot=20221122,language=*>")`.

PiperOrigin-RevId: 465993640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant