Skip to content

NIH Deeplesion, auto download version#1224

Closed
zl190 wants to merge 46 commits intotensorflow:masterfrom
zl190:jasonfix
Closed

NIH Deeplesion, auto download version#1224
zl190 wants to merge 46 commits intotensorflow:masterfrom
zl190:jasonfix

Conversation

@zl190
Copy link
Copy Markdown

@zl190 zl190 commented Nov 24, 2019

correct user.name and user.email in the commit history

@googlebot googlebot added the cla: yes Author has signed CLA label Nov 24, 2019
@zl190 zl190 changed the title Jasonfix NIH Deeplesion, auto download version Nov 24, 2019
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
@Conchylicultor Conchylicultor added the dataset request Request for a new dataset to be added label Nov 27, 2019
@cyfra cyfra added the kokoro:run Run Kokoro tests label Dec 4, 2019
@kokoro-team kokoro-team removed the kokoro:run Run Kokoro tests label Dec 4, 2019
@cyfra
Copy link
Copy Markdown
Contributor

cyfra commented Dec 4, 2019

The tests are failing with:
E tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory /tmpfs/src/github/datasets/tensorflow_datasets/testing/test_data/fake_examples/deeplesion/zipfile03/Images_png

@cyfra cyfra self-assigned this Dec 4, 2019
@zl190
Copy link
Copy Markdown
Author

zl190 commented Dec 19, 2019

The tests are failing with:
E tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory /tmpfs/src/github/datasets/tensorflow_datasets/testing/test_data/fake_examples/deeplesion/zipfile03/Images_png

fixed

Comment thread tensorflow_datasets/image/__init__.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
@zl190
Copy link
Copy Markdown
Author

zl190 commented Jan 8, 2020

Thanks for the review! @cyfra

@cyfra
Copy link
Copy Markdown
Contributor

cyfra commented Jan 9, 2020

Hey @jason-zl190 - unfortunately this dataset turns out to be quite large - so in such cases, we'd like to have it use 'iter_archive' approach, rather than 'download_and_extract'.

The iter_archive, makes it more efficient (as you 'read' data only once).

Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated
@zl190
Copy link
Copy Markdown
Author

zl190 commented Jan 10, 2020

Hey @jason-zl190 - unfortunately this dataset turns out to be quite large - so in such cases, we'd like to have it use 'iter_archive' approach, rather than 'download_and_extract'.

The iter_archive makes it more efficient (as you 'read' data only once).

Hi, @cyfra Thanks for the review and notice. My concern is that whole zip archives will be iterated three times if they are not allowed to be extracted out. This is because those zip archives aren't compressed separately on splits. I need to iterate all zip archives once for collecting training data, once for validation data, and once for test data. However, I can refactor the code using iter_archives If you still think that's the better solution. Also, any suggestions are welcome if there's a way to avoid the three times iteration of the whole zip archives.

Another concern is that I plan to provide contextual images in the next version. However, I got ResourceExhaustedError when I'm trying to including the whole series images into a record. The reason is that a series, Images_png/<series_folder> /xxx.png, might includes more than 200 images. Although most of the images have no lesions on them, as @Ouwen suggested, they are surrounding slices and might contain contextual information, which is critical for medical diagnose. I'm thinking of including only the image paths in the records and let users decide how many contextual images they need to read. However, That means the data will be read frequently. Let me know if the paths-only plan is not a good idea. I appreciate getting any advice on how to include those mightly large surrounding images.

@Ouwen
Copy link
Copy Markdown
Contributor

Ouwen commented Jan 10, 2020

@jason-zl190 does NIH predefine a validation/test split?

@zl190
Copy link
Copy Markdown
Author

zl190 commented Jan 11, 2020

@jason-zl190 does NIH predefine a validation/test split?

Hi @Ouwen. They do. They provided a CSV file, named "DL_info.csv," in which each key slice belongs to a split. However, the zip files they provided were compressed and separated orderly. For each split, images are randomly scattered in those archives.

@cyfra
Copy link
Copy Markdown
Contributor

cyfra commented Jan 20, 2020

@jason-zl190 - first, thanks a lot for your contribution - this is quite a large dataset, so it is bound to be challenging ;-)

What I'd suggest is:
a) let's start with reading the data 3 times. The other approach (download and extract) still has to read data twice (and write once) - but it has to read from many small files - which is usually slower (especially on distributed storage systems). That's why we prefer the iterate approach for larger datasets.

b) for the contextual images part: I'd tackle it in the separate PR. As you've mentioned, there are couple challenges (including the size of the records). One thing that you could do now - is to make sure that you include the 'series' name somewhere in the record (currently you include only file name from what I see).

@zl190 zl190 requested a review from cyfra February 3, 2020 16:45
@zl190
Copy link
Copy Markdown
Author

zl190 commented Feb 3, 2020

@jason-zl190 - first, thanks a lot for your contribution - this is quite a large dataset, so it is bound to be challenging ;-)

What I'd suggest is:
a) let's start with reading the data 3 times. The other approach (download and extract) still has to read data twice (and write once) - but it has to read from many small files - which is usually slower (especially on distributed storage systems). That's why we prefer the iterate approach for larger datasets.

b) for the contextual images part: I'd tackle it in the separate PR. As you've mentioned, there are couple challenges (including the size of the records). One thing that you could do now - is to make sure that you include the 'series' name somewhere in the record (currently you include only file name from what I see).

@cyfra Hi, Thanks for the suggestions. I changed the code to read data directly from zip files. However, I didn't use iter_archive because the zip files are not compressed on splits. Instead, I look up and read necessary files using zipfile module. Also, I changed the comments pattern. I kept the structure of the example because file_name contains the series information inclusively. Please let me know if there is anything else I need to change.

@cyfra cyfra added the tfds:is_reviewing TFDS team: PTAL label Feb 8, 2020
@zl190
Copy link
Copy Markdown
Author

zl190 commented Mar 31, 2020

@cyfra Thanks for your suggestion and review for this PR. However, this PR is a little outdated because of the long time window. I create a new PR #1769 contains the most recent changes, including different builder configs and better data filter rules. Would you mind taking it a look? Thanks a lot.

@cyfra
Copy link
Copy Markdown
Contributor

cyfra commented Apr 8, 2020

@jason-zl190 will do.

@cyfra cyfra closed this Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Author has signed CLA dataset request Request for a new dataset to be added tfds:is_reviewing TFDS team: PTAL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants