NIH Deeplesion, auto download version by zl190 · Pull Request #1224 · tensorflow/datasets

zl190 · 2019-11-24T19:43:26Z

correct user.name and user.email in the commit history

…nually download data

…ed manual_down_instruction

cyfra · 2019-12-04T16:01:00Z

The tests are failing with:
E tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory /tmpfs/src/github/datasets/tensorflow_datasets/testing/test_data/fake_examples/deeplesion/zipfile03/Images_png

zl190 · 2019-12-19T23:46:05Z

The tests are failing with:
E tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory /tmpfs/src/github/datasets/tensorflow_datasets/testing/test_data/fake_examples/deeplesion/zipfile03/Images_png

fixed

add comments and minor change the code for readability

delete empty line; remove unused line

Fix nit

zl190 · 2020-01-08T17:05:32Z

Thanks for the review! @cyfra

cyfra · 2020-01-09T15:26:36Z

Hey @jason-zl190 - unfortunately this dataset turns out to be quite large - so in such cases, we'd like to have it use 'iter_archive' approach, rather than 'download_and_extract'.

The iter_archive, makes it more efficient (as you 'read' data only once).

zl190 · 2020-01-10T02:36:05Z

Hey @jason-zl190 - unfortunately this dataset turns out to be quite large - so in such cases, we'd like to have it use 'iter_archive' approach, rather than 'download_and_extract'.

The iter_archive makes it more efficient (as you 'read' data only once).

Hi, @cyfra Thanks for the review and notice. My concern is that whole zip archives will be iterated three times if they are not allowed to be extracted out. This is because those zip archives aren't compressed separately on splits. I need to iterate all zip archives once for collecting training data, once for validation data, and once for test data. However, I can refactor the code using iter_archives If you still think that's the better solution. Also, any suggestions are welcome if there's a way to avoid the three times iteration of the whole zip archives.

Another concern is that I plan to provide contextual images in the next version. However, I got ResourceExhaustedError when I'm trying to including the whole series images into a record. The reason is that a series, Images_png/<series_folder> /xxx.png, might includes more than 200 images. Although most of the images have no lesions on them, as @Ouwen suggested, they are surrounding slices and might contain contextual information, which is critical for medical diagnose. I'm thinking of including only the image paths in the records and let users decide how many contextual images they need to read. However, That means the data will be read frequently. Let me know if the paths-only plan is not a good idea. I appreciate getting any advice on how to include those mightly large surrounding images.

Ouwen · 2020-01-10T21:50:27Z

@jason-zl190 does NIH predefine a validation/test split?

zl190 · 2020-01-11T03:23:27Z

@jason-zl190 does NIH predefine a validation/test split?

Hi @Ouwen. They do. They provided a CSV file, named "DL_info.csv," in which each key slice belongs to a split. However, the zip files they provided were compressed and separated orderly. For each split, images are randomly scattered in those archives.

cyfra · 2020-01-20T08:35:48Z

@jason-zl190 - first, thanks a lot for your contribution - this is quite a large dataset, so it is bound to be challenging ;-)

What I'd suggest is:
a) let's start with reading the data 3 times. The other approach (download and extract) still has to read data twice (and write once) - but it has to read from many small files - which is usually slower (especially on distributed storage systems). That's why we prefer the iterate approach for larger datasets.

b) for the contextual images part: I'd tackle it in the separate PR. As you've mentioned, there are couple challenges (including the size of the records). One thing that you could do now - is to make sure that you include the 'series' name somewhere in the record (currently you include only file name from what I see).

refactor the code to read data directly from archives

zl190 · 2020-02-03T16:55:30Z

@jason-zl190 - first, thanks a lot for your contribution - this is quite a large dataset, so it is bound to be challenging ;-)

What I'd suggest is:
a) let's start with reading the data 3 times. The other approach (download and extract) still has to read data twice (and write once) - but it has to read from many small files - which is usually slower (especially on distributed storage systems). That's why we prefer the iterate approach for larger datasets.

b) for the contextual images part: I'd tackle it in the separate PR. As you've mentioned, there are couple challenges (including the size of the records). One thing that you could do now - is to make sure that you include the 'series' name somewhere in the record (currently you include only file name from what I see).

@cyfra Hi, Thanks for the suggestions. I changed the code to read data directly from zip files. However, I didn't use iter_archive because the zip files are not compressed on splits. Instead, I look up and read necessary files using zipfile module. Also, I changed the comments pattern. I kept the structure of the example because file_name contains the series information inclusively. Please let me know if there is anything else I need to change.

chagne version to 1.0.0

zl190 · 2020-03-31T00:09:06Z

@cyfra Thanks for your suggestion and review for this PR. However, this PR is a little outdated because of the long time window. I create a new PR #1769 contains the most recent changes, including different builder configs and better data filter rules. Would you mind taking it a look? Thanks a lot.

cyfra · 2020-04-08T07:06:12Z

@jason-zl190 will do.

zl190 and others added 10 commits November 21, 2019 18:09

add url, citation and description

a0c9117

data can be downloaded from remote

3414ae9

update .gitignore to ignore jupyter notebook

f0420e8

add pandas as DATASET_EXTRAS for deeplesion in setup.py

c3b4d24

finished split and generate example part; test on local

0438e3f

add instruction of manually place the data; this version relies on ma…

cef0c06

…nually download data

add data for test; test passed; refine the document;

576f3a2

resolve suggested changes from us

1d94cfd

replace pandas with csv; fix a variale name inconsisent error; format…

a9fe3a1

…ed manual_down_instruction

merged

e5a6aed

googlebot added the cla: yes Author has signed CLA label Nov 24, 2019

zl190 changed the title ~~Jasonfix~~ NIH Deeplesion, auto download version Nov 24, 2019

zl190 mentioned this pull request Nov 24, 2019

[data request] NIH Deep Lesion Dataset #1225

Open

Ouwen reviewed Nov 26, 2019

View reviewed changes

Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated

Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated

Conchylicultor added the dataset request Request for a new dataset to be added label Nov 27, 2019

cyfra added the kokoro:run Run Kokoro tests label Dec 4, 2019

kokoro-team removed the kokoro:run Run Kokoro tests label Dec 4, 2019

cyfra self-assigned this Dec 4, 2019

fix the bug that fake_examples/zipfile03 not exist

bf4f0eb

zl190 and others added 2 commits December 31, 2019 19:14

add comments and minor change the code for readability

fd0bf36

Merge pull request #8 from jason-zl190/add_config

affb29c

add comments and minor change the code for readability

cyfra suggested changes Jan 3, 2020

View reviewed changes

zl190 and others added 6 commits January 5, 2020 04:38

delete empty line; remove unused line

e4ed2b6

Merge pull request #9 from jason-zl190/fix_nit

385c030

delete empty line; remove unused line

replace a class with a simpler function; update comments

4ff137d

shorter the code according to cyfra's suggestion

e0e6848

allow multiple bboxs associateiated with an image

5cd77c6

Merge pull request #10 from jason-zl190/fix_nit

f20db6b

Fix nit

zl190 added 6 commits January 6, 2020 23:34

add comments and minor change the code for readability

b5f4984

delete empty line; remove unused line

03ac8ec

replace a class with a simpler function; update comments

79b8ee9

shorter the code according to cyfra's suggestion

2da4de6

allow multiple bboxs associateiated with an image

da70796

refine the code and comment based on the suggestions from cyfra

51098de

zl190 mentioned this pull request Jan 7, 2020

Add NIH Deeplesion dataset, (same as #1224 but a updated and clean one) #1342

Closed

zl190 and others added 2 commits January 7, 2020 05:09

minor change in comment and feature name (bbox -> bboxs)

a868a62

Merge branch 'master' into jasonfix

ef9d8cd

cyfra added being_merged and removed almost_ready_to_merge labels Jan 9, 2020

cyfra reviewed Jan 9, 2020

View reviewed changes

Comment thread tensorflow_datasets/object_detection/deeplesion.py Outdated

zl190 and others added 3 commits February 3, 2020 16:39

refactor the code to read data directly from archives

e4f08d5

Merge pull request #14 from jason-zl190/iter_archive

92f7283

refactor the code to read data directly from archives

Merge branch 'master' into jasonfix

2217cbd

zl190 requested a review from cyfra February 3, 2020 16:45

zl190 and others added 2 commits February 3, 2020 20:53

chagne version to 1.0.0

008fd1c

Merge pull request #15 from jason-zl190/iter_archive

66b6eb3

chagne version to 1.0.0

cyfra added the tfds:is_reviewing TFDS team: PTAL label Feb 8, 2020

cyfra closed this Apr 8, 2020

Conversation

zl190 commented Nov 24, 2019

Uh oh!

Uh oh!

Uh oh!

cyfra commented Dec 4, 2019

Uh oh!

zl190 commented Dec 19, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zl190 commented Jan 8, 2020

Uh oh!

cyfra commented Jan 9, 2020

Uh oh!

Uh oh!

zl190 commented Jan 10, 2020

Uh oh!

Ouwen commented Jan 10, 2020

Uh oh!

zl190 commented Jan 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyfra commented Jan 20, 2020

Uh oh!

zl190 commented Feb 3, 2020

Uh oh!

zl190 commented Mar 31, 2020

Uh oh!

cyfra commented Apr 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zl190 commented Jan 11, 2020 •

edited

Loading