I am working on a project with histopathological images (called Whole-Slide Image). Each of these images is ~1 GB, so they are really hard to handle.
Particularly I struggle when I use
DataLoader(num_worker=N) (where N>1) because PyTorch starts loading and preprocessing (slight data augmentation in our case) multiple batches in RAM and then the RAM fills up fast.
I wanted to know if there are other people working on implementing an alternative
DataLoader mechanism that could allow us to have multiple workers working on the same batch.
I would also like to know if you have any suggestions regarding this topic.
For example I was wondering if the best option was to let worker processes create new child processes that elaborate single elements of the batch (so that it was easier to combine hybrid approach with workers working on the same batch or different batches). The alternative option could be having the main process gathering processed samples (from the same batch) from workers.
Since I never opened a Pytorch PR and since I noticed that worker shutdown/handling is a very delicate matter, do you think I could open a draft and then someone could provide some suggestions and support?
Have a look at WebDataset: Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs | PyTorch
vitaly and team are working on a more performant and more flexible dataset API update.
Read the RFC here: RFC-0009: DataLoader architecture updates by VitalyFedyunin · Pull Request #15 · pytorch/rfcs · GitHub
Another thread worth reading might be: [RFC] Add tar-based IterableDataset implementation to PyTorch · Issue #38419 · pytorch/pytorch · GitHub
I would also love that feature!
multiple workes really speeds up my Mask RCNN project, but this also multiplies my CPU RAM footprint.
so I usually have to use a lower number than I would like, otherwise I exceed my 32GB RAM