amirakbarnejad

Section 2: Making a Dataloader

In this scetion, we firstly introduce the two main abstract entities in PyDmed:

Big chunk
Small chunk

The below video illustrates the concept:

Big chunk: is a relatively big chunk from an instnace. It can be, e.g., a 5000x5000 patch from a huge whole-slide-image. In the above video, BigChunks are the big patches that slowly get loaded from hard disk.
Small chunk: a small data chunk collected from a big chunk. It can be, e.g., a 224x224 patch cropped from a 5000x5000 big chunk. In the above video, SmallChunks are the red patches which are pushed to queues 1-5 and are then collected for GPU(s).

Making a BigChunkLoader

To specify how to load a BigChunk, you should make a subclass of lightdl.BigChunkLoader and implement the abstract method extract_bigchunk.

Here is an example of making a BigChunkLoader:

class SampleBigchunkLoader(BigChunkLoader):
    @abstractmethod
    def extract_bigchunk(self, arg_msg):
        '''
        Extract and return a bigchunk.
        Inputs:
            - `arg_msg`: we won't need this argument for now. 
        In this function you have access to `self.patient` and some
        other functions and fields to be covered later.
        '''
        record_HandE = self.patient.dict_records["H&E"]
        fname_hande = record_HandE.rootdir + record_HandE.relativedir
        osimage = openslide.OpenSlide(fname_hande)
        w, h = 5000, 5000
        W, H = osimage.dimensions
        rand_x, rand_y = np.random.randint(0, W-w),\
                         np.random.randint(0, H-h)
        pil_bigchunk = osimage.read_region(location=[rand_x, rand_y],\
                                           level=0,\
                                           size=[w,h])
        np_bigchunk = np.array(pil_bigchunk)[:,:,0:3]
        bigchunk = BigChunk(\
                    data = np_bigchunk,\
                    dict_info_of_bigchunk = {"x":rand_x, "y":rand_y},\
                    patient = self.patient)
        return bigchunk

Making a SmallChunkCollector:

To specify how to collect a smallchunk from a bigchunk, you should make a subclass of SmallChunkCollector and implement the abstract method extract_smallchunk.

Here is an example of making a SmallChunkCollector:

class SampleSmallchunkCollector(SmallChunkCollector):

    @abstractmethod 
    def extract_smallchunk(self, call_count, bigchunk, last_message_fromroot):
        '''
        Extract and return a smallchunk. 
        Note that in this function you have access to 
        self.bigchunk, self.patient, self.const_global_info.
        Inputs:
            - `call_count`: we won't need this argument for now.
            - `bigchunk`: the bigchunk that we just extracted.
            - `last_message_fromroot`: we won't need this argument for now.
        In this function you have access to `self.patient` and some
        other functions and fields to be covered later on.
        '''
        np_bigchunk = bigchunk.data
        W, H = np_bigchunk.shape[1], np_bigchunk.shape[0]
        w, h = 224, 224
        rand_x, rand_y = np.random.randint(0, W-w),\
                         np.random.randint(0, H-h)
        np_smallchunk = np_bigchunk[rand_y:rand_y+h, rand_x:rand_x+w, :]
        #wrap in SmallChunk
        smallchunk = SmallChunk(\
                        data=np_smallchunk,\
                        dict_info_of_smallchunk=\
                                    {"x":rand_x, "y":rand_y},\
                        dict_info_of_bigchunk = \
                                    bigchunk.dict_info_of_bigchunk,\
                        patient=bigchunk.patient
                      )
        return smallchunk

Making a Dataloader

The last step is to make an instance of pydmed.lightdl.LightDL as follows:

const_global_info =\
    pydmed.lightdl.get_default_constglobinf()#to be covered later.
dataloader = LightDL(
                dataset = dataset,\
                type_bigchunkloader = SampleBigchunkLoader,\
                type_smallchunkcollector = SampleSmallchunkCollector,\
                const_global_info = const_global_info,\
                batch_size = 10,\
                tfms = torchvision.transforms.ToTensor()
        )

Now we can get data from the dataloader as follows:

dataloader.start()
time.sleep(10) #wait for the dataloader to load initial BigChunks.
while True:
    x, list_patients, list_smallchunks = dataloader.get()
    '''
     `x` is now a tensor of shape [`batch_size` x 3 x 224 x 224].
     `list_patients` is a list of lenght `batch_size`.
     TODO: You can use these values to, e.g., update model parameters.
     .
     .
     .
    '''
    if(flag_finish_running == True):
        dataloader.pause_loading()
        break