Pytorch DataLoader class

A data loader in Pytorch is used to feed the data represented by the Dataset class into the Neural Network model. Pytorch provides the DataLoader class to do just this. In this chapter of the Pytorch Tutorial, you will learn about the Pytorch DataLoader class. You will also explore how it works and learn how you can use it to feed the data to a Neural Network.

The DataLoader class

Importing DataLoader class

You can import the DataLoader class from torch.utils.data module

from torch.utils.data import DataLoader

Using the DataLoader class

To create a data loader you need to create an instance of the DataLoader class. You need to pass the following arguments to the DataLoader constructor-

dataset– the dataset for which you want to create the data loader. sample_dataset will be an object of the Dataset class or any of its sub-class.

batch_size– the size of each batch that the data loader will feed to the model. If you do not specify the batch size, the default value will be 1.

shuffle– Whether the data needs to be shuffled while feed to the model. If you do not specify the shuffle parameter, the default value will be False. Generally, shuffle is set to True during training, but False during validation and testing.

Example

In the example below, a data loader sample_dataloader is created to feed to the dataset sample_dataset to the model in batches of 32 items without shuffling them.

sample_dataloader = DataLoader(sample_dataset, batch_size=32, shuffle=False)

Note– You can also pass other arguments to the DataLoader class constructor. These include sampler, which is an object of the Sampler class or its sub-class and is used to sample the data in a particular desired fashion, and num_workers which decides how many CPU processes are used to generate and load the data in parallel. Other arguments are used for much more advanced cases. We will discuss the Pytorch Sampler class later in this Pytorch tutorial.


Understanding how DataLoader works

Instantiating the DataLoader class returns an iterable. This iterable runs over the dataset and returns a batch of required number of items in each iteration. Therefore, in each iteration, a batch of items will be returned by the iterator which will then be fed to the neural network. The data loader returns the batch of item by internally making use of the __getitem__() or __iter__() method defined in the dataset class. Therefore, it is necessary that each sub-class of the Dataset class to contain either of these methods and implement them.

In the above example, the important things to notice are-

  1. We create sample_dataloader to feed sample_dataset to the model.
  2. sample_dataloader is an iterator that will run over sample_dataset.
  3. In each iteration of sample_dataloader, a batch of 32 items is returned. This batch will then be fed to the model.
  4. sample_dataloader internally makes use of the __getitem__() or __iter__() methods defined in the class from which sample_dataset is created.

Creating Data Loaders

Separate data loaders are created for feeding data to the model during training, testing, and validation loops. Not only is it required to run different loops of training, validation, and testing. But it also allows you to feed data from different locations and in different fashion in these processes. For example, the training, validation, and test datasets can be present in separate directories. Therefore the use of different data loaders in this case is necessary. Similarly, shuffling the data might be useful when training the model but might not be necessary during the validation and testing processes. You can also choose to use a sampler or different number of CPU processes during each of these various processes according to your need by creating separate data loaders.

Example

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)