A data loader in Pytorch is used to feed the data represented by the Dataset class into the Neural Network model. Pytorch provides the DataLoader class to do just this. In this chapter of the Pytorch Tutorial, you will learn about the Pytorch DataLoader class. You will also explore how it works and learn how you can use it to feed the data to a Neural Network.
The DataLoader class
Importing DataLoader class
You can import the DataLoader
class from torch.utils.data
module
from torch.utils.data import DataLoader
Using the DataLoader class
To create a data loader you need to create an instance of the DataLoader
class. You need to pass the following arguments to the DataLoader
constructor-
dataset– the dataset for which you want to create the data loader. sample_dataset
will be an object of the Dataset
class or any of its sub-class.
batch_size– the size of each batch that the data loader will feed to the model. If you do not specify the batch size, the default value will be 1.
shuffle– Whether the data needs to be shuffled while feed to the model. If you do not specify the shuffle parameter, the default value will be False
. Generally, shuffle is set to True
during training, but False
during validation and testing.
Example
In the example below, a data loader sample_dataloader
is created to feed to the dataset sample_dataset
to the model in batches of 32 items without shuffling them.
sample_dataloader = DataLoader(sample_dataset, batch_size=32, shuffle=False)
Note– You can also pass other arguments to the DataLoader
class constructor. These include sampler, which is an object of the Sampler
class or its sub-class and is used to sample the data in a particular desired fashion, and num_workers which decides how many CPU processes are used to generate and load the data in parallel. Other arguments are used for much more advanced cases. We will discuss the Pytorch Sampler
class later in this Pytorch tutorial.
Understanding how DataLoader works
Instantiating the DataLoader
class returns an iterable. This iterable runs over the dataset and returns a batch of required number of items in each iteration. Therefore, in each iteration, a batch of items will be returned by the iterator which will then be fed to the neural network. The data loader returns the batch of item by internally making use of the __getitem__()
or __iter__()
method defined in the dataset class. Therefore, it is necessary that each sub-class of the Dataset
class to contain either of these methods and implement them.
In the above example, the important things to notice are-
- We create
sample_dataloader
to feedsample_dataset
to the model. sample_dataloader
is an iterator that will run oversample_dataset
.- In each iteration of
sample_dataloader
, a batch of 32 items is returned. This batch will then be fed to the model. sample_dataloader
internally makes use of the __getitem__() or __iter__() methods defined in the class from whichsample_dataset
is created.
Creating Data Loaders
Separate data loaders are created for feeding data to the model during training, testing, and validation loops. Not only is it required to run different loops of training, validation, and testing. But it also allows you to feed data from different locations and in different fashion in these processes. For example, the training, validation, and test datasets can be present in separate directories. Therefore the use of different data loaders in this case is necessary. Similarly, shuffling the data might be useful when training the model but might not be necessary during the validation and testing processes. You can also choose to use a sampler or different number of CPU processes during each of these various processes according to your need by creating separate data loaders.
Example
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)