Imagine you trained a deep learning model on some dataset. A few days later, you want to reproduce the same experiment, but if you were not careful, you may never be able to reproduce the same experiment exactly even if you used the same architecture, the same dataset, and trained on the same machine!
The underlying reason for this behavior is that deep learning training processes are stochastic in nature. This randomness is often acceptable and indeed desirable.
In this post, we will go over the steps necessary to ensure you are able to reproduce a training experiment in PyTorch at least with the same version and same platform (OS etc.)
Sources of Randomness in Training
In the process of training a neural network, there are multiple stages where randomness is used, for example
- random initialization of weights of the network before the training starts.
- regularization, e.g. dropout, which involves randomly dropping nodes in the network while training.
- optimization process like stochastic gradient descent, RMSProp or Adam also include random initializations.
Effect of Randomness on a Toy Example
Let us now see the effect of randomness on training by implementing a simple neural network with just one hidden layer that trains a model to fit a line y=mx using some given data points, and see how the convergence varies on different training runs.
Of course this problem can be easily solved using a single neuron, but we are using it to simply demonstrate how things can change because of randomness in the training process.
If you are new to building neural network models in PyTorch, we encourage you to use PyTorch docs for reference for building models. You can run this code either on a CPU or GPU.
# Train a model to fit a line y=mx using given data points import torch ## Uncomment the two lines below to make the training reproducible. #seed = 3 #torch.manual_seed(seed) # set device to CUDA if available, else to CPU device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print('Device:', device) # N - number of data points # n_inputs - number of input variables # n_hidden - number of units in the hidden layer # n_outputs - number of outputs N, n_inputs, n_hidden, n_outputs = 5, 1, 100, 1 # Input 7 pairs of (x, y) input values x = torch.tensor([[0.0], [1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0]], device=device) y = torch.tensor([[0.0], [10.0], [20.0], [30.0], [40.0], [50.0], [60.0], [70.0]], device=device) # Make a 3 layer neural network with an input layer, hidden layer and output layer model = torch.nn.Sequential( torch.nn.Linear(n_inputs, n_hidden), torch.nn.ReLU(), torch.nn.Linear(n_hidden, n_outputs) ) # Move the model to the device model.to(device) # Define the loss function to be the mean squared error loss loss_fn = torch.nn.MSELoss(reduction='sum') # Do forward pass through the data points, compute loss, compute gradients using backward propagation and update the weights using the gradients. learning_rate = 1e-4 for t in range(1000): y_out = model.forward(x) loss = loss_fn(y_out, y) if t % 100 == 99: print(t, loss.item()) print(y_out) # Gradients are made to zero prior to backward pass. model.zero_grad() loss.backward() # Update weights using gradient descent with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad
If we run the above commands multiple times, we get different outputs. For examples, following are the losses obtained in two different runs.
First run’s output:
Device: cuda 99 9.77314567565918 199 3.7914605140686035 299 2.302948474884033 399 1.6840213537216187 499 1.2992208003997803 599 1.0251753330230713 699 0.8185980916023254 799 0.6595200896263123 899 0.5369465351104736 999 0.44230467081069946
Second run’s output:
Device: cuda 99 4.503087997436523 199 2.5806190967559814 299 1.71985924243927 399 1.2397112846374512 499 0.9274320006370544 599 0.7048821449279785 699 0.5397034287452698 799 0.41564038395881653 899 0.32327836751937866 999 0.2524281442165375
Now let us fix the seed for the random number generator by uncommenting the following commands in the above code immediately following the ‘import torch‘ command at the top, and running again.
seed = 3 torch.manual_seed(seed)
You can set the seed to any fixed value.
Now, even if we run the code multiple times, we get the following fixed loss values.
Device: cuda 99 10.655608177185059 199 3.6195263862609863 299 1.653144359588623 399 0.9989959001541138 499 0.712784469127655 599 0.5509689450263977 699 0.44407185912132263 799 0.368024617433548 899 0.3116675019264221 999 0.2681158781051636
So finally, we have been able to reproduce the exact same training process for our model !
Want to learn Deep Learning and Computer Vision in depth? OpenCV (in collaboration with LearnOpenCV) is offering 3 Computer Vision courses.
Reproducible training on GPU using CuDNN
Our previous model was a simple one, so the torch.manual_seed(seed) command was sufficient to make the process reproducible. But when we work with models involving convolutional layers, e.g. in this PyTorch tutorial, then only the torch.manual_seed(seed) command will not be enough. Since CuDNN will be involved to accelerate GPU operations, we will need to add all the four commands below to make the training process reproducible.
seed = 3 torch.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
Let us add that to the PyTorch image classification tutorial, make necessary changes to do the training on a GPU and then run it on the GPU multiple times. We will then see that the training process becomes consistent with a fixed loss pattern, even if we run the training multiple times.
## Image Classification on CIFAR-10 dataset import torch import torchvision import torchvision.transforms as transforms # set device to CUDA if available, else to CPU device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # fix the seed and make cudnn deterministic seed = 3 torch.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False # Define normalization transform transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) # Get the training set ready trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2) # Get the test set ready testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2) # Define the classes classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
Let us now define the network and move the network to our GPU device.
# Define the network import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x net = Net() net.to(device)
Next, we define the loss function, optimization process, then iterate through the training data to do forward pass, backward pass and update the parameters. The loss is printed out every 2000 mini-batches. Note that the inputs and labels need to be moved to the GPU prior to forward pass.
# Perform the optimization and training import torch.optim as optim criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9) print('Device:', device) for epoch in range(2): # loop over the dataset multiple times running_loss = 0.0 for i, data in enumerate(trainloader, 0): # get the inputs; data is a list of [inputs, labels] inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) # zero the parameter gradients optimizer.zero_grad() # forward + backward + optimize outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics running_loss += loss.item() if i % 2000 == 1999: # print every 2000 mini-batches print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000)) running_loss = 0.0 print('Finished Training')
Now, everytime we run the above training, we get the same train losses in the respective mini batches.
Device: cuda [1, 2000] loss: 2.192 [1, 4000] loss: 1.823 [1, 6000] loss: 1.610 [1, 8000] loss: 1.534 [1, 10000] loss: 1.471 [1, 12000] loss: 1.432 [2, 2000] loss: 1.382 [2, 4000] loss: 1.317 [2, 6000] loss: 1.292 [2, 8000] loss: 1.298 [2, 10000] loss: 1.263 [2, 12000] loss: 1.257 Finished Training
Till now we saw how to get reproducible training while building two different deep learning models. Sometimes the algorithm can also involve the use of randomness in python itself and in numpy. In that case, we will need to seed the corresponding random commands too. Then the whole seeding block would look like the following:
seed = 3 random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False