Imagine you trained a deep learning model on some dataset. A few days later, you want to reproduce the same experiment, but if you were not careful, you may never be able to reproduce the same experiment exactly even if you used the same architecture, the same dataset, and trained on the same machine!
The underlying reason for this behavior is that deep learning training processes are stochastic in nature. This randomness is often acceptable and indeed desirable.
In this post, we will go over the steps necessary to ensure you are able to reproduce a training experiment in PyTorch at least with the same version and same platform (OS etc.)
Sources of Randomness in Training
In the process of training a neural network, there are multiple stages where randomness is used, for example
- random initialization of weights of the network before the training starts.
- regularization, e.g. dropout, which involves randomly dropping nodes in the network while training.
- optimization process like stochastic gradient descent, RMSProp or Adam also include random initializations.
Effect of Randomness on a Toy Example
Let us now see the effect of randomness on training by implementing a simple neural network with just one hidden layer that trains a model to fit a line y=mx using some given data points, and see how the convergence varies on different training runs.
Of course this problem can be easily solved using a single neuron, but we are using it to simply demonstrate how things can change because of randomness in the training process.
If you are new to building neural network models in PyTorch, we encourage you to use PyTorch docs for reference for building models. You can run this code either on a CPU or GPU.
# Train a model to fit a line y=mx using given data points
import torch
## Uncomment the two lines below to make the training reproducible.
#seed = 3
#torch.manual_seed(seed)
# set device to CUDA if available, else to CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)
# N - number of data points
# n_inputs - number of input variables
# n_hidden - number of units in the hidden layer
# n_outputs - number of outputs
N, n_inputs, n_hidden, n_outputs = 5, 1, 100, 1
# Input 7 pairs of (x, y) input values
x = torch.tensor([[0.0], [1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0]], device=device)
y = torch.tensor([[0.0], [10.0], [20.0], [30.0], [40.0], [50.0], [60.0], [70.0]], device=device)
# Make a 3 layer neural network with an input layer, hidden layer and output layer
model = torch.nn.Sequential(
torch.nn.Linear(n_inputs, n_hidden),
torch.nn.ReLU(),
torch.nn.Linear(n_hidden, n_outputs)
)
# Move the model to the device
model.to(device)
# Define the loss function to be the mean squared error loss
loss_fn = torch.nn.MSELoss(reduction='sum')
# Do forward pass through the data points, compute loss, compute gradients using backward propagation and update the weights using the gradients.
learning_rate = 1e-4
for t in range(1000):
y_out = model.forward(x)
loss = loss_fn(y_out, y)
if t % 100 == 99:
print(t, loss.item())
print(y_out)
# Gradients are made to zero prior to backward pass.
model.zero_grad()
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
If we run the above commands multiple times, we get different outputs. For examples, following are the losses obtained in two different runs.
First run’s output:
Device: cuda
99 9.77314567565918
199 3.7914605140686035
299 2.302948474884033
399 1.6840213537216187
499 1.2992208003997803
599 1.0251753330230713
699 0.8185980916023254
799 0.6595200896263123
899 0.5369465351104736
999 0.44230467081069946
Second run’s output:
Device: cuda
99 4.503087997436523
199 2.5806190967559814
299 1.71985924243927
399 1.2397112846374512
499 0.9274320006370544
599 0.7048821449279785
699 0.5397034287452698
799 0.41564038395881653
899 0.32327836751937866
999 0.2524281442165375
Now let us fix the seed for the random number generator by uncommenting the following commands in the above code immediately following the ‘import torch‘ command at the top, and running again.
seed = 3
torch.manual_seed(seed)
You can set the seed to any fixed value.
Now, even if we run the code multiple times, we get the following fixed loss values.
Device: cuda
99 10.655608177185059
199 3.6195263862609863
299 1.653144359588623
399 0.9989959001541138
499 0.712784469127655
599 0.5509689450263977
699 0.44407185912132263
799 0.368024617433548
899 0.3116675019264221
999 0.2681158781051636
So finally, we have been able to reproduce the exact same training process for our model !
Reproducible training on GPU using CuDNN
Our previous model was a simple one, so the torch.manual_seed(seed) command was sufficient to make the process reproducible. But when we work with models involving convolutional layers, e.g. in this PyTorch tutorial, then only the torch.manual_seed(seed) command will not be enough. Since CuDNN will be involved to accelerate GPU operations, we will need to add all the four commands below to make the training process reproducible.
seed = 3
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Let us add that to the PyTorch image classification tutorial, make necessary changes to do the training on a GPU and then run it on the GPU multiple times. We will then see that the training process becomes consistent with a fixed loss pattern, even if we run the training multiple times.
## Image Classification on CIFAR-10 dataset
import torch
import torchvision
import torchvision.transforms as transforms
# set device to CUDA if available, else to CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# fix the seed and make cudnn deterministic
seed = 3
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Define normalization transform
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Get the training set ready
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
# Get the test set ready
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
# Define the classes
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
Let us now define the network and move the network to our GPU device.
# Define the network
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
net.to(device)
Next, we define the loss function, optimization process, then iterate through the training data to do forward pass, backward pass and update the parameters. The loss is printed out every 2000 mini-batches. Note that the inputs and labels need to be moved to the GPU prior to forward pass.
# Perform the optimization and training
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
print('Device:', device)
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
Now, everytime we run the above training, we get the same train losses in the respective mini batches.
Device: cuda
[1, 2000] loss: 2.192
[1, 4000] loss: 1.823
[1, 6000] loss: 1.610
[1, 8000] loss: 1.534
[1, 10000] loss: 1.471
[1, 12000] loss: 1.432
[2, 2000] loss: 1.382
[2, 4000] loss: 1.317
[2, 6000] loss: 1.292
[2, 8000] loss: 1.298
[2, 10000] loss: 1.263
[2, 12000] loss: 1.257
Finished Training
Till now we saw how to get reproducible training while building two different deep learning models. Sometimes the algorithm can also involve the use of randomness in python itself and in numpy. In that case, we will need to seed the corresponding random commands too. Then the whole seeding block would look like the following:
seed = 3
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False