The GPU trains this network in about 16 seconds. The CPU in about 13 seconds. (I am uncommenting/commenting appropriate lines to do the test). Can anyone see what's wrong with my code or pytorch installation? (I have already checked that the GPU is available, and that there is sufficient memory available on the GPU.
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'
print(accelerator)
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.0-{platform}-linux_x86_64.whl torchvision
print("done")
#########################
import torch
from datetime import datetime
startTime = datetime.now()
dtype = torch.float
device = torch.device("cpu") # Comment this to run on GPU
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1024, 128, 8
# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
t = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
w3 = torch.randn(D_out, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-9
for i in range(10000):
    y_pred = x.mm(w1).clamp(min=0).mm(w2).clamp(min=0).mm(w3)
    loss = (y_pred - t).pow(2).sum()
    if i % 1000 == 0:
        print(i, loss.item())
    loss.backward()
    # Manually update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()
print(datetime.now() - startTime)
I see you're timing things you shouldn't be timing (definition of dtype, device, ...). What's interesting to time here is the creation of the input, output and weight tensors.
startTime = datetime.now()
# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
t = torch.randn(N, D_out, device=device, dtype=dtype)
torch.cuda.synchronize()
print(datetime.now()-startTime)
# Create random Tensors for weights.
startTime = datetime.now()
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
w3 = torch.randn(D_out, D_out, device=device, dtype=dtype, requires_grad=True)
torch.cuda.synchronize()
print(datetime.now()-startTime)
and the training loop
startTime = datetime.now()
for i in range(10000):
    y_pred = x.mm(w1).clamp(min=0).mm(w2).clamp(min=0).mm(w3)
    loss = (y_pred - t).pow(2).sum()
    if i % 1000 == 0:
        print(i, loss.item())
    loss.backward()
    # Manually update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()
torch.cuda.synchronize()
print(datetime.now() - startTime)
I run it on my machine with a GTX1080 and a very good CPU, so the absolute timing is lower, but the explanation should still be valid. If you open a Jupyter notebook and run it on the CPU:
0:00:00.001786 time to create input/output tensors
0:00:00.003359 time to create weight tensors
0:00:04.030797 time to run training loop
Now you set device to cuda and we call this "cold start" (nothing has been previously run on the GPU in this notebook)
0:00:03.180510 time to create input/output tensors
0:00:00.000642 time to create weight tensors
0:00:03.534751 time to run training loop
You see that the time to run the training loop is reduced by a small amount, but there is an overhead of 3 seconds because you need to move the tensors from CPU to GPU RAM.
If you run it again without closing the Jupyter notebook:
0:00:00.000421 time to create input/output tensors
0:00:00.000733 time to create weight tensors
0:00:03.501581 time to run training loop
The overhead disappears, because Pytorch uses a caching memory allocator to speed things up.
You can notice that the speedup you get on the training loop is very small, this is because the operations you're doing are on tensors of pretty small size. When dealing with small architectures and data I always run a quick test to see if I actually gain anything by running it on GPU.
For example if I set N, D_in, H, D_out = 64, 5000, 5000, 8, the training loop runs in 3.5 seconds on the GTX1080 and in 85 seconds on the CPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With