I'm using pyTorch to run calculations on my GPU (RTX 3000, CUDA 11.1). One step involves calculating the distance between one point and an array of points. For kicks I tested 2 functions to determine which is faster as follows:
import datetime as dt
import functools
import timeit
import torch
import numpy as np
device = torch.device("cuda:0")
# define functions for calculating distance
def dist_geom(a, b):
    dist = (a - b)**2
    dist = dist.sum(axis=1)**0.5
    return dist
def dist_linalg(a, b):
    dist = torch.linalg.norm(a - b, axis=1)
    return dist
   
# create dummy data
a = np.random.randint(0, 100000, (100000, 10, 10)).astype(np.float64)
b = np.random.randint(0, 100000, (1, 10)).astype(np.float64)
# send data to GPU
a = torch.from_numpy(a).to(device)
b = torch.from_numpy(b).to(device)
# test runtime of each
iterations = 1000
t = timeit.Timer(functools.partial(dist_linalg, a, b))
linalg_delta = t.timeit(number=iterations) / iterations
print("Linear algebra time: ", linalg_delta, " seconds per iteration")
t = timeit.Timer(functools.partial(dist_geom, a, b))
geom_delta = t.timeit(number=iterations) / iterations
print("Geometry time: ", geom_delta, " seconds per iteration")
print("linear algebra:geometry ratio: ", linalg_delta / geom_delta)
This gives the following output:
Linear algebra time:  0.000743145  seconds per iteration
Geometry time:  0.001446731  seconds per iteration
linear algebra:geometry ratio:  0.5136718574496572
So the linear algebra function is ~2x faster. But if I call the geometry function first:
t = timeit.Timer(functools.partial(dist_geom, a, b))
geom_delta = t.timeit(number=iterations) / iterations
print("Geometry time: ", geom_delta, " seconds per iteration")
t = timeit.Timer(functools.partial(dist_linalg, a, b))
linalg_delta = t.timeit(number=iterations) / iterations
print("Linear algebra time: ", linalg_delta, " seconds per iteration")      
print("linear algebra:geometry ratio: ", linalg_delta / geom_delta)
I get this output:
Geometry time:  0.001213497  seconds per iteration
Linear algebra time:  0.001136769  seconds per iteration
linear algebra:geometry ratio:  0.9367711663069623
The dist_geom time is nearly identical to the initial run, but the dist_linalg time is now 1.46x longer!
I've tested this multiple ways and the result is always the same: the call order seems to matter...a lot. I think I'm missing a fundamental point here, so any help in understanding what is going on will be appreciated (and I suspect it will be so simple I'll feel foolish).
I created two sets of tensors. The following yields the same runtime regardless of order.
# create 2 tensors for geometry test
a1 = np.random.randint(0, 100000, (100000, 10, 10)).astype(np.float64)
b1 = np.random.randint(0, 100000, (1, 10)).astype(np.float64)
a1 = torch.from_numpy(a1).to(device)
b1 = torch.from_numpy(b1).to(device)
t = timeit.Timer(functools.partial(dist_geom, a, b))
geom_delta = t.timeit(number=iterations) / iterations
print("Geometry time: ", geom_delta, " seconds per iteration")
# create 2 different tensors for the linalg function
a2 = np.random.randint(0, 100000, (100000, 10, 10)).astype(np.float64)
b2 = np.random.randint(0, 100000, (1, 10)).astype(np.float64)
a2 = torch.from_numpy(a2).to(device)
b2 = torch.from_numpy(b2).to(device)
t = timeit.Timer(functools.partial(dist_linalg, a, b))
linalg_delta = t.timeit(number=iterations) / iterations
print("Linear algebra time: ", linalg_delta, " seconds per iteration")      
print("linear algebra:geometry ratio: ", linalg_delta / geom_delta)
Geometry time:  0.0012010019999999998  seconds per iteration
Linear algebra time:  0.0007349769999999999  seconds per iteration
linear algebra:geometry ratio:  0.6119698385181707
That said, if I define both a1/b1 and a2/b2 before the function calls I see the difference in times again. Initially I thought this was caused memory load times, but that does not really fit, right?
you just can add
torch.cuda.empty_cache()
All code:
import datetime as dt
import functools
import timeit
import torch
import numpy as np
device = torch.device("cuda:0")
# define functions for calculating distance
def dist_geom(a, b):
    dist = (a - b)**2
    dist = dist.sum(axis=1)**0.5
    return dist
def dist_linalg(a, b):
    dist = torch.linalg.norm(a - b, axis=1)
    return dist
   
# create dummy data
a = np.random.randint(0, 100000, (100000, 10, 10)).astype(np.float64)
b = np.random.randint(0, 100000, (1, 10)).astype(np.float64)
# send data to GPU
a = torch.from_numpy(a).to(device)
b = torch.from_numpy(b).to(device)
# test runtime of each
iterations = 1000
t = timeit.Timer(functools.partial(dist_linalg, a, b))
linalg_delta = t.timeit(number=iterations) / iterations
print("Linear algebra time: ", linalg_delta, " seconds per iteration")
torch.cuda.empty_cache()
t = timeit.Timer(functools.partial(dist_geom, a, b))
geom_delta = t.timeit(number=iterations) / iterations
print("Geometry time: ", geom_delta, " seconds per iteration")
print("linear algebra:geometry ratio: ", linalg_delta / geom_delta)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With