I am trying to calculate geodesic distance from a dataframe which consists of four columns of latitude and longitude data with around 3 million rows. I used the apply lambda method to do it but it took 18 minutes to finish the task. Is there a way to use Vectorization with NumPy arrays to speed up the calculation? Thank you for answering.
My code using apply and lambda method:
from geopy import distance
df['geo_dist'] = df.apply(lambda x: distance.distance(
                              (x['start_latitude'], x['start_longitude']),
                              (x['end_latitude'], x['end_longitude'])).miles, axis=1)
Updates:
I am trying this code but it gives me the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Appreciate if anyone can help.
df['geo_dist'] = distance.distance(
                          (df['start_latitude'].values, df['start_longitude'].values),
                          (df['end_latitude'].values, df['end_longitude'].values)).miles
The Haversine formula calculates the great-circle distance between any two locations on a sphere using their longitudes and latitudes. The Haversine method gives an accurate way of determining the distance between any specified longitude and latitude.
The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations. Example 1: Using vectorized sum method on NumPy array.
The simplest way to calculate geodesic distance is to find the angle between the two points, and multiply this by the circumference of the earth. The formula is: angle = arccos(point1 * point2) distance = angle * pi * radius.
Install it via pip install mpu --user and use it like this to get the haversine distance: import mpu # Point one lat1 = 52.2296756 lon1 = 21.0122287 # Point two lat2 = 52.406374 lon2 = 16.9251681 # What you were looking for dist = mpu.
I think you might consider using geopandas for this, it's an extension of pandas (and therefore numpy) designed to do these types of calculations very quickly.
Specifically, it has a method for calculating the distance between sets of points in a GeoSeries, which can be a column of a GeoDataFrame. I’m fairly certain that this method leverages numexpr for vectorization.
It should look something like this, where you convert your data frame to a GeoDataFrame with (at least) two GeoSeries columns that you can use for the origin and point destinations. This should return a GeoSeries object:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)
distances = gdf.geometry.distance(gdf.destination_geometry)
The answer to your question: You cannot do what you want to do with geopy. I am not familiar with this package but the error traceback shows that this function and possibly all other functions in this package were not written/designed with vectorized computations in mind.
Now, if you can do with great-circle distances, then I would suggest that you experiment with astropy.coordinates package that my be able to compute separations between points in a vectorial way.
Here is an example based on my answer to a different question: Finding closest point:
from astropy.units import Quantity
from astropy.coordinates import SkyCoord, EarthLocation
from astropy.constants import R_earth
import numpy as np
lon1 = Quantity([-71.312796, -87.645307, -87.640426, -87.635513,
                 -87.630629, -87.625793 ], unit='deg')
lat1 = Quantity([41.49008, 41.894577, 41.894647, 41.894713,
                 41.894768, 41.894830], unit='deg')
lon2 = Quantity([-81.695391, -87.645307 + 0.5, -87.640426, -87.635513 - 0.5,
                 -87.630629 + 1.0, -87.625793 - 1.0], unit='deg')
lat2 = Quantity([41.499498, 41.894577 - 0.5, 41.894647, 41.894713 - 0.5,
                 41.894768 - 1.0, 41.894830 + 1.0], unit='deg')
pts1 = SkyCoord(EarthLocation.from_geodetic(lon1, lat1, height=R_earth).itrs, frame='itrs')
pts2 = SkyCoord(EarthLocation.from_geodetic(lon2, lat2, height=R_earth).itrs, frame='itrs')
Then, distances between the two sets of points can be computed as:
>>> dist = pts2.separation(pts1)
>>> print(dist)
<Angle [ 7.78350849, 0.62435354, 0., 0.62435308, 1.25039805, 1.24353876] deg>
Approximate conversion to distance:
>>> np.deg2rad(pts2.separation(pts1)) * R_earth / u.rad
<Quantity [ 866451.17527216,  69502.31527953,      0.        ,
             69502.26348614, 139192.86680148, 138429.29874024] m>
Compare the first value with what you would get from the geopy's example:
>>> distance.distance((41.49008, -71.312796), (41.499498, -81.695391)).meters
866455.4329098687
EDIT: Actually, quite possibly this may actually give you the geodesic distance that you are after but make sure to check the description of EarthLocation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With