Why is this polars filtering so much slower than my pandas equivalent?

Question

I'm trying a function in polars, and it is significantly slower than my pandas equivalent.

My pandas function is the following:

import pandas as pd
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")

and polars is the following

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
    df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")

My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?

One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.

Additionally in my real use case, I would need to return multiple rows that match the closest value.

Not sure if this is important, but for additional context, I'm running python on a linux server.

Lewis · Accepted Answer

Testing your "function" with pandas, polars and numpy

import pandas as pd
import time
import numpy as np
import polars as pl


def test(func, argument):
    run_times = []
    for i in range(100):
        st = time.perf_counter()
        df = func(argument)
        run_time = time.perf_counter() - st
        run_times.append(run_time)
    return np.mean(run_times)

def f_pandas(df):
    min_abs_diff = (df[0] - target_value).abs().min()
    return df.loc[(df[0] - target_value).abs() == min_abs_diff]

def f_pandas_vectorized(df):
    return df.loc[(df[0] - target_value).abs().idxmin()]

def f_polars(df):
    min_abs_diff = (df["column_0"] - target_value).abs().min() 
    return df.filter((df["column_0"] - target_value).abs() == min_abs_diff)

def f_numpy(data):
    abs_diff = np.abs(data[:, 0] - target_value)
    min_idx = np.argmin(abs_diff)
    return pd.DataFrame(data[[min_idx]])


target_value = 0.5
data = np.random.rand(100000, 1000)
df = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

print(f"average pandas runtime: {test(f_pandas, df)}")
print(f"average pandas runtime with idxmin(): {test(f_pandas_vectorized, df)}")
print(f"average polars runtime: {test(f_polars, df_pl)}")
print(f"average numpy runtime: {test(f_numpy, data)}")

I got this results running in a Jupyter Notebook on a Linux machine.

average pandas runtime: 0.00989325414002451
average pandas runtime with idxmin(): 0.005005129760029377
average polars runtime: 0.006758741329904296
average numpy runtime: 0.004175669220221607

average pandas runtime: 0.009967705049803044
average pandas runtime with idxmin(): 0.005097740050114225
average polars runtime: 0.006972378070222476
average numpy runtime: 0.004102102290034964

average pandas runtime: 0.010020545769948512
average pandas runtime with idxmin(): 0.004993948210048984
average polars runtime: 0.007027968560159934
average numpy runtime: 0.004024256040174805

You see polars is faster than your panda code, but using vectorized operations like idxmin() in pandas at least in this case is better than polars. numpy is often faster in this type of numerical work.

Why is this polars filtering so much slower than my pandas equivalent?

Tags:

performance

python

pandas

dataframe

python-polars

Raymond Han

1 Answers

Lewis

Recent Activity

Donate For Us

Why is this polars filtering so much slower than my pandas equivalent?

Tags:

performance

python

pandas

dataframe

python-polars

Raymond Han

1 Answers

Lewis

Related questions

Recent Activity

Donate For Us