Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Spark Dataframe is better than Pandas Dataframe in performance? [closed]

Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations

For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.

from datetime import datetime
import numpy as np
import pandas as pd

def cube(num):
    return num**3

array_of_nums = np.arange(0,100000)

dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])

start_time = datetime.now() 
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now() 

print("Time taken :", (end_time-start_time))

The output is

Time taken : 0:00:00.109349

If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)

Time taken : 0:00:00.010935
like image 956
komandurikc Avatar asked Dec 22 '25 21:12

komandurikc


1 Answers

1) Pandas data frame is not distributed & Spark's DataFrame is distributed. -> Hence you won't get the benefit of parallel processing in Pandas DataFrame & speed of processing in Pandas DataFrame will be less for large amount of data.

2) Spark DataFrame assures you fault tolerance (It's resilient) & pandas DataFrame does not assure it. -> Hence if your data processing got interrupted/failed in between processing then spark can regenerate the failed result set from lineage (from DAG) . Fault tolerance is not supported in Pandas. You need to implement your own framework to assure it.

like image 53
MIKHIL NAGARALE Avatar answered Dec 24 '25 09:12

MIKHIL NAGARALE



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!