Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.
Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.
Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas
def myFunction(data_frame):
data_frame.toPandas()
df = sc.load....
df.groupBy('var_name').mapValues(_.toDF).map(myFunction)
There is nothing special about Pandas DataFrame in this context.
DataFrame is created by using toPandas method on pyspark.sql.dataframe.DataFrame this collects data and creates local Python object on the driver.pandas.core.frame.DataFrame is created inside executor process (for example in mapPartitions) you simply get RDD[pandas.core.frame.DataFrame]. There is no distinction between Pandas objects and let's say a tuple.DataFrame (I assume this what you mean by _.toDF) inside executor thread.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With