How to transform Spark dataframe to Polars dataframe?

Question

I wonder how i can transform Spark dataframe to Polars dataframe.

Let's say i have this code on PySpark:

df = spark.sql('''select * from tmp''')

I can easily transform it to pandas dataframe using .toPandas. Is there something similar in polars, as I need to get a polars dataframe for further processing?

ritchie46 · Accepted Answer

Context

Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we can hijack the API that spark uses internally to create the arrow data and use that to create the polars DataFrame.

TLDR

Given an spark context we can write:

import pyarrow as pa
import polars as pl

sql_context = SQLContext(spark)

data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])

df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))

print(df)

shape: (1, 2)
┌───────┬────────────┐
│ name  ┆ properties │
│ ---   ┆ ---        │
│ str   ┆ list[i64]  │
╞═══════╪════════════╡
│ James ┆ [1, 2]     │
└───────┴────────────┘

Serialization steps

This will actually be faster than the toPandas provided by spark itself, because it saves an extra copy.

toPandas() will lead to this serialization/copy step:

spark-memory -> arrow-memory -> pandas-memory

With the query provided we have:

spark-memory -> arrow/polars-memory

How to transform Spark dataframe to Polars dataframe?

Tags:

python

dataframe

pyspark

python-polars

s1nbad

1 Answers

Context

TLDR

Serialization steps

ritchie46

Recent Activity

Donate For Us

How to transform Spark dataframe to Polars dataframe?

Tags:

python

dataframe

pyspark

python-polars

s1nbad

1 Answers

Context

TLDR

Serialization steps

ritchie46

Related questions

Recent Activity

Donate For Us