I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()
But, I got the error:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1949 import pyarrow
-> 1950 to_arrow_schema(self.schema)
1951 tables = self._collectAsArrow()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: VectorUDT
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1962 "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963 "to disable this.")
-> 1964 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
1965 else:
1966 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
It's not working, but if I set arrow to false, it works. But It's so slow... Any idea?
Arrow supports only a small set of types, and Spark UserDefinedTypes, including ml and mllib VectorUDTs are not among supported ones.
If you rally want to use arrow you'll have to convert your data to a format that it is supported. One possible solution is to expand Vectors into columns - How to split Vector into columns - using PySpark
You can also serialize output using to_json method:
from pyspark.sql.functions import to_json
df.withColumn("your_vector_column", to_json("your_vector_column"))
but if data is large enough for toPandas to be a serious bottleneck, then I would reconsider collecting data like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With