I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()
But, I got the error:
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
   1949                 import pyarrow
-> 1950                 to_arrow_schema(self.schema)
   1951                 tables = self._collectAsArrow()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
   1650     fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651               for field in schema]
   1652     return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
   1650     fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651               for field in schema]
   1652     return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
   1641     else:
-> 1642         raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
   1643     return arrow_type
TypeError: Unsupported type in conversion to Arrow: VectorUDT
During handling of the above exception, another exception occurred:
RuntimeError                              Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
      1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
   1962                     "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
   1963                     "to disable this.")
-> 1964                 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
   1965         else:
   1966             pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
It's not working, but if I set arrow to false, it works. But It's so slow... Any idea?
Arrow supports only a small set of types, and Spark UserDefinedTypes, including ml and mllib VectorUDTs are not among supported ones. 
If you rally want to use arrow you'll have to convert your data to a format that it is supported. One possible solution is to expand Vectors into columns - How to split Vector into columns - using PySpark 
You can also serialize output using to_json method:
from pyspark.sql.functions import to_json
 df.withColumn("your_vector_column", to_json("your_vector_column"))
but if data is large enough for toPandas to be a serious bottleneck, then I would reconsider collecting data like this. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With