Anyone know how to display a pandas dataframe in Databricks?

Question

Previously I had a pandas dataframe that I could display as a table in Databricks using:

df.display()

Pandas was updated to v2.0.0. today and I am now getting the following error when I run df.display():

AttributeError: 'DataFrame' object has no attribute 'iteritems'

Anyone know how I can resolve this?

I tried running df.display (without parenthesis) and it gives an output but I am looking for an output in the tabular form.

Zee · Accepted Answer

As a workaround, downgrade to pandas v1.5

%pip install --upgrade pandas==1.5

The answers provided till now used to work prior to 3rd April 2023.

As of April 4, with pandas 2.0.0, you are not able to convert a Pandas DataFrame to a Spark DataFrame using the command:

spark.createDataFrame(df)

Using the above command leads to the error mentioned in the question:

AttributeError: 'DataFrame' object has no attribute 'iteritems'

The iteritems function seems to have been removed in pandas 2.0.0. From the changelog of pandas 2.0.0:

Removed deprecated Series.iteritems(), DataFrame.iteritems(), use obj.items instead

While the code written in spark to convert pandas dataframe to a spark dataframe still uses iteritems

/databricks/spark/python/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    308                     warnings.warn(msg)
    309                     raise
--> 310         data = self._convert_from_pandas(data, schema, timezone)
    311         return self._create_dataframe(data, schema, samplingRatio, verifySchema)
    312 

/databricks/spark/python/pyspark/sql/pandas/conversion.py in _convert_from_pandas(self, pdf, schema, timezone)
    340                             pdf[field.name] = s
    341             else:
--> 342                 for column, series in pdf.iteritems():
    343                     s = _check_series_convert_timestamps_tz_local(series, timezone)
    344                     if s is not series:

Looks like we will have to wait for a fix to use Pandas 2.0.0.

Alex Ott · Answer

You just need to use display function passing Pandas DataFrame as the argument - not try to call it as a member of the Pandas DataFrame class.

display(pdf)

enter image description here

Or you can simply specify variable name with Pandas DataFrame object - then it will be printed using Panda's built-in representation

import pyspark.sql.functions as F

pdf = spark.range(10).withColumn("rnd", F.rand()).toPandas()

enter image description here

Anyone know how to display a pandas dataframe in Databricks?

Tags:

python

pandas

apache-spark

pyspark

databricks

ibarbo

2 Answers

Zee

Alex Ott

Recent Activity

Donate For Us

Anyone know how to display a pandas dataframe in Databricks?

Tags:

python

pandas

apache-spark

pyspark

databricks

ibarbo

2 Answers

Zee

Alex Ott

Related questions

Recent Activity

Donate For Us