Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anyone know how to display a pandas dataframe in Databricks?

Previously I had a pandas dataframe that I could display as a table in Databricks using:

df.display()

Pandas was updated to v2.0.0. today and I am now getting the following error when I run df.display():

AttributeError: 'DataFrame' object has no attribute 'iteritems'

Anyone know how I can resolve this?

I tried running df.display (without parenthesis) and it gives an output but I am looking for an output in the tabular form.

like image 779
ibarbo Avatar asked Nov 15 '25 21:11

ibarbo


2 Answers

As a workaround, downgrade to pandas v1.5

%pip install --upgrade pandas==1.5

The answers provided till now used to work prior to 3rd April 2023.

As of April 4, with pandas 2.0.0, you are not able to convert a Pandas DataFrame to a Spark DataFrame using the command:

spark.createDataFrame(df)

Using the above command leads to the error mentioned in the question:

AttributeError: 'DataFrame' object has no attribute 'iteritems'

The iteritems function seems to have been removed in pandas 2.0.0. From the changelog of pandas 2.0.0:

Removed deprecated Series.iteritems(), DataFrame.iteritems(), use obj.items instead

While the code written in spark to convert pandas dataframe to a spark dataframe still uses iteritems

/databricks/spark/python/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    308                     warnings.warn(msg)
    309                     raise
--> 310         data = self._convert_from_pandas(data, schema, timezone)
    311         return self._create_dataframe(data, schema, samplingRatio, verifySchema)
    312 

/databricks/spark/python/pyspark/sql/pandas/conversion.py in _convert_from_pandas(self, pdf, schema, timezone)
    340                             pdf[field.name] = s
    341             else:
--> 342                 for column, series in pdf.iteritems():
    343                     s = _check_series_convert_timestamps_tz_local(series, timezone)
    344                     if s is not series:

Looks like we will have to wait for a fix to use Pandas 2.0.0.

like image 123
Zee Avatar answered Nov 17 '25 10:11

Zee


You just need to use display function passing Pandas DataFrame as the argument - not try to call it as a member of the Pandas DataFrame class.

display(pdf)

enter image description here

Or you can simply specify variable name with Pandas DataFrame object - then it will be printed using Panda's built-in representation

import pyspark.sql.functions as F

pdf = spark.range(10).withColumn("rnd", F.rand()).toPandas()

enter image description here

like image 25
Alex Ott Avatar answered Nov 17 '25 12:11

Alex Ott



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!