show() subset of big dataframe pyspark

Question

I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.

I have tried numerous things e.g.

df.show(5)

and

df.limit(5).show()

but everything I try requires a large portion of jobs resulting in slow performance. I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?

Seif · Accepted Answer

Try the rdd equivalent of the dataframe

 rdd_df = df.rdd
 rdd_df.take(5)

Or, Try to print the dataframe schema

 df.printSchema()

show() subset of big dataframe pyspark

Tags:

python

dataframe

pyspark

databricks

azure-databricks

Martin Petri Bagger

1 Answers

Seif

Recent Activity

Donate For Us

show() subset of big dataframe pyspark

Tags:

python

dataframe

pyspark

databricks

azure-databricks

Martin Petri Bagger

1 Answers

Seif

Related questions

Recent Activity

Donate For Us