Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

show() subset of big dataframe pyspark

I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.

I have tried numerous things e.g.

df.show(5)

and

df.limit(5).show()

but everything I try requires a large portion of jobs resulting in slow performance. I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?

like image 446
Martin Petri Bagger Avatar asked Nov 14 '25 19:11

Martin Petri Bagger


1 Answers

Try the rdd equivalent of the dataframe

 rdd_df = df.rdd
 rdd_df.take(5)

Or, Try to print the dataframe schema

 df.printSchema()
like image 53
Seif Avatar answered Nov 17 '25 09:11

Seif



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!