I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.
I have tried numerous things e.g.
df.show(5)
and
df.limit(5).show()
but everything I try requires a large portion of jobs resulting in slow performance. I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?
Try the rdd equivalent of the dataframe
rdd_df = df.rdd
rdd_df.take(5)
Or, Try to print the dataframe schema
df.printSchema()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With