There are quite a lot of posts regarding how to partition a dataframe/rdd to improve performance. My question is much simpler: what's the most direct way to show the partitioner of a dataframe? By looking at the name, I guess df.rdd.partitioner
would return the partitioner, however, it always return None:
df = spark.createDataFrame((("A", 1), ("B", 2), ("A", 3), ("C", 1)),['k','v']).repartition("k")
df.rdd.partitioner #None
I find one way to find the partitioner is to read the output of df.explain()
. However, this prints quite a lot of other info (physical plan). Is there a more direct way to show just the partitioner of a dataframe/rdd?
As suggested in the comment above(mayank agrawal), we can use the executionQuery object to get some insights.
If we do not have a table we can use:
df._jdf.queryExecution().executedPlan().prettyJson()
df._jdf.queryExecution().sparkPlan().outputPartitioning().prettyJson()
whichever fits our goal
or if we have a hive table then we could also have something like this:
table = df._jdf.queryExecution().logical().tableName()
catalog = c.Catalog(spark)
for col in catalog.listColumns(table.split(".")[1], table.split(".")[0]):
if col.isBucket:
print(f"bucketed by {col.name}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With