Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark cache() doesn't work when used with repartition()

The spark cache() function when used along with the repartition() doesn't cache the dataframe. Can anyone explain why this happens?

Edit:

df.repartition(1000).cache()
df.count()

I have tried doing them on separate lines and that works.

Edit:

df2 = df1.repartition(1000)
df2.cache()
df2.count()

I expected the dataframe to be cached but i can't see it in the storage on UI

like image 633
Goutham Panneeru Avatar asked Sep 03 '25 15:09

Goutham Panneeru


1 Answers

Dataframes are immutable like RDD, So though you are calling repartition on df, you are not assigning it to any DF and the current df will not change.

df.repartition(1000).cache()
df.count()

Above one won't work.

df.repartition(1000)
df.cache()
df.count()

For above code if you check in storage, it wont show 1000 partitions cached. Storage will show the cached partitions as df.rdd.getNumPartitions(which will be not 1000).

So try this.

val df1 = df.repartition(1000).cache()
df1.count()

This should work.

like image 150
msrv499 Avatar answered Sep 05 '25 15:09

msrv499