Spark cache() doesn't work when used with repartition()

Question

The spark cache() function when used along with the repartition() doesn't cache the dataframe. Can anyone explain why this happens?

Edit:

df.repartition(1000).cache()
df.count()

I have tried doing them on separate lines and that works.

Edit:

df2 = df1.repartition(1000)
df2.cache()
df2.count()

I expected the dataframe to be cached but i can't see it in the storage on UI

msrv499 · Accepted Answer

Dataframes are immutable like RDD, So though you are calling repartition on df, you are not assigning it to any DF and the current df will not change.

df.repartition(1000).cache()
df.count()

Above one won't work.

df.repartition(1000)
df.cache()
df.count()

For above code if you check in storage, it wont show 1000 partitions cached. Storage will show the cached partitions as df.rdd.getNumPartitions(which will be not 1000).

So try this.

val df1 = df.repartition(1000).cache()
df1.count()

This should work.

Spark cache() doesn't work when used with repartition()

Tags:

caching

apache-spark

pyspark

Goutham Panneeru

1 Answers

msrv499

Recent Activity

Donate For Us

Spark cache() doesn't work when used with repartition()

Tags:

caching

apache-spark

pyspark

Goutham Panneeru

1 Answers

msrv499

Related questions

Recent Activity

Donate For Us