Spark memory fraction vs Young Generation/Old Generation java heap split

Question

I am studying Spark and I have some doubts regarding the Executor memory split. Specifically, in the Spark Apache documentation (here) is stated that:

Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.

this one:

enter image description here

But for the Spark Executor there is another abstract split for the memory, as stated by spark apache doc (here):

Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M).

As shown here:

enter image description here

I don't understand how Young Gen\Old gen are overlapped with storage\execution memory, because in the same doc (always here) is stated that:

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

Where spark.memory.fraction represent the execution\storage memory part of the Java Heap

But

If the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction; it is better to cache fewer objects than to slow down task execution.

This seems suggesting that the oldgen is in fact the User Memory, but the following statement seems to contradict my hypothesis

If the OldGen is close to being full, alternatively, consider decreasing the size of the Young generation.

What am I no seeing?

How is Young Gen\Old Gen split related to the spark fraction \ User Memory?

Levi Ramsey · Accepted Answer

The short answer is that they're not really related beyond both having to do with the JVM heap.

The better way to think of this is that there are four buckets (numbered in no significant order):

Spark memory in the young gen
Spark memory in the old gen
User memory in the young gen
User memory in the old gen

(technically there's also some system memory that's neither Spark nor User, but this typically is small enough to not worry about: this can also be either old or young).

Whether an object is classed as Spark or User is decided by Spark (I actually don't know if this is an eternal designation or if objects can change their categorization in this respect).

As for old vs. young, this is managed by the garbage collector and the GC can and will promote objects from young to old. In some GC algorithms, the sizes of the generations are dynamically adjusted (or they use fixed size regions and a given region can be old or young).

You have control of aggregate capacity of 1+2, 3+4, 1+3, and 2+4, but you don't really have (and probably don't really want, because there's a lot of benefit to being able to use excess space in one category to getting more space temporarily in another) control over the capacity of 1, 2, 3, or 4.

Spark memory fraction vs Young Generation/Old Generation java heap split

Tags:

java

heap-memory

scala

apache-spark

Nikaido

1 Answers

Levi Ramsey

Recent Activity

Donate For Us

Spark memory fraction vs Young Generation/Old Generation java heap split

Tags:

java

heap-memory

scala

apache-spark

Nikaido

1 Answers

Levi Ramsey

Related questions

Recent Activity

Donate For Us