What is overhead of Java Native Memory Tracking in "summary" mode?

Question

I'm wondering what is the real/typical overhead when NMT is enabled via ‑XX:NativeMemoryTracking=summary (the full command options I'm after are -XX:+UnlockDiagnosticVMOptions ‑XX:NativeMemoryTracking=summary ‑XX:+PrintNMTStatistics)

I could not find much information anywhere - either on SO, blog posts or the official docs. The docs say:

Note: Enabling NMT causes a 5% -10% performance overhead.

But they do not say which mode is expected to have this performance overhead (both summary and detail?) and what this overhead really is (CPU, memory, ...). In Native Memory Tracking guide they claim:

Enabling NMT will result in a 5-10 percent JVM performance drop, and memory usage for NMT adds 2 machine words to all malloc memory as a malloc header. NMT memory usage is also tracked by NMT.

But again, is this true for both summary and detail mode?

What I'm after is basically whether it's safe to add ‑XX:NativeMemoryTracking=summary permanently for a production app (similar to continuous JFR recording) and what are potential costs. So far, when testing this on our app, I didn't spot a difference but it's difficult to

Is there an authoritative source of information containing more details about this performance overhead? Does somebody have experience with enabling this permanently for production apps?

Thomas Stuefe · Accepted Answer

Disclaimer: I base my answer on JDK 18. Most of what I write should be valid for older releases. When in doubt, you need to measure for yourself.

Background:

NMT tracks hotspot VM memory usage and memory usage done via Direct Byte Buffers. Basically, it hooks into malloc/free and mmap/munmap calls and does accounting.

It does not track other JDK memory usage (outside hotspot VM) or usage by third-party libraries. That matters here, since it makes NMT behavior somewhat predictable. Hotspot tries to avoid fine-grained allocations via malloc. Instead, it relies on custom memory managers like Arenas, code heap or Metaspace.

Therefore, for most applications, mallocs/mmap from hotspot are not that "hot" and the number of mallocs is not that large.

I'll focus on malloc/free for the rest of this writeup since they outnumber the number of mmaps/munmaps by far:

Memory cost:

(detail+summary): 2 words per malloc()
(detail only): Malloc site table
(detail+summary): mapping lists for vm mappings and thread stacks

Here, (1) completely dwarfs (2) and (3). Therefore the memory overhead between summary and detail mode is not significant.

Note that even (1) may not matter that much, since the underlying libc allocator already dictates a minimal allocation size that may be larger than (pure allocation size + 16 byte malloc header). So, how much of the NMT memory overhead actually translates into RSS increase needs to be measured.

How much memory overhead in total this means cannot be answered since JVM memory costs are usually dominated by heap size. So, comparing RSS with NMT cost is almost meaningless. But just to give an example, for spring petclinic with 1GB pretouched heap NMT memory overhead is about 0.5%.

In my experience, NMT memory overhead only matters in pathological situations or in corner cases that cause the JVM to do tons of fine-grained allocations. E.g. if one does insane amount of class loading. But often these are the cases where you want NMT enabled, to see what's going on.

Performance cost:

NMT does need some synchronization. It atomically increases counters on each malloc/free in summary mode.

In detail mode, it does a lot more:

capture call stack on malloc site
look for call stack in hash map
increase hash map entry counters

This requires more cycles. Hash map is lock free, but still modified with atomic operations. It looks expensive, especially if hotspot does many mallocs from different threads. How bad is it really?

Worst case example

Micro-benchmark, 64 mio malloc allocations (via Unsafe.allocateMemory()) done by 100 concurrent threads, on a 24 core machine:

NMT off:        6   secs
NMT summary:    34  secs
NMT detail:     46  secs

That looks insane. However, it may not matter in practice, since this is no real-life example.

Spring petclinic bootup, average of ten runs:

NMT off:               3.79 secs
NMT summary:           3.79 secs  (+0%)
NMT detail:            3.91 secs  (+3%)

So, here, not so bad. Cost of summary mode actually disappeared in test noise.

Renaissance, philosophers benchmark

Interesting example, since this does a lot of synchronization, leads to many object monitors being inflated, and those are malloced:

Average benchmark scores:

NMT off:               4697
NMT summary:           4599  (-2%)
NMT detail:            4190  (-11%)

Somewhat in the middle between the two other examples.

Conclusion

There is no clear answer.

Both memory- and performance cost depend on how many allocations the JVM does.

This number is small for normal well-behaved applications, but it may be large in pathological situations (e.g. JVM bugs) as well as in some corner case scenarios caused by user programs, e.g., lots of class loading or synchronization. To be really sure, you need to measure yourself.

What is overhead of Java Native Memory Tracking in "summary" mode?

Tags:

java

memory

jvm

Juraj Martinka

1 Answers

Background:

Memory cost:

Performance cost:

Worst case example

Spring petclinic bootup, average of ten runs:

Renaissance, philosophers benchmark

Conclusion

Thomas Stuefe

Recent Activity

Donate For Us

What is overhead of Java Native Memory Tracking in "summary" mode?

Tags:

java

memory

jvm

Juraj Martinka

1 Answers

Background:

Memory cost:

Performance cost:

Worst case example

Spring petclinic bootup, average of ten runs:

Renaissance, philosophers benchmark

Conclusion

Thomas Stuefe

Related questions

Recent Activity

Donate For Us