Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongos memory usage in constant augmentation

Tags:

mongodb

We chose to deploy the mongos router in the same VM as our applications, but we're running into some issues where the application gets OOM Killed because the mongos eats up a lot more RAM than we'd expect / want to.

After a reboot, the mongos footprint is a bit under 2GB, but from here it constantly requires more memory. About 500MB per week. It went up to 4.5+GB

mongos memory usage

This is the stats for one of our mongos for the past 2 weeks and it clearly looks like it's leaking memory...

So my question is: how to investigate such behavior? We've not really been able to find explanations as of why the router might require more RAM, or how to diagnosis the behavior much. Or even how to set a memory usage limit to the mongos.

With a db.serverStatus on the mongos we can see the allocations:

    "tcmalloc" : {
        "generic" : {
            "current_allocated_bytes" : 536925728,
            "heap_size" : NumberLong("2530185216")
        },
        "tcmalloc" : {
            "pageheap_free_bytes" : 848211968,
            "pageheap_unmapped_bytes" : 213700608,
            "max_total_thread_cache_bytes" : NumberLong(1073741824),
            "current_total_thread_cache_bytes" : 819058352,
            "total_free_bytes" : 931346912,
            "central_cache_free_bytes" : 108358128,
            "transfer_cache_free_bytes" : 3930432,
            "thread_cache_free_bytes" : 819058352,
            "aggressive_memory_decommit" : 0,
            "pageheap_committed_bytes" : NumberLong("2316484608"),
            "pageheap_scavenge_count" : 35286,
            "pageheap_commit_count" : 64660,
            "pageheap_total_commit_bytes" : NumberLong("28015460352"),
            "pageheap_decommit_count" : 35286,
            "pageheap_total_decommit_bytes" : NumberLong("25698975744"),
            "pageheap_reserve_count" : 513,
            "pageheap_total_reserve_bytes" : NumberLong("2530185216"),
            "spinlock_total_delay_ns" : NumberLong("38522661660"),
            "release_rate" : 1
        }
    },
------------------------------------------------
MALLOC:      536926304 (  512.1 MiB) Bytes in use by application
MALLOC: +    848211968 (  808.9 MiB) Bytes in page heap freelist
MALLOC: +    108358128 (  103.3 MiB) Bytes in central cache freelist
MALLOC: +      3930432 (    3.7 MiB) Bytes in transfer cache freelist
MALLOC: +    819057776 (  781.1 MiB) Bytes in thread cache freelists
MALLOC: +     12411136 (   11.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2328895744 ( 2221.0 MiB) Actual memory used (physical + swap)
MALLOC: +    213700608 (  203.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2542596352 ( 2424.8 MiB) Virtual address space used
MALLOC:
MALLOC:         127967              Spans in use
MALLOC:             73              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

But I can't say it's really helpful. At least to me.

In the server stats we can also see that the number of calls to killCursors is quite high (2909015) but I'm not sure how it would explain the steady increase in memory usage? As the cursors are automatically killed after 30ish seconds, and the number of calls made to the mongos is pretty much steady all throughout the period.

So yeah, any idea on how to diagnosis / where to look / what to look for?

Mongos version: 4.0.19

Edit: seems like our monitoring is based on the virt and not the res memory, so the graph might not be very pertinent. However, we still ended-up with 4+ GB of RES memory at some point

like image 340
Tiller Avatar asked Sep 01 '25 20:09

Tiller


1 Answers

Why the router would require more memory?

If there is any query in the sharded cluster where the system needs to do a scatter gather then merging activity is taken care of by the mongos itself.

For example I am running a query db.collectionanme.find({something : 1})

If this something field here is not the shard key itself then by default it will do a scatter gather, use explainPlan to check the query. It does a scatter gather because mongos interacts with config server and realises that it doesn't have information for this respective field. {This is applicable for a collection which is sharded}

To make things worse, if you have sorting operations where the index cannot be used then even that now has to be done on the mongos itself. Sorting operations have to block the memory segment to get the pages together based on volume of data then sort works, imagine the best possible Big O for a sorting operation here. Till that is done the memory is blocked for that operation.

What you should do?

Based on settings (your slowms setting, default should be 100ms), check the logs, take a look at your slow queries in the system. If you see a lot of SHARD_MERGE & in memory sorts taking place then you have your culprit right there.

And for quick fix increase the Swap memory availability and make sure settings are apt.

All the best.

like image 113
Gandalf the White Avatar answered Sep 03 '25 14:09

Gandalf the White