Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

oom-killer killed my app when there was 9GB available memory

One of our production server experienced an oom-killer problem in 2020-12-26 20:05:30, but I can not figure out the cause of this event. The zabbix page showed that available memory was about 9GB when oom-killer killed my java process. Please help me figure out why it happened, thank you. zabbix page showed available memory

Here is the log in /var/log/message(there were a lot "kernel: TCP: too many orphaned sockets" before this)

Dec 26 20:05:30 myhost kernel: systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
Dec 26 20:05:30 myhost kernel: systemd cpuset=/ mems_allowed=0
Dec 26 20:05:30 myhost kernel: CPU: 4 PID: 1 Comm: systemd Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
Dec 26 20:05:30 myhost kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
Dec 26 20:05:30 myhost kernel: Call Trace:
Dec 26 20:05:30 myhost kernel: [<ffffffffae37ac23>] dump_stack+0x19/0x1b
Dec 26 20:05:30 myhost kernel: [<ffffffffae375ce9>] dump_header+0x90/0x229
Dec 26 20:05:30 myhost kernel: [<ffffffffadd06142>] ? ktime_get_ts64+0x52/0xf0
Dec 26 20:05:30 myhost kernel: [<ffffffffaddc1714>] oom_kill_process+0x254/0x3e0
Dec 26 20:05:30 myhost kernel: [<ffffffffadd32e71>] ? cpuset_mems_allowed_intersects+0x21/0x30
Dec 26 20:05:30 myhost kernel: [<ffffffffaddc11bd>] ? oom_unkillable_task+0xcd/0x120
Dec 26 20:05:30 myhost kernel: [<ffffffffaddc1266>] ? find_lock_task_mm+0x56/0xc0
Dec 26 20:05:30 myhost kernel: [<ffffffffaddc1f66>] out_of_memory+0x4b6/0x4f0
Dec 26 20:05:30 myhost kernel: [<ffffffffaddc8a6f>] __alloc_pages_nodemask+0xacf/0xbe0
Dec 26 20:05:30 myhost kernel: [<ffffffffadc98f6d>] copy_process+0x1dd/0x1a50
Dec 26 20:05:30 myhost kernel: [<ffffffffadc9a991>] do_fork+0x91/0x330
Dec 26 20:05:30 myhost kernel: [<ffffffffae388a26>] ? trace_do_page_fault+0x56/0x150
Dec 26 20:05:30 myhost kernel: [<ffffffffadc9acb6>] SyS_clone+0x16/0x20
Dec 26 20:05:30 myhost kernel: [<ffffffffae38e2b4>] stub_clone+0x44/0x70
Dec 26 20:05:30 myhost kernel: [<ffffffffae38dede>] ? system_call_fastpath+0x25/0x2a
Dec 26 20:05:30 myhost kernel: Mem-Info:
Dec 26 20:05:30 myhost kernel: active_anon:13196525 inactive_anon:589 isolated_anon:35#012 active_file:950 inactive_file:1072 isolated_file:93#012 unevictable:0 dirty:0 writeback:30 unstable:0#012 slab_reclaimable:209382 slab_unreclaimable:237991#012 mapped:612 shmem:724 pagetables:36600 bounce:0#012 free:1721704 free_pcp:226 free_cma:0
Dec 26 20:05:30 myhost kernel: Node 0 DMA free:15904kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 26 20:05:30 myhost kernel: lowmem_reserve[]: 0 2810 63223 63223
Dec 26 20:05:30 myhost kernel: Node 0 DMA32 free:242744kB min:3000kB low:3748kB high:4500kB active_anon:1444208kB inactive_anon:24kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3111608kB managed:2878104kB mlocked:0kB dirty:0kB writeback:0kB mapped:20kB shmem:40kB slab_reclaimable:571840kB slab_unreclaimable:419836kB kernel_stack:65344kB pagetables:2092kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 26 20:05:30 myhost kernel: lowmem_reserve[]: 0 0 60412 60412
Dec 26 20:05:30 myhost kernel: Node 0 Normal free:6628168kB min:64560kB low:80700kB high:96840kB active_anon:51341892kB inactive_anon:2332kB active_file:3800kB inactive_file:4288kB unevictable:0kB isolated(anon):140kB isolated(file):372kB present:62914560kB managed:61862736kB mlocked:0kB dirty:0kB writeback:120kB mapped:2428kB shmem:2856kB slab_reclaimable:265688kB slab_unreclaimable:532128kB kernel_stack:167552kB pagetables:144308kB unstable:0kB bounce:0kB free_pcp:904kB local_pcp:104kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Dec 26 20:05:30 myhost kernel: lowmem_reserve[]: 0 0 0 0
Dec 26 20:05:30 myhost kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
Dec 26 20:05:30 myhost kernel: Node 0 DMA32: 8570*4kB (UEM) 6220*8kB (UEM) 1155*16kB (UEM) 528*32kB (UEM) 647*64kB (UEM) 268*128kB (UEM) 76*256kB (UEM) 31*512kB (EM) 12*1024kB (UEM) 0*2048kB 0*4096kB = 242744kB
Dec 26 20:05:30 myhost kernel: Node 0 Normal: 1280224*4kB (EM) 188481*8kB (EM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6628744kB
Dec 26 20:05:30 myhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 26 20:05:30 myhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 26 20:05:30 myhost kernel: 3295 total pagecache pages
Dec 26 20:05:30 myhost kernel: 0 pages in swap cache
Dec 26 20:05:30 myhost kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 26 20:05:30 myhost kernel: Free swap  = 0kB
Dec 26 20:05:30 myhost kernel: Total swap = 0kB
Dec 26 20:05:30 myhost kernel: 16510540 pages RAM
Dec 26 20:05:30 myhost kernel: 0 pages HighMem/MovableOnly
Dec 26 20:05:30 myhost kernel: 321353 pages reserved
Dec 26 20:05:30 myhost kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Dec 26 20:05:30 myhost kernel: [  532]     0   532    24185      100      53        0             0 systemd-journal
Dec 26 20:05:30 myhost kernel: [  555]     0   555    11277      235      22        0         -1000 systemd-udevd
Dec 26 20:05:30 myhost kernel: [  788]     0   788    13882      114      27        0         -1000 auditd
Dec 26 20:05:30 myhost kernel: [  820]     0   820    58999     1724      26        0             0 CmsGoAgent.linu
Dec 26 20:05:30 myhost kernel: [  836]   999   836   153093     1667      59        0             0 polkitd
Dec 26 20:05:30 myhost kernel: [  837]    81   837    14597      203      32        0          -900 dbus-daemon
Dec 26 20:05:30 myhost kernel: [  885]     0   885     6703      199      18        0             0 systemd-logind
Dec 26 20:05:30 myhost kernel: [  892]     0   892     6477       53      18        0             0 atd
Dec 26 20:05:30 myhost kernel: [  894]     0   894    31593      202      20        0             0 crond
Dec 26 20:05:30 myhost kernel: [  896]     0   896    27527       34      10        0             0 agetty
Dec 26 20:05:30 myhost kernel: [  898]     0   898    27527       34      11        0             0 agetty
Dec 26 20:05:30 myhost kernel: [  948]     0   948   832809    11158     156        0             0 exe
Dec 26 20:05:30 myhost kernel: [ 1512]     0  1512    25724      515      48        0             0 dhclient
Dec 26 20:05:30 myhost kernel: [ 1587]     0  1587    73789     4313      97        0             0 salt-minion
Dec 26 20:05:30 myhost kernel: [ 1588]     0  1588   143550     3328      95        0             0 tuned
Dec 26 20:05:30 myhost kernel: [ 1592]     0  1592    29699     3019      20        0             0 node_exporter
Dec 26 20:05:30 myhost kernel: [ 1595]     0  1595   259453      450     342        0             0 rsyslogd
Dec 26 20:05:30 myhost kernel: [ 1756]     0  1756    28230      261      56        0         -1000 sshd
Dec 26 20:05:30 myhost kernel: [ 1825]     0  1825   285329    12268     162        0             0 salt-minion
Dec 26 20:05:30 myhost kernel: [ 1827]     0  1827    95073     5869     100        0             0 salt-minion
Dec 26 20:05:30 myhost kernel: [ 2149]  1001  2149  1572340   335768     806        0             0 java
Dec 26 20:05:30 myhost kernel: [11492]   997 11492    20486      213      40        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11493]   997 11493    20512      534      40        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11494]   997 11494    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11495]   997 11495    20542      266      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11496]   997 11496    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11497]   997 11497    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11498]   997 11498    20542      279      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11499]   997 11499    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11500]   997 11500    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11501]   997 11501    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11502]   997 11502    20542      262      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11503]   997 11503    20542      261      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [11504]   997 11504    20520      241      41        0             0 zabbix_agentd
Dec 26 20:05:30 myhost kernel: [ 8595]     0  8595     8132      372      19        0             0 AliYunDunUpdate
Dec 26 20:05:30 myhost kernel: [ 9080]     0  9080    44295    11441      88        0             0 AliYunDun
Dec 26 20:05:30 myhost kernel: [58297]    38 58297     7491      158      19        0             0 ntpd
Dec 26 20:05:30 myhost kernel: [21537]  1001 21537    28726      472      12        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [21600]  1001 21600  3817967   615094    1819        0             0 java
Dec 26 20:05:30 myhost kernel: [29546]  1001 29546    28726      471      12        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [32310]  1001 32310    28726      471      12        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [32360]  1001 32360  4532386  1934665    5006        0             0 java
Dec 26 20:05:30 myhost kernel: [37926]  1001 37926    28726      471      13        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [37965]  1001 37965  3686958   387894    1320        0             0 java
Dec 26 20:05:30 myhost kernel: [40686]  1001 40686    28726      471      11        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [40735]  1001 40735  4597914  1960388    5065        0             0 java
Dec 26 20:05:30 myhost kernel: [41223]  1001 41223    28726      471      13        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [41468]  1001 41468  4596501  1862812    4877        0             0 java
Dec 26 20:05:30 myhost kernel: [43679]  1001 43679    28726      471      13        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [43836]  1001 43836  4596108  1924911    4999        0             0 java
Dec 26 20:05:30 myhost kernel: [45896]  1001 45896    28726      471      13        0             0 app_watchdog.sh
Dec 26 20:05:30 myhost kernel: [45988]  1001 45988  4595346  1896739    4951        0             0 java
Dec 26 20:05:30 myhost kernel: [56357]  1001 56357  4535046  1940786    5046        0             0 java
Dec 26 20:05:30 myhost kernel: [ 3720]     0  3720   201068     1471      13        0             0 aliyun-service
Dec 26 20:05:30 myhost kernel: [ 3822]     0  3822     4472      122      13        0             0 assist_daemon
Dec 26 20:05:30 myhost kernel: [28130]     0 28130    45614      263      46        0             0 crond
Dec 26 20:05:30 myhost kernel: [37497]  1001 37497    28325       91      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [37502]  1001 37502    28325       93      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [37506]  1001 37506    43958      188      41        0             0 scp
Dec 26 20:05:30 myhost kernel: [37507]  1001 37507    44682      252      43        0             0 ssh
Dec 26 20:05:30 myhost kernel: [54640]  1001 54640    28325       95      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [54645]  1001 54645    28325       97      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [54649]  1001 54649    43958      188      40        0             0 scp
Dec 26 20:05:30 myhost kernel: [54650]  1001 54650    44682      262      41        0             0 ssh
Dec 26 20:05:30 myhost kernel: [ 6703]  1001  6703    28325       99      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [ 6708]  1001  6708    28325      101      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [ 6711]  1001  6711    43958      200      42        0             0 scp
Dec 26 20:05:30 myhost kernel: [ 6713]  1001  6713    44682      253      43        0             0 ssh
Dec 26 20:05:30 myhost kernel: [11353]  1001 11353    28325      101      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [11358]  1001 11358    28325      103      11        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: [11361]  1001 11361    43958      188      41        0             0 scp
Dec 26 20:05:30 myhost kernel: [11362]  1001 11362    44682      248      44        0             0 ssh
Dec 26 20:05:30 myhost kernel: [23855]     0 23855    45614      263      46        0             0 crond
Dec 26 20:05:30 myhost kernel: [23858]  1001 23858    28325       88      12        0             0 app_log_bak.sh
Dec 26 20:05:30 myhost kernel: Out of memory: Kill process 40735 (java) score 121 or sacrifice child
Dec 26 20:05:30 myhost kernel: Killed process 40735 (java), UID 1001, total-vm:18391656kB, anon-rss:7841552kB, file-rss:0kB, shmem-rss:0kB

host info: Linux myhost 3.10.0-1062.9.1.el7.x86_64 #1

free -lm gives:
              total        used        free      shared  buff/cache   available
Mem:          63239       41399         354           2       21484       21317
Low:          63239       62884         354
High:             0           0           0
Swap:             0           0           0

Our host disabled swap intentionally.

like image 652
Allen Xu Avatar asked Nov 04 '25 03:11

Allen Xu


1 Answers

It seems that your system was suffering from heavy memory fragmentation, probably because you're running an old kernel without any swap.

The important lines are as follows:

systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
...
Node 0 Normal: 1280224*4kB (EM) 188481*8kB (EM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6628744kB

The last digit of gfp_mask is zero so this allocation needs to be done from the Normal memory area. The order=2 tells that the amount that needs to be allocated is pagesize * 2 ^ order or 4096 bytes * 2 * 2 = 16 kB.

The "buddyinfo" line (you can check the current status with cat /proc/buddyinfo) that tells how much free memory you have with different block sizes. It turns out that you had 1280224 blocks of 4 kB each, 188481 blocks of 8 kB each and zero bigger blocks. The request was for 16 kB block and you had zero available so the system entered OOM mode. (Your system is probably running some software that causes heavy fragmentation because this is not the usual state with so much total free memory.)

In addition, the total free RAM in the Normal area was 6628744 kB, not 9 GB as you were to believe. The difference is probably caused by your memory statistics logger that cannot properly deal with spikes in memory usage.

If you had used at least a small amount of swap, I think the kernel could have used it to swap around pages to merge some of those small page fragments into bigger pieces but you had removed that possibility by having no swap at all. I'd recommend using a small zram swap area so that your performance is not limited by SSD or HDD in case of swapping.

As an alternative, you could have try running a more modern kernel that has support for memory compaction (CONFIG_COMPACTION=y) and you could configure your system as follows:

/proc/sys/vm/extfrag_threshold: at least 500, preferably more /proc/sys/vm/watermark_scale_factor: at least 100, preferably more /proc/sys/vm/zone_reclaim_mode: at least 1 and I'd recommend 3 or 7 (note that if you're running hard realtime you cannot change this but you shouldn't be running out of RAM either)

You could also consider setting /proc/sys/vm/watermark_boost_factor to a value greater than zero but read the documentation first.

The virtual memory manager configuration options are documented at https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/sysctl/vm.rst (ignore the header that claims that this documents version 2.6.29 only, it actually describes the latest Linux kernel).

Or if this only happens when the system has been running for a long time, the fragmentation could be avoided by running a cron job that periodically runs as root echo 1 > /proc/sys/vm/compact_memory.

If the software you're running supports huge pages you probably want to set /proc/sys/vm/nr_overcommit_hugepages to about 90% of your RAM. In practice, on Linux you can simply run

echo "$(getconf _PHYS_PAGES) * 0.9 / 512" | bc > /proc/sys/vm/nr_overcommit_hugepages

because huge pages are 2 MB each which is 512 times bigger than the default 4 KB page and you want to set the limit to 90% of the physical pages available. Use of huge pages for processes that require lots of shared RAM (e.g. PostgreSQL) reduces memory fragmentation and reduces RAM wasted to page tables. Note that this is a runtime change and you have to do this after each reboot before starting services that support using hugepages.

like image 74
Mikko Rantalainen Avatar answered Nov 05 '25 18:11

Mikko Rantalainen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!