One Hour One Life Forums

a multiplayer game of parenting and civilization building

You are not logged in.

#1 2019-04-08 23:08:05

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Help with Linux Kernel OOM-killer issue

The main web server Linode has been up for 361 days now.

Over the past few days, MySQL keeps getting killed.  Tracing this to the syslog, it's a kernel issue.  The "out of memory" killer is invoked by some other process (probably a malloc that fails), and then the "worst offender" is killed off to free up memory.  The worst offender is MySQL.

Here's the relevant portion of the syslog:

Apr  8 07:08:23 localhost kernel: php-fpm5.6 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr  8 07:08:23 localhost kernel: php-fpm5.6 cpuset=/ mems_allowed=0
Apr  8 07:08:23 localhost kernel: CPU: 0 PID: 21202 Comm: php-fpm5.6 Not tainted 4.15.12-x86_64-linode105 #1
Apr  8 07:08:23 localhost kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Apr  8 07:08:23 localhost kernel: Call Trace:
Apr  8 07:08:23 localhost kernel: dump_stack+0x5c/0x7c
Apr  8 07:08:23 localhost kernel: dump_header+0x78/0x28f
Apr  8 07:08:23 localhost kernel: ? do_try_to_free_pages+0x2c8/0x340
Apr  8 07:08:23 localhost kernel: oom_kill_process+0x217/0x430
Apr  8 07:08:23 localhost kernel: ? oom_badness+0xeb/0x160
Apr  8 07:08:23 localhost kernel: out_of_memory+0xf9/0x4a0
Apr  8 07:08:23 localhost kernel: __alloc_pages_nodemask+0xe7b/0xfa0
Apr  8 07:08:23 localhost kernel: filemap_fault+0x3d8/0x4e0
Apr  8 07:08:23 localhost kernel: ext4_filemap_fault+0x2c/0x40
Apr  8 07:08:23 localhost kernel: __do_fault+0x20/0x60
Apr  8 07:08:23 localhost kernel: __handle_mm_fault+0x529/0xb10
Apr  8 07:08:23 localhost kernel: handle_mm_fault+0xf1/0x1b0
Apr  8 07:08:23 localhost kernel: __do_page_fault+0x222/0x440
Apr  8 07:08:23 localhost kernel: ? async_page_fault+0x2f/0x50
Apr  8 07:08:23 localhost kernel: async_page_fault+0x45/0x50
Apr  8 07:08:23 localhost kernel: RIP: 0000:          (null)
Apr  8 07:08:23 localhost kernel: RSP: 0008:00007f2b63ea0c80 EFLAGS: 00000008
Apr  8 07:08:23 localhost kernel: Mem-Info:
Apr  8 07:08:23 localhost kernel: active_anon:796978 inactive_anon:159595 isolated_anon:0
Apr  8 07:08:23 localhost kernel: active_file:164 inactive_file:444 isolated_file:0
Apr  8 07:08:23 localhost kernel: unevictable:0 dirty:0 writeback:4 unstable:0
Apr  8 07:08:23 localhost kernel: slab_reclaimable:3905 slab_unreclaimable:13480
Apr  8 07:08:23 localhost kernel: mapped:2059 shmem:13278 pagetables:9311 bounce:0
Apr  8 07:08:23 localhost kernel: free:6813 free_pcp:171 free_cma:0
Apr  8 07:08:23 localhost kernel: Node 0 active_anon:3187912kB inactive_anon:638380kB active_file:656kB inactive_file:1776kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:8236kB dirty:0kB writeback:16kB shmem:53112kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Apr  8 07:08:23 localhost kernel: Node 0 DMA free:15700kB min:28kB low:40kB high:52kB active_anon:200kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr  8 07:08:23 localhost kernel: lowmem_reserve[]: 0 2958 3918 3918
Apr  8 07:08:23 localhost kernel: Node 0 DMA32 free:9660kB min:6036kB low:9064kB high:12092kB active_anon:2716344kB inactive_anon:259360kB active_file:616kB inactive_file:540kB unevictable:0kB writepending:0kB present:3129208kB managed:3035628kB mlocked:0kB kernel_stack:256kB pagetables:16276kB bounce:0kB free_pcp:344kB local_pcp:0kB free_cma:0kB
Apr  8 07:08:23 localhost kernel: lowmem_reserve[]: 0 0 959 959
Apr  8 07:08:23 localhost kernel: Node 0 Normal free:1892kB min:1956kB low:2936kB high:3916kB active_anon:471368kB inactive_anon:379020kB active_file:172kB inactive_file:1504kB unevictable:0kB writepending:16kB present:1048576kB managed:982204kB mlocked:0kB kernel_stack:3456kB pagetables:20968kB bounce:0kB free_pcp:340kB local_pcp:36kB free_cma:0kB
Apr  8 07:08:23 localhost kernel: lowmem_reserve[]: 0 0 0 0
Apr  8 07:08:23 localhost kernel: Node 0 DMA: 1*4kB (U) 2*8kB (UM) 2*16kB (UM) 1*32kB (M) 2*64kB (U) 1*128kB (U) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (UM) 2*4096kB (ME) = 15700kB
Apr  8 07:08:23 localhost kernel: Node 0 DMA32: 139*4kB (UME) 31*8kB (UME) 7*16kB (UME) 12*32kB (UME) 2*64kB (UE) 2*128kB (ME) 1*256kB (M) 2*512kB (UE) 1*1024kB (E) 3*2048kB (UME) 0*4096kB = 10132kB
Apr  8 07:08:23 localhost kernel: Node 0 Normal: 95*4kB (UME) 63*8kB (UME) 25*16kB (ME) 18*32kB (ME) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2244kB
Apr  8 07:08:23 localhost kernel: 15338 total pagecache pages
Apr  8 07:08:23 localhost kernel: 1397 pages in swap cache
Apr  8 07:08:23 localhost kernel: Swap cache stats: add 827133, delete 825718, find 256168518/256297980
Apr  8 07:08:23 localhost kernel: Free swap  = 0kB
Apr  8 07:08:23 localhost kernel: Total swap = 262140kB
Apr  8 07:08:23 localhost kernel: 1048444 pages RAM
Apr  8 07:08:23 localhost kernel: 0 pages HighMem/MovableOnly
Apr  8 07:08:23 localhost kernel: 40009 pages reserved
Apr  8 07:08:23 localhost kernel: [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr  8 07:08:23 localhost kernel: [ 2001]     0  2001    10490       12   114688      187         -1000 systemd-udevd
Apr  8 07:08:23 localhost kernel: [ 2262]   100  2262    25081        0    98304       69             0 systemd-timesyn
Apr  8 07:08:23 localhost kernel: [ 3357]   104  3357    64099      238   135168      203             0 rsyslogd
Apr  8 07:08:23 localhost kernel: [ 3368]     0  3368     7252       30   106496       43             0 cron
Apr  8 07:08:23 localhost kernel: [ 3371]   106  3371    10726       49   131072       68          -900 dbus-daemon
Apr  8 07:08:23 localhost kernel: [ 3387]     0  3387    68988       28   180224      247             0 accounts-daemon
Apr  8 07:08:23 localhost kernel: [ 3390]     0  3390     7149       25    94208       68             0 systemd-logind
Apr  8 07:08:23 localhost kernel: [ 3458]     0  3458     4868       21    86016       35             0 irqbalance
Apr  8 07:08:23 localhost kernel: [ 3470]     0  3470    16380       30   159744      153         -1000 sshd
Apr  8 07:08:23 localhost kernel: [ 3487]     0  3487     3985        0    81920       39             0 agetty
Apr  8 07:08:23 localhost kernel: [32394]    33 32394     6314        0    94208       65             0 fcgiwrap
Apr  8 07:08:23 localhost kernel: [17069]     0 17069    80728       31   499712     1371             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [18162]     0 18162    31276       13   204800      353             0 nginx
Apr  8 07:08:23 localhost kernel: [18163]    33 18163    31486      326   217088      242             0 nginx
Apr  8 07:08:23 localhost kernel: [18164]    33 18164    31452      281   217088      248             0 nginx
Apr  8 07:08:23 localhost kernel: [20792]     0 20792     3938        0    73728       37             0 agetty
Apr  8 07:08:23 localhost kernel: [18908]  1000 18908    11640        0   131072      535             0 systemd
Apr  8 07:08:23 localhost kernel: [18909]  1000 18909    73579        1   327680      803             0 (sd-pam)
Apr  8 07:08:23 localhost kernel: [15755]  1000 15755     6823        0    90112      152             0 screen
Apr  8 07:08:23 localhost kernel: [15756]  1000 15756     5686        2    94208      510             0 bash
Apr  8 07:08:23 localhost kernel: [ 5907]     0  5907     8819      751   106496       35             0 systemd-journal
Apr  8 07:08:23 localhost kernel: [28873]     0 28873    80793     7710   274432     3783             0 fail2ban-server
Apr  8 07:08:23 localhost kernel: [18206]    33 18206    81309      410   491520     1107             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [18216]   109 18216   392175    61919   847872        0             0 mysqld
Apr  8 07:08:23 localhost kernel: [18228]    33 18228    81309      426   491520     1107             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [18349]    33 18349    81309      404   491520     1112             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [21202]    33 21202   100166     2182   516096      752             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [21507]    33 21507    81334     1722   507904      804             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [21537]    33 21537    81599     1772   507904      916             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: [21755]     0 21755   139449    23246   577536        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21756]     0 21756   153627    23078   577536        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21757]   108 21757    16716      181   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21758]   108 21758    16716      181   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21759]     0 21759   150708    20202   569344        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21760]     0 21760   150355    19851   552960        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21761]     0 21761   151130    20601   565248        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21762]     0 21762   151785    21263   569344        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21763]   108 21763    16716      184   172032        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21764]   108 21764    16716      183   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21765]   108 21765    16716      183   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21766]   108 21766    16716      182   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21767]     0 21767   133974    19850   548864        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21768]     0 21768   152204    21670   589824        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21769]   108 21769    16716      182   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21770]     0 21770   151789    21273   557056        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21771]     0 21771   147800    17267   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21772]     0 21772   132580    18458   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21773]   108 21773    16716      181   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21774]   108 21774    16716      185   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21775]     0 21775   152993    20439   557056        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21776]     0 21776   131888    17735   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21777]     0 21777   148967    18462   528384        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21778]     0 21778   115787    18045   528384        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21779]   108 21779    16716      180   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21780]     0 21780   148964    18458   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21781]     0 21781   148269    17735   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21782]     0 21782   126594    12472   491520        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21783]     0 21783   148964    18458   548864        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21784]     0 21784   132171    18048   532480        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21785]   108 21785    16716      181   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21786]   108 21786    16716      181   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21787]     0 21787   117587    19847   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21788]     0 21788   133655    17484   528384        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21789]     0 21789   148269    17757   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21790]   108 21790    16716      182   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21791]     0 21791   117587    19848   561152        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21797]     0 21797   116196    18455   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21798]     0 21798   148272    17739   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21799]     0 21799   126594    12470   491520        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21800]     0 21800   132171    18047   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21801]     0 21801   148555    18049   552960        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21802]   108 21802    16716      181   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21803]   108 21803    16716      182   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21804]   108 21804    16716      182   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21805]   108 21805    16716      184   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21806]   108 21806    16716      180   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21807]     0 21807   147798    17264   524288        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21808]     0 21808   148269    17735   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21809]     0 21809   148555    18051   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21810]     0 21810   115030    17263   524288        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21811]     0 21811   147990    17480   536576        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21812]     0 21812   148964    18458   532480        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21813]     0 21813   150355    19851   557056        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21814]     0 21814   131416    17266   516096        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21815]     0 21815   148269    17736   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21816]   108 21816    16716      180   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21817]   108 21817    16716      182   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21818]   108 21818    16716      182   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21819]   108 21819    16716      182   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21820]   108 21820    16716      181   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21821]   108 21821    16716      181   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21822]   108 21822    16716      182   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21823]     0 21823   131606    17476   532480        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21824]     0 21824   147798    17264   520192        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21825]   108 21825    16716      181   172032        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21826]   108 21826    16716      183   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21827]   108 21827    16716      182   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21828]   108 21828    16716      181   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21829]     0 21829   148555    18050   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21830]     0 21830   148557    18052   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21831]     0 21831   147990    17481   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21832]   108 21832    16716      182   155648        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21833]   108 21833    16716      181   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21834]   108 21834    16716      183   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21835]   108 21835    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21836]     0 21836   133463    17267   528384        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21837]   108 21837    16716      181   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21838]   108 21838    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21839]   108 21839    16716      182   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21840]   108 21840    16716      181   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21846]   108 21846    16716      183   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21847]   108 21847    16716      182   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21848]     0 21848   116196    18455   540672        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21849]   108 21849    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21850]   108 21850    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21851]   108 21851    16716      182   159744        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21852]   108 21852    16716      182   172032        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21858]   108 21858    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21869]   108 21869    16716      182   172032        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21880]     0 21880   148555    18049   544768        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21886]   108 21886    16716      179   167936        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21917]     0 21917   146217    15684   528384        0             0 sshd
Apr  8 07:08:23 localhost kernel: [21933]   108 21933    16716      180   163840        0             0 sshd
Apr  8 07:08:23 localhost kernel: [22084]     0 22084    80726       32   438272     1366             0 php-fpm5.6
Apr  8 07:08:23 localhost kernel: Out of memory: Kill process 18216 (mysqld) score 57 or sacrifice child
Apr  8 07:08:23 localhost kernel: Killed process 18216 (mysqld) total-vm:1568700kB, anon-rss:247676kB, file-rss:0kB, shmem-rss:0kB
Apr  8 07:08:23 localhost kernel: oom_reaper: reaped process 18216 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Now, the strange thing here is that "top" shows over 3GB of physical RAM free:

top - 23:05:54 up 361 days, 17:47,  4 users,  load average: 0.94, 0.86, 1.15
Tasks: 103 total,   2 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  0.5 sy,  0.0 ni, 89.3 id,  3.0 wa,  0.0 hi,  0.3 si,  5.9 st
KiB Mem :  4033740 total,  2986724 free,   480340 used,   566676 buff/cache
KiB Swap:   262140 total,     3032 free,   259108 used.  3375656 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND      
 4174 mysql     20   0 1567668 240608  15152 S   1.3  6.0   2:12.68 mysqld       
 4435 root      20   0  645116 131496   9676 S   0.0  3.3   0:00.67 sshd         
 4464 jcr15     20   0  645116 125392   3572 S   0.0  3.1   0:00.67 sshd         
 5959 www-data  20   0  400732  25040  20164 S   0.0  0.6   0:07.50 php-fpm5.6   
 5891 www-data  20   0  400644  24900  20096 S   0.3  0.6   0:09.92 php-fpm5.6   
28873 root      20   0  327524  24344   1728 S   0.0  0.6   6:07.68 fail2ban-se+ 
 5960 www-data  20   0  400648  24124  19832 S   1.0  0.6   0:07.95 php-fpm5.6   
 4194 www-data  20   0  325236  10808   9632 S   0.0  0.3   0:00.99 php-fpm5.6   
 4190 www-data  20   0  325236  10072   8804 S   0.0  0.2   0:01.09 php-fpm5.6   
 4177 www-data  20   0  325236   9800   8584 S   0.0  0.2   0:01.09 php-fpm5.6   
 4465 jcr15     20   0   22756   5392   3336 S   0.0  0.1   0:00.12 bash         
18163 www-data  20   0  125972   5084   3904 S   0.0  0.1   1088:45 nginx        
32761 root      20   0  526832   5048   1532 S   0.0  0.1   0:04.82 sshd         
18164 www-data  20   0  125856   4956   3948 S   0.0  0.1   1087:49 nginx        
    1 root      20   0  194088   4548   2632 S   0.0  0.1  35:45.52 systemd      
  516 root      20   0   22992   4216   2780 S   0.0  0.1   0:00.83 bash         
 4923 jcr15     20   0   41880   3664   3108 R   0.0  0.1   0:04.39 top          
 5907 root      20   0   35276   3664   3528 S   0.0  0.1   1:20.91 systemd-jou+ 
17069 root      20   0  322912   1996   1888 S   0.0  0.0  39:54.32 php-fpm5.6   
 3390 root      20   0   28596   1944   1780 S   0.0  0.0   7:20.14 systemd-log+ 
  466 root      20   0   38104   1812      0 S   0.0  0.0   0:00.14 systemd      
  467 root      20   0  302560   1748      0 S   0.0  0.0   0:00.00 (sd-pam)     
 3387 root      20   0  275952   1548   1016 S   0.0  0.0  27:58.69 accounts-da+ 
 3357 syslog    20   0  256396   1484    572 S   0.0  0.0  28:48.69 rsyslogd     
 3470 root      20   0   65520   1404   1288 S   0.0  0.0   8:15.21 sshd         
 3368 root      20   0   29008   1152   1036 S   0.0  0.0   7:58.40 cron         
 3371 message+  20   0   42904    912    692 S   0.0  0.0  16:59.82 dbus-daemon  
 2001 root      20   0   41960    436    388 S   0.0  0.0   0:43.46 systemd-ude+ 
18908 jcr15     20   0   46560    252      0 S   0.0  0.0   0:00.18 systemd      
 3458 root      20   0   19472     84      0 S   0.0  0.0  15:22.29 irqbalance   

I'm no expert in reading through the OOM-killer log.... just digging into it myself.  Anyone see anything in there that's strange?  At first glance, it appears to be complaining that it's out of swap.... but I can't figure out why it's out of physical RAM, or what's consuming the most space there.

Offline

#2 2019-04-08 23:14:55

BlueDiamondAvatar
Member
Registered: 2018-11-19
Posts: 322

Re: Help with Linux Kernel OOM-killer issue

I'm not a programmer, but....

It's a bit suspicious that you tried to change the way family names are passed down and 1) it isn't working 2) it is causing problems with cursing people after a last name is lost and 3) something terrible is going on with the database system.

Is there any possibility that there's a connection between the way you changed the last name records and this mysql error???

Or can we assume it is some error that was introduced by the most recent update?


--Blue Diamond

I aim to leave behind a world that is easier for people to live in that it was before I got there.

Offline

#3 2019-04-08 23:20:22

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

No, I don't think this has anything to do with that.

The last name change was in the server code, which is per-game-server, and not even running on this main web server.

Offline

#4 2019-04-08 23:20:58

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

The fact that there are 95 sshd processes running during this out-of-memory event is troubling...  seems like a DDOS attack of some kind.

There are normally 4 sshd processes running.

Offline

#5 2019-04-09 00:22:50

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Yeah, so every time this happens, there's an overload of sshd processes, and I'm pretty sure this is what's causing the server to run out of RAM.

I'm implementing a work-around so that I can still connect via ssh, but other people won't be able to overload it with spam connection requests like this.

Offline

#6 2019-04-09 00:33:50

BlueDiamondAvatar
Member
Registered: 2018-11-19
Posts: 322

Re: Help with Linux Kernel OOM-killer issue

Ok, sounds like you are on a path to a solution.  Glad to hear it!


Hope it works out quickly, and you have an awesome vacation with your family.


--Blue Diamond

I aim to leave behind a world that is easier for people to live in that it was before I got there.

Offline

#7 2019-04-09 02:03:40

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Yes, the sshd overload was the problem.  When someone opens an ssh connection, sshd will just hang there for up to 120 seconds while they try to log in.

If enough bots do this, and leave the connections hanging, these connections build up, with one sshd process each, until the system runs out of RAM.

The fix is pretty simple:  the sshd setting MaxStartups, which is described here:

https://stackoverflow.com/questions/311 … shd-config

I'm using a setting of 10:30:30

Which means:

After 10 unauthenticated connections are hanging, start dropping 30% of new connections at random, linearly increasing until we are dropping 100% of connections at random, up to a hard-limit of 30 unauthenticated connections hanging.

The probabilistic dropping (as opposed to just a hard cap) lets the real people (me) have a chance at getting through, even during a temporary overload.

And we'll never get anywhere near 95 sshd processes.  30 max (not including the authenticated ones that are really me).

The OOM-killer was also killing my open ssh connections, so I kept getting disconnected from my own server while I was trying to debug this issue.

Offline

#8 2019-04-09 02:04:41

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

And for reference, the last OOM-killer event was here:

Apr  9 00:43:03 localhost kernel: emacs invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0

Here's hoping it was the last one.

Offline

#9 2019-04-09 06:07:07

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Ugg... even with this limit on sshd, it happened again:

Apr  9 04:24:25 localhost kernel: php-fpm5.6 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0

During this oom incident, mysql was killed (again) and there were 60 sshd processes running, which is the limit (30 max incoming ssh connections, two processes each).

HOWEVER, I'm wondering if something else is going on here, and the sshd overflow is just a symptom, rather than a cause.  Maybe the sshd connections pile up WHEN the system is out of RAM, as a side-effect.

Maybe this php module that is triggering OOM is allocating too much ram for some reason.  Maybe there's a giant forum post that's being fetched or something.

However, if I go back in time in the syslog to find the first oom event ever:

Apr  5 22:33:08 localhost kernel: sshd invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0

So it seems like sshd can trigger it....

I've turned down the max ssh connections to 10 (20 processes max, instead of 60), and we'll see if that helps.

Offline

#10 2019-04-09 14:16:33

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Well, after reducing sshd cap to 10 (down from 30), we made it through the night without any OOM-killer events.

However, it seems like memory fragmentation is likely the real issue here.  We are not actually out of physical memory, but out of swap, mostly due to fragmentation.  This is a 64-bit machine, but this might be relevant anyway:

https://serverfault.com/questions/56406 … -situation

So maybe.... after 361 days of uptime.... it's time for a reboot?

Offline

#11 2019-04-09 14:23:46

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974

Can anyone make any sense of these values?  I can't find any documentation about this.

Offline

#12 2019-04-09 14:37:48

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Okay, post-reboot, no swap is used, at all, which is a huge improvement:

top - 14:35:16 up 7 min,  1 user,  load average: 0.66, 0.75, 0.44
Tasks: 130 total,   3 running,  62 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us,  2.1 sy,  0.0 ni, 75.1 id,  7.0 wa,  1.4 hi,  0.3 si, 11.7 st
KiB Mem :  4024420 total,  2886728 free,   453736 used,   683956 buff/cache
KiB Swap:   262140 total,   262140 free,        0 used.  3296208 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                             
  849 mysql     20   0 1567912 201028  17824 S   6.0  5.0   0:13.20 mysqld                                                                              
 1688 root      20   0  645116 131852  10032 S   0.0  3.3   0:00.66 sshd                                                                                
  900 root      20   0  114340  36096   9372 S   0.0  0.9   0:00.54 linode-longview                                                                     
 1306 www-data  20   0  401696  28120  19408 S   0.3  0.7   0:01.75 php-fpm5.6                                                                          
  788 root      20   0  322912  27996  22420 S   0.0  0.7   0:00.10 php-fpm5.6                                                                          
 1321 www-data  20   0  400688  27212  19392 S   2.6  0.7   0:01.84 php-fpm5.6                                                                          
 1675 www-data  20   0  326648  24236  17000 S   0.0  0.6   0:00.82 php-fpm5.6                                                                          
  905 root      20   0  294484  18760   7028 S   0.0  0.5   0:51.56 fail2ban-server                                                                     
  897 www-data  20   0  325236  16768  11104 S   0.0  0.4   0:00.14 php-fpm5.6                                                                          
  896 www-data  20   0  325236  15692  10036 S   0.0  0.4   0:00.15 php-fpm5.6                                                                          
  878 www-data  20   0  125844   7348   5172 R   0.3  0.2   0:00.60 nginx                                                                               
  877 www-data  20   0  125952   7256   5096 S   0.0  0.2   0:00.76 nginx                                                                               
  810 root      20   0  275864   6332   5572 S   0.0  0.2   0:00.03 accounts-daemon                                                                     
  823 root      20   0   65520   5984   5276 S   0.0  0.1   0:00.03 sshd                                                                                
    1 root      20   0   37836   5884   3952 S   0.0  0.1   0:00.97 systemd                                                                             
 1714 root      20   0   22584   5096   3224 S   0.0  0.1   0:00.04 bash                                                                                
 1695 root      20   0   36788   4716   3908 S   0.0  0.1   0:00.04 systemd       

Also, frag index looks like this now:

 sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000

So perhaps it wasn't an sshd issue after all, but that definitely may have pushed it over the top, eventually.  Anyway, it should be stable for now.... good enough for me to go on vacation.  My ride to the airport is getting here in.... 30 minutes!

Hopefully, this week, you guys can do some research here and figure out how this problem can be prevented in the future, over the next 361 days of uptime....

Offline

#13 2019-04-09 18:12:15

Chard
Moderator
Registered: 2018-03-04
Posts: 125

Re: Help with Linux Kernel OOM-killer issue

jasonrohrer wrote:
sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974

Can anyone make any sense of these values?  I can't find any documentation about this.

kernel.org wrote:

extfrag_threshold

This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.

The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.

Based on that those values don't seem troubling, but I don't have much to compare to.

Offline

#14 2019-04-10 03:08:23

Starknight_One
Member
Registered: 2018-10-15
Posts: 347

Re: Help with Linux Kernel OOM-killer issue

Chard wrote:
jasonrohrer wrote:
sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979 
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974

Can anyone make any sense of these values?  I can't find any documentation about this.

kernel.org wrote:

extfrag_threshold

This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.

The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.

Based on that those values don't seem troubling, but I don't have much to compare to.

I suspect that 1.000 is fully-fragmented; 0.000 is normal fragmentation; and -1.000 is unfragmented files. The documentation seems to be missing the decimal point (or is written Euro-style, with 1,000 being the same as 1.000 in the US). If the positive value went up to 1000, the negative value should be similar.

Offline

#15 2019-04-10 04:45:44

lionon
Member
Registered: 2018-11-19
Posts: 532

Re: Help with Linux Kernel OOM-killer issue

Only 2 stories I can tell to this:

Back the day I was developing embedded devices, with a linux kernel, where the OOM-killer would sometimes kick in on heavy usage of the web interface and it's scoring algorithm would kill the "central control daemon" which put the device into unusable mode and least since it stopped servicing the watchdog, it would reboot soon after. I replaced the default scoring algorithm in the kernel to a) always hit the web stuff first, since it's least essential, and whatever it does, hit the control deamon last (well last before init).  (conclusion: you can make custom scorings to the OOM-killer)

Second story, I once had a server that kept oom-killing the backup daemon... and I spent quite some time digging into it, since the RAM strangely wasn't actually depleted. What I learned that (at least on this architecture) there are actually different kinds of RAM and only one of these went down. I never fully understood what was going on there. I believe it was due to being a very old machine, modern ones seem to have an easier RAM built. I fiddled around with some kernel parameters, like swappiness, OOM-killer sensitivity etc. and it stopped killing the daemon mostly. It still does every other month, but I put it on auto-restart so the damage is minimal. As I only have this behavior on that old machine, it will go once it is swapped out as a whole and I stopped wanting to invest more time into that. (conclusion: RAM is more complicated that one might believe, there are some parameters fiddle with it)

Last edited by lionon (2019-04-10 04:48:03)

Offline

#16 2019-04-10 04:49:09

lionon
Member
Registered: 2018-11-19
Posts: 532

Re: Help with Linux Kernel OOM-killer issue

jasonrohrer wrote:

Can anyone make any sense of these values?  I can't find any documentation about this.

Yes that was the same problem I was having in "story 2". This is poorly documented especially the "Memory kinds" "Normal", "DMA", "DMA32"...

Last edited by lionon (2019-04-10 04:49:49)

Offline

#17 2019-04-16 04:04:42

BladeWoods
Member
Registered: 2018-08-11
Posts: 476

Re: Help with Linux Kernel OOM-killer issue

I don't know what any of those logs mean, but

There's been server-wide connection lost happening occasionally lately. Everyone on the server gets connection lost at the same time, it lasts about 10 seconds, then we can reconnect.

Offline

#18 2019-04-17 16:05:20

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Can I get some more info about server-wide connection loss?

This is on bigserver2 (default server)?

What time of day?  I can look in the logs and verify.

Offline

#19 2019-04-17 16:17:46

jasonrohrer
Administrator
Registered: 2017-02-13
Posts: 4,805

Re: Help with Linux Kernel OOM-killer issue

Also, seems like the OOM issue is solved for now.  I don't think it ever happened again while I was out of town, and swap has remained low on the main web server (below 2mb all week).

Will probably need to reboot the main web server once a year or so to deal with this problem in the future.

Offline

#20 2019-04-17 18:45:32

Grim_Arbiter
Member
Registered: 2018-12-30
Posts: 943

Re: Help with Linux Kernel OOM-killer issue

jasonrohrer wrote:

Can I get some more info about server-wide connection loss?

This is on bigserver2 (default server)?

What time of day?  I can look in the logs and verify.

No idea but people have been getting kicked to server 1 and they dont even realize it. There were like 20 to 30 people on there today and nobody knew why they were there.


--Grim
I'm flying high. But the worst is never first, and there's a person that'll set you straight. Cancelling the force within my brain. For flying high. The simulator has been disengaged.

Offline

#21 2019-04-17 21:12:01

BladeWoods
Member
Registered: 2018-08-11
Posts: 476

Re: Help with Linux Kernel OOM-killer issue

jasonrohrer wrote:

Can I get some more info about server-wide connection loss?

This is on bigserver2 (default server)?

What time of day?  I can look in the logs and verify.


It is happening on bigserver2. The time of day varies. Not a consistent time.

Offline

#22 2019-04-17 21:15:35

Thaulos
Member
Registered: 2019-02-19
Posts: 456

Re: Help with Linux Kernel OOM-killer issue

I didn't get any forum "mysql not found" error last week, but I have been getting the "login failed" quite regularly while trying to get born.

Offline

#23 2019-04-17 21:44:20

Grim_Arbiter
Member
Registered: 2018-12-30
Posts: 943

Re: Help with Linux Kernel OOM-killer issue

Thaulos wrote:

I didn't get any forum "mysql not found" error last week, but I have been getting the "login failed" quite regularly while trying to get born.

Ive been getting them too but not regularly, more like occasionally.


--Grim
I'm flying high. But the worst is never first, and there's a person that'll set you straight. Cancelling the force within my brain. For flying high. The simulator has been disengaged.

Offline

Board footer

Powered by FluxBB