a multiplayer game of parenting and civilization building
You are not logged in.
Pages: 1
The main web server Linode has been up for 361 days now.
Over the past few days, MySQL keeps getting killed. Tracing this to the syslog, it's a kernel issue. The "out of memory" killer is invoked by some other process (probably a malloc that fails), and then the "worst offender" is killed off to free up memory. The worst offender is MySQL.
Here's the relevant portion of the syslog:
Apr 8 07:08:23 localhost kernel: php-fpm5.6 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 8 07:08:23 localhost kernel: php-fpm5.6 cpuset=/ mems_allowed=0
Apr 8 07:08:23 localhost kernel: CPU: 0 PID: 21202 Comm: php-fpm5.6 Not tainted 4.15.12-x86_64-linode105 #1
Apr 8 07:08:23 localhost kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Apr 8 07:08:23 localhost kernel: Call Trace:
Apr 8 07:08:23 localhost kernel: dump_stack+0x5c/0x7c
Apr 8 07:08:23 localhost kernel: dump_header+0x78/0x28f
Apr 8 07:08:23 localhost kernel: ? do_try_to_free_pages+0x2c8/0x340
Apr 8 07:08:23 localhost kernel: oom_kill_process+0x217/0x430
Apr 8 07:08:23 localhost kernel: ? oom_badness+0xeb/0x160
Apr 8 07:08:23 localhost kernel: out_of_memory+0xf9/0x4a0
Apr 8 07:08:23 localhost kernel: __alloc_pages_nodemask+0xe7b/0xfa0
Apr 8 07:08:23 localhost kernel: filemap_fault+0x3d8/0x4e0
Apr 8 07:08:23 localhost kernel: ext4_filemap_fault+0x2c/0x40
Apr 8 07:08:23 localhost kernel: __do_fault+0x20/0x60
Apr 8 07:08:23 localhost kernel: __handle_mm_fault+0x529/0xb10
Apr 8 07:08:23 localhost kernel: handle_mm_fault+0xf1/0x1b0
Apr 8 07:08:23 localhost kernel: __do_page_fault+0x222/0x440
Apr 8 07:08:23 localhost kernel: ? async_page_fault+0x2f/0x50
Apr 8 07:08:23 localhost kernel: async_page_fault+0x45/0x50
Apr 8 07:08:23 localhost kernel: RIP: 0000: (null)
Apr 8 07:08:23 localhost kernel: RSP: 0008:00007f2b63ea0c80 EFLAGS: 00000008
Apr 8 07:08:23 localhost kernel: Mem-Info:
Apr 8 07:08:23 localhost kernel: active_anon:796978 inactive_anon:159595 isolated_anon:0
Apr 8 07:08:23 localhost kernel: active_file:164 inactive_file:444 isolated_file:0
Apr 8 07:08:23 localhost kernel: unevictable:0 dirty:0 writeback:4 unstable:0
Apr 8 07:08:23 localhost kernel: slab_reclaimable:3905 slab_unreclaimable:13480
Apr 8 07:08:23 localhost kernel: mapped:2059 shmem:13278 pagetables:9311 bounce:0
Apr 8 07:08:23 localhost kernel: free:6813 free_pcp:171 free_cma:0
Apr 8 07:08:23 localhost kernel: Node 0 active_anon:3187912kB inactive_anon:638380kB active_file:656kB inactive_file:1776kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:8236kB dirty:0kB writeback:16kB shmem:53112kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Apr 8 07:08:23 localhost kernel: Node 0 DMA free:15700kB min:28kB low:40kB high:52kB active_anon:200kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr 8 07:08:23 localhost kernel: lowmem_reserve[]: 0 2958 3918 3918
Apr 8 07:08:23 localhost kernel: Node 0 DMA32 free:9660kB min:6036kB low:9064kB high:12092kB active_anon:2716344kB inactive_anon:259360kB active_file:616kB inactive_file:540kB unevictable:0kB writepending:0kB present:3129208kB managed:3035628kB mlocked:0kB kernel_stack:256kB pagetables:16276kB bounce:0kB free_pcp:344kB local_pcp:0kB free_cma:0kB
Apr 8 07:08:23 localhost kernel: lowmem_reserve[]: 0 0 959 959
Apr 8 07:08:23 localhost kernel: Node 0 Normal free:1892kB min:1956kB low:2936kB high:3916kB active_anon:471368kB inactive_anon:379020kB active_file:172kB inactive_file:1504kB unevictable:0kB writepending:16kB present:1048576kB managed:982204kB mlocked:0kB kernel_stack:3456kB pagetables:20968kB bounce:0kB free_pcp:340kB local_pcp:36kB free_cma:0kB
Apr 8 07:08:23 localhost kernel: lowmem_reserve[]: 0 0 0 0
Apr 8 07:08:23 localhost kernel: Node 0 DMA: 1*4kB (U) 2*8kB (UM) 2*16kB (UM) 1*32kB (M) 2*64kB (U) 1*128kB (U) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (UM) 2*4096kB (ME) = 15700kB
Apr 8 07:08:23 localhost kernel: Node 0 DMA32: 139*4kB (UME) 31*8kB (UME) 7*16kB (UME) 12*32kB (UME) 2*64kB (UE) 2*128kB (ME) 1*256kB (M) 2*512kB (UE) 1*1024kB (E) 3*2048kB (UME) 0*4096kB = 10132kB
Apr 8 07:08:23 localhost kernel: Node 0 Normal: 95*4kB (UME) 63*8kB (UME) 25*16kB (ME) 18*32kB (ME) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2244kB
Apr 8 07:08:23 localhost kernel: 15338 total pagecache pages
Apr 8 07:08:23 localhost kernel: 1397 pages in swap cache
Apr 8 07:08:23 localhost kernel: Swap cache stats: add 827133, delete 825718, find 256168518/256297980
Apr 8 07:08:23 localhost kernel: Free swap = 0kB
Apr 8 07:08:23 localhost kernel: Total swap = 262140kB
Apr 8 07:08:23 localhost kernel: 1048444 pages RAM
Apr 8 07:08:23 localhost kernel: 0 pages HighMem/MovableOnly
Apr 8 07:08:23 localhost kernel: 40009 pages reserved
Apr 8 07:08:23 localhost kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 8 07:08:23 localhost kernel: [ 2001] 0 2001 10490 12 114688 187 -1000 systemd-udevd
Apr 8 07:08:23 localhost kernel: [ 2262] 100 2262 25081 0 98304 69 0 systemd-timesyn
Apr 8 07:08:23 localhost kernel: [ 3357] 104 3357 64099 238 135168 203 0 rsyslogd
Apr 8 07:08:23 localhost kernel: [ 3368] 0 3368 7252 30 106496 43 0 cron
Apr 8 07:08:23 localhost kernel: [ 3371] 106 3371 10726 49 131072 68 -900 dbus-daemon
Apr 8 07:08:23 localhost kernel: [ 3387] 0 3387 68988 28 180224 247 0 accounts-daemon
Apr 8 07:08:23 localhost kernel: [ 3390] 0 3390 7149 25 94208 68 0 systemd-logind
Apr 8 07:08:23 localhost kernel: [ 3458] 0 3458 4868 21 86016 35 0 irqbalance
Apr 8 07:08:23 localhost kernel: [ 3470] 0 3470 16380 30 159744 153 -1000 sshd
Apr 8 07:08:23 localhost kernel: [ 3487] 0 3487 3985 0 81920 39 0 agetty
Apr 8 07:08:23 localhost kernel: [32394] 33 32394 6314 0 94208 65 0 fcgiwrap
Apr 8 07:08:23 localhost kernel: [17069] 0 17069 80728 31 499712 1371 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [18162] 0 18162 31276 13 204800 353 0 nginx
Apr 8 07:08:23 localhost kernel: [18163] 33 18163 31486 326 217088 242 0 nginx
Apr 8 07:08:23 localhost kernel: [18164] 33 18164 31452 281 217088 248 0 nginx
Apr 8 07:08:23 localhost kernel: [20792] 0 20792 3938 0 73728 37 0 agetty
Apr 8 07:08:23 localhost kernel: [18908] 1000 18908 11640 0 131072 535 0 systemd
Apr 8 07:08:23 localhost kernel: [18909] 1000 18909 73579 1 327680 803 0 (sd-pam)
Apr 8 07:08:23 localhost kernel: [15755] 1000 15755 6823 0 90112 152 0 screen
Apr 8 07:08:23 localhost kernel: [15756] 1000 15756 5686 2 94208 510 0 bash
Apr 8 07:08:23 localhost kernel: [ 5907] 0 5907 8819 751 106496 35 0 systemd-journal
Apr 8 07:08:23 localhost kernel: [28873] 0 28873 80793 7710 274432 3783 0 fail2ban-server
Apr 8 07:08:23 localhost kernel: [18206] 33 18206 81309 410 491520 1107 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [18216] 109 18216 392175 61919 847872 0 0 mysqld
Apr 8 07:08:23 localhost kernel: [18228] 33 18228 81309 426 491520 1107 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [18349] 33 18349 81309 404 491520 1112 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [21202] 33 21202 100166 2182 516096 752 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [21507] 33 21507 81334 1722 507904 804 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [21537] 33 21537 81599 1772 507904 916 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: [21755] 0 21755 139449 23246 577536 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21756] 0 21756 153627 23078 577536 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21757] 108 21757 16716 181 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21758] 108 21758 16716 181 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21759] 0 21759 150708 20202 569344 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21760] 0 21760 150355 19851 552960 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21761] 0 21761 151130 20601 565248 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21762] 0 21762 151785 21263 569344 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21763] 108 21763 16716 184 172032 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21764] 108 21764 16716 183 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21765] 108 21765 16716 183 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21766] 108 21766 16716 182 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21767] 0 21767 133974 19850 548864 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21768] 0 21768 152204 21670 589824 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21769] 108 21769 16716 182 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21770] 0 21770 151789 21273 557056 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21771] 0 21771 147800 17267 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21772] 0 21772 132580 18458 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21773] 108 21773 16716 181 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21774] 108 21774 16716 185 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21775] 0 21775 152993 20439 557056 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21776] 0 21776 131888 17735 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21777] 0 21777 148967 18462 528384 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21778] 0 21778 115787 18045 528384 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21779] 108 21779 16716 180 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21780] 0 21780 148964 18458 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21781] 0 21781 148269 17735 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21782] 0 21782 126594 12472 491520 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21783] 0 21783 148964 18458 548864 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21784] 0 21784 132171 18048 532480 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21785] 108 21785 16716 181 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21786] 108 21786 16716 181 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21787] 0 21787 117587 19847 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21788] 0 21788 133655 17484 528384 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21789] 0 21789 148269 17757 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21790] 108 21790 16716 182 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21791] 0 21791 117587 19848 561152 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21797] 0 21797 116196 18455 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21798] 0 21798 148272 17739 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21799] 0 21799 126594 12470 491520 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21800] 0 21800 132171 18047 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21801] 0 21801 148555 18049 552960 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21802] 108 21802 16716 181 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21803] 108 21803 16716 182 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21804] 108 21804 16716 182 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21805] 108 21805 16716 184 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21806] 108 21806 16716 180 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21807] 0 21807 147798 17264 524288 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21808] 0 21808 148269 17735 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21809] 0 21809 148555 18051 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21810] 0 21810 115030 17263 524288 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21811] 0 21811 147990 17480 536576 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21812] 0 21812 148964 18458 532480 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21813] 0 21813 150355 19851 557056 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21814] 0 21814 131416 17266 516096 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21815] 0 21815 148269 17736 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21816] 108 21816 16716 180 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21817] 108 21817 16716 182 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21818] 108 21818 16716 182 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21819] 108 21819 16716 182 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21820] 108 21820 16716 181 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21821] 108 21821 16716 181 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21822] 108 21822 16716 182 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21823] 0 21823 131606 17476 532480 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21824] 0 21824 147798 17264 520192 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21825] 108 21825 16716 181 172032 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21826] 108 21826 16716 183 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21827] 108 21827 16716 182 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21828] 108 21828 16716 181 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21829] 0 21829 148555 18050 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21830] 0 21830 148557 18052 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21831] 0 21831 147990 17481 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21832] 108 21832 16716 182 155648 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21833] 108 21833 16716 181 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21834] 108 21834 16716 183 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21835] 108 21835 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21836] 0 21836 133463 17267 528384 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21837] 108 21837 16716 181 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21838] 108 21838 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21839] 108 21839 16716 182 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21840] 108 21840 16716 181 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21846] 108 21846 16716 183 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21847] 108 21847 16716 182 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21848] 0 21848 116196 18455 540672 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21849] 108 21849 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21850] 108 21850 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21851] 108 21851 16716 182 159744 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21852] 108 21852 16716 182 172032 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21858] 108 21858 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21869] 108 21869 16716 182 172032 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21880] 0 21880 148555 18049 544768 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21886] 108 21886 16716 179 167936 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21917] 0 21917 146217 15684 528384 0 0 sshd
Apr 8 07:08:23 localhost kernel: [21933] 108 21933 16716 180 163840 0 0 sshd
Apr 8 07:08:23 localhost kernel: [22084] 0 22084 80726 32 438272 1366 0 php-fpm5.6
Apr 8 07:08:23 localhost kernel: Out of memory: Kill process 18216 (mysqld) score 57 or sacrifice child
Apr 8 07:08:23 localhost kernel: Killed process 18216 (mysqld) total-vm:1568700kB, anon-rss:247676kB, file-rss:0kB, shmem-rss:0kB
Apr 8 07:08:23 localhost kernel: oom_reaper: reaped process 18216 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Now, the strange thing here is that "top" shows over 3GB of physical RAM free:
top - 23:05:54 up 361 days, 17:47, 4 users, load average: 0.94, 0.86, 1.15
Tasks: 103 total, 2 running, 57 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.0 us, 0.5 sy, 0.0 ni, 89.3 id, 3.0 wa, 0.0 hi, 0.3 si, 5.9 st
KiB Mem : 4033740 total, 2986724 free, 480340 used, 566676 buff/cache
KiB Swap: 262140 total, 3032 free, 259108 used. 3375656 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4174 mysql 20 0 1567668 240608 15152 S 1.3 6.0 2:12.68 mysqld
4435 root 20 0 645116 131496 9676 S 0.0 3.3 0:00.67 sshd
4464 jcr15 20 0 645116 125392 3572 S 0.0 3.1 0:00.67 sshd
5959 www-data 20 0 400732 25040 20164 S 0.0 0.6 0:07.50 php-fpm5.6
5891 www-data 20 0 400644 24900 20096 S 0.3 0.6 0:09.92 php-fpm5.6
28873 root 20 0 327524 24344 1728 S 0.0 0.6 6:07.68 fail2ban-se+
5960 www-data 20 0 400648 24124 19832 S 1.0 0.6 0:07.95 php-fpm5.6
4194 www-data 20 0 325236 10808 9632 S 0.0 0.3 0:00.99 php-fpm5.6
4190 www-data 20 0 325236 10072 8804 S 0.0 0.2 0:01.09 php-fpm5.6
4177 www-data 20 0 325236 9800 8584 S 0.0 0.2 0:01.09 php-fpm5.6
4465 jcr15 20 0 22756 5392 3336 S 0.0 0.1 0:00.12 bash
18163 www-data 20 0 125972 5084 3904 S 0.0 0.1 1088:45 nginx
32761 root 20 0 526832 5048 1532 S 0.0 0.1 0:04.82 sshd
18164 www-data 20 0 125856 4956 3948 S 0.0 0.1 1087:49 nginx
1 root 20 0 194088 4548 2632 S 0.0 0.1 35:45.52 systemd
516 root 20 0 22992 4216 2780 S 0.0 0.1 0:00.83 bash
4923 jcr15 20 0 41880 3664 3108 R 0.0 0.1 0:04.39 top
5907 root 20 0 35276 3664 3528 S 0.0 0.1 1:20.91 systemd-jou+
17069 root 20 0 322912 1996 1888 S 0.0 0.0 39:54.32 php-fpm5.6
3390 root 20 0 28596 1944 1780 S 0.0 0.0 7:20.14 systemd-log+
466 root 20 0 38104 1812 0 S 0.0 0.0 0:00.14 systemd
467 root 20 0 302560 1748 0 S 0.0 0.0 0:00.00 (sd-pam)
3387 root 20 0 275952 1548 1016 S 0.0 0.0 27:58.69 accounts-da+
3357 syslog 20 0 256396 1484 572 S 0.0 0.0 28:48.69 rsyslogd
3470 root 20 0 65520 1404 1288 S 0.0 0.0 8:15.21 sshd
3368 root 20 0 29008 1152 1036 S 0.0 0.0 7:58.40 cron
3371 message+ 20 0 42904 912 692 S 0.0 0.0 16:59.82 dbus-daemon
2001 root 20 0 41960 436 388 S 0.0 0.0 0:43.46 systemd-ude+
18908 jcr15 20 0 46560 252 0 S 0.0 0.0 0:00.18 systemd
3458 root 20 0 19472 84 0 S 0.0 0.0 15:22.29 irqbalance
I'm no expert in reading through the OOM-killer log.... just digging into it myself. Anyone see anything in there that's strange? At first glance, it appears to be complaining that it's out of swap.... but I can't figure out why it's out of physical RAM, or what's consuming the most space there.
Offline
I'm not a programmer, but....
It's a bit suspicious that you tried to change the way family names are passed down and 1) it isn't working 2) it is causing problems with cursing people after a last name is lost and 3) something terrible is going on with the database system.
Is there any possibility that there's a connection between the way you changed the last name records and this mysql error???
Or can we assume it is some error that was introduced by the most recent update?
--Blue Diamond
I aim to leave behind a world that is easier for people to live in that it was before I got there.
Offline
No, I don't think this has anything to do with that.
The last name change was in the server code, which is per-game-server, and not even running on this main web server.
Offline
The fact that there are 95 sshd processes running during this out-of-memory event is troubling... seems like a DDOS attack of some kind.
There are normally 4 sshd processes running.
Offline
Yeah, so every time this happens, there's an overload of sshd processes, and I'm pretty sure this is what's causing the server to run out of RAM.
I'm implementing a work-around so that I can still connect via ssh, but other people won't be able to overload it with spam connection requests like this.
Offline
Ok, sounds like you are on a path to a solution. Glad to hear it!
Hope it works out quickly, and you have an awesome vacation with your family.
--Blue Diamond
I aim to leave behind a world that is easier for people to live in that it was before I got there.
Offline
Yes, the sshd overload was the problem. When someone opens an ssh connection, sshd will just hang there for up to 120 seconds while they try to log in.
If enough bots do this, and leave the connections hanging, these connections build up, with one sshd process each, until the system runs out of RAM.
The fix is pretty simple: the sshd setting MaxStartups, which is described here:
https://stackoverflow.com/questions/311 … shd-config
I'm using a setting of 10:30:30
Which means:
After 10 unauthenticated connections are hanging, start dropping 30% of new connections at random, linearly increasing until we are dropping 100% of connections at random, up to a hard-limit of 30 unauthenticated connections hanging.
The probabilistic dropping (as opposed to just a hard cap) lets the real people (me) have a chance at getting through, even during a temporary overload.
And we'll never get anywhere near 95 sshd processes. 30 max (not including the authenticated ones that are really me).
The OOM-killer was also killing my open ssh connections, so I kept getting disconnected from my own server while I was trying to debug this issue.
Offline
And for reference, the last OOM-killer event was here:
Apr 9 00:43:03 localhost kernel: emacs invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Here's hoping it was the last one.
Offline
Ugg... even with this limit on sshd, it happened again:
Apr 9 04:24:25 localhost kernel: php-fpm5.6 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
During this oom incident, mysql was killed (again) and there were 60 sshd processes running, which is the limit (30 max incoming ssh connections, two processes each).
HOWEVER, I'm wondering if something else is going on here, and the sshd overflow is just a symptom, rather than a cause. Maybe the sshd connections pile up WHEN the system is out of RAM, as a side-effect.
Maybe this php module that is triggering OOM is allocating too much ram for some reason. Maybe there's a giant forum post that's being fetched or something.
However, if I go back in time in the syslog to find the first oom event ever:
Apr 5 22:33:08 localhost kernel: sshd invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
So it seems like sshd can trigger it....
I've turned down the max ssh connections to 10 (20 processes max, instead of 60), and we'll see if that helps.
Offline
Well, after reducing sshd cap to 10 (down from 30), we made it through the night without any OOM-killer events.
However, it seems like memory fragmentation is likely the real issue here. We are not actually out of physical memory, but out of swap, mostly due to fragmentation. This is a 64-bit machine, but this might be relevant anyway:
https://serverfault.com/questions/56406 … -situation
So maybe.... after 361 days of uptime.... it's time for a reboot?
Offline
sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974
Can anyone make any sense of these values? I can't find any documentation about this.
Offline
Okay, post-reboot, no swap is used, at all, which is a huge improvement:
top - 14:35:16 up 7 min, 1 user, load average: 0.66, 0.75, 0.44
Tasks: 130 total, 3 running, 62 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 2.1 sy, 0.0 ni, 75.1 id, 7.0 wa, 1.4 hi, 0.3 si, 11.7 st
KiB Mem : 4024420 total, 2886728 free, 453736 used, 683956 buff/cache
KiB Swap: 262140 total, 262140 free, 0 used. 3296208 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
849 mysql 20 0 1567912 201028 17824 S 6.0 5.0 0:13.20 mysqld
1688 root 20 0 645116 131852 10032 S 0.0 3.3 0:00.66 sshd
900 root 20 0 114340 36096 9372 S 0.0 0.9 0:00.54 linode-longview
1306 www-data 20 0 401696 28120 19408 S 0.3 0.7 0:01.75 php-fpm5.6
788 root 20 0 322912 27996 22420 S 0.0 0.7 0:00.10 php-fpm5.6
1321 www-data 20 0 400688 27212 19392 S 2.6 0.7 0:01.84 php-fpm5.6
1675 www-data 20 0 326648 24236 17000 S 0.0 0.6 0:00.82 php-fpm5.6
905 root 20 0 294484 18760 7028 S 0.0 0.5 0:51.56 fail2ban-server
897 www-data 20 0 325236 16768 11104 S 0.0 0.4 0:00.14 php-fpm5.6
896 www-data 20 0 325236 15692 10036 S 0.0 0.4 0:00.15 php-fpm5.6
878 www-data 20 0 125844 7348 5172 R 0.3 0.2 0:00.60 nginx
877 www-data 20 0 125952 7256 5096 S 0.0 0.2 0:00.76 nginx
810 root 20 0 275864 6332 5572 S 0.0 0.2 0:00.03 accounts-daemon
823 root 20 0 65520 5984 5276 S 0.0 0.1 0:00.03 sshd
1 root 20 0 37836 5884 3952 S 0.0 0.1 0:00.97 systemd
1714 root 20 0 22584 5096 3224 S 0.0 0.1 0:00.04 bash
1695 root 20 0 36788 4716 3908 S 0.0 0.1 0:00.04 systemd
Also, frag index looks like this now:
sudo cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
So perhaps it wasn't an sshd issue after all, but that definitely may have pushed it over the top, eventually. Anyway, it should be stable for now.... good enough for me to go on vacation. My ride to the airport is getting here in.... 30 minutes!
Hopefully, this week, you guys can do some research here and figure out how this problem can be prevented in the future, over the next 361 days of uptime....
Offline
sudo cat /sys/kernel/debug/extfrag/extfrag_index Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974
Can anyone make any sense of these values? I can't find any documentation about this.
extfrag_threshold
This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.
Based on that those values don't seem troubling, but I don't have much to compare to.
Offline
jasonrohrer wrote:sudo cat /sys/kernel/debug/extfrag/extfrag_index Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.979 Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.884 0.935 0.961 0.974
Can anyone make any sense of these values? I can't find any documentation about this.
kernel.org wrote:extfrag_threshold
This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.Based on that those values don't seem troubling, but I don't have much to compare to.
I suspect that 1.000 is fully-fragmented; 0.000 is normal fragmentation; and -1.000 is unfragmented files. The documentation seems to be missing the decimal point (or is written Euro-style, with 1,000 being the same as 1.000 in the US). If the positive value went up to 1000, the negative value should be similar.
Steam name: starkn1ght
The Berry Bush Song
The Compost Cycle
Gobble-uns!
Offline
Only 2 stories I can tell to this:
Back the day I was developing embedded devices, with a linux kernel, where the OOM-killer would sometimes kick in on heavy usage of the web interface and it's scoring algorithm would kill the "central control daemon" which put the device into unusable mode and least since it stopped servicing the watchdog, it would reboot soon after. I replaced the default scoring algorithm in the kernel to a) always hit the web stuff first, since it's least essential, and whatever it does, hit the control deamon last (well last before init). (conclusion: you can make custom scorings to the OOM-killer)
Second story, I once had a server that kept oom-killing the backup daemon... and I spent quite some time digging into it, since the RAM strangely wasn't actually depleted. What I learned that (at least on this architecture) there are actually different kinds of RAM and only one of these went down. I never fully understood what was going on there. I believe it was due to being a very old machine, modern ones seem to have an easier RAM built. I fiddled around with some kernel parameters, like swappiness, OOM-killer sensitivity etc. and it stopped killing the daemon mostly. It still does every other month, but I put it on auto-restart so the damage is minimal. As I only have this behavior on that old machine, it will go once it is swapped out as a whole and I stopped wanting to invest more time into that. (conclusion: RAM is more complicated that one might believe, there are some parameters fiddle with it)
Last edited by lionon (2019-04-10 04:48:03)
Offline
Can anyone make any sense of these values? I can't find any documentation about this.
Yes that was the same problem I was having in "story 2". This is poorly documented especially the "Memory kinds" "Normal", "DMA", "DMA32"...
Last edited by lionon (2019-04-10 04:49:49)
Offline
I don't know what any of those logs mean, but
There's been server-wide connection lost happening occasionally lately. Everyone on the server gets connection lost at the same time, it lasts about 10 seconds, then we can reconnect.
Offline
Can I get some more info about server-wide connection loss?
This is on bigserver2 (default server)?
What time of day? I can look in the logs and verify.
Offline
Also, seems like the OOM issue is solved for now. I don't think it ever happened again while I was out of town, and swap has remained low on the main web server (below 2mb all week).
Will probably need to reboot the main web server once a year or so to deal with this problem in the future.
Offline
Can I get some more info about server-wide connection loss?
This is on bigserver2 (default server)?
What time of day? I can look in the logs and verify.
No idea but people have been getting kicked to server 1 and they dont even realize it. There were like 20 to 30 people on there today and nobody knew why they were there.
--Grim
I'm flying high. But the worst is never first, and there's a person that'll set you straight. Cancelling the force within my brain. For flying high. The simulator has been disengaged.
Offline
Can I get some more info about server-wide connection loss?
This is on bigserver2 (default server)?
What time of day? I can look in the logs and verify.
It is happening on bigserver2. The time of day varies. Not a consistent time.
Offline
I didn't get any forum "mysql not found" error last week, but I have been getting the "login failed" quite regularly while trying to get born.
Offline
I didn't get any forum "mysql not found" error last week, but I have been getting the "login failed" quite regularly while trying to get born.
Ive been getting them too but not regularly, more like occasionally.
--Grim
I'm flying high. But the worst is never first, and there's a person that'll set you straight. Cancelling the force within my brain. For flying high. The simulator has been disengaged.
Offline
Pages: 1