Hello,
I’m currently running Endless OS 5.1.2, and I’ve found the OOM killer to be overly aggressive for my workload. The system has 8GiB of RAM, and the default zram swap. The system reports plenty of “available” memory, swap is barely used, but frequently my application gets terminated by the OOM killer.
I have explored some VM tunables, including watermark ratios, swappiness, etc, but none stopped this behaviour.
The workaround I now have is to run a script that drops the kernel page caches every 30s. This feels like a hack - but it’s actually working to keep more memory free and the OOM killer at bay.
I wonder if something can be done to tune the OOM killer in Endless OS.
Alright, I’ve already run echo ‘n’ | sudo tee /sys/kernel/mm/lru_gen/enabled, and I have tried different applications but the same problem still exists, none of them start or it takes a long time to start. I performed the test without rebooting the system. I will send the new diagnosis after executing that order. eos-diagnostic-240416_151945_UTC-0300.txt (1006,2 KB)
Thanks @pauldoo.
Endless OS has a component called “psi-monitor” which attempts to monitor memory pressure. When it detects that your system is really struggling to allocate memory, to the point where it would be hard to even use the UI to close an app, it is supposed to step in and kill a process via OOM killer.
What you are seeing here is that this is misfiring, as you describe it is causing the kill to happen when the system does not appear to be in any trouble at all.
Based on your feedback we have done a re-review of the thresholds used here, and as a result we are going to try quadrupling this to 40% in the next EOS6.0 beta release. Indeed it was too sensitive at this point. We will also make it easier to log what the pressure is, and adjust the threshold.
What is still unexplained and weird is why your system is reporting >10% memory pressure when it is relatively idle. Yours is the only report we have of this on EOS5.1. Interesting that this might be related to multi-gen LRU.
While we work on that new beta release, in the mean time if you want to stop this killing from happening until next reboot, the command is: systemctl stop psi-monitor
If curious you can use this command to watch memory pressure:
$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=37884
full avg10=1.11 avg60=0.00 avg300=0.00 total=36783
The “full avg10” value is the one we monitor. In the above fictional example it says that all running processes have been prevented from doing useful work because the kernel is busy doing memory-management for 1.11% of the last 10 seconds. If that value were to exceed 10.0 (or soon, 40.0), psi-monitor will request an OOM kill.