OOM killer is overly aggressive (Endless OS 5.1.2 - Kernel 6.5)

pauldoo · April 14, 2024, 7:18pm

Hello,
I’m currently running Endless OS 5.1.2, and I’ve found the OOM killer to be overly aggressive for my workload. The system has 8GiB of RAM, and the default zram swap. The system reports plenty of “available” memory, swap is barely used, but frequently my application gets terminated by the OOM killer.

I have explored some VM tunables, including watermark ratios, swappiness, etc, but none stopped this behaviour.

The workaround I now have is to run a script that drops the kernel page caches every 30s. This feels like a hack - but it’s actually working to keep more memory free and the OOM killer at bay.

I wonder if something can be done to tune the OOM killer in Endless OS.

pauldoo · April 15, 2024, 8:07am

I managed to find a better workaround than my script. It appears that the issue is resolved if I disable MGLRU:

echo 'n' | sudo tee /sys/kernel/mm/lru_gen/enabled

wjt · April 16, 2024, 9:12am

Thanks for this report. We are investigating something similar in 6.0.0 on this thread:

Can you disable your script, reproduce the problem, create a diagnostic file by running eos-diagnostics then attach it here? Thanks!

James_Martinez · April 16, 2024, 4:23pm

Sorry I don’t know how to disable the script, please give me the instructions and I will send a new diagnostic.

wjt · April 16, 2024, 5:04pm

@James_Martinez i was referring to the script that @pauldoo said above that they had created to work around the problem.

James_Martinez · April 16, 2024, 6:20pm

Alright, I’ve already run echo ‘n’ | sudo tee /sys/kernel/mm/lru_gen/enabled, and I have tried different applications but the same problem still exists, none of them start or it takes a long time to start. I performed the test without rebooting the system. I will send the new diagnosis after executing that order.
eos-diagnostic-240416_151945_UTC-0300.txt (1006,2 KB)

Daniel · April 18, 2024, 6:29pm

Thanks @pauldoo.
Endless OS has a component called “psi-monitor” which attempts to monitor memory pressure. When it detects that your system is really struggling to allocate memory, to the point where it would be hard to even use the UI to close an app, it is supposed to step in and kill a process via OOM killer.

What you are seeing here is that this is misfiring, as you describe it is causing the kill to happen when the system does not appear to be in any trouble at all.

Based on your feedback we have done a re-review of the thresholds used here, and as a result we are going to try quadrupling this to 40% in the next EOS6.0 beta release. Indeed it was too sensitive at this point. We will also make it easier to log what the pressure is, and adjust the threshold.

What is still unexplained and weird is why your system is reporting >10% memory pressure when it is relatively idle. Yours is the only report we have of this on EOS5.1. Interesting that this might be related to multi-gen LRU.

While we work on that new beta release, in the mean time if you want to stop this killing from happening until next reboot, the command is: systemctl stop psi-monitor

If curious you can use this command to watch memory pressure:

$ cat /proc/pressure/memory 
some avg10=0.00 avg60=0.00 avg300=0.00 total=37884
full avg10=1.11 avg60=0.00 avg300=0.00 total=36783

The “full avg10” value is the one we monitor. In the above fictional example it says that all running processes have been prevented from doing useful work because the kernel is busy doing memory-management for 1.11% of the last 10 seconds. If that value were to exceed 10.0 (or soon, 40.0), psi-monitor will request an OOM kill.

Daniel · April 26, 2024, 3:22am

Please retest for this app-killing issue with the beta2 release (should be available as an automatic update)

wjt · May 2, 2024, 8:47am

@pauldoo have you had a chance to upgrade to the beta2 release, removing your workaround and seeing whether the system works better?

system · May 30, 2024, 8:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.