Memory Pressure Handling in systemd

When the system is under memory pressure (i.e. some component of the OS requires memory allocation but there is only very little or none available), it can attempt various things to make more memory available again (“reclaim”):

The latter is what we want to focus on in this document: how to ensure userspace process can detect mounting memory pressure early and release memory back to the kernel as it happens, relieving the memory pressure before it becomes too critical.

The effects of memory pressure during runtime generally are growing latencies during operation: when a program requires memory but the system is busy writing out memory to (relatively slow) disks in order make some available, this generally surfaces in scheduling latencies, and applications and services will slow down until memory pressure is relieved. Hence, to ensure stable service latencies it is essential to release unneeded memory back to the kernel early on.

On Linux the Pressure Stall Information (PSI) Linux kernel interface is the primary way to determine the system or a part of it is under memory pressure. PSI makes available to userspace a poll()-able file descriptor that gets notifications whenever memory pressure latencies for the system or a control group grow beyond some level.

systemd itself makes use of PSI, and helps applications to do so too. Specifically:

Memory Pressure Service Protocol

If memory pressure handling for a specific service is enabled via MemoryPressureWatch= the memory pressure service protocol is used to tell the service code about this. Specifically two environment variables are set by the service manager, and typically consumed by the service:

When a service initializes it hence should look for $MEMORY_PRESSURE_WATCH. If set, it should try to open the specified path. If it detects the path to refer to a regular file it should assume it refers to a PSI kernel file. If so, it should write the data from $MEMORY_PRESSURE_WRITE into the file descriptor (after Base64-decoding it, and only if the variable is set) and then watch for POLLPRI events on it. If it detects the paths refers to a FIFO inode, it should open it, write the $MEMORY_PRESSURE_WRITE data into it (as above) and then watch for POLLIN events on it. Whenever POLLIN is seen it should read and discard any data queued in the FIFO. If the path refers to an AF_UNIX socket in the file system, the application should connect() a stream socket to it, write $MEMORY_PRESSURE_WRITE into it (as above) and watch for POLLIN, discarding any data it might receive.

To summarize:

(And in each case, immediately after opening/connecting to the path, write the decoded $MEMORY_PRESSURE_WRITE data into it.)

Whenever a POLLPRI/POLLIN event is seen the service is under memory pressure. It should use this as hint to release suitable redundant resources, for example:

Which actions precisely to take depends on the service in question. Note that the notifications are delivered when memory allocation latency already degraded beyond some point. Hence when discussing which resources to keep and which to discard, keep in mind it’s typically acceptable that latencies incurred recovering discarded resources at a later point are acceptable, given that latencies already are affected negatively.

In case the path supplied via $MEMORY_PRESSURE_WATCH points to a PSI kernel API file, or to an AF_UNIX opening it multiple times is safe and reliable, and should deliver notifications to each of the opened file descriptors. This is specifically useful for services that consist of multiple processes, and where each of them shall be able to release resources on memory pressure.

The POLLPRI/POLLIN conditions will be triggered every time memory pressure is detected, but not continuously. It is thus safe to keep poll()-ing on the same file descriptor continuously, and executing resource release operations whenever the file descriptor triggers without having to expect overloading the process.

(Currently, the protocol defined here only allows configuration of a single “degree” of memory pressure, there’s no distinction made on how strong the pressure is. In future, if it becomes apparent that there’s clear need to extend this we might eventually add different degrees, most likely by adding additional environment variables such as $MEMORY_PRESSURE_WRITE_LOW and $MEMORY_PRESSURE_WRITE_HIGH or similar, which may contain different settings for lower or higher memory pressure thresholds.)

Service Manager Settings

The service manager provides two per-service settings that control the memory pressure handling:

The /etc/systemd/system.conf file provides two settings that may be used to select the default values for the above settings. If the threshold isn’t configured via the per-service nor system-wide option, it defaults to 100ms.

When memory pressure monitoring is enabled for a service via MemoryPressureWatch= this primarily does three things:

Memory Pressure Events in sd-event

The sd-event event loop library provides two API calls that encapsulate the functionality described above:

When implementing a service using sd-event, for automatic memory pressure handling, it’s typically sufficient to add a line such as:

(void) sd_event_add_memory_pressure(event, NULL, NULL, NULL);

– right after allocating the event loop object event.

Other APIs

Other programming environments might have native APIs to watch memory pressure/low memory events. Most notable is probably GLib’s GMemoryMonitor. It currently uses the per-system Linux PSI interface as the backend, but operates differently than the above: memory pressure events are picked up by a system service, which then propagates this through D-Bus to the applications. This is typically less than ideal, since this means each notification event has to traverse three processes before being handled. This traversal creates additional latencies at a time where the system is already experiencing adverse latencies. Moreover, it focuses on system-wide PSI events, even though service-local ones are generally the better approach.