Node being killed by OOM Killer

Beginning early am on Dec 1, my operator node began to cycle - being killed by OOM Killer. Happened with Nov 29 build and continued happening on Dec 01 build.

Dec 02 20:10:06 mina-mainnet-vt-1 systemd[852]: subspace-node.service: A process of this unit has been killed by the OOM killer.
Dec 02 20:10:07 mina-mainnet-vt-1 systemd[852]: subspace-node.service: Main process exited, code=killed, status=9/KILL
Dec 02 20:10:07 mina-mainnet-vt-1 systemd[852]: subspace-node.service: Failed with result 'oom-kill'.
Dec 02 20:10:07 mina-mainnet-vt-1 systemd[852]: subspace-node.service: Consumed 3min 25.311s CPU time.

Intel(R) Core™ i9-10900K CPU @ 3.70GHz
32G RAM
Running as Operator

ExecStart=/usr/local/bin/subspace-node \
--chain gemini-3g \
--name jkwh1-mmvt1-acli \
--listen-addr /ip4/0.0.0.0/tcp/30501 \
--dsn-listen-on /ip4/0.0.0.0/tcp/30502 \
--dsn-listen-on /ip4/0.0.0.0/udp/30502/quic-v1 \
--rpc-port 9949 \
-- \
--domain-id 1 \
--chain gemini-3g \
--operator-id 36 \
--keystore-path /keys/subspace-keystore

Before starting subspace:

minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi       600Mi        30Gi        44Ki       113Mi        30Gi
Swap:          976Mi        17Mi       959Mi
minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  1.2M  3.2G   1% /run
/dev/nvme0n1p2  1.8T  986G  753G  57% /
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p1  511M  5.9M  506M   2% /boot/efi
tmpfs           3.2G     0  3.2G   0% /run/user/1000
/dev/nvme1n1    3.6T  3.3T  128G  97% /mnt/farm1

See rapid memory allocation immediately after starting:

minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi       3.5Gi        26Gi       160Ki       1.7Gi        27Gi
Swap:          976Mi        17Mi       959Mi
minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h; date;
               total        used        free      shared  buff/cache   available
Mem:            31Gi       5.9Gi        17Gi       160Ki       8.6Gi        25Gi
Swap:          976Mi        17Mi       959Mi
Sat Dec  2 08:45:57 PM EST 2023
minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h; date;
               total        used        free      shared  buff/cache   available
Mem:            31Gi       6.3Gi       239Mi       160Ki        25Gi        24Gi
Swap:          976Mi        17Mi       959Mi
Sat Dec  2 08:46:01 PM EST 2023
minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h; date;
               total        used        free      shared  buff/cache   available
Mem:            31Gi       6.6Gi       254Mi       160Ki        24Gi        24Gi
Swap:          976Mi        17Mi       959Mi
Sat Dec  2 08:46:05 PM EST 2023
minar@mina-mainnet-vt-1:~/subspace/subspace-utils$ free -h; date;
               total        used        free      shared  buff/cache   available
Mem:            31Gi        20Gi       257Mi       156Ki        10Gi        10Gi
Swap:          976Mi        44Mi       932Mi
Sat Dec  2 08:48:05 PM EST 2023

@ved do you have any idea what could be causing this high RAM usage?

For fun - here is a video of the node being killed by the oom killer once the domain starts processing things, and then rebooting and quickly eating memory as the node starts up again.

Nothing interesting in the logs, but I will post them as well.

OOMKiller takes action at 0:40.

Logs available here:

@jim-counter @jrwashburn I’ll try to reproduce this behavior on my end will check back with you if I need more information. Thanks for reporting!

1 Like