Tuning the system for Subspace node/farmer operation

This topic is dedicated to finding, collecting and sharing information and experiences for tuning system to maximise the performance of the Subspace node/farmer. This concerns not only tuning of the OS itself, but also optimisation of executables (compilation and post-compilation time optimisations, for example).

Mostly aimed at Linux users.

1 Like

As of kernel 6.6, the default scheduler is now EEVDF instead of CFS. It is supposed to provide much lower latency (compared to CFS), with comparable throughput.

For timekeepers, low latency and high performance are very important. It is possible to improve these parameters by using the Linux kernel’s ability to disable periodic ticks for individual cores/threads (NO_HZ_FULL). Only one single task can run on such kernels, and since the node allows you to explicitly specify the kernels to be used by the timekeeper, this should work.

To use this feature, you must build a kernel with CONFIG_NO_HZ_FULL. Some distributions already ship kernels with it. You can verify with the following command:

zcat /proc/config.gz | grep CONFIG_NO_HZ_FULL

And if you see a line like this:

CONFIG_NO_HZ_FULL=y

That means it is enabled.

Note: The kernel configuration may be in other places, depending on the distribution you are using. It may be, for example, /boot/config or /boot/config-$(uname -r).

Now you need to specify the threads that will be isolated and on which periodic ticks will be disabled. These threads will be ignored by the scheduler and therefore no other tasks should be launched on it.

You can do this by setting the kernel parameters nohz_full and isolcpus=nohz,domain,.... For example, nohz_full=0,1 and isolcpus=nohz,domain,0,1. If you are using hyperthreading, you must disable ticks for hyperthreads as well. If you have 12 cores, 24 threads, then core 0 will be the main thread and core 12 – hyperthread.

Isolcpus removes every CPU in the list from the scheduler’s domains, meaning that the kernel will not to do things like run the load balancer for them, and it also disables the scheduler tick (that’s what the nohz flag is for). nohz_full= disables the tick (yes, there’s some overlap of the in-kernel flags which means you need both of these options) as well as offloading RCU callbacks and other miscellaneous items.

If you are using grub, you can add the boot parameter by editing the GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grub. Just add the desired string there. Afterwards, regenerate the grub configuration:

grub-mkconfig -o /boot/grub/grub.cfg

And reboot the machine.

I assume that specifying two threads of the same physical core (hyperthreading) will be sufficient for the timekeeper, but I hope to be corrected if I am wrong.

After a reboot, simply instruct the timekeeper to use those threads.

This configuration has not yet been tested by me at the moment.
I’ll try to do it later

I built a kernel with CONFIG_NO_HZ_FULL and gave it a try. There are some nuances. If using systemd, you should set the CPUAffinity parameter in /etc/systemd/system.conf so that it ignores cores with disabled ticks. For example, if you have 12 cores and 24 threads, and you disable ticks on core 12 (threads 11 and 23), then CPUAffinity should be set to 0-10,12-22. For a node process you should then set CPUAffinity=0-23. This can be done with taskset or systemd services.

It may also be worthwhile to use isolcpus=managed_irq....

I tried running some benchmarks using kernel 6.6.0 with the sched-ext patchset. Sched-ext adds the ability to run other task schedulers from userspace via eBPF programs. You can read an overview of the patchset here and here, and a detailed description here. On the sched-ext page on Reddit you can find a build guide and other useful information. You can build a kernel with sched-ext using the source code from the Github repository. If you use Arch Linux, you can use PKGBUILDs from AUR: kernel, headers, schedulers. Or you can build CachyOS kernels, which have a large number of other patches/fixes available besides sched-ext. Ebuild files for CachyOS kernels are also available for Gentoo.

Some of the advantages of this approach are the ability to experiment with different schedulers without having to rebuild the kernel, and the ability to place complex logic in the userspace. The latter opens up many possibilities, including task scheduling using neural networks.

My experiments were limited to a single scheduler – scx_rusty. It is one of the few production-ready sched-ext schedulers available. It is used in production at Meta. scx_rusty is much better adapted to modern processors with their complex L3 cache arrangement and so on.

The tests were performed on two machines: AMD Ryzen 9 3900 (12 cores, 24 threads) and Intel Xeon W-2145 (8 cores, 16 threads). I used optimised cores from CachyOS tuned for server workloads. I compared both plotting benchmark results and real-world performance with and without farming. scx_rusty was used with default settings.


Intel Xeon W-2145:
The performance difference in plotting benchmarks between disabled and enabled scx_rusty was around 5% towards the latter.

Without scx_rusty, plotting performance with farming enabled was ~9 sectors per hour. At the time of the tests, ~350 sectors were completed. Farming was not bringing any rewards. There were reports of insufficient farming performance on the node side.

With scx_rusty, plotting performance with farming enabled was ~10-11 sectors per hour. At the time of the tests, ~365 sectors were completed. Farming was bringing rewards, ~2-4 per hour (I mean messages about successful signing of the rewards hash).

With scx_rusty, plotting performance with farming turned off was a steady 11 sectors per hour.


AMD Ryzen 9 3900:
The performance difference in plotting benchmarks between scx_rusty turned off and turned on was around 17% towards the latter.

I haven’t made a performance comparison in real world conditions yet.

Could you do some standard PoS and farmer’s auditing benchmark with and without scx_rusty? If you can post benchmark results here as they are printed that’d be ideal.

Intel Xeon W-2145:

scx_rusty:
chia/table/parallel time: [325.33 ms 326.20 ms 327.17 ms]
audit/plot/sync/single time: [28.843 ms 29.094 ms 29.340 ms]
audit/plot/sync/rayon time: [27.652 ms 28.039 ms 28.425 ms]
audit/plot/monoio time: [26.596 ms 27.155 ms 27.838 ms]

default:
chia/table/parallel time: [353.02 ms 354.13 ms 355.25 ms]
audit/plot/sync/single time: [28.834 ms 29.176 ms 29.557 ms]
audit/plot/sync/rayon time: [27.931 ms 28.293 ms 28.660 ms]
audit/plot/monoio time: [26.808 ms 27.434 ms 28.145 ms]

I don’t have access to an AMD Ryzen 9 3900 right now, so I can’t run benchmarks on it.

1 Like

That is a very good improvement just from changing CPU scheduler!

I will experiment with pinning threads to CPU cores, that might help with performance if these numbers are to be believed since we have a fixed size thread pool and don’t need scheduler’s help when we engage all the cores at the same time.

1 Like

Sounds great. If threads will be pinned to cores, then it will be possible to use nohz_full for plotting, which can lead to additional performance improvements and reduced latency.

This can be used to isolate threads allocated to virtual machines to minimise overhead from the host task scheduling.

So that others can try sched-ext with scx_rusty for themselves, I have built deb packages of kernel 6.6.1 with patches from CachyOS, sched-ext and BORE.

I built them in the Ubuntu 20.04 environment, so they should work in it, and probably will work in Ubuntu 22.04. Since I don’t use Debian-based Linux distributions, and haven’t dealt with build scripts in them before, I had to look into it. I found the kernel package build scripts in Ubuntu to be confusing and inflexible, not even allowing me to simply change the compiler from GCC to Clang. Using LLVM/Clang>=16 is essential for building additional schedulers (including scx_rusty). After some “dancing with tambourines” and editing the build scripts, I was able to build the deb packages. If you want to build them yourself (e.g. due to an understandable distrust of my builds), I can provide a set of steps (dances with tambourines). Use these packages at your own risk. Also note that if you use any external kernel modules (e.g. Nvidia drivers), you will need to use their DKMS versions. I am not responsible for this. The link will be valid for a week, but I can provide a new link after that if needed.

This kernel is mostly using the Ubuntu configuration, but with a few differences (besides the previously mentioned patchesets).

The kernel is built with Clang 17;
SCHED_BORE (BORE) is used;
SCHED_CLASS_EXT (sched-ext) is used;
The kernel is built with Full LTO;
BBRv3 and FQ (Fair Queue) are used by default;
PER_VMA_LOCK is used;
Kernel timer frequency of 100 (instead of 250) is used;
Performance governor is used by default;
Periodic kernel ticks (HZ_PERIODIC) are enabled;
Disabled kernel preemption (server);
The kernel is built with -O3;

Here is a general set of steps to install this kernel on Ubuntu 20.04.

Install LLVM 17:

wget https://apt.llvm.org/llvm.sh 
chmod +x llvm.sh 
sudo ./llvm.sh 17

Download the archive with the deb packages:

wget --content-disposition https://oshi.at/SnUL

It should save to a file called linux-kernel-cachyos-6.6.1-sched-ext.tar.

Unzip the archive:

tar -xf linux-kernel-cachyos-6.6.1-sched-ext.tar

Navigate to the directory with the deb packages:

cd deb

And install them:

sudo apt install ./*.deb

You may not need all the packages there, but if you don’t know for sure, install everything.

Install the DKMS versions of the external modules if you haven’t done so before.

After installing the packages, you should have the GRUB configuration automatically updated (if you are using it), but you can do it manually just in case:

sudo update-grub

The output of this command should mention the 6.6.6.1+cachyos-060601 kernel. If it does, you can reboot the machine and select that kernel from the GRUB menu.

You can always select a different kernel from the same menu if you have problems with this one.

After booting into the system, make sure you boot using that particular kernel:

uname -r

If this is the case, you can try using scx_rusty:

sudo /usr/lib/linux-tools-6.6.1+cachyos-060601/scx_rusty

If lines with scheduler statistics start appearing and the system doesn’t crash, you can use it. The scx_rusty scheduler can cause system crashes on hybrid processors (e.g., i9 12900K).

You can run scx_rusty manually (using screen for example), or through systemd services.

Open the service file:

EDITOR=nano sudo -e /etc/systemd/system/scx_rusty.service

And paste the following contents into it:

[Unit]
Description=Rusty task scheduler

[Service]
User=root
ExecStart=/usr/lib/linux-tools-6.6.1+cachyos-060601/scx_rusty
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start the service:

sudo systemctl start scx_rusty

And check its status:

sudo systemctl status scx_rusty

If everything works, you can enable it:

sudo systemctl enable scx_rusty

EDIT: Only works on Ubuntu 20.04 and Debian-based distributions with similar package base. Does not work on Ubuntu 22.04 due to unsatisfied dependencies.

The scx_rusty also has quite a few parameters to configure. The full list and their description can be obtained by running it with the --help flag. Here are some of them that are worth paying attention to:


--slice-us: In case of processor-bound tasks, a longer slice might be beneficial to reduce context-switching overhead. You could increase this from the default 20000 microseconds to a higher value, experimenting to find the optimal balance.

--interval: With a mix of IO and CPU-bound tasks, a shorter interval might help in more responsive load balancing.

--tune-interval: A more frequent tuning interval might be advantageous for dynamic adjustments. Consider reducing it from the default 0.1 seconds.

--load-decay-factor: A lower decay factor might be better for a server with a mix of IO and CPU-bound tasks, as it makes the load calculation more sensitive to recent changes. Consider values lower than the default 0.5.

I tried to evaluate the performance of the farmer when using scx_rusty with the following parameters:

scx_rusty --slice-us 30000 --interval 1.0 --tune-interval 0.05 --greedy-threshold 2 --load-decay-factor 0.4

As a result, the audit and plotting benchmarks showed changes at the error level compared to the default scx_rusty settings. But in real conditions the replotting speed increased somewhere around 10%. I compared the number of sectors produced over the last 6 hours: 55 for the default settings and 62 for the above settings.

It supposedly hasn’t affected farming in any way, and it works very well.

I specified the thread pool size for replotting to all threads in the system, likewise for farming. For W-2145 it is 8 cores/16 threads.

It has been noticed that when using scx_rusty the node and farmer may exit with the error specified in this post. If you use a supervisor (like systemd), there should be no problem, as the services will just restart and keep running.

I haven’t tested with Subspace, but potentially scx_rusty can cause problems when using docker containers. See this post.

New link to download deb packages of the Linux 6.6.1 kernel with sched-ext: https://oshi.at/fuVL.

It will be valid for about a month.

Based on learning that CPU scheduling impacts performance I decided to try the simplest things: to pin parallel threads to CPU cores, which will make kernel not move those threads between CPU cores.

Results or simple benchmark are positive:

Default:
chia/table/parallel     time:   [206.19 ms 207.68 ms 209.06 ms]
Threads pinned to CPU cores:
chia/table/parallel     time:   [196.82 ms 198.04 ms 199.18 ms]

The difference is not massive, but I’ll take it.

Will try to incorporate it with some other changes I’m planning to do soon.

Turns out when composing the whole farmer, pinning threads to cores, at least on my machine, actually results in lower performance, not higher.

All threads are underutilized, but not clear why.

Specifically, with non-pinned threads it took 2m31s to plot one sector and with pinning 4m7s, which is a massive slow-down.