NUMA support is coming

For a long time farmers were saying that plotting is slow on large CPUs, now it is time to change that!

I’ve been hacking on NUMA support that should make things much better and need folks to test and provide feedback to confirm it is actually a positive change.

Please read this post to the very end before replying!

What is changing

There are several behaviors on the farmer that will be different.

Global thread pools

Previously plotting/replotting thread pools were created for each farm separately even though only configured number of them can be used at a time (by default just one). With upcoming changes there is a concept of thread pool manager that will create necessary number of thread pools that will be allocated to currently plotting/replotting farms.

Thread pinning

When thread pool is created, it is assigned to a set of CPU cores and will only be able to use those cores. Pinning doesn’t pin threads to cores 1:1 because I noticed it makes plotting A LOT slower on my machine at least, instead OS is free to move threads between cores, but only withing CPU cores allocated for thread pool. This will ensure plotting for a particular sector only happens on particular CPU/NUMA node.

NUMA support

On Linux and Windows farmer will detect NUMA systems and create number of thread pools that corresponds to number of NUMA nodes. This means default behavior will change for large CPUs and will consume more memory as the result, but that can be changed to the previous behavior with familiar CLI options if desired.

NOTE: You will have to enable NUMA in BIOS of your motherboard for farmer to know it exists, this option is definitely present in motherboards for Threadripper/Epyc processors, but might exist in others too. If you don’t enable it, both OS and farmer will think you have a single UMA processor and will not be able to apply optimizations!

Experimental NUMA-aware memory allocator

Mimalloc allocator we are using apparently has opt-in NUMA-aware allocation support, which is also exposed in Subspace farmer now.

What and how to test

For testing purposes I have created a test build here: Snapshot build · subspace/subspace@1e88a23 · GitHub
You’ll need a GitHub account to see build artifacts at the bottom of the page there, for container images gemini-3g-backport-numa-support tag can be used.

To confirm positive changes I’d like you to test following scenarios:

  1. last release (dec-22) with default configuration and no tweaks to thread pools, concurrent encodings, etc. Just stock behavior
  2. if you have done some CLI tweaks, test with them too
  3. try this experimental build with defaults (don’t change CLI options around thread pools, concurrent encodings, none of it should be necessary anymore)
  4. try 3. again, but also set environment variable NUMA_ALLOCATOR=1 to use experimental NUMA-aware memory allocator that might further improve performance by keeping both compute and memory mostly within the same NUMA node

Results

Please post results in the following format:

  1. CPU AMD Epyc 7302 x2, RAM 16G DDR4 x16 (meaning motherboard has two Epyc 7302 processors installed and 16 memory modules, 16G each)
  2. 5m10s per sector (one sector is encoded at a time by default)
  3. 6m0s per sector, 4 sectors at a time (meaning number of downloaded and encoded sectors was manually increased)
  4. 4m30s per sector, 8 sectors at a time (meaning 8 NUMA nodes)
  5. 2m90s per sector, 8 sectors at a time (meaning 8 NUMA nodes)

Where 0 is information about your system and 1…4 correspond to tests described above.

Important remarks

First of all, these changes are only benefiting NUMA systems AND if they use multiple drives with the same farmer application to benefit from concurrent encoding.

Please keep this thread low-bandwidth and only use it to post results.
If you don’t know how to test, how to set environment variables, see errors in the process, etc. - ask in Discord, someone will help you (please don’t tag me directly).

3 Likes

It seems like a good optimization setting, but in my tests, if the number of threads matches the number of threads in the NUMA node, the system scheduler will automatically schedule them in accordance with NUMA binding. It appears that the number of NUMA misses does not increase rapidly.

Can you tell me if the PieceOffset in the encode sector relies on the result of its previous iteration each time (piece_offset, record)? Is this combined iterator only capable of serial execution? If it can be modified for parallel execution, I would like to try adapting it to utilize other hardware offloading features like Intel VAC or Intel Phi for parallel computing.

Windows 10 AMD Ryzen 9 7950X / 96GB 6400

gemini-3g-2023-dec-22
–sector-downloading-concurrency 2
–sector-encoding-concurrency 2
CPU 71% Replotting sector 3.40/2 = 1.70m

executables-windows-x86_64-skylake-gemini-3g-backport-numa-support
–sector-downloading-concurrency 2
–sector-encoding-concurrency 2
CPU 46% Replotting sector 4.34/2 = 2.17m

1 Like

@Nacho-Neko I asked to keep this thread for testing results. If you want to ask questions about how things work, create a separate topic.

@nonstoper thanks for results, but that is not quite how/what I asked to test though, please check the description carefully. Also Ideally you’d test initial plotting rather than replotting.

before this I solved this problem by making two separate spaces and running two nodes with allocating /NODE for each space in advanced version. it worked fine. but this is definitely a good news. will try to test it asap. Also if memory leakage for windows will be repaired it will be awesome!

  1. Xeon E5-2680v2 x 2, 8G DDR3 1066 x 24
  2. 6m34s
  3. 7m17s, 2 at a time (3:38.6 each) - two farmer processes
  4. 7m13s, 2 at a time (3:36.5 each)
  5. 7m35s, 2 at a time (3:47.8 each)

Notes:

  • all tests reliable, added an additional disk with large plot file for consistent data
  • there is drift between CPUs in test 2-4, after 1h+ single blocks appear minutes apart
  • the time difference between test 3 and 4 is consistent with previous observations of numactl -C vs. numactl --cpunodebind=X --membind=X, numactl -C being slightly faster
  • pretty good results on test 3, congratulations.

I’d seriously reconsider whether a numa-aware miner should be a priority. Turnout here suggests otherwise. It can also be forced as in test 2, with minimal effort.

1 Like
  1. CPU Intel E5-2696 V3 x2, RAM 32G DDR4 x8
  2. 15m per sector
  3. 35m per sector, 8 sectors at a time
1 Like

This is the analysis result with NUMA-aware allocation:

Collection and Platform Info
Application Command Line:
Operating System: 6.5.0-14-generic DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=23.10
DISTRIB_CODENAME=mantic
DISTRIB_DESCRIPTION=“Ubuntu 23.10”
Computer Name: disk
Result Size: 3.5 MB
Collection start time: 04:15:17 02/01/2024 UTC
Collection stop time: 04:30:24 02/01/2024 UTC
Collector Type: Driverless Perf per-process counting
Finalization mode: Fast. If the number of collected samples exceeds the threshold, this mode limits the number of processed samples to speed up post-processing.
CPU
Name: Intel(R) Xeon(R) Processor code named Skylake
Frequency: 1.9 GHz
Logical CPU Count: 48
LLC size: 34.6 MB
Cache Allocation Technology
Level 2 capability: not detected
Level 3 capability: available

Elapsed Time: 906.410s
IPC: 1.706
SP GFLOPS: 0.000
DP GFLOPS: 0.000
x87 GFLOPS: 0.002
Average CPU Frequency: 2.1 GHz

image

image

This is the performance analysis result using gemini-3g-2023-dec-22 :

Elapsed Time: 903.192s
IPC: 1.196
SP GFLOPS: 0.000
DP GFLOPS: 0.000
x87 GFLOPS: 0.002
Average CPU Frequency: 2.1 GHz

image

1 Like

It seems that NUMA awareness indeed reduces the probability of NUMA remote accesses caused by the concurrent threads being allocated and recycled across NUMA nodes, while also enhancing IPC. Therefore, this should be an effective improvement strategy for optimizing machines with multiple NUMA nodes.

1 Like

That is great info, thanks!
Though it is not quite what I requested and not in the format I requested.

I’m looking for short messages in the format defined in the very first post. This topic is not for general discussion purposes.

1 Like

Although I understand that you are requesting feedback in the format of understanding the improvements in drawing and redrawing speed brought about by adopting NUMA awareness, this information is indeed more intuitive. The enhancement in drawing speed depends on reasonable parameters and the number of hard disks, and does not have substantial effects.

1 Like

CPU:8272cl*2 RAM 16G DDR4 2 NUMA2

default:
Two in 5-6 minutes

2024-01-02T09:54:28.148502Z  INFO single_disk_farm{disk_farm_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector (14.15% complete) sector_index=991
2024-01-02T09:54:36.474762Z  INFO single_disk_farm{disk_farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (57.58% complete) sector_index=4032
2024-01-02T09:57:14.527826Z  INFO single_disk_farm{disk_farm_index=1}: subspace_farmer::reward_signing: Successfully signed reward hash 0xf931c3e1372881a34ad2d455af2e92aa4a76e3eddb38f90172eab295ce31d42a
2024-01-02T10:00:18.181608Z  INFO single_disk_farm{disk_farm_index=0}: subspace_farmer::single_disk_farm::plotting: Plotting sector (17.72% complete) sector_index=1241
2024-01-02T10:00:19.565399Z  INFO single_disk_farm{disk_farm_index=2}: subspace_farmer::single_disk_farm::plotting: Plotting sector (19.29% complete) sector_index=1351
2024-01-02T10:05:54.119420Z  INFO single_disk_farm{disk_farm_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector (14.17% complete) sector_index=992
2024-01-02T10:06:02.773497Z  INFO single_disk_farm{disk_farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (57.60% complete) sector_index=4033

–sector-downloading-concurrency 4 --sector-encoding-concurrency 4

2024-01-02T10:11:05.249944Z  INFO single_disk_farm{disk_farm_index=0}: subspace_farmer::single_disk_farm::plotting: Plotting sector (17.72% complete) sector_index=1241
2024-01-02T10:11:05.250028Z  INFO single_disk_farm{disk_farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (57.60% complete) sector_index=4033
2024-01-02T10:11:05.256351Z  INFO single_disk_farm{disk_farm_index=2}: subspace_farmer::single_disk_farm::plotting: Plotting sector (19.29% complete) sector_index=1351
2024-01-02T10:11:05.256364Z  INFO single_disk_farm{disk_farm_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector (14.17% complete) sector_index=992
2024-01-02T10:23:55.842171Z  INFO single_disk_farm{disk_farm_index=1}: subspace_farmer::single_disk_farm::plotting: Plotting sector (14.18% complete) sector_index=993
2024-01-02T10:24:11.583385Z  INFO single_disk_farm{disk_farm_index=3}: subspace_farmer::single_disk_farm::plotting: Plotting sector (57.61% complete) sector_index=4034
2024-01-02T10:26:09.866234Z  INFO single_disk_farm{disk_farm_index=0}: subspace_farmer::single_disk_farm::plotting: Plotting sector (17.74% complete) sector_index=1242
2024-01-02T10:26:13.309618Z  INFO single_disk_farm{disk_farm_index=2}: subspace_farmer::single_disk_farm::plotting: Plotting sector (19.31% complete) sector_index=1352

In my routine operations, I activate four farm processes, managing to attain a sector completion every seven minutes per process, a rate that aligns with the default parameters’ expected performance. Yet, altering the parameter to 4 leads to atypical and suboptimal speed, which I suspect could be due to my server having merely two NUMA nodes.

The peak performance capability of this server is expected to be the generation of four mining plot sectors every seven minutes.

1 Like

Can you put your results into requested format, please? It is hard to analyze the results with a bunch of extra information and lack of some requested information. see examples above from erdnapa and PuNkYsHuNgRy.

1 Like

CPU:8272cl*2 RAM 16G DDR4 2 NUMA 2
1:5-6m per sector
2:5-6m per sector, 2 sectors at a time
3:6-7m per sector, 2 sectors at a time
4:13m per secror, 4 sectors at a time

2 Likes

NUMA support was released in Release gemini-3g-2024-jan-03 · subspace/subspace · GitHub

Thanks everyone for testing!

2 Likes
  1. TR 3960X, 32G DDR4 3600 x 8
  2. 6m10s
  3. not tested due to download breaks, old data ~1m18s-1m20s/sector (numactl)
  4. 10m30.5, 8 at a time (1:18,8 each)
  5. 10m43.7, 8 at a time (1:20,5 each)

Notes:
For some reason I can’t convince the bios to run other values than 1 (no numa) or 8 numa cores.

1 Like

Also, I found some issues with the numa plotter, where should I report those?

1 Like

Create a separate forum thread first

1 Like