Attempts to optimise the Subspace node at compile time

Is a transfer to the forum of an existing thread from discord.

It is mostly dedicated to finding optimal compilation parameters for building a node/farmer.

1 Like

I ran some PoS benchmarks on two different platforms, Zen 3 Ryzen and Skylake Xeon.

Here are the results for the main branch. the value of each column is the difference with the best result obtained (at the very top). The first column is single-threaded performance, the second column is multi-threaded performance.

Zen 3 Ryzen
skylake, generic, no-avx, no-avx2, no-fma: 0.0%, 0.0%
skylake, generic, no-avx, no-avx2: -0.5%, -1.5%
skylake, core-avx-i: -6.0%, -1.5%
skylake, core-avx2: -2.6%, -1.5%
skylake: -2.5%, -1.5%
native: -2.6%, -14%
x86-64: -0.5%, -11%

Skylake Xeon
skylake, core-avx-i: 0.0%, 0.0%
skylake, core-avx2: -0.0%, -0.2%
skylake, native: -0.0%, -1.2%
skylake, generic, no-avx, no-avx2, no-fma: -0.0%, -1.5%
skylake: -1.2%, -1.7%
native: -9.7%, -1.7%

The results for gemini-3f-maintenance show approximately the same results.

1 Like

Upgrading from the Rust version nightly-2023-07-21 to nightly-2023-10-05 gives a 35-36% performance gain in single-threaded mode and 15-16% performance gain in multi-threaded mode for PoS benchmarks on Skylake Xeon, but some regression seems to show up on Zen 3 Ryzen (performance dropped by the same values).

It seems that the LLVM argument -inline-threshold=500 can noticeably improve performance when used on nightly-2023-10-05. On a Skylake Xeon, there was a performance gain for PoS benchmarks of ~35% for single-threaded and ~25% for multi-threaded.

On Zen 3 Ryzen, the gains were as high as ~50% and ~44% respectively.

And it seems to almost completely nivilise the previously mentioned regressions, albeit the performance on nightly-2023-07-21 the smallest bit higher.

Iā€™ve noticed that while there is a noticeable change in PoS benchmark performance, it has little effect on the performance of plotting directly. At best, the impact is less than 1%. Only optimisations explicitly or implicitly related to register optimisation help noticeably. Using LLVM parameters -inline-threshold=2048 and -polly + -polly-register-tiling increased performance by a few percent.

Neither profiling, BOLT, nor Polly (except -polly-register-tiling) helped.

All -inline-threshold values after ~500 start to only hurt PoS benchmark performance, but improve overall plotting performance.

The main calculations during plotting are performed by functions from the blst library, and since they are mostly written in assembly, almost no compilation parameters will help here.

I canā€™t replicate numbers of compiler upgrade on 13900K (with -C target-cpu=raptorlake).

nightly-2023-07-21:

chia/table/parallel     time:   [105.89 ms 106.68 ms 107.89 ms]

nightly-2023-10-11:

chia/table/parallel     time:   [102.65 ms 103.87 ms 105.44 ms]

So it is a bit faster, but nowhere near 15%.

PoS was quite heavily optimized already, it is meant to be mostly memory-bound, so no wonder it is not easy to accelerate with just compilation options, the operations use there are quite simple and efficient, most of the ā€œoptimizationsā€ I have tried to do there lately were only decreasing performance.

And while yes, blst is quite heavly optimized, KZG library is not so much and I still believe there are ways to improve it, but they are very non-trivial and not fixable with compiler options Iā€™m afraid: Fr FFT doesn't really use all the cores fully Ā· Issue #227 Ā· sifraitech/rust-kzg Ā· GitHub

Itā€™d be great if you could post before/after results of Criterion benches rather than a single number. Make sure to look at absolute numbers, not just percentage difference it shows since last run (or create baseline results and compare against that rather than last run).

Iā€™m not registering any performance difference with inline-threshold=500 on -C target-cpu=raptorlake.

Itā€™d be great if you provided exact commands you were using when benchmarking as well.

@nazar-pc Itā€™s probably my mistake. I donā€™t remember how I came to this conclusion, but I am now seeing exceptional performance regressions when switching to nightly-2023-10-05 on both machines. But these regressions only show up when building with the production profile, meaning LTO is most likely the cause. Using -inline-threshold=512 levels out the regressions. In plotting benchmarks -inline-threshold=2048 improves performance by a few percent.

Rust nightly-2023-07-21 + release:

chia/table/single       time:   [222.52 ms 222.80 ms 223.14 ms]
chia/table/parallel     time:   [44.682 ms 44.831 ms 44.997 ms]

Rust nightly-2023-10-05 + release:

chia/table/single       time:   [221.86 ms 222.29 ms 222.78 ms]
chia/table/parallel     time:   [45.037 ms 45.232 ms 45.450 ms]

Rust nightly-2023-07-21 + production:

chia/table/single       time:   [209.31 ms 210.98 ms 213.21 ms]
chia/table/parallel     time:   [41.975 ms 42.134 ms 42.315 ms]

Rust nightly-2023-10-05 + production:

chia/table/single       time:   [286.80 ms 287.36 ms 288.08 ms]
chia/table/parallel     time:   [54.406 ms 54.655 ms 54.915 ms]

Rust nightly-2023-07-21 + production + inline-threshold=512:

chia/table/single       time:   [205.71 ms 205.98 ms 206.32 ms]
chia/table/parallel     time:   [41.638 ms 41.814 ms 42.009 ms]

Rust nightly-2023-10-05 + production + inline-threshold=512:

chia/table/single       time:   [205.48 ms 206.09 ms 206.94 ms]
chia/table/parallel     time:   [42.016 ms 42.155 ms 42.316 ms]

I also found this article.

I found that codegen-units=1 decreases performance, asked about this on Zulip: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Is.20codegen-units.3D1.20supposed.20to.20ever.20be.20slower.3F/near/396012925

So from your tests it doesnā€™t seem like there are massive changes either way.

I have optimized rust-kzg (the library author talks about) significantly afterwards. It is not necessarily OpenMP vs rayon. And yes, I read that article quite some time ago.

Also if you are interested in performance, you might find https://www.youtube.com/watch?v=r-TLSBdHe1A interesting. Between all the benches you have above it doesnā€™t actually mean that many of them are meaningfully slower or faster than each other, there is actually a lot of noise in there and benching in a clear way is actually fairly difficult.

1 Like

Is it possible to achieve performance improvement by compiling source files with oneā€™s own CPU? If so, approximately how much performance improvement can be obtained?

It is possible, but the difference will be a few % at best, I believe it will be even smaller in the next Gemini version. Youā€™ll have to try different options and see how it performs, compiling for native doesnā€™t always lead to the fastest code for example.