Attempts to optimise the Subspace node at compile time

SeryogaLeshii · October 10, 2023, 8:39am

Is a transfer to the forum of an existing thread from discord.

It is mostly dedicated to finding optimal compilation parameters for building a node/farmer.

SeryogaLeshii · October 10, 2023, 8:43am

I ran some PoS benchmarks on two different platforms, Zen 3 Ryzen and Skylake Xeon.

Here are the results for the main branch. the value of each column is the difference with the best result obtained (at the very top). The first column is single-threaded performance, the second column is multi-threaded performance.

Zen 3 Ryzen
skylake, generic, no-avx, no-avx2, no-fma: 0.0%, 0.0%
skylake, generic, no-avx, no-avx2: -0.5%, -1.5%
skylake, core-avx-i: -6.0%, -1.5%
skylake, core-avx2: -2.6%, -1.5%
skylake: -2.5%, -1.5%
native: -2.6%, -14%
x86-64: -0.5%, -11%

Skylake Xeon
skylake, core-avx-i: 0.0%, 0.0%
skylake, core-avx2: -0.0%, -0.2%
skylake, native: -0.0%, -1.2%
skylake, generic, no-avx, no-avx2, no-fma: -0.0%, -1.5%
skylake: -1.2%, -1.7%
native: -9.7%, -1.7%

The results for gemini-3f-maintenance show approximately the same results.

SeryogaLeshii · October 10, 2023, 8:46am

Upgrading from the Rust version nightly-2023-07-21 to nightly-2023-10-05 gives a 35-36% performance gain in single-threaded mode and 15-16% performance gain in multi-threaded mode for PoS benchmarks on Skylake Xeon, but some regression seems to show up on Zen 3 Ryzen (performance dropped by the same values).

SeryogaLeshii · October 10, 2023, 8:50am

It seems that the LLVM argument -inline-threshold=500 can noticeably improve performance when used on nightly-2023-10-05. On a Skylake Xeon, there was a performance gain for PoS benchmarks of ~35% for single-threaded and ~25% for multi-threaded.

On Zen 3 Ryzen, the gains were as high as ~50% and ~44% respectively.

And it seems to almost completely nivilise the previously mentioned regressions, albeit the performance on nightly-2023-07-21 the smallest bit higher.

SeryogaLeshii · October 10, 2023, 8:54am

I’ve noticed that while there is a noticeable change in PoS benchmark performance, it has little effect on the performance of plotting directly. At best, the impact is less than 1%. Only optimisations explicitly or implicitly related to register optimisation help noticeably. Using LLVM parameters -inline-threshold=2048 and -polly + -polly-register-tiling increased performance by a few percent.

Neither profiling, BOLT, nor Polly (except -polly-register-tiling) helped.

All -inline-threshold values after ~500 start to only hurt PoS benchmark performance, but improve overall plotting performance.

SeryogaLeshii · October 10, 2023, 8:57am

The main calculations during plotting are performed by functions from the blst library, and since they are mostly written in assembly, almost no compilation parameters will help here.

nazar-pc · October 11, 2023, 1:10am

I can’t replicate numbers of compiler upgrade on 13900K (with -C target-cpu=raptorlake).

nightly-2023-07-21:

chia/table/parallel     time:   [105.89 ms 106.68 ms 107.89 ms]

nightly-2023-10-11:

chia/table/parallel     time:   [102.65 ms 103.87 ms 105.44 ms]

So it is a bit faster, but nowhere near 15%.

nazar-pc · October 11, 2023, 1:20am

PoS was quite heavily optimized already, it is meant to be mostly memory-bound, so no wonder it is not easy to accelerate with just compilation options, the operations use there are quite simple and efficient, most of the “optimizations” I have tried to do there lately were only decreasing performance.

And while yes, blst is quite heavly optimized, KZG library is not so much and I still believe there are ways to improve it, but they are very non-trivial and not fixable with compiler options I’m afraid: Fr FFT doesn't really use all the cores fully · Issue #227 · sifraitech/rust-kzg · GitHub

It’d be great if you could post before/after results of Criterion benches rather than a single number. Make sure to look at absolute numbers, not just percentage difference it shows since last run (or create baseline results and compare against that rather than last run).

nazar-pc · October 11, 2023, 3:27am

I’m not registering any performance difference with inline-threshold=500 on -C target-cpu=raptorlake.

It’d be great if you provided exact commands you were using when benchmarking as well.

SeryogaLeshii · October 11, 2023, 11:21am

@nazar-pc It’s probably my mistake. I don’t remember how I came to this conclusion, but I am now seeing exceptional performance regressions when switching to nightly-2023-10-05 on both machines. But these regressions only show up when building with the production profile, meaning LTO is most likely the cause. Using -inline-threshold=512 levels out the regressions. In plotting benchmarks -inline-threshold=2048 improves performance by a few percent.

SeryogaLeshii · October 11, 2023, 11:24am

Rust nightly-2023-07-21 + release:

chia/table/single       time:   [222.52 ms 222.80 ms 223.14 ms]
chia/table/parallel     time:   [44.682 ms 44.831 ms 44.997 ms]

Rust nightly-2023-10-05 + release:

chia/table/single       time:   [221.86 ms 222.29 ms 222.78 ms]
chia/table/parallel     time:   [45.037 ms 45.232 ms 45.450 ms]

Rust nightly-2023-07-21 + production:

chia/table/single       time:   [209.31 ms 210.98 ms 213.21 ms]
chia/table/parallel     time:   [41.975 ms 42.134 ms 42.315 ms]

Rust nightly-2023-10-05 + production:

chia/table/single       time:   [286.80 ms 287.36 ms 288.08 ms]
chia/table/parallel     time:   [54.406 ms 54.655 ms 54.915 ms]

Rust nightly-2023-07-21 + production + inline-threshold=512:

chia/table/single       time:   [205.71 ms 205.98 ms 206.32 ms]
chia/table/parallel     time:   [41.638 ms 41.814 ms 42.009 ms]

Rust nightly-2023-10-05 + production + inline-threshold=512:

chia/table/single       time:   [205.48 ms 206.09 ms 206.94 ms]
chia/table/parallel     time:   [42.016 ms 42.155 ms 42.316 ms]

SeryogaLeshii · October 11, 2023, 4:10pm

I also found this article.

nazar-pc · October 11, 2023, 4:34pm

I found that codegen-units=1 decreases performance, asked about this on Zulip: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Is.20codegen-units.3D1.20supposed.20to.20ever.20be.20slower.3F/near/396012925

So from your tests it doesn’t seem like there are massive changes either way.

I have optimized rust-kzg (the library author talks about) significantly afterwards. It is not necessarily OpenMP vs rayon. And yes, I read that article quite some time ago.

Also if you are interested in performance, you might find https://www.youtube.com/watch?v=r-TLSBdHe1A interesting. Between all the benches you have above it doesn’t actually mean that many of them are meaningfully slower or faster than each other, there is actually a lot of noise in there and benching in a clear way is actually fairly difficult.

z_W · October 15, 2023, 5:58pm

Is it possible to achieve performance improvement by compiling source files with one’s own CPU? If so, approximately how much performance improvement can be obtained?

nazar-pc · October 15, 2023, 6:55pm

It is possible, but the difference will be a few % at best, I believe it will be even smaller in the next Gemini version. You’ll have to try different options and see how it performs, compiling for native doesn’t always lead to the fastest code for example.

Topic		Replies	Views
FAQ Gemini Phase 3d Guides and FAQs faq , gemini , cli , gemini-3d	0	271	May 14, 2023
Poll: Collecting information about the processors in use Research	6	193	August 21, 2023
Configuration Subspace Gemini-3f with Advanced CLI & Multiple plooting with LVM Guides and FAQs faq , nodes , cli , gemini-3f	4	602	September 14, 2023
Sub疑难杂症中文问题解答专区 Guides and FAQs	9	393	April 30, 2024
5 nodes and all is full for 100 %! Support	5	255	April 27, 2023

Attempts to optimise the Subspace node at compile time

Related Topics