Extreme audit algorithm inefficiency (CPU-bound)

nazar-pc · October 19, 2023, 10:36am

I have been trying to tune auditing performance and turned out it is severely CPU-bound. Why? Quite naturally actually.

For every sector we pick an s-bucket and check all of the audit chunks of every encoded chunk.

Let’s assume that an average s-bucket size is 40kiB and all chunks are encoded. In that case we get 40*1024/8=5120 individual audit chunks to check. This is just one sector.

Consider 1TB plot (not farm), then we will have 5_120_000 audit chunks to check. For each we need to do 3 blake3 hashes and a few mathematical operations and other checks. It is not hard to see why peope are struggling HARD with large farms.

I think we need to somehow decrease amount of audited contents. Specifically we might want to consider some subset of s-bucket to be valid or something like that.

On my high-end consumer CPU Intel Core i9 13900K I can audit ~90TiB per farm before I run out of CPU. Probably with two farms I can get more in total, but this is bottlenecked by CPU and the protocol is meant to be energy-efficient.

dariolina · October 19, 2023, 12:52pm

While I agree with the fact that it’s inefficient, I think your math is slightly off. In our current sectors, one s-bucket should have ~500 32-byte chunks, or 500*4=2000 8-byte audit chunks and be 16KiB in size. For a TB plot if we take 1000 sectors it’s 2 000 000 audit chunks.
Which means the current audit algorithm throughput is even worse

dragonP · October 19, 2023, 5:04pm

I just add what I’m observing.

I’m running 25 Sep Windows version. I’m not sure how the farmer is coded. If I run 1 farm 2TB, the CPU (10900) load is at 10%-15%. But if I run 2 farm 2TB each, the CPU load will be 3x to be 30% - 45% on average. And if I run 3 farm 2TB each, the CPU will be 80% all the way to 100% at its peak very frequently. Not sure if this is from the inefficient Input/Output bottle neck. I really hope we can fix this in future.

Otherwise Windows user can’t farm much.

dariolina · October 24, 2023, 8:47am

We have considered several options in the last few days:

Option of auditing fewer sectors each seconds was discarded as it gives attackers more time to plot sectors on the fly - so we should strive to audit as many sectors as possible each second.
One low-hanging fruit is to reduce the number of audit chunks. We currently audit 4x8-byte chunks in the 32-byte chunk. We can choose only to audit one, however, that still requires 500k hashes per terabyte per second.
Tempted to reduce or altogether remove the hashing with challenge during the audit, we should recall the initial reason hashing was employed - to remove the possibility of compressing plots, specifically to make sure every single bit of a chunk plays in determining whether it can win. To achieve the same property, we are exploring the possibility to use cyclic convolution of challenge and chunk. We are yet to see whether it is more efficient than blake3 in practice, given blake3’s optimal implementations.

nazar-pc · October 24, 2023, 2:42pm

Cyclic convolution ended up ~50% slower than Blake3 hash due to relatively large number of SIMD instructions (40ns vs 60ns).

dariolina · October 24, 2023, 3:22pm

As an intermediate step as we figure out what (if anything) we can replace blake3 with, I propose implementing a simple improvement right away:

Hash the 32-byte chunk with the challenge
Truncate to 8 bytes
Check if within solution range

This should bring a 4x improvement right away. @nazar-pc I will update the spec

nazar-pc · October 25, 2023, 3:54pm

Change plot auditing to reduce amount of computation needed by nazar-pc · Pull Request #2164 · subspace/subspace · GitHub implements this optimization

Chen_Feng_2023 · October 30, 2023, 12:50pm

We have explored three alternative options:

Cyclic convolution
wyhash
AES-based hash

It turns out that cyclic convolution is secure but inefficient. wyhash is efficient but insecure. AES-based hash is still under investigation.

To see this, let us consider a toy example where A = (1, 0, x, y) and B = (1, 0, 1, 1). The cyclic convolution of A and B is the following: C = (1 + x, x+y, 1 + x + y, 1 + y). Suppose that x and y are independent uniform random variables. Then, every bit of C is uniform at random over (0, 1) and every pair of C is uniform at random over (00, 01, 10, 11) except for the pair (x + y, 1 + x + y). Intuitively, these desirable properties imply security against compression attacks.

On the other hand, wyhash is based on integer multiplication and XOR operations. Since A is between 8 and 11, B is 11. We know that A \times B is between 88 and 121. Also, we know that the first two bits of A XOR B are 00. These are not good. With a more detailed analysis, we can show that wyhash is insecure.

We will continue exploring AES-based hash along this line of reasoning.

shashank · November 4, 2023, 1:53am

Haraka v2 could be a good candidate. It is a fast hashing algorithm based on AES. It was published several years ago in IACR Transactions on Symmetric Cryptology, a reputed journal. Haraka has been used to instantiate SPHINCS+, which was recently selected as a NIST post-quantum signature standard.

Haraka comes in two variants, 256 and 512-bit inputs, and produces a 256-bit output. Haraka-256 consists of 5 rounds (recommended) and each round itself has 2 AES rounds and a mixing function, so a total of 10 AES rounds overall (see Figure 2(b) here).

An implementation of Haraka in C and Python is available on GitHub. The AES-NI optimized implementation can hash under 1 cycle per byte on Skylake (0.72 to be precise) and can be made even faster when processing several blocks in parallel.

Chen_Feng_2023 · November 5, 2023, 8:29am

Thanks @shashank! To @nazar-pc: Is it possible for us to quickly compare this hash with Blake3?

nazar-pc · November 5, 2023, 3:56pm

Not as quickly as if it was in Rust, but I’ll try

nazar-pc · November 6, 2023, 3:37pm

I was lazy, so looked for benchmarks.

According to GitHub - BLAKE3-team/BLAKE3: the official Rust and C implementations of the BLAKE3 cryptographic hash function blake3 is ~7.8x faster than blake2s and according to https://www.researchgate.net/publication/369056653_Areion_Highly-Efficient_Permutations_and_Its_Applications_to_Hash_Functions_for_Short_Input Haraha v2 256 is ~1.93/4.74x faster depending on input.

As such blake3 is still the fastest and I see no reason to spend time to bench it myself.

Topic		Replies	Views
Redefining s-buckets Research	6	227	April 25, 2023
Slight change to sector contents Research research , consensus , sector , plot	17	265	April 16, 2023
S-buckets sizes difference Research consensus	3	225	May 15, 2023
Please change how the benchmark run at start - Mar 18 build Feedback & Suggestions	28	181	April 26, 2024
Prove benchmark result - how to interpret Support	3	364	November 12, 2023

Extreme audit algorithm inefficiency (CPU-bound)

Related Topics