Extreme audit algorithm inefficiency (CPU-bound)

I have been trying to tune auditing performance and turned out it is severely CPU-bound. Why? Quite naturally actually.

For every sector we pick an s-bucket and check all of the audit chunks of every encoded chunk.

Let’s assume that an average s-bucket size is 40kiB and all chunks are encoded. In that case we get 40*1024/8=5120 individual audit chunks to check. This is just one sector.

Consider 1TB plot (not farm), then we will have 5_120_000 audit chunks to check. For each we need to do 3 blake3 hashes and a few mathematical operations and other checks. It is not hard to see why peope are struggling HARD with large farms.

I think we need to somehow decrease amount of audited contents. Specifically we might want to consider some subset of s-bucket to be valid or something like that.

On my high-end consumer CPU Intel Core i9 13900K I can audit ~90TiB per farm before I run out of CPU. Probably with two farms I can get more in total, but this is bottlenecked by CPU and the protocol is meant to be energy-efficient.

While I agree with the fact that it’s inefficient, I think your math is slightly off. In our current sectors, one s-bucket should have ~500 32-byte chunks, or 500*4=2000 8-byte audit chunks and be 16KiB in size. For a TB plot if we take 1000 sectors it’s 2 000 000 audit chunks.
Which means the current audit algorithm throughput is even worse :slight_smile:

I just add what I’m observing.

I’m running 25 Sep Windows version. I’m not sure how the farmer is coded. If I run 1 farm 2TB, the CPU (10900) load is at 10%-15%. But if I run 2 farm 2TB each, the CPU load will be 3x to be 30% - 45% on average. And if I run 3 farm 2TB each, the CPU will be 80% all the way to 100% at its peak very frequently. Not sure if this is from the inefficient Input/Output bottle neck. I really hope we can fix this in future.

Otherwise Windows user can’t farm much.

We have considered several options in the last few days:

  • Option of auditing fewer sectors each seconds was discarded as it gives attackers more time to plot sectors on the fly - so we should strive to audit as many sectors as possible each second.
  • One low-hanging fruit is to reduce the number of audit chunks. We currently audit 4x8-byte chunks in the 32-byte chunk. We can choose only to audit one, however, that still requires 500k hashes per terabyte per second.
  • Tempted to reduce or altogether remove the hashing with challenge during the audit, we should recall the initial reason hashing was employed - to remove the possibility of compressing plots, specifically to make sure every single bit of a chunk plays in determining whether it can win. To achieve the same property, we are exploring the possibility to use cyclic convolution of challenge and chunk. We are yet to see whether it is more efficient than blake3 in practice, given blake3’s optimal implementations.

Cyclic convolution ended up ~50% slower than Blake3 hash due to relatively large number of SIMD instructions (40ns vs 60ns).

As an intermediate step as we figure out what (if anything) we can replace blake3 with, I propose implementing a simple improvement right away:

  1. Hash the 32-byte chunk with the challenge
  2. Truncate to 8 bytes
  3. Check if within solution range

This should bring a 4x improvement right away. @nazar-pc I will update the spec

1 Like

Change plot auditing to reduce amount of computation needed by nazar-pc · Pull Request #2164 · subspace/subspace · GitHub implements this optimization

We have explored three alternative options:

  1. Cyclic convolution

  2. wyhash

  3. AES-based hash

It turns out that cyclic convolution is secure but inefficient. wyhash is efficient but insecure. AES-based hash is still under investigation.

To see this, let us consider a toy example where A = (1, 0, x, y) and B = (1, 0, 1, 1). The cyclic convolution of A and B is the following: C = (1 + x, x+y, 1 + x + y, 1 + y). Suppose that x and y are independent uniform random variables. Then, every bit of C is uniform at random over (0, 1) and every pair of C is uniform at random over (00, 01, 10, 11) except for the pair (x + y, 1 + x + y). Intuitively, these desirable properties imply security against compression attacks.

On the other hand, wyhash is based on integer multiplication and XOR operations. Since A is between 8 and 11, B is 11. We know that A \times B is between 88 and 121. Also, we know that the first two bits of A XOR B are 00. These are not good. With a more detailed analysis, we can show that wyhash is insecure.

We will continue exploring AES-based hash along this line of reasoning.

Haraka v2 could be a good candidate. It is a fast hashing algorithm based on AES. It was published several years ago in IACR Transactions on Symmetric Cryptology, a reputed journal. Haraka has been used to instantiate SPHINCS+, which was recently selected as a NIST post-quantum signature standard.

Haraka comes in two variants, 256 and 512-bit inputs, and produces a 256-bit output. Haraka-256 consists of 5 rounds (recommended) and each round itself has 2 AES rounds and a mixing function, so a total of 10 AES rounds overall (see Figure 2(b) here).

An implementation of Haraka in C and Python is available on GitHub. The AES-NI optimized implementation can hash under 1 cycle per byte on Skylake (0.72 to be precise) and can be made even faster when processing several blocks in parallel.

1 Like

Thanks @shashank! To @nazar-pc: Is it possible for us to quickly compare this hash with Blake3?

Not as quickly as if it was in Rust, but I’ll try

I was lazy, so looked for benchmarks.

According to GitHub - BLAKE3-team/BLAKE3: the official Rust and C implementations of the BLAKE3 cryptographic hash function blake3 is ~7.8x faster than blake2s and according to https://www.researchgate.net/publication/369056653_Areion_Highly-Efficient_Permutations_and_Its_Applications_to_Hash_Functions_for_Short_Input Haraha v2 256 is ~1.93/4.74x faster depending on input.

As such blake3 is still the fastest and I see no reason to spend time to bench it myself.