Performance of chia proof of space re-implementation in Rust

Still not working quite correctly, but re-implementation of chiapos in Rust is faster.
Specifically, ~175ms for k17, a few ms faster than first phase in C++ chiapos + we don’t do other phases. k16 goes down do ~91ms.

Perf says it is quite heavily memory-bound, 20-24% typically (would have been worse with larger K I think):

     4.003866159           1 000,43 msec task-clock:u                     #    1,000 CPUs utilized             
     4.003866159                  0      context-switches:u               #    0,000 /sec                      
     4.003866159                  0      cpu-migrations:u                 #    0,000 /sec                      
     4.003866159             38 984      page-faults:u                    #   38,967 K/sec                     
     4.003866159      5 296 564 666      cpu_core/cycles/u                #    5,294 G/sec                     
     4.003866159      <not counted>      cpu_atom/cycles/u                                                       (0,00%)
     4.003866159     14 431 722 077      cpu_core/instructions/u          #   14,426 G/sec                     
     4.003866159      <not counted>      cpu_atom/instructions/u                                                 (0,00%)
     4.003866159      1 985 520 012      cpu_core/branches/u              #    1,985 G/sec                     
     4.003866159      <not counted>      cpu_atom/branches/u                                                     (0,00%)
     4.003866159         20 956 461      cpu_core/branch-misses/u         #   20,948 M/sec                     
     4.003866159      <not counted>      cpu_atom/branch-misses/u                                                (0,00%)
     4.003866159     31 753 616 184      cpu_core/slots:u/                #   31,740 G/sec                     
     4.003866159     13 324 066 398      cpu_core/topdown-retiring/u      #     42,0% Retiring                 
     4.003866159      2 864 051 655      cpu_core/topdown-bad-spec/u      #      9,0% Bad Speculation          
     4.003866159        996 191 880      cpu_core/topdown-fe-bound/u      #      3,1% Frontend Bound           
     4.003866159     14 569 306 249      cpu_core/topdown-be-bound/u      #     45,9% Backend Bound            
     4.003866159        249 047 970      cpu_core/topdown-heavy-ops/u     #      0,8% Heavy Operations          #     41,2% Light Operations         
     4.003866159      2 864 051 655      cpu_core/topdown-br-mispredict/u #      9,0% Branch Mispredict         #      0,0% Machine Clears           
     4.003866159        498 095 940      cpu_core/topdown-fetch-lat/u     #      1,6% Fetch Latency             #      1,6% Fetch Bandwidth          
     4.003866159      7 595 963 087      cpu_core/topdown-mem-bound/u     #     23,9% Memory Bound              #     22,0% Core Bound               

Code could still benefit from some SIMD instructions for bit shifts, but it seems that benefits will be marginal since bitshifts are over small data structures and cache-friendly, so not much room should be left there.

Parallelism only hurts during very narrow timing and large number of elements being processed, at least for the k that we have.

2 Likes

Found some ways to significantly improve performance, for now here is early comparison with chiapos C++ implementation:

chia/table                time:   [262.84 ms 263.50 ms 264.42 ms]
chia2/table               time:   [171.15 ms 172.83 ms 174.47 ms]

chia/quality/no-solution  time:   [56.601 ns 56.745 ns 57.000 ns]
chia2/quality/no-solution time:   [20.777 ns 20.827 ns 20.864 ns]

chia/quality/solution     time:   [122.24 µs 122.32 µs 122.47 µs]
chia2/quality/solution    time:   [121.94 ns 122.10 ns 122.35 ns]

chia/proof                time:   [556.02 µs 556.67 µs 557.44 µs]
chia2/proof               time:   [477.94 ns 478.27 ns 478.79 ns]

chia/verification         time:   [34.580 µs 34.800 µs 34.959 µs]
chia2/verification        time:   [12.331 µs 12.333 µs 12.334 µs]

chia is chiapos, chia2 is Rust reimplementation.

The results chiapos produces are actually different, but we have not yet found why. Re-implementation is internally consistent, so we likely missed one more thing that chiapos implemented differently from the paper.

Massive improvement can be observed and even more optimizations can be made (in particular quality search in bulk can be even faster, also table creation can be tuned further for specific Ks and even in generic way).

With some parallelism table creation on my machine went down to this:

chia/table/parallel     time:   [25.994 ms 26.368 ms 27.029 ms]