Slight change to sector contents

Initially I read specification incorrectly and there was growing confusion between what I though I understood and what I was implementing.
Eventually we talked with @dariolina and we were able to understand each other: I understood what specification was saying and she understood what I have actually implemented and why.

TL;DR: I don’t think the way specification is written is possible to audit correctly, but I have re-designed it slightly in such a way that is possible to efficiently construct and audit and will describe it below (should also be easier to visualize and implement).

The reason audit is not possible to do efficiently is that each s-bucket has probabilistic size and we can’t just do quick lookup by offset, we’d have to store some kind of mapping that’ll tell us where each s-bucket starts. This also hurts data retrievability.

Proposed sector construction

As an input for sector we have a set of records extracted from pieces using verifiable sector construction.
Each record contains 2^15 32-bytes chunks.

Essentially we have a grid, where 1299 records are rows and 2^15 chunks are columns:

record_0_chunk_0 record_0_chunk_1 record_0_chunk_(2^15-1)
record_1_chunk_0 record_1_chunk_1 record_1_chunk_(2^15-1)
record_1299_chunk_0 record_1299_chunk_1 record_1299_chunk_(2^15-1)

Then we do following transformation for each record individually:

  • apply erasure coding to get 2^16 chunks out of each record
  • generate PoSpace table
  • allocate a bitfield with 2^16 bits (one for each erasure coded chunk)
  • treat each chunk index as challenge index and query PoSpace table
  • keep first 2^15 chunks that have PoSpace proofs and xor them with PoSpace quality (encoding)
  • in case there isn’t enough proofs to collect 2^15 chunks (from empiric tests seems to happen for ~0.8% of records) append chunks to those collected in previous step using following selection algorithm:
    • out of 2^16 erasure coded chunks take as many chunks without PoSpace proofs as necessary to get to 2^15 chunks total
  • assemble chunks back into “record” (same size, but not real records anymore, most chunks are encoded)

As the result we have two things:

  • 2^15 bytes “record” with most chunks encoded
  • 2^16 bits bitfield that allows us to understand what above “record” actually contains and how to decode it afterwards

Now that we have done encoding, we essentially have the same set of “records” as before arranged the same exact way: “records” are rows and chunks are columns. Bitfields form identical table, but columns are 1 bit in size rather than 32 bytes.

Each s-bucket in this case is chunks and bitfields at the same column.

When auditing each s-bucket has exactly 1300 32-byte chunks, so it is easy to read and bitfields explain which out of those chunks are eligible for solution creation (correspond to encoded chunk) and which are not.

Similarly for data retrieval we take one chunk and one bitfield bit at the same offset from each s-bucket and we know which chunks need to be decoded and how to arrange chunks such that they can be erasure-decoded.


The mechanism overall seems to be similar in nature as what is described in the paper, I’m wondering if there is something I’m missing that would cause issues. If nothing problematic is found, we should update the specification accordingly (I have actually updated it partially due to mentioned misunderstanding).

For now continuing with implementation as described above.

4 Likes

You mean 1300, right? The index starts from 0.

This “bitfield table” is a great way to visualize what is going on!

Yes, 1300 records, index from 0 to 1299. If we want 32k (I don’t think it matters too much) s-buckets, we can do 1000 records, it is just one parameter, easy to change.

According to the previous description, each s-bucket has at most 1300 32-byte chunks. Sometimes, an s-bucket may have fewer than 1000 32-byte chunks.

Since different s-buckets may have different sizes, we need some data structure to maintain such information and facilitate efficient auditing (of any given s-bucket). @nazar-pc What data structure are you planning to use?

The way I described it every s-bucket is always of the same exact size and no fancy data structure is needed, we just write s-buckets one after another and use offsets to read specific s-bucket.

I see. I misunderstood your description. Let me use the following example to double-check my new understanding.

Consider a toy example of 4 records each with 2 chunks.

record_0_chunk_0 record_0_chunk_1
record_1_chunk_0 record_1_chunk_1
record_2_chunk_0 record_2_chunk_1
record_3_chunk_0 record_3_chunk_1

Let us do the transformation for record_0

  • After erasure coding, we obtain
    record_0_chunk_0 record_0_chunk_1 record_0_chunk_2 record_0_chunk_3

  • After generating and querying PoSpace table, we obtain the following bitfield
    [0, 1, 1, 0]
    So, we keep encoding(record_0_chunk_1) and encoding(record_0_chunk_2), where encoding(record_0_chunk_1) is record_0_chunk_1 \xor PoSpace quality at index 1. For ease of notation, we use encode_0_chunk_1 to denote encoding(record_0_chunk_1).

  • After assembling chunks, we have
    encode_0_chunk_1 encode_0_chunk_2

Now, we finish all the steps for record_0. We are ready to jump to the end with two outputs

  1. bitfield table
    0, 1, 1, 0
    1, 0, 1, 0
    0, 0, 0, 1
    1, 1, 0, 0

  2. transformed records
    encode_0_chunk_1 encode_0_chunk_2
    encode_1_chunk_0 encode_1_chunk_2
    record_2_chunk_0 encode_2_chunk_3
    encode_3_chunk_0 encode_3_chunk_1

Finally, we can specify s-buckets. For example, the first s-bucket is
record_0_chunk_0
encode_1_chunk_0
record_2_chunk_0
encode_3_chunk_0

and its column bitfield is

0
1
0
1

Is the above correct?

1 Like

Almost, transformed records are these:

encode_0_chunk_1 encode_0_chunk_2
encode_1_chunk_0 encode_1_chunk_2
encode_2_chunk_3 record_2_chunk_0
encode_3_chunk_0 encode_3_chunk_1

The difference is in third line.

encode_* always go before record_*.

This solution has a subtle issue. Say, we have two s-buckets. The first one stores the first column, which is
encode_0_chunk_1
encode_1_chunk_0
encode_2_chunk_3
encode_3_chunk_0.
The second one stores the second column.

Suppose that a farmer would like to prove that encode_1_chunk_2 is a “winning ticket” of the second s-bucket. Unlike our original solution, this proof is not easy at all. Basically, the farmer has to show that encode_1_chunk_2 is indeed contained in the second s-bucket. In particular, the farmer must prove that encode_1_chunk_1 doesn’t exist because otherwise encode_1_chunk_1 will be in the second s-bucket.

As one way to fix this issue, we can do the following:

s-bucket 1 stores [encode_1_chunk_0, record_2_chunk_0, encode_3_chunk_0].
s-bucket 2 stores [encode_0_chunk_1, encode_3_chunk_1].
s-bucket 3 stores [encode_0_chunk_2, encode_1_chunk_2].
s-bucket 4 stores [encode_2_chunk_3].

In total, we have 4 s-buckets storing 8 chunks. Then, we maintain an array [3, 2, 2, 1] indicating the size of each s-bucket. Clearly, this array can help us find out the start position and end position of a specific s-bucket.

Alternatively, we can use five s-buckets:

s-bucket 1 stores [encode_1_chunk_0, encode_3_chunk_0].
s-bucket 2 stores [encode_0_chunk_1, encode_3_chunk_1].
s-bucket 3 stores [encode_0_chunk_2, encode_1_chunk_2].
s-bucket 4 stores [encode_2_chunk_3].
s-bucket 5 stores [record_2_chunk_0].

Clearly, the first four s-buckets correspond to the four columns of the bitfield table
0, 1, 1, 0
1, 0, 1, 0
0, 0, 0, 1
1, 1, 0, 0
and the last s-bucket stores the uncoded chunk [record_2_chunk_0].

I see now how the structure Nazar proposed is difficult to prove. Indeed, in the original scheme each encoded chunks respective index and the bucket index align.
I second @Chen_Feng_2023 in the idea:

Then, we maintain an array [3, 2, 2, 1] indicating the size of each s-bucket. Clearly, this array can help us find out the start position and end position of a specific s-bucket.

Or we could have a length-prefixed encoding to write the buckets, like SCALE or something more appropriate so you know how many chunks each bucket contains. Then the correspondence between chunk and record can be restored from the record bitfield. Would that work for you @nazar-pc ?

I oppose the idea of storing all erasure coded chunks (both provable and not) because that kinda defies the idea of masking, the record now can be restored purely from unencoded chunks.

That works if we want to potentially read the whole sector just to audit it, we’d really have to store some external mapping to do it efficiently.

I oppose the idea of storing all erasure coded chunks (both provable and not) because that kinda defies the idea of masking, the record now can be restored purely from unencoded chunks.

Well, we can make them all encoded then. Let’s say we want to store the first 2^15 chunks only (I can’t recall why are we even erasure coding them, maybe we don’t need to), then we can add a seed that farmer can iterate and instead of storing bitfield with encoded chunks, we’ll store bytes, each of which will be indicating number of the table used for each chunk.

So we’ll generate first PoSpace table and use proofs to encode some chunks. If some chunks don’t have proofs, we’ll generate a second table and encode remaining chunks with those proofs and so on until all 2^15 chunks are encoded.

To occupy just one byte per chunk we’ll will have to limit number of seeds to 255 and use unencoded chunk in case after 255 attempts it still wasn’t possible to encode, which from my quick check means ~1.82e-99 probability of being unable to encode a particular chunk (~0.41 per attempt), which should be sufficient for our needs. In case we want to cut storage to 4 bits per chunk, probability will be ~0.000_001_555, which is still decent, most of the chunks will be encoded and participate in farming.

If we want to bring down number of tables that needs to be generated per sector we can reuse tables across multiple records (lets say each one table per pair of records).

I’d really like to see a construction where each sector is exactly the same size and has deterministic shape.

An issue I see with this approach as it potentially gives malicious farmers a legal way to create multiple auditable chunks per source chunk. If a farmer can create up to X tables per record it gives them a chance to potentially magnify their storage by 0,63X. Say they have encoded chunk 0 with the first table, but it also happened to have a proof in a second table. The farmer can store both versions, and farm both, since there is no way to check whether chunk did not have a proof at all previous seeds. The cost of challenging the table is negligible compared to its creation to check for proofs for all chunks.

With original buckets,the sectors are all same size, 2^15 by 1300, but not the same shape, that is true. However, once encoded, the shape of each sector becomes fixed.

we’d really have to store some external mapping to do it efficiently.

Can we precompute and store the offset for each bucket from their sizes? If bucket sizes were [1000, 789,…] then you can read the 3rd bucket from start+1000+789

Just FYI, decided to go with original design in the end, there are various issues with this proposal.

I find Nazar’s proposal appealing, and I also have some comments on the subsequent replies.

1 )

@nazar-pc can you confirm that this is what you had in mind?
Please note when record is used and when encode is.
You already replied that you meant pushing (unencoded) record chunks to the end, but I think that we agree that s-buckets should be arranged index-wise.

I’m asking because in the subsequent reply by Chen, he seems to consider a different s-bucket.

2 )

@dariolina, what’s the problem with what you write? Masking, to the best of my understanding, is done in order to 1) make the “plot” unique; 2) discourage from not following the protocol by generating on-demand plots.

3 ) There’s a potential issue with storing unprovable chunks (raw chunks that don’t give “lottery tickets”), which are roughly 37% of all chunks. Basically farmers with lots of storage (hence storing the same archived record several times) may prefer to store the raw record once instead of several scattered 37% fragments.

No, I meant what I wrote above. I didn’t plan to store s-buckets index-wise because we don’t have all indices, which makes s-buckets of different size, which in turn was the reason why I started this discussion.

That is exactly the problem we were talking about. 37% is way too much. We effectively waste 37% of space because that space can’t farm.