Calculating fraction of missing pieces

I am writing to discuss the fraction of missing pieces under various scenarios. We have explored two primary approaches:

  1. With a fixed number of pieces (Static).
  2. When the number of pieces increases over time.

In the first experiment, we assume a total of 1,000,000 pieces, with 100,000 farmers, where each farmer can store 1000 pieces. Given this setup, the replication factor is 100, resulting in no missing data.

However, when we reduce the number of farmers to 10,000, the replication factor drops to 10. Consequently, we observed that the fraction of missing data becomes 116/1,000,000.

For the dynamic scenario where \( n \) increases progressively, we follow the given scheme: height grows by n_0 plus a uniformly random value within the range ([min, multiplier * n_0]). In this setup, every farmer joins the system at n = 1000. The objective is to devise a strategy that reduces the likelihood of data loss.
We are assuming n_max = 1,000,000 and min = 100

The process for each farmer is to continuously compute the next height until the maximum is reached.

Multiplier = 4

With a multiplier set to 4, and assuming #farmers = 100,000 with each capable of storing 1000 pieces, the fraction of missing pieces comes to 9876/1,000,000.

The figure below is a histogram of the latest height of farmers.

When the number of farmers is reduced to 10,000, the fraction becomes 87794/1,000,000. This is a logical outcome given the decrease in the replication factor from 100 to 10.

Multiplier = 2

By setting the multiplier to 2, we essentially offer each farmer a broader scope for selecting pieces. This methodology appears to be both more realistic and practical.

For #farmers = 100,000, the fraction of missing pieces is 0.0064%. You can see which pieces weren’t selected by any farmers.

However, with a reduced count of #farmers = 10,000, the fraction of missing pieces rises to 0.0583%.

In conclusion, the results suggest that a multiplier of 2 is preferable. In a subsequent discussion, I will delve into potential modifications to the selection rule to further mitigate data loss.

You can also find the code here:

1 Like

Thanks @saeid!

For the static case, it is a standard balls-and-bins problem and so the fraction of missing pieces is about 1/{e^{replication factor}}. For the dynamic case with multiplier = 4, the fraction of missing pieces is about 1/replication factor. For the dynamic case with multiplier = 2, the fraction of missing pieces is about 1/(2 \times replication factor).

That is, our current sector expiry policy, together with uniform piece selection, leads to a relatively high fraction of missing pieces. This issue can be mitigated in two ways: (1) using better parameters for our sector expiry policy; (2) modifying our uniform piece selection. Saeid will report more on this tomorrow.

For mitigating the mentioned issue, high fraction of missing pieces, we designed four scenarios as follows. Replication factor in these scenarios is 10.

Scenario 1

The maximum number of pieces is 2,000,000. We have two stages:

  1. 10,000 farmers join based on uniform distribution from range [0,1M]
  2. Another 10,000 famers join based on uniform distribution from range [1M,2M].

Then, based on the formula of calculating the heights of each farmer, n_0 + [min, multiplier*n_0], we calculated the latest height of each farmer. Then, each farmer can select 1,000 pieces.

Scenario 2

In this scenario, we assumed history is not important. The setup is similar to scenario 1 but the first 10,000 farmers join the system at a constant height, let’s say 1000.

Scenario 3

In this scenario, we went a step further. The maximum number of pieces is 3,000,000. The setup is like scenario 1. The third batch of farmers join in range [2M,3M] based on uniform distribution.
We set 11 checkpoints from [2M,3M] to track the number of missing data at each stage. These checkpoints selected at heights: {2M,2.1M,2.2M,...,3M}

Scenario 4

We assumed after a specific height, all of the farmers [oldest,…,newest], select their pieces from scratch. For example, after height of 3,000,000 all of the 30,000 farmers select their pieces from range of [0,3M]
This scenario because of overhead is not efficient. However, it gives us clearer insight.

You can see the results of these scenarios with different values of multiplier:

For scenario 3, here is outcome of checkpoints when multiplier = 2.

CheckPoint 0: 3201
CheckPoint 1: 6224
CheckPoint 2: 9400
CheckPoint 3: 12421
CheckPoint 4: 15716
CheckPoint 5: 19814
CheckPoint 6: 26393
CheckPoint 7: 38224
CheckPoint 8: 60043
CheckPoint 9: 100999
CheckPoint 10: 176230

Also, you can find the code for all of the scenarios in this repo: