Recovering missing piece failed

Interesting, what happens on new setup if you use just a single farm? I’m starting to suspect it can be a function of number of farms.

I think a lot of users run Subspace on an internet network that comes by the ISP with default networking gear modem/router/firewall etc. Subspace is heavy on the network. Even one of my Cisco/Meraki firewalls seems to struggle to the point latency increases…

The reasons why circulated tweaks (lower in/out peers and connections etc.) may work, is because it takes load of your network connection.

I have just adjusted all my farmers to user fewer connections and noticed the network itself, like the ssh terminal to servers started to respond faster.

I start to actually see the effect on the firewall monitoring as well, the loss of packages is decreasing and latency too but without affecting the overall in/out data volume…

What surprising is that the even within 30 days, 5 farmers at the sum uploaded 5.4 TB and downloaded 10.7 TB of data… Next you see a breakdown by clients.

So Subspace as proof of archival space and time, does a lot of CPU while plotting/re-plotting and farming & lots of internet traffic.

At the moment I am thinking my failed recovery of missing pieces has to do with overloading the network link into the internet.

I do these guys on a regular basis, can they be ignored? I am on sep-21 right now. If not ignored what can be changed to reduce the occurrence of these errors?

Blockquote
2023-09-22T04:25:46.969464Z ERROR yamux::connection: 60b70350: maximum number of streams reached
2023-09-22T04:25:50.088984Z ERROR yamux::connection: 71cb0501: maximum number of streams reached

I have only seen this once over last few weeks. You can generally ignore it, but this is something for us to fix on protocol level (cc @shamil).

What exactly is this error reporting that has reached the limit? Is there any way to increase the parameters from the additional parameters to solve this error?

Because I found that this error rarely appeared when I ran one node and one farm, but after starting more farms, this error appeared very frequently.

This must be related to weaker networking or router. Generally the limit set in code should be high enough for this to not happen at all, but something is clearly off. This is a non-fatal error though.

@xorinox Regarding “maximum number of streams reached”: I would appreciate additional info about this error.

  • What is your configuration (OS, VM, docker, hardware, etc…)?
  • Is there a farmer-machine configuration where you don’t get such errors?
  • Did you try running smaller plots or/and a single plot on your machine? Is there a difference in error frequency?
  • Did you try changing out- and in- connection CLI arguments separately?

Please tell me how should I increase or decrease this parameter? I found that there are many parameters related to connection. Which one should I adjust specifically?

@shamil this is a physcal machine, AMD 64 core (7702), Fedora v37 I am using pretty much the same configuration on all machines. So there is only one other that has same errors. I tried smaller and larger plots. But I didn’t look for frequency differences. That is an interesting question, but takes quiet a bit of time & organization for testing. What would you suggest to change in/out parameters to? I am using defaults on this machine.

@xorinox You mentioned that you run custom builds sometimes. Try increasing this constant (YAMUX_MAX_STREAMS): https://github.com/subspace/subspace/blob/0caf3bf63b4109b31bbae7ec85db3666f7710c2d/crates/subspace-networking/src/constructor.rs#L77

We use recommended values from the upstream library builders (libp2p team). However, if the change will remove this type of error from your machine, we’ll reconsider the default value.

I don’t think you need to change dsn-in- parameters in most cases. Try increasing 2x, 4x dsn-out- parameters if you use data center routers. This could affect your regular experience on your home router though (browsing, youtube, etc). For nodes there is a temporary parameter “–dsn-sync-parallelism-level” - try slowly increasing it as well. It will be removed next week in favor of automatic scaling based on dsn-out- and dsn-pending-out- connection parameters.

@shamil I have increased YAMUX max to 512 and not seen the error so far. But I see these lines and wonder if they are important. I have enabled DEBUG

These lines keep coming, plotting appears to never start.

Blockquote
2023-09-29T16:44:58.041482Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=9952 connected_peers=16
2023-09-29T16:44:59.567084Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=10435 connected_peers=16
2023-09-29T16:45:00.224672Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=10348 connected_peers=16
2023-09-29T16:45:00.369840Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=10496 connected_peers=16
2023-09-29T16:45:01.151665Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=10979 connected_peers=16
2023-09-29T16:45:01.152496Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=10716 connected_peers=16
2023-09-29T16:45:01.154001Z DEBUG subspace_farmer::utils::farmer_piece_getter: Cannot acquire piece: all methods yielded empty result. piece_index=9830 connected_peers=16

Thanks for the experiment. We continue to look for the root cause.

I checked our L2 cache. All pieces mentioned in the log are present in the DSN and accessible. Please, double-check your configuration. It seems your farmer can’t access the outer network.

Shamil, could you explain in which condition the farmer falls into ‘recovering missing piece’ loop? Sometimes it lasts for more than a day and the farmer hardly plot anything under this circumstance and worse, the CPU load rises very high.

I’m wondering if a sector has ‘missing piece’, rather than trying to recover pieces one by one for that sector. Can we discard all pieces and replot this sector from scratch?

I’ve observed this happened to many serious farmers with very powerful CPU, huge RAM, good router and internet line. Recovering pieces sometimes even happen during replotting.

There are multiple possible reasons for “missing pieces”: no network connections (or unstable network), incorrect connection-related settings, and “remote” or “DSN” reasons like exceeding incoming connections limit on remote peers. The most common reason for multiple “recovery processes” should be connection settings-related reasons. The last code changes we released should decrease the probability of these “recoveries” as we implemented the automatic rate-limiter for parallel processes which is the root cause of this behavior.

Many people in China experience this issue. Is it because of China’s firewall causing certain restrictions on accessing foreign websites?

Also, after encountering this issue, I raised the connection limit of the farm to 3000, and this resolved the problem. Why is that?
Code as follows:

"--in-connections", "300",
"--out-connections", "3000",
"--pending-in-connections", "3000",
"--pending-out-connections", "3000",
"--target-connections", "3000"

While this approach can resolve the issue, it places a significant burden on the network.

It is possible. We had previous reports on networking issues when launching apps from China.

Only two parameters should be changed if the router is performant enough (like datacenter router):

  • “–out-connections”
  • “–pending-out-connections”

These parameters affect the parallelism level of data acquisition and the default values are set to protect the commodity routers.
Changing “–target-connections” will likely worsen the overall performance. Other parameters should be left default as well.

I’ve observed that farms with different connection counts exhibit different plotting speeds. Farms with a higher number of connections appear to plot faster. Could this be attributed to those two parameters?

Absolutely, before (and if) pieces are cached locally, they have to be retrieved from the network, and those parameters impact the success rate of retrieval.