Timed out without ping from plotter

Farmer: Snapshot build #364
Node: Snapshot build #364

  • Sectors are not completing successfully. After about seven hours of running, zero sectors completed.

WARN: Farmer

2024-07-10T23:29:46.198480Z  WARN {farm_index=0}:{sector_index=745}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 787387567680 sector bytes out of 1056896064 expected bytes
2024-07-10T23:33:32.841258Z  WARN {farm_index=9}:{sector_index=632}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 667958312448 sector bytes out of 1056896064 expected bytes
2024-07-10T23:34:33.330491Z  WARN {farm_index=8}:{sector_index=101}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 106746502464 sector bytes out of 1056896064 expected bytes

WARN: Plotter

2024-07-10T23:34:38.755287Z  WARN {public_key=f4187d5e8a8b7ef59edbeac2ce703b44efc3e40fa8a44a91dd73d56fbbb87c1c sector_index=1213}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-10T23:35:24.851414Z  WARN {public_key=0aecf94b58132d95ade20797fb108ddc1d184349701bc13a1d4dfaa2831ed173 sector_index=835}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2FENM64MFHYN5JB0GEJWAPQ expected_index=138 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-10T23:35:28.732111Z  WARN {public_key=0aecf94b58132d95ade20797fb108ddc1d184349701bc13a1d4dfaa2831ed173 sector_index=835}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-10T23:35:33.629623Z  WARN {public_key=de2367fb79835aec8185508b9c01d0fbca68c12d82e4f1efe890f785bff1d225 sector_index=101}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2FEPDW9QCPG5H4W4PR5Q0E8 expected_index=137 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-10T23:35:37.028791Z  WARN {public_key=de2367fb79835aec8185508b9c01d0fbca68c12d82e4f1efe890f785bff1d225 sector_index=101}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-10T23:35:55.395375Z  WARN {public_key=e0848ac804c9f05b1a28638fdb11ae67ad35790066dcf00f102cffd1abb5bd5f sector_index=860}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2FEQHQHWETM1NQAQ2Z5MJRN expected_index=136 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-10T23:35:59.038212Z  WARN {public_key=e0848ac804c9f05b1a28638fdb11ae67ad35790066dcf00f102cffd1abb5bd5f sector_index=860}: subspace_farmer::cluster::plotter: Response sending ended early
System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit


Logs (249.2MiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

My apologies, there was indeed another bug. I tested the fix for it and it does work for me with a cluster: Snapshot build · subspace/subspace@a1f9bb0 · GitHub

1 Like

Farmer: Snapshot build #365
Node: Snapshot build #365

  • It appears there’s an ongoing issue with sectors not completing as expected. Previously, I had no trouble keeping up with replotting, but since the latest release, the process has been falling further behind. Some metrics indicate that it’s now taking over an hour per sector, compared to the previous 10-15 minute range. Additionally, logs suggest that some completed sectors are still being discarded (Failed to plot sector).

WARN: Farmer

2024-07-11T17:15:35.079785Z  WARN {farm_index=46}:{sector_index=698}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-11T17:15:35.248195Z  WARN {farm_index=47}:{sector_index=824}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-11T17:15:35.248362Z  WARN {farm_index=38}:{sector_index=365}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-11T17:15:35.250592Z  WARN {farm_index=3}:{sector_index=488}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-11T16:59:02.581434Z  WARN {farm_index=7}:{sector_index=446}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 471641773020 sector bytes out of 1056896064 expected bytes
2024-07-11T16:59:00.678658Z  WARN {farm_index=22}:{sector_index=1760}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 1860733502416 sector bytes out of 1056896064 expected bytes
2024-07-11T16:59:00.800198Z  WARN {farm_index=45}:{sector_index=1340}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Received only 1416663511424 sector bytes out of 1056896064 expected bytes

WARN: Plotter

2024-07-11T17:15:57.347296Z  WARN {public_key=70b1b79e84ec95460534825e0b2d2aad08dc47873d7ceb4fbc9ec248a27f0973 sector_index=138}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-11T17:15:58.115156Z  WARN {public_key=2cebdd6938352d560b4aa3a3dd0d6838d388d5379e7a479edf1caaba5e96d60c sector_index=34}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2HCAKK03FA7VT22JXVM6V06 expected_index=35 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-11T17:15:58.115216Z  WARN {public_key=2cebdd6938352d560b4aa3a3dd0d6838d388d5379e7a479edf1caaba5e96d60c sector_index=34}: subspace_farmer::cluster::plotter: Response sending ended early
2024-07-11T17:15:58.699581Z  WARN {public_key=42f869b2329e967cfd6324d59fde439f82c38b84dd0aa2cf7a71f24e967a6e3a sector_index=1766}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2HCAM4Q4EDN4VFA7ZQ9MFAJ expected_index=35 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit


Logs (413.6MiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

Those warnings have the same source: NATS connection breaks, sectors are not written fully and time out. More detailed logs from NATS server also say that client simply stopped reading messages so server waited for a bit and then simply closed dead connection.

I still do not see an obvious reason why this would happen and after reading through a lot of upstream tickets think rt: make the LIFO slot in the multi-threaded scheduler stealable · Issue #4941 · tokio-rs/tokio · GitHub might be what we’re hitting here under load. Assuming that is the case, we can try to use a workaround introduced in rt: add unstable option to disable the LIFO slot by carllerche · Pull Request #4936 · tokio-rs/tokio · GitHub, which this test build enables: Snapshot build · subspace/subspace@ceec8a4 · GitHub

:crossed_fingers: that it finally works as expected.

Farmer: Snapshot build #367
Node: Snapshot build #367

  • Performance seems to be an improvement over Snapshot #365, but less performant in plotting compared to jul-05.

  • I haven’t had enough expired sectors to notice this before, as I have quite a backlog at the moment. However, it seems that some AboutToExpire sectors are taking precedence over expired sectors. Over approximately 13 hours of logging, AboutToExpire sectors decreased while there was a backlog of expired sectors to replot. This happened with all farmers. The only plausible explanation, aside from some odd logic, is that AboutToExpire sectors are progressing to expired before I can replot them.

WARN: Farmer

2024-07-12T04:59:56.174909Z  WARN {farm_index=36}:{sector_index=1558}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:56.284208Z  WARN {farm_index=37}:{sector_index=1710}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:56.380516Z  WARN {farm_index=44}:{sector_index=354}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:58.308986Z  WARN {farm_index=9}:{sector_index=431}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:56.174909Z  WARN {farm_index=36}:{sector_index=1558}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:56.284208Z  WARN {farm_index=37}:{sector_index=1710}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:56.380516Z  WARN {farm_index=44}:{sector_index=354}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter
2024-07-12T04:59:58.308986Z  WARN {farm_index=9}:{sector_index=431}: subspace_farmer::single_disk_farm::plotting: Failed to plot sector, retrying in 1s error=Low-level plotting error: Timed out without ping from plotter

WARN: Plotter

2024-07-12T05:00:18.264442Z  WARN {public_key=1e884369f35785d13dc6b04267da4a685655865fe773c00de7cff3975bca3726 sector_index=1367}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2JMQ14SC9R7AVH4ECHQY71P expected_index=26 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-12T05:00:18.431777Z  WARN {public_key=3cb688f4d9967a3f4ac8e985226d3b763ec2134f6fb7320bee218d1e653c8e48 sector_index=116}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2JMCYK40TBQDXB3F55TRBXM expected_index=60 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
2024-07-12T05:00:19.386284Z  WARN {public_key=d0256843881b3abf31825f25c2c13ef510d12dd43138896891ed713e1650d67a sector_index=260}: subspace_farmer::cluster::nats_client: Acknowledgement wait timed out response_subject=stream-response.01J2JMVMQCKCN0KB2PJXW3BXJ2 expected_index=11 request_type=subspace_farmer::cluster::plotter::ClusterPlotterPlotSectorRequest response_type=subspace_farmer::cluster::plotter::ClusterSectorPlottingProgress
System Specs

Role: NATS, Node (nats.log node.log)
Dual Socket 2680v3 24 cores (48 threads)
Link: 20Gbit

Role: Cache, Controller, Plotter, Farmer (cache1.log controller1.log plotter.log farmer1.log)
Dual Socket 7742 128 cores (256 threads)
Cache: 100GiB
Plots: 84T
Link: 100Gbit

Role: Cache, Controller (cache2.log controller2.log)
EPYC 7F72 24 Cores (48 Threads)
Plots: 0
Link: 100Gbit

Role: Farmer (farmer3.log)
Dual Socket 2687v4 24 cores (48 threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer4.log)
Dual Socket 2697A 32 Cores (64 Threads)
Plots:190T
Link: 20Gbit

Role: Farmer (farmer5.log)
Dual Socket 2680v3 24 cores (48 threads)
Plots 113TB
Link: 20Gbit


Logs (1.33GiB)

  Cache 1
  Cache 2
  Controller 1
  Controller 2
  Farmer 1
  Farmer 3
  Farmer 4
  Farmer 5
  NATS
  Node
  Plotter

Sectors between farms are independent. So if you have expired in one farm and about to expire in another farm then there isn’t really a specific ordering between them, ordering only exists within the same farm right now.

Timeouts seem to be exactly the same, so the conclusion is it didn’t really help much. I’ll have to think some more about this, not yet sure what else to do.

For now I’d probably recommend moving NATS server to the same machine as plotter, this way you will not have to do extra roundtrips over external network interfaces (all messages are flowing through NATS server, even between components on the same machine).

I’ll be looking for solutions to this in the meantime with upstream developers.

1 Like

I should have been clearer. This issue is occurring on a per farmer basis and has also been observed across all farmers. It’s entirely possible that sectors are expiring (AboutToExpire > Expired) faster than I can clear the backlog at the moment. However, based on the counts, it seems this may not necessarily be the case, which is why I shared this information.

Will do, thanks.

I was talking about individual farms, not farmers. Until farm has finished one loop of replotting it doesn’t discover more expired or about to expire sectors, at least current implementation doesn’t.

This makes sense and clarifies the current conditions I am experiencing. Thank you for the additional explanation.