Metrics for plotting times - reduce number of samples being averaged

Hello,

I’ve lately been working closely with Wolfrage on developing his tool, and we’ve been analyzing the metrics output closely. We’ve come to the conclusion that the following values from the metrics are averaging over too many sectors, which causes a great deal of distortion in the times for any remaining plots when another plot finishes.

subspace_farmer_sector_plotting_time_seconds_sum{farm_id="01HS66MQMW86T6DFS0S9KS63ZQ"} 50217.142811497004
subspace_farmer_sector_plotting_time_seconds_count{farm_id="01HS66MQMW86T6DFS0S9KS63ZQ"} 126
subspace_farmer_sector_plotting_time_seconds_sum{farm_id="01HSDTEKF42JM2XEMSD3ZC8545"} 50616.17119767302
subspace_farmer_sector_plotting_time_seconds_count{farm_id="01HSDTEKF42JM2XEMSD3ZC8545"} 129

In that metrics output, the plot ending in ID 545 is only half plotted, while the one ending in 3ZQ has recently finished being plotted and replotted.

The true sector time for this CPU is 3m17s. But currently, using those metrics values, it makes it appear as if the sector time for the one remaining plot is 6m29s. That’s because for the preceding 129 sectors, the one remaining plot was getting a sector every 6m34s as the CPU’s attention was being split with a second plot. But it no longer is, it’s the only plot remaining and getting one every 3m17s. But it’s averaging the last 129 sectors. At over 3m each, that means it’ll take at least 7 hours before the number adjusts back to normal - possibly much more, if the number of sectors being averaged continues to increase. As it stands, in his tool, the average farm sector time is veerrrrrry slowly, verrrrrry gradually, over many hours, dropping from 6m30s back down until eventually it will become the correct 3m17s again.

Averaging that many sectors seems to be of dubious benefit, when plotting times for a given CPU are generally pretty consistent with little variation. It seems like even just a “last sector time” would be more useful, and adjust far more quickly when conditions like the number of plots being worked on changed, than “average sector time” over such a large number of sectors would be.

If changing the existing plotting metric to max out at around a count of 1-3, rather than 129, is unacceptable for some reason, then could we get an additional simple “last sector time” as a separate metric instead?

1 Like

Oops. Seems that just restarting the farmer is enough to get it to start tracking from 1 sector again. Should’ve thought to try that.

I still don’t think there’s a use to tracking every sector since restart, and lowering the count would remove the need to restart to correct the value the metrics are providing when an initial plot or a replot finishes, but consider the priority of such a request drastically reduced.

(Perhaps have it automatically reset the metric when an Initial Plotting Complete or Replotting Complete takes place, meaning a change in number of drives being plotted takes place?)

The metrics is a histogram, you can read about it in detail here: Histograms and summaries | Prometheus

Yes, it has count and sum, but that are not the only values, you also have time distribution in percentiles that can be customized in the software if they don’t match expectations that will give you much more information about what is happening and whether there is an issue.

Does it help or would you still need last sector time? You need to understand that metrics are for monitoring purposes, meaning showing what is happening and being able to predict/detect issues. It is not for showing real-time dashboards, even though I appreciate that you’re able to repurpose it for that purpose as well.

While CLI tool is great, expanding on Subspace | Grafana Labs and using that to introspect into farmer’s operation might be more appropriate and it will also allows you to see historical data, set up alerts, etc. Looks like right now the dashboard is quite basic, but it it is a good start and can be improved.

So I just collected some good data showing the effects of the distortion I’m talking about.

These screenshots will focus on Gungnir, which had 4 completed plots and 1 half done plot when a replot started. This is what the average times showed in wolfrage’s tool (which is extremely accurate when replots aren’t distorting it) at the beginning. 3m21s is accurate.

Within a few minutes of the replots starting, avg sector time for Gungnir became 40s. You can see this wildly distorted every other farm as well.

After two of the four replots completed, avg sector time at 2m31s.

After the entire replot was completely finished, 5m12s.

Note that this is all with code that we added yesterday to ignore metrics being sent for plots that had already fully completed, replot finished. Without that code ignoring those metrics from no-longer-plotting farms, it’s way worse - 7m39s.

After a Gungnir farmer restart, sector time fixed. All other farms not restarted (not done replotting yet) and still greatly distorted.

If we had last sector time, the numbers would probably still be terrible during the replot, but it would at least fix itself quickly once the replot finished without needing to restart every farm after every replot. The distortion would very slowly decrease over time but would never fully go away until the farmer was restarted.

As stated previously, another option would be for the metrics service to reset the plotting time metrics every time an initial plot or replot completed, as if the farmer were restarted.

(Note, btw, that Gungnir is concurrency 1 and I have --replotting-thread-pool-size set so there is no actual difference in plotting times between initial plotting and replotting.)

We will see if we can find these and work with them.

Can you be specific as to the metric name you’re suggesting? The only things I’m finding that could be related are:

subspace_farmer_sector_plotting_counter_sectors
subspace_farmer_sector_plotted_counter_sectors

The latter seems to be the count of plotted sectors since the last restart. The former says “Number of sectors being plotted”, but it’s giving me a result of 9, and I don’t see what that could mean for a concurrency 1 farm that only has one plot to plot to. Can’t see how we can use either to our advantage.

I wasn’t talking about them, I was talking about subspace_farmer_sector_plotting_time_seconds that you had initially. Simply search for it and you will see many sub-metrics in there. Documentation about histograms explains what they are and how to use them, also in PromQL there are functions for rendering those in convenient ways.

Okay, I found the bucket histogram information (wolfrage was initially filtering it out as he didn’t see a use for it ) and took a long look at it, but I’m not really seeing a way to use it to resolve this issue. As to using it to determine whether there is a problem, I think the screenshots I posted prove the issue exists beyond doubt.

The stats recovered quicker than I expected following the replot, in just a few hours (3 or 4), but they’re still distorted (larger) by a few seconds each, which can throw off ETA significantly. It would still be nice to not have the distortion caused by replots persist until the next restart, so some sort of rotation/pruning of older metric data would be great.

It will not give you the information you have specifically requested in this issue, what is does give you is an industry standard way of collecting metrics from apps. By looking at whether you have sector times falling into specific buckets and their distribution over time you can understand what is going with farmer, whether it performs as expected performance-wise, etc.

I will repeat: metrics are not for monitoring and altering purposes, they are not for building custom dashboards with use-case requests. It would be worth investing time into Grafana dashboards though that would be more scalable and functional over time.