Two farms slowed to a crawl after system reboots, now one won't plot and one wont win rewards

This is going to be a long post. A lot of information to share. First, let me give you my setup. I have 4 machines involved in subspace.

#1: 10900 CPU, Ubuntu 20.04, running node and 4TB farm (single drive). When working correctly, 5 minute sectors.
#2: 10850k CPU, Ubuntu 20.04, running 2x2TB farm (two 2TB drives) connected via rcp to #1 node. When working correctly, 5 minute sectors.
#3: 10850k, effectively identical to #2 in every respect.
#4: 7950x, Windows 10, running 3TB farm (one 2TB plot and one 1TB plot on same drive) connected via rcp to #1 node. When working correctly, 3m20s sectors.

Okay. Here’s my full experience with subspace since Gemini 2a. A week before 3g, I wanted to get reacquainted with the process so I practiced trying to set up 3f. My experience was terrible. On all four farms, It would take me nearly an hour to go from “Synchronizing piece cache” to “Finished piece cache synchronization”. Plotting a sector started out taking over 3 hours and very slowly improved to maybe one every 30 minutes. CPU usage was always low, in the 3-5% range at most.
And any time I would control-c a farm, I’d get hundreds of error messages (“piece_provider : get providers returned an error piece” for hundreds of indexes).

After a week or so of this frustration, 3g launched, and it worked great! I got all the times described above as “working correctly”, 5 minute sector times at worst, and very consistent. Piece cache would synchronize almost instantly. Never ever got an error message control-c’ing a farm, I’d always get the “SIGINT received” trap and shutting down working properly. Been working like a dream.

Until yesterday (November 14th).

This morning I updated to the November 14 builds as soon as the announcement came out. Worked great, node syncing definitely improved btw. No issues at all with the Nov 14 build on either node or farmers.

But then later in the day, for reasons totally unrelated to subspace (actually to chia), I had a need to physically switch 13 HDDs between the #2 and #3 10850k boxes. I also wanted to label the drives, so I did about a dozen reboots of each box unplugging an HDD one at a time to identify it so I could label them. That’s ALL I did (well, that and necessary /etc/fstab updates, obviously). Eventually all the HDDs were labeled and switched. And all was good in that respect. Everything went fine as far as my chia stuff was concerned.

But then when I went to restart the farms (which I did shut down properly with control-c before my first reboot and didn’t attempt to restart again until everything was done), they’re now behaving just like they did under 3f. Though I am now also seeing a few “missing piece” messages I don’t recall seeing in 3f, but that’s about the only difference and it’s possible I missed them previously or forgot them, not seeing a lot of them even now and not always).

I ran scrub on both drive farms on the #3 box (which does NOTHING else), and scrub reported both farms were fine, no errors.

After a few reboots and restarts of the #3 farm, the piece cache now synchronizes immediately. But - I no longer sign hashes successfully. Ever. There are no error messages on node or farm, I just haven’t signed a hash in over 8 hours. Even the #2 box, which is taking forever to do anything (Piece cache sync at 59.52% was 11 hours ago) is still farming properly and signing hashes, but #3 box will not.

I should note that my move of HDDs from #2 to #3 does mean that I am now farming nossd/chia plots across a samba network share, where previously #2 box was farming those shares locally. So this could be causing a lot more network traffic between the two boxes. The problem with that theory is that the problems persist even when I completely shut nossd/chia farming down. It’s been shut off for 2 hours now and it has made no difference on either box.

#2 box, btw, I have not shut down or rebooted since the problem started. I’m afraid to at this point, because while it can’t replot, it can at least win rewards, which is better than #3 which after several reboots and restarts now appears fine in all respects except the new inability to actually win anything (previously, it was winning rewards like #2, rebooting and restarting fixed piece cache issues but disabled wins).

For the record, box #1 (where the node and a farm are) and #4 boxes have worked fine throughout, can have farms restarted with no issues, I have not rebooted them.

Please help re: boxes #2 and #3.

Incidentally, what I am trying now is syncing up a node on box #3 and connect the farm to it locally rather than to the node on #1, see if that helps get rewards back. Will report results when sync is complete.

Can you share all the impacted plot ID in these 2 boxes?

Are you able to provide logs for the problematic boxes please? As text files or via pastebin would be best.

I have a 3.2 meg farm file log (a ton of those “get provider” errors whenever I control c’d, remember). I can’t upload it directly because .txt or .log aren’t valid upload options, and pastebin seems to have a 512k limit. What do you suggest?

Once we get the method sorted, I’ll try to clip the relevant time frame from my node log too. And also farm log for Box 2.

By the way, having the #3 farm attach to a new local node running on box #3 doesn’t appear to be helping. About an hour now with no wins. Still plausibly bad luck but if no more in another couple hours, no longer plausible, assume if I don’t say otherwise that it isn’t going to work.

1 Like

I seem to have the same issue on 3 of my farmers. 1 on windows that happened during windows update and 2 on ubuntu after resizing plots down.

Are those farms on local nodes, or are you connecting them to a remote node via --rpc-node-url?

I am beginning to suspect that using --rpc-node-url to connect to a remote node is a prerequisite for the issue, and can be helped by running a node on every box that has a farmer. The main thing I see whenever things break seems to be that “get_provider” error, which does say ‘Disconnected’ and makes me think it involves the farmer getting disconnected from the node. Note that these get_provider errors do not appear until you control-c the farmer. No errors appear until then, things just stop working.

I am currently testing this by creating nodes on every box, and will update when I’ve learned more.

As an update on efforts so far:

Box #2: After rebooting and restarting farm on box #2, it too went through some piece cache stuff but seemed to have mostly recovered - then overnight it kept farming but stopped replotting, and when I control-c’d, I got a few (not the usual hundreds) of the get_provider error in the farm log. I checked for when this must’ve happened in the remote node log, but there was no hint that anything had gone wrong there. After restart and 3 minutes of syncing piece cache, it seems mostly back to normal. I am now syncing a node on this box and will have farm connect to it rather than the node on box #1, and see if the problem does or does not recur.

Box #3: This box had seemed to recover except that it was hardly ever winning. On a 4TB farm, it signed hashes only about 4 times in 8 hours, maybe 10% or less of what I was getting on my other 4TB farms. Switching it to local node does seem a bit better - maybe a hit or two every hour, but this box is still getting way less hits than my others. It is possible that this box has significantly more LAN traffic going on, as it is hosting about 200TB of plots for a remote nossd chia farmer (the farmer which happens to be on Box #2).

Box #4: This is my windows box. I too got hit with a windows update and reboot killing the process. Took about 70 minutes to sync the cache (and did keep farming and winning while that was going on), and now seems back to normal, winning and replotting as expected. I will eventually put this farm on a local node as well, but not right away.

Box #1: Has had absolutely no issues. It is the one box that was always farming with a local node.

Additional: Something occurred to me. Remember when I said that everything was going perfectly smooth until November 14th? Correlation is not causation, but it should be noted that just prior to these problems starting, both boxes #2 and #3 finished initial plotting and started replotting. Same is true of trouble-free box #1 and mostly trouble free windows box #4 though (which actually finished initial plotting several days ago) .

Running --node-rpc-url.

It seems to start piece cache sync then stop after a short time.
I tried sacrificing a plot by deleting piece_cache.bin and resizing to 200GIB. Then it finished piece cache sync at least but doesnt seem to work still.

Plot moved to one of these computer doesnt seem to work either.

reverting farmer to nov 9 for a short to sync then going back to 16th seems to fix it for me.

How long after reverting to nov 16 farmer did it take for you to sync? I tried it with Nov 13 farmer, but when I went back to the 16 farmer it again hung for 15 minutes and I gave up on it.

a few seconds to sync on 16th, took some minutes to sync on 9th before swapping.

I let it stay on 9th untill it started plotting then swapped on 16th.