Large Farm Optimizations

This is a follow-up to a post I made in the general-chat channel on discord earlier today.

While the minimum system requirements are very clear, it is not clear how to scale up a large farm. Based on my personal experience along with discord questions and answers I have been able to compile a few bits of information. Some of those bits are:

  • Large plots take up tons of ram - at least initially. 16 TB of plots (4 @ 4 TB SSD’s) takes up 222 gigs of ram on my server initially and then settles at about 80-82 gigs used after 20-30 minutes.

  • Internet usage is massive. Today I saw the receive on my NIC stay above 100 Mbps for a good 20 minutes and even hitting 200 at one point. Checking my usage at my ISP, it looks like I spent an extra 3 TB of data during my last testing phase which lasted a few weeks.

  • It appears the node process is largely single threaded and is limited by the speed of the core it is running. It does take a considerable amount of time to sync up. I am getting about 2-4 blocks per second.

  • The farmer has very brief periods of heavy CPU usage and is multi-threaded to some extent. It usually hovers around 20% of the processor and will only use one processor in a multi-processor system.

  • Following posted recommendations, I am not using raided SSD’s. SATA SSD drives appear to work fine as I have never seen usage higher than a few hundred MB/s. Most of the time the drives are sitting idle while plotting (and farming). Every once in awhile I will see all of them with a few hundred kb/sec usage.

  • I have yet to determine what are the bottlenecks other than the initial ram usage when the farmer process starts. Most of the time (i.e. 95% or more), I see plenty of available ram, CPU, disk I/O, and network bandwidth.

As I had mentioned earlier in this post, I am interested in determining how to scale things up. I have a ton of questions. Would it be best to add more SSD’s to an existing PC or run more PC’s? Based on my observations right now we are limited by RAM, but I have a few older workstations with 512 gigs of ram with Ivy Bridge processors. Could I run 32 TB on them? is it always better to run one node process and multiple farmers? Is it better to use the largest plot sizes you can fit on an SSD or should you split them into different files? Right now I am not CPU limited, but I have less than 1% of 16TB plotted. Are CPU or RAM requirements going to go up the farther I get? Is there anything that I can do to increase my plotting speed? When a network is brand new there is a huge advantage in getting in early. But with such a slow plotting speed, does it make sense to run as many different farmers as possible with smaller (or fewer) drives to be able to plot quicker? How is internet bandwidth being used here? Are larger plots taking up more bandwidth?

Tons of questions here and I can understand if they can’t all be answered right away. But now that a date has been announced for the incentivized testnet I am sure there are many people that are interested in the answers. I would appreciate some guidance from the developers here.

Thank you!

4 Likes

I do not believe this is actually the case. On Windows memory usage reporting is odd, if you look at the farming process it uses much less memory and if there are apps that need RAM you’ll notice nothing crashing with out of memory. This was discussed on Discord, but now that this is on the forum hopefully more people will be able to find it.

Previous networks worked very differently and shouldn’t be compared to Gemini 3f directly. High network usage is expected during plotting, did you already finish that process? Once plotting is done your farmer will help the rest of the network to sync (both nodes and farmers are now sync from plots), but eventually should settle on lower bandwidth usage.

Node sync could is, this is likely an upstream Substrate issue that I’ll get to once I have more time (see sc_consensus_slots::check_equivocation is not guaranteed to catch equivocation · Issue #1302 · paritytech/polkadot-sdk · GitHub for upstream discussion on related topic). There are some things we control that are heavily multi-threaded, but they happen relatively infrequently (archiving of blockchain history).

20% usage is during auditing and the plan is to get it even lower, it is supposed to be space / I/O-heavy, not CPU-heavy. When you farmer does find a solution though, it has to generate a proof, which is indeed heavily multi-threaded and computationally expensive process. How often you find solutions depends on size of the network and amount of space pledged. So this is an expected behavior.

Yep, SATA is expected to be perfectly fine. Auditing is basically a lot of small random I/O, but when someone requests something from you, higher reads could be observed. However if you have a lot of RAM OS will likely put caches in RAM and you’ll have less actual disk reads anyway.

The goal of the protocol is to be bound by the amount of “fast enough” space. There is no goal to burn a lot of energy or destroy disks with unnecessary writes or anything like that, in fact we implement things in a way to minimize those (ask people that ran previous networks how high write amplification was there comparing to latest iteration for example).

Given how much RAM you have and the fact that I don’t think you use even 10% of that in practice the only thing that matters is space. But we don’t recommend people to buy hardware as they might get negative ROI in that case. The choice is yours of course.

As long as farmer is able to generate a solution in time you’re good, but those CPUs are quite old and probably not energy efficient by today standards.

There should be no need to run multiple nodes and farmers, just use multiple farms if you have multiple physical disks.

One physical disk - one farm. No RAID, no fancy file systems. In the future farmer will likely be able to work with raw disks directly with no file system on them at all to improve efficiency.

This is mostly an implementation inefficiency. Make sure to upgrade to latest releases, we’re tackling plotting bottlenecks one by one. The goal is to be either CPU or network bound (whichever is weaker) until plotting is finished. This is not quite the case right now. Also farmer uses little memory and tries to be very deliberate with it, we might introduce optional flags that would allow to accelerate plotting while eating more memory, which in your case will be very helpful given how much RAM you have available.

Eventually it should make a difference. In the meantime it can help with plotting and will result in proportionally larger memory usage.

To a degree. Right now blockchain history is still small, so your farmer will eventually cache everything locally and will stop reaching out to the network to pull pieces since it’ll have everything locally, but it will not be the case when we have a lot of data on mainnet, at mainnet you’ll eventually have to download as much data as the plot size, but it’ll probably take some time before we have terabytes of blockchain history.

Thanks for asking on Discord where people will be able to find these answers afterwards. These are great questions and the only thing I’d change is to ask independent questions in separate topics such that discussions can evolve around them separately.

I hope this helps.

3 Likes