Sync issues on multiple nodes with different configurations

I have tried to run a few nodes, all set up slightly differently, and the last couple of days have been brutal.

First:

  • Windows GUI node v0.6.7 (upgraded the other day upon release)
  • 1.5TB allocated for plots
  • Plots are held on a RAID5 array of a few 600G 10K disks. It’s just for testing.
  • Lots of RAM available / single E5-2430L CPU though.
  • No external ports mapped for this node.
  • WAN is stable with at least 100mbps down bw available.
  • This is my oldest node from maybe about a week ago. It took days to sync and plot to the current state. I believe the plots are finally done (IO was a continuous bottleneck towards the later stages of plotting)
  • GUI is reporting that it’s synched, but based on the size of the chain db (51GB) I know it can’t be.
  • The node portion appears to have been stuck on max height 222211 for a day:
e[2m2022-06-16T19:47:31.406030Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:47:36.660508Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:47:41.907423Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:47:47.179542Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:47:52.440648Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:47:57.691756Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:48:02.955230Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:48:08.203127Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:48:13.469426Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:48:18.732251Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m    
e[2m2022-06-16T19:48:23.989859Ze[0m e[32m INFOe[0m e[2msubstratee[0me[2m:e[0m 💤 e[1;37mIdlee[0m (e[1;37m0e[0m peers), best: #e[1;37m222211e[0m (0xf69c…aa54), finalized #e[1;37m0e[0m (0x9ee8…ccf0), e[32m⬇ 0e[0m e[31m⬆ 0e[0m

Second node:

  • CLI version on Windows (tried this because the GUI kept saying synched when it wasn’t + log not being rotated)
  • subspace-node-windows-x86_64-gemini-1b-2022-jun-13.exe
  • Same WAN as first node
  • Same storage as first node (different path).
  • Won’t sync / find peers.
  • Startup command:

.\subspace-node-windows-x86_64-gemini-1b-2022-jun-13.exe --base-path E:\subspace-cli-data\node
–chain gemini-1 --execution wasm
–pruning archive --validator
–no-mdns --reserved-nodes "/dns/bootstrap-0.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWF9CgB8bDvWCvzPPZrWG3awjhS7gPFu7MzNPkF9F9xWwc"
–reserved-nodes “/dns/bootstrap-1.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWLrpSArNaZ3Hvs4mABwYGDY1Rf2bqiNTqUzLm7koxedQQ” --reserved-nodes "/dns/bootstrap-2.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNN5uuzPtDNtWoLU28ZDCQP7HTdRjyWbNYo5EA6fZDAMD"
–reserved-nodes “/dns/bootstrap-3.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWM47uyGtvbUFt5tmWdFezNQjwbYZmWE19RpWhXgRzuEqh” --reserved-nodes "/dns/bootstrap-4.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNMEKxFZm9mbwPXfQ3LQaUgin9JckCq7TJdLS2UnH6E7z"
–reserved-nodes “/dns/bootstrap-5.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWFfEtDmpb8BWKXoEAgxkKAMfxU2yGDq8nK87MqnHvXsok” --reserved-nodes "/dns/bootstrap-6.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh"
–reserved-nodes “/dns/bootstrap-7.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWKwrGSmaGJBD29agJGC3MWiA7NZt34Vd98f6VYgRbV8hH” --reserved-nodes "/dns/bootstrap-8.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWCXFrzVGtAzrTUc4y7jyyvhCcNTAcm18Zj7UN46whZ5Bm"
–reserved-nodes “/dns/bootstrap-9.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNGxWQ4sajzW1akPRZxjYM5TszRtsCnEiLhpsGrsHrFC6” --reserved-nodes "/dns/bootstrap-10.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNGf1qr5411JwPHgwqftjEL6RgFRUEFnsJpTMx6zKEdWn"
–reserved-nodes “/dns/bootstrap-11.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWM7Qe4rVfzUAMucb5GTs3m4ts5ZrFg83LZnLhRCjmYEJK” --out-peers 100
–in-peers 100

  • See log below. I removed identifiers because I’ve restarted this one from scratch a few times (completely removed the data dir to get a new ident and a fresh start each time) and it’s been the same every time:
[2m2022-06-16 15:52:48[0m Subspace
[2m2022-06-16 15:52:48[0m ✌️  version 0.1.0-3c3a48f865c
[2m2022-06-16 15:52:48[0m ❤️  by Subspace Labs <https://subspace.network>, 2021-2022
[2m2022-06-16 15:52:48[0m 📋 Chain specification: Subspace Gemini 1
[2m2022-06-16 15:52:48[0m 🏷  Node name: [....]
[2m2022-06-16 15:52:48[0m 👤 Role: AUTHORITY
[2m2022-06-16 15:52:48[0m 💾 Database: ParityDb at [....]
[2m2022-06-16 15:52:48[0m ⛓  Native runtime: subspace-2 (subspace-0.tx0.au0)
[2m2022-06-16 15:52:48[0m [PrimaryChain] Starting archiving from genesis
[2m2022-06-16 15:52:49[0m [PrimaryChain] Archiving already produced blocks 0..=0
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤  Local node identity is: [...]
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤‍🌾 Starting Subspace Authorship worker
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤 Operating system: windows
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤 CPU architecture: x86_64
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤 Target environment: msvc
[2m2022-06-16 15:52:49[0m [PrimaryChain] 💤 Highest known block at #0
[2m2022-06-16 15:52:49[0m [PrimaryChain] 〽️ Prometheus exporter started at 127.0.0.1:9615
[2m2022-06-16 15:52:49[0m [PrimaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:56073, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
[2m2022-06-16 15:52:49[0m [PrimaryChain] Running JSON-RPC WS server: addr=127.0.0.1:9944, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
[2m2022-06-16 15:52:54[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:52:54[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:52:59[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:52:59[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:04[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:04[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:09[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:09[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:14[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:14[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:19[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:19[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:24[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:24[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:29[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:29[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:34[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:34[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:39[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:39[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:44[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:44[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:49[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:49[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0
[2m2022-06-16 15:53:54[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
[2m2022-06-16 15:53:54[0m [PrimaryChain] 💤 Idle (0 peers), best: #0 ([...]), finalized #0 ([...]), ⬇ 0 ⬆ 0

Third node:

  • Different system. Ubuntu 20.04 LTS
  • 2xE5-2697v2, 22632MB RAM allocated (running in a VM with 2x8 cores allocated)
  • Same WAN, but port 30333 mapped to it
  • A day on and the node’s data dir is still <500MB.
  • target keeps jumping all over the place. At the time of writing it’s down to “target=#33581”
  • Node startup command:

./subspace-node-ubuntu-x86_64-gemini-1b-2022-jun-13
–base-path /mnt/subspace/node0dat/
–chain gemini-1
–pruning archive
–reserved-nodes “/dns/bootstrap-0.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWF9CgB8bDvWCvzPPZrWG3awjhS7gPFu7MzNPkF9F9xWwc”
–reserved-nodes “/dns/bootstrap-1.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWLrpSArNaZ3Hvs4mABwYGDY1Rf2bqiNTqUzLm7koxedQQ”
–reserved-nodes “/dns/bootstrap-2.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNN5uuzPtDNtWoLU28ZDCQP7HTdRjyWbNYo5EA6fZDAMD”
–reserved-nodes “/dns/bootstrap-3.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWM47uyGtvbUFt5tmWdFezNQjwbYZmWE19RpWhXgRzuEqh”
–reserved-nodes “/dns/bootstrap-4.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNMEKxFZm9mbwPXfQ3LQaUgin9JckCq7TJdLS2UnH6E7z”
–reserved-nodes “/dns/bootstrap-5.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWFfEtDmpb8BWKXoEAgxkKAMfxU2yGDq8nK87MqnHvXsok”
–reserved-nodes “/dns/bootstrap-6.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh”
–reserved-nodes “/dns/bootstrap-7.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWKwrGSmaGJBD29agJGC3MWiA7NZt34Vd98f6VYgRbV8hH”
–reserved-nodes “/dns/bootstrap-8.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWCXFrzVGtAzrTUc4y7jyyvhCcNTAcm18Zj7UN46whZ5Bm”
–reserved-nodes “/dns/bootstrap-9.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNGxWQ4sajzW1akPRZxjYM5TszRtsCnEiLhpsGrsHrFC6”
–reserved-nodes “/dns/bootstrap-10.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWNGf1qr5411JwPHgwqftjEL6RgFRUEFnsJpTMx6zKEdWn”
–reserved-nodes “/dns/bootstrap-11.gemini-1b.subspace.network/tcp/30333/p2p/12D3KooWM7Qe4rVfzUAMucb5GTs3m4ts5ZrFg83LZnLhRCjmYEJK”
–out-peers 100
–in-peers 100
–validator
–name fnsuj2as

  • This is what the log looks like (lots of peers, 0bps):
user@subspacearchival0:~$ ./node.sh
2022-06-16 06:44:13 Subspace
2022-06-16 06:44:13 ✌️  version 0.1.0-3c3a48f865c
2022-06-16 06:44:13 ❤️  by Subspace Labs <https://subspace.network>, 2021-2022
2022-06-16 06:44:13 📋 Chain specification: Subspace Gemini 1
2022-06-16 06:44:13 🏷  Node name: fnsuj2as
2022-06-16 06:44:13 👤 Role: AUTHORITY
2022-06-16 06:44:13 💾 Database: ParityDb at /mnt/subspace/node0dat/chains/subspace_gemini_1b/paritydb/full
2022-06-16 06:44:13 ⛓  Native runtime: subspace-2 (subspace-0.tx0.au0)
2022-06-16 06:44:15 [PrimaryChain] Last archived block 4
2022-06-16 06:44:15 [PrimaryChain] Archiving already produced blocks 5..=5
2022-06-16 06:44:15 [PrimaryChain] 🏷  Local node identity is: [....]
2022-06-16 06:44:15 [PrimaryChain] 🧑‍🌾 Starting Subspace Authorship worker
2022-06-16 06:44:15 [PrimaryChain] 💻 Operating system: linux
2022-06-16 06:44:15 [PrimaryChain] 💻 CPU architecture: x86_64
2022-06-16 06:44:15 [PrimaryChain] 💻 Target environment: gnu
2022-06-16 06:44:15 [PrimaryChain] 💻 CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
2022-06-16 06:44:15 [PrimaryChain] 💻 CPU cores: 16
2022-06-16 06:44:15 [PrimaryChain] 💻 Memory: 22632MB
2022-06-16 06:44:15 [PrimaryChain] 💻 Kernel: 5.4.0-120-generic
2022-06-16 06:44:15 [PrimaryChain] 💻 Linux distribution: Ubuntu 20.04.4 LTS
2022-06-16 06:44:15 [PrimaryChain] 💻 Virtual machine: yes
2022-06-16 06:44:15 [PrimaryChain] 📦 Highest known block at #105
2022-06-16 06:44:15 [PrimaryChain] 〽️ Prometheus exporter started at 127.0.0.1:9615
2022-06-16 06:44:15 [PrimaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:9933, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-06-16 06:44:15 [PrimaryChain] Running JSON-RPC WS server: addr=127.0.0.1:9944, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-06-16 06:44:15 [PrimaryChain] creating instance on iface [...]
2022-06-16 06:44:15 [PrimaryChain] 🔍 Discovered new external address for our node:[...]
2022-06-16 06:44:20 [PrimaryChain] ⚙️  Syncing, target=#226643 (38 peers), best: #105 (0x3ddf…ff09), finalized #0 (0x9ee8…ccf0), ⬇ 4.4MiB/s ⬆ 29.2kiB/s
2022-06-16 06:44:25 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#226643 (45 peers), best: #105 (0x3ddf…ff09), finalized #0 (0x9ee8…ccf0), ⬇ 7.7MiB/s ⬆ 8.5kiB/s
2022-06-16 06:44:30 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#226643 (50 peers), best: #105 (0x3ddf…ff09), finalized #0 (0x9ee8…ccf0), ⬇ 9.3MiB/s ⬆ 13.6kiB/s
2022-06-16 06:44:35 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#226643 (50 peers), best: #105 (0x3ddf…ff09), finalized #0 (0x9ee8…ccf0), ⬇ 9.3MiB/s ⬆ 22.9kiB/s
2022-06-16 06:44:35 [PrimaryChain] ❌ Error while dialing /dns/telemetry.subspace.network/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Timeout }
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:35 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:36 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:36 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:36 [PrimaryChain] Reserved peer 12D3KooWHSeob6t43ukWAGnkTcQEoRaFSUWphGDCKF1uefG2UGDh disconnected
2022-06-16 06:44:36 [PrimaryChain] Reserved peer 12D3KooWNN5uuzPtDNtWoLU28ZDCQP7HTdRjyWbNYo5EA6fZDAMD disconnected
2022-06-16 06:44:36 [PrimaryChain] Reserved peer 12D3KooWNGxWQ4sajzW1akPRZxjYM5TszRtsCnEiLhpsGrsHrFC6 disconnected
2022-06-16 06:44:40 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#226643 (50 peers), best: #105 (0x3ddf…ff09), finalized #0 (0x9ee8…ccf0), ⬇ 9.8MiB/s ⬆ 19.0kiB/s
[.......]
2022-06-16 20:06:05 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:06 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (191 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 468.2kiB/s ⬆ 2.2kiB/s
2022-06-16 20:06:10 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:11 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (192 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 360.6kiB/s ⬆ 3.6kiB/s
2022-06-16 20:06:15 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:16 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (192 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 645.0kiB/s ⬆ 2.0kiB/s
2022-06-16 20:06:20 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:21 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (192 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 435.6kiB/s ⬆ 2.2kiB/s
2022-06-16 20:06:25 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:26 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (191 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 461.2kiB/s ⬆ 2.5kiB/s
2022-06-16 20:06:30 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:31 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (190 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 466.4kiB/s ⬆ 2.1kiB/s
2022-06-16 20:06:35 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:36 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (191 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 562.0kiB/s ⬆ 3.3kiB/s
2022-06-16 20:06:40 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:41 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (190 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 395.4kiB/s ⬆ 2.4kiB/s
2022-06-16 20:06:45 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:46 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (190 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 422.2kiB/s ⬆ 2.2kiB/s
2022-06-16 20:06:50 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:51 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (190 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 387.2kiB/s ⬆ 2.8kiB/s
2022-06-16 20:06:55 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:06:56 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (189 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 499.7kiB/s ⬆ 2.4kiB/s
2022-06-16 20:07:00 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:07:01 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (189 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 587.9kiB/s ⬆ 2.2kiB/s
2022-06-16 20:07:05 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:07:06 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (189 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 338.9kiB/s ⬆ 2.3kiB/s
2022-06-16 20:07:10 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:07:11 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (189 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 417.8kiB/s ⬆ 2.0kiB/s
2022-06-16 20:07:15 [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
2022-06-16 20:07:16 [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#33581 (189 peers), best: #119 (0xb36d…f739), finalized #0 (0x9ee8…ccf0), ⬇ 405.7kiB/s ⬆ 2.8kiB/s

Really frustrating experience so far.

Sorry about the formatting. Still getting used to this forum.

What I’ve been trying to achieve:

  • Sync the node with 1.5TB allocated to run as a temp farmer
  • Sync another node to act as a central archival node on my LAN for other nodes to bootstrap from, to optimise my setup and reduce strain on WAN.

(this being an incentivised testnet, the sync issues are extra annoying, because this now also invokes FOMO. I wish the team waited until sync issues were somewhat resolved before switching to the incentivised testnet. I am far from being the only one struggling with sync and plotting. Just my personal opinion though.)

First off thank you for all your help and support you have been providing over the past weeks.

Also thank you for the very detailed post.

  1. First Computer: When it got stuck have you attempted full reboot of the program? (when you close it, it hides in the background similar to discord) A chainDB of 50-60GB is about right (this is the nodedb NOT your plot)

2 Second Computer: What is the farmer log showing during this time? did it make proper connection, and start plotting?

  1. Third Computer: Same as 2nd, what is farmer showing?

here to help you get this sorted, and or bugs reported to the dev team!

In regards to your feedback I definitely agree that this is sub-optimal to have these issues during incentivized testnet. We even did our best to force these issues out earlier with our 3 month over 20,000 node stress test we did prior to launching inc. testnet. Of course that faced almost no issues compared to this version. Just a matter of circumstance, but also to be expected with an early launch. We are doing our best to get all of these issues resolved and improved as quick as we can. At least its better they arise now prior to mainnet launch, I like to think of this like the Trial run to Mainnet. While there are rewards they aren’t 1:1 to mainnet, so I hope you this helps any FOMO and Appreciate the great feedback :slight_smile:

I replied in-line:

Thanks for your omnipresence @ImmaZoni. I think you should have a look at the Discord chats over the past few days. Some of the common questions should be moved over here, if this is the preferred place to address them.

Again, sorry if the formatting of my post is messed up. Still figuring out this forum editor.

Hours later, one node is still stuck on 222211 and the other on 33581 (target).

Also discovered that the max height on the telemetry portal is totally wrong. This should have been made clearer IMO. But I am nowhere near the height shown on the correct source anyway.

I changed the max peers on the linux node to 200/150 and restarted it, and it’s slowly advancing at a rate of about one block every hour or two, with the target stuck in the 30k range.

(I think I’m ready to put this experiment aside until a major announcement addressing the sync issues, because between the discord and here, it doesn’t seem to be a pressing matter. The way things are right now, it seems like those lucky to have discovered “good” peers are being rewarded, while the rest of us are troubleshooting and trying to find solutions. Fair enough, but my time has some value too.)

No worries on the formatting! I appreciate you trying, most poeple just throw it all in one text block :laughing:

I appreciate all of your troubleshooting and testing during this process, I see you already saw our announcement about sunsetting phase 1 of Gemini, We will continue to look back and see what we can improve, & implement feedback that you and many other community members have provided making the experience better and better as we go.

(Also thank you for the poll you posted, lots of good feedback :slight_smile: )

Thank you for your comment, and for allowing the poll.
The poll/survey was just a quick mock-up to get a bit of feedback, and it was a source of validation for me (e.g. re docs), and I’m glad you found it at least somewhat valuable too. Hopefully I’ll get more responses on it, because it would be useful to see an aggregate from more respondents.

Just a small update to not leave this hanging -
I stopped actively wrangling with the nodes, kept just two running for now, but I did also try a few small things here and there.

Currently, one node’s sync is sitting around #25664 (small node / Ubuntu 20.04 / VM / set up as an archival node, 150GB allocated to bootstrap it)

and the other at around #24256 (larger node “–plot-size 6000G” / lots of RAM / nothing else running on the host / Windows CLI version of the node and farmer / 6x1.2TB 10K SFF disks RAID0 2G FBWC). CPUs are 2xE5-2660v1 and they get almost maxed out during the sporadic plotting. Once I’ve moved this to a Linux host I will compile some metrics/stats, but from what I’ve seen so far on this node the IO hasn’t been a bottleneck yet, while the CPU was probably not a great choice for plotting.

(Oddly - and this may be coincidence/voodoo - when I switch the port mapping between the two nodes, I get a burst of syncing where it advances for a while, then starts idling again. I suspect that the switch makes it lose some peers and replace them with alternative ones, but I could be way off here.)

I’ve tried to add a –reserved-nodes entry of a fully synced node (without –reserved-only set) that another farmer kindly allowed me to use, but this did not resolve the sporadic sync.

Update: today both nodes unexpectedly started to sync more consistently, and really flying the past hour or so.

The larger node is now bottlenecked by the disk IO.
The smaller node is chugging along with still fairly high IO strain.

I have not upgraded these nodes to gemini-1b-2022-jun-18 yet.

Edit: Also, the WAN utilisation has dropped by a factor of 10.

1 Like

Awesome, im glad it finally kicked into gear for you! Did you change anything, or was this just out of the blue?

Disk IO is where we expect the bottleneck to be for the time being which is a good sign, also happy to see that WAN utilization has dropped aswell.

I did not make any changes this time. The port mapping has been set to the linux node. Interestingly, the firewall is reporting that only 155.1 MiB of traffic has passed through port 30333 since I added the entry (over a day). This is in contrast to the overall ~4T of traffic generated by my numerous Subspace node tests.

The linux node seems to have found life around block 31681, as it ramped up very quickly after, with few breaks where it sat reporting “Idle”. Then in a span of a few hours it’s now reached #190390+.

The other node is far behind, but at least it’s making the disks work. It has definitely confirmed that plotting 1TB+ to anything with lower IOPS than a decent SSD is a no starter. I am thinking of stopping this node and enabling a 100G PrimoCache cache, then trying to finish the plotting/sync with that. It’s not really a solution even if it helps, but it would be curious to compare.

e[2m2022-06-21 19:20:54e[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
e[2m2022-06-21 19:20:57e[0m [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#44397 (157 peers), best: #32340 (0xec18…6e07), finalized #0 (0x9ee8…ccf0), ⬇ 354.0kiB/s ⬆ 103.7kiB/s
e[2m2022-06-21 19:20:59e[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
e[2m2022-06-21 19:21:02e[0m [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#44450 (158 peers), best: #32340 (0xec18…6e07), finalized #0 (0x9ee8…ccf0), ⬇ 423.5kiB/s ⬆ 43.8kiB/s
e[2m2022-06-21 19:21:04e[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
e[2m2022-06-21 19:21:07e[0m [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#44450 (158 peers), best: #32340 (0xec18…6e07), finalized #0 (0x9ee8…ccf0), ⬇ 427.7kiB/s ⬆ 30.7kiB/s
e[2m2022-06-21 19:21:09e[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
e[2m2022-06-21 19:21:12e[0m [PrimaryChain] ⚙️  Syncing  0.0 bps, target=#44450 (158 peers), best: #32340 (0xec18…6e07), finalized #0 (0x9ee8…ccf0), ⬇ 501.5kiB/s ⬆ 14.4kiB/s
e[2m2022-06-21 19:21:14e[0m [PrimaryChain] Waiting for farmer to receive and acknowledge archived segment
e[2m2022-06-21T23:20:33.245010Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102745
e[2m2022-06-21T23:21:23.578636Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102746
e[2m2022-06-21T23:21:51.958432Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102747
e[2m2022-06-21T23:22:12.408517Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102748
e[2m2022-06-21T23:22:44.246283Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102749
e[2m2022-06-21T23:23:29.286070Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102750
e[2m2022-06-21T23:24:05.782727Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102751
e[2m2022-06-21T23:24:25.097130Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102752
e[2m2022-06-21T23:24:43.375411Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102753
e[2m2022-06-21T23:25:21.584633Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102754
e[2m2022-06-21T23:25:45.819822Ze[0m e[32m INFOe[0m e[2msubspace_farmer::archivinge[0me[2m:e[0m Plotted segment e[3msegment_indexe[0me[2m=e[0m102755

I’m still not ready to try a node on my beefier gear, but we’ll see what happens once these two reach the finish line. I am hoping that my next test node could be bootstrapped from the linux node, especially if I use more power hungry hardware.

Alright, final update:
The small linux node synched and I’ve earned a bit of TSSC.
The larger node was still nowhere near a full sync today and I scrapped it.

Also - not directly related - the latest node software won’t load without an OpenCL runtime available, and I didn’t see any warning about that on the release page. Maybe I missed it. I added a bit of info here gemini-1b-2022-jun-18 silently exists on Windows without doing anything · Issue #611 · subspace/subspace · GitHub
Edit: that particular github issue turned out to be not related - oops.