The farmer hangs dead after running for a while.

Issue Report

Environment

  • Operating System: Debian 10.2.1-6 (Kernel 5.10.0-26-amd64)
  • Pulsar/Advanced CLI/Docker: subspace-farmer-ubuntu-x86_64-skylake-gemini-3h-2024-mar-29 farm --reward-address xxxxx path=/data/ssd1/farmer,size=900G

MainBoard: PRIME X570-PRO

CPU: AMD 5900X

Problem

The farmer hangs dead after running for a while.

Apr  1 13:34:52 nas kernel: [76806.563962] ------------[ cut here ]------------
Apr  1 13:34:52 nas kernel: [76806.563969] mem_cgroup_update_lru_size(00000000dd4624d0, 0, -1): lru_size -1
Apr  1 13:34:52 nas kernel: [76806.563984] WARNING: CPU: 27 PID: 51750 at mm/memcontrol.c:1402 mem_cgroup_update_lru_size+0x8d/0xa0
Apr  1 13:34:52 nas kernel: [76806.563985] Modules linked in: xt_nat veth nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter overlay bridge stp llc nft_chain_nat xt_MASQUERADE nf_nat xt_addrtype xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_CHECKSUM xt_tcpudp nft_compat nf_tables nfnetlink nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio kvm snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation snd_soc_core snd_compress soundwire_cadence snd_hda_codec irqbypass snd_hda_core ghash_clmulni_intel snd_hwdep soundwire_bus sg snd_pcm eeepc_wmi aesni_intel snd_timer asus_wmi ccp battery snd sp5100_tco libaes soundcore crypto_simd rng_core sparse_keymap cryptd watchdog k10temp rfkill glue_helper video rapl efi_pstore pcspkr wmi_bmof evdev acpi_cpufreq drm sunrpc fuse configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic raid10 raid456 async_raid6_recov async_memcpy async_pq usbhid async_xor async_tx
Apr  1 13:34:52 nas kernel: [76806.564043]  xor hid uas usb_storage raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod xhci_pci xhci_hcd ahci libahci libata atlantic usbcore nvme igb scsi_mod mxm_wmi nvme_core crc32_pclmul crc32c_intel macsec i2c_piix4 dca t10_pi ptp crc_t10dif usb_common i2c_algo_bit crct10dif_generic pps_core crct10dif_pclmul crct10dif_common wmi button
Apr  1 13:34:52 nas kernel: [76806.564072] CPU: 27 PID: 51750 Comm: plotting-1.7 Not tainted 5.10.0-26-amd64 #1 Debian 5.10.197-1
Apr  1 13:34:52 nas kernel: [76806.564074] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 3603 03/20/2021
Apr  1 13:34:52 nas kernel: [76806.564076] RIP: 0010:mem_cgroup_update_lru_size+0x8d/0xa0
Apr  1 13:34:52 nas kernel: [76806.564078] Code: 00 eb c9 e9 c5 8e 93 00 89 f1 48 89 fa 41 89 d8 48 c7 c6 00 a2 a3 b7 48 c7 c7 61 96 cf b7 c6 05 37 8f 57 01 01 e8 93 d5 5f 00 <0f> 0b eb c8 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
Apr  1 13:34:52 nas kernel: [76806.564079] RSP: 0018:ffffba3b0396bcd0 EFLAGS: 00010086
Apr  1 13:34:52 nas kernel: [76806.564080] RAX: 0000000000000000 RBX: ffffffffffffffff RCX: ffff8e48ef0e0908
Apr  1 13:34:52 nas kernel: [76806.564080] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8e48ef0e0900
Apr  1 13:34:52 nas kernel: [76806.564082] RBP: ffff8e3a41958078 R08: 0000000000000000 R09: ffffba3b0396baf0
Apr  1 13:34:52 nas kernel: [76806.564082] R10: ffffba3b0396bae8 R11: ffff8e492f279fe8 R12: ffffe8b8a3fefcc8
Apr  1 13:34:52 nas kernel: [76806.564083] R13: ffff8e492f2d6000 R14: ffffe8b8a3fefcc0 R15: 0000000000000246
Apr  1 13:34:52 nas kernel: [76806.564085] FS:  00007fb9515ea700(0000) GS:ffff8e48ef0c0000(0000) knlGS:0000000000000000
Apr  1 13:34:52 nas kernel: [76806.564086] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  1 13:34:52 nas kernel: [76806.564086] CR2: 0000028348e0f000 CR3: 0000000118c2a000 CR4: 0000000000750ee0
Apr  1 13:34:52 nas kernel: [76806.564087] PKRU: 55555554
Apr  1 13:34:52 nas kernel: [76806.564087] Call Trace:
Apr  1 13:34:52 nas kernel: [76806.564099]  ? __warn+0x80/0x100
Apr  1 13:34:52 nas kernel: [76806.564100]  ? mem_cgroup_update_lru_size+0x8d/0xa0
Apr  1 13:34:52 nas kernel: [76806.564103]  ? report_bug+0x9e/0xc0
Apr  1 13:34:52 nas kernel: [76806.564108]  ? handle_bug+0x35/0x80
Apr  1 13:34:52 nas kernel: [76806.564108]  ? exc_invalid_op+0x14/0x70
Apr  1 13:34:52 nas kernel: [76806.564111]  ? asm_exc_invalid_op+0x12/0x20
Apr  1 13:34:52 nas kernel: [76806.564114]  ? mem_cgroup_update_lru_size+0x8d/0xa0
Apr  1 13:34:52 nas kernel: [76806.564118]  release_pages+0x284/0x460
Apr  1 13:34:52 nas kernel: [76806.564122]  tlb_finish_mmu+0x7a/0x1a0
Apr  1 13:34:52 nas kernel: [76806.564124]  zap_page_range+0x116/0x170
Apr  1 13:34:52 nas kernel: [76806.564129]  do_madvise.part.0+0x69f/0xb60
Apr  1 13:34:52 nas kernel: [76806.564134]  ? do_user_addr_fault+0x1cc/0x400
Apr  1 13:34:52 nas kernel: [76806.564135]  __x64_sys_madvise+0x58/0x70
Apr  1 13:34:52 nas kernel: [76806.564137]  do_syscall_64+0x33/0x80
Apr  1 13:34:52 nas kernel: [76806.564139]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
Apr  1 13:34:52 nas kernel: [76806.564141] RIP: 0033:0x7fb9b9ca4267
Apr  1 13:34:52 nas kernel: [76806.564142] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 21 8c 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 8b 0d 00 f7 d8 64 89 01 48
Apr  1 13:34:52 nas kernel: [76806.564143] RSP: 002b:00007fb9515e8388 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
Apr  1 13:34:52 nas kernel: [76806.564144] RAX: ffffffffffffffda RBX: 00007fb9515e83a7 RCX: 00007fb9b9ca4267
Apr  1 13:34:52 nas kernel: [76806.564144] RDX: 0000000000000004 RSI: 0000000001ff0000 RDI: 0000028350010000
Apr  1 13:34:52 nas kernel: [76806.564145] RBP: 0000000001ff0000 R08: fffffffffffff000 R09: 0000000000000001
Apr  1 13:34:52 nas kernel: [76806.564146] R10: 0000000000010000 R11: 0000000000000206 R12: 0000028350010000
Apr  1 13:34:52 nas kernel: [76806.564146] R13: 0000028350010000 R14: 0000028350000000 R15: 0000028350000158
Apr  1 13:34:52 nas kernel: [76806.564148] ---[ end trace 5801886cd81cb9b1 ]---
Apr  1 13:34:52 nas kernel: [76806.564182] list_add corruption. next is NULL.
Apr  1 13:34:52 nas kernel: [76806.564197] ------------[ cut here ]------------
Apr  1 13:34:52 nas kernel: [76806.564197] kernel BUG at lib/list_debug.c:25!
Apr  1 13:34:52 nas kernel: [76806.564206] invalid opcode: 0000 [#1] SMP NOPTI
Apr  1 13:34:52 nas kernel: [76806.564208] CPU: 27 PID: 51750 Comm: plotting-1.7 Tainted: G        W         5.10.0-26-amd64 #1 Debian 5.10.197-1
Apr  1 13:34:52 nas kernel: [76806.564211] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 3603 03/20/2021
Apr  1 13:34:52 nas kernel: [76806.564218] RIP: 0010:__list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76806.564220] Code: c7 c7 10 2b d2 b7 e8 02 0e ff ff 0f 0b 4c 89 c1 48 c7 c7 b8 2a d2 b7 e8 f1 0d ff ff 0f 0b 48 c7 c7 90 2a d2 b7 e8 e3 0d ff ff <0f> 0b 48 89 fe 48 c7 c7 a0 2b d2 b7 e8 d2 0d ff ff 0f 0b 48 c7 c7
Apr  1 13:34:52 nas kernel: [76806.564224] RSP: 0018:ffffba3b0396bbe0 EFLAGS: 00010046
Apr  1 13:34:52 nas kernel: [76806.564226] RAX: 0000000000000022 RBX: ffff8e492f2d7210 RCX: ffff8e48ef0e0908
Apr  1 13:34:52 nas kernel: [76806.564228] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8e48ef0e0900
Apr  1 13:34:52 nas kernel: [76806.564230] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffba3b0396ba08
Apr  1 13:34:52 nas kernel: [76806.564232] R10: ffffba3b0396ba00 R11: ffff8e492f279fe8 R12: ffffe8b8a3fefcc8
Apr  1 13:34:52 nas kernel: [76806.564235] R13: ffff8e492f2d7140 R14: ffffe8b8a3fefcc0 R15: 0000000000000000
Apr  1 13:34:52 nas kernel: [76806.564237] FS:  00007fb9515ea700(0000) GS:ffff8e48ef0c0000(0000) knlGS:0000000000000000
Apr  1 13:34:52 nas kernel: [76806.564239] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  1 13:34:52 nas kernel: [76806.564241] CR2: 0000028348e0f000 CR3: 0000000118c2a000 CR4: 0000000000750ee0
Apr  1 13:34:52 nas kernel: [76806.564243] PKRU: 55555554
Apr  1 13:34:52 nas kernel: [76806.564244] Call Trace:
Apr  1 13:34:52 nas kernel: [76806.564250]  ? __die_body.cold+0x1a/0x1f
Apr  1 13:34:52 nas kernel: [76806.564254]  ? die+0x2b/0x50
Apr  1 13:34:52 nas kernel: [76806.564258]  ? do_trap+0x91/0x110
Apr  1 13:34:52 nas kernel: [76806.564259]  ? __list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76806.564262]  ? do_error_trap+0x64/0xa0
Apr  1 13:34:52 nas kernel: [76806.564264]  ? __list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76806.564266]  ? exc_invalid_op+0x4e/0x70
Apr  1 13:34:52 nas kernel: [76806.564268]  ? __list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76806.564270]  ? asm_exc_invalid_op+0x12/0x20
Apr  1 13:34:52 nas kernel: [76806.564272]  ? __list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76806.564275]  __free_one_page+0x3a5/0x450
Apr  1 13:34:52 nas kernel: [76806.564278]  ? handle_bug+0x35/0x80
Apr  1 13:34:52 nas kernel: [76806.564280]  free_pcppages_bulk+0x219/0x2e0
Apr  1 13:34:52 nas kernel: [76806.564282]  free_unref_page_list+0x16e/0x1e0
Apr  1 13:34:52 nas kernel: [76806.564285]  release_pages+0x3d8/0x460
Apr  1 13:34:52 nas kernel: [76806.564288]  tlb_finish_mmu+0x7a/0x1a0
Apr  1 13:34:52 nas kernel: [76806.564290]  zap_page_range+0x116/0x170
Apr  1 13:34:52 nas kernel: [76806.564293]  do_madvise.part.0+0x69f/0xb60
Apr  1 13:34:52 nas kernel: [76806.564295]  ? do_user_addr_fault+0x1cc/0x400
Apr  1 13:34:52 nas kernel: [76806.564297]  __x64_sys_madvise+0x58/0x70
Apr  1 13:34:52 nas kernel: [76806.564299]  do_syscall_64+0x33/0x80
Apr  1 13:34:52 nas kernel: [76806.564301]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
Apr  1 13:34:52 nas kernel: [76806.564303] RIP: 0033:0x7fb9b9ca4267
Apr  1 13:34:52 nas kernel: [76806.564305] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 21 8c 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 8b 0d 00 f7 d8 64 89 01 48
Apr  1 13:34:52 nas kernel: [76806.564308] RSP: 002b:00007fb9515e8388 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
Apr  1 13:34:52 nas kernel: [76806.564310] RAX: ffffffffffffffda RBX: 00007fb9515e83a7 RCX: 00007fb9b9ca4267
Apr  1 13:34:52 nas kernel: [76806.564312] RDX: 0000000000000004 RSI: 0000000001ff0000 RDI: 0000028350010000
Apr  1 13:34:52 nas kernel: [76806.564314] RBP: 0000000001ff0000 R08: fffffffffffff000 R09: 0000000000000001
Apr  1 13:34:52 nas kernel: [76806.564316] R10: 0000000000010000 R11: 0000000000000206 R12: 0000028350010000
Apr  1 13:34:52 nas kernel: [76806.564317] R13: 0000028350010000 R14: 0000028350000000 R15: 0000028350000158
Apr  1 13:34:52 nas kernel: [76806.564320] Modules linked in: xt_nat veth nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter overlay bridge stp llc nft_chain_nat xt_MASQUERADE nf_nat xt_addrtype xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_CHECKSUM xt_tcpudp nft_compat nf_tables nfnetlink nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio kvm snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation snd_soc_core snd_compress soundwire_cadence snd_hda_codec irqbypass snd_hda_core ghash_clmulni_intel snd_hwdep soundwire_bus sg snd_pcm eeepc_wmi aesni_intel snd_timer asus_wmi ccp battery snd sp5100_tco libaes soundcore crypto_simd rng_core sparse_keymap cryptd watchdog k10temp rfkill glue_helper video rapl efi_pstore pcspkr wmi_bmof evdev acpi_cpufreq drm sunrpc fuse configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic raid10 raid456 async_raid6_recov async_memcpy async_pq usbhid async_xor async_tx
Apr  1 13:34:52 nas kernel: [76806.564349]  xor hid uas usb_storage raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod xhci_pci xhci_hcd ahci libahci libata atlantic usbcore nvme igb scsi_mod mxm_wmi nvme_core crc32_pclmul crc32c_intel macsec i2c_piix4 dca t10_pi ptp crc_t10dif usb_common i2c_algo_bit crct10dif_generic pps_core crct10dif_pclmul crct10dif_common wmi button
Apr  1 13:34:52 nas kernel: [76806.564380] ---[ end trace 5801886cd81cb9b2 ]---
Apr  1 13:34:52 nas kernel: [76809.538748] RIP: 0010:__list_add_valid.cold+0x59/0x5b
Apr  1 13:34:52 nas kernel: [76809.538757] Code: c7 c7 10 2b d2 b7 e8 02 0e ff ff 0f 0b 4c 89 c1 48 c7 c7 b8 2a d2 b7 e8 f1 0d ff ff 0f 0b 48 c7 c7 90 2a d2 b7 e8 e3 0d ff ff <0f> 0b 48 89 fe 48 c7 c7 a0 2b d2 b7 e8 d2 0d ff ff 0f 0b 48 c7 c7
Apr  1 13:34:52 nas kernel: [76809.538761] RSP: 0018:ffffba3b0396bbe0 EFLAGS: 00010046
Apr  1 13:34:52 nas kernel: [76809.538765] RAX: 0000000000000022 RBX: ffff8e492f2d7210 RCX: ffff8e48ef0e0908
Apr  1 13:34:52 nas kernel: [76809.538766] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8e48ef0e0900
Apr  1 13:34:52 nas kernel: [76809.538767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffba3b0396ba08
Apr  1 13:34:52 nas kernel: [76809.538769] R10: ffffba3b0396ba00 R11: ffff8e492f279fe8 R12: ffffe8b8a3fefcc8
Apr  1 13:34:52 nas kernel: [76809.538770] R13: ffff8e492f2d7140 R14: ffffe8b8a3fefcc0 R15: 0000000000000000
Apr  1 13:34:52 nas kernel: [76809.538771] FS:  00007fb9515ea700(0000) GS:ffff8e48ef0c0000(0000) knlGS:0000000000000000
Apr  1 13:34:52 nas kernel: [76809.538772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  1 13:34:52 nas kernel: [76809.538773] CR2: 0000028348e0f000 CR3: 0000000118c2a000 CR4: 0000000000750ee0
Apr  1 13:34:52 nas kernel: [76809.538774] PKRU: 55555554

Looks like some hardware issue or other problem. There should be nothing unprivileged userspace software like Subspace can do to cause kernel errors.