Kernel crash caused by farmer

Issue Report

A kernel crash was caused by a subspace farming process (farming-5.18)
Trace below
I checked the 6th disc (farming-5), a nvme, but no errors in nvme error-log nor any smart data errors. Also no issues in xfs_repair.
Server was running dec-11 farmer, just to test if files work in preparation for the numa test
I have never encountered a subspace crash that took down the entire server before this crash.

Environment

Ubuntu server 22.04
CLI
Farmer DEC-11 version
kernel: 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem

Dec 31 20:07:19 subtrx kernel: [2170950.307914] BUG: Bad page state in process farming-5.18  pfn:11f4624
Dec 31 20:07:19 subtrx kernel: [2170950.307972] page:00000000e2c9ace2 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x11f4624
Dec 31 20:07:19 subtrx kernel: [2170950.307975] flags: 0x97ffffc2000000(idle|node=2|zone=2|lastcpupid=0x1fffff)
Dec 31 20:07:19 subtrx kernel: [2170950.307979] raw: 0097ffffc2000000 dead000000000100 dead000000000122 0000000000000000
Dec 31 20:07:19 subtrx kernel: [2170950.307980] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
Dec 31 20:07:19 subtrx kernel: [2170950.307981] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
Dec 31 20:07:19 subtrx kernel: [2170950.307982] Modules linked in: nfnetlink cpuid tls nvme_fabrics binfmt_misc xfs nls_iso8859_1 intel_rapl_msr intel_rapl_common iwlmvm edac_mce_amd mac80211 snd_
hda_intel btusb snd_usb_audio snd_intel_dspcfg btrtl snd_intel_sdw_acpi btbcm snd_hda_codec kvm_amd snd_usbmidi_lib btintel libarc4 kvm bluetooth iwlwifi snd_rawmidi snd_hda_core snd_seq_device ec
dh_generic mc snd_hwdep rapl wmi_bmof gigabyte_wmi input_leds joydev ecc cfg80211 snd_pcm snd_timer snd ccp soundcore plx_dma k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scs
i_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
 multipath linear radeon hid_generic drm_ttm_helper ttm drm_kms_helper syscopyarea usbhid sysfillrect sysimgblt hid fb_sys_fops cec rc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_
intel crypto_simd mxm_wmi igb cryptd atlantic ahci drm
Dec 31 20:07:19 subtrx kernel: [2170950.308045]  dca libahci macsec i2c_algo_bit xhci_pci nvme xhci_pci_renesas i2c_piix4 nvme_core wmi
Dec 31 20:07:19 subtrx kernel: [2170950.308053] CPU: 7 PID: 1313160 Comm: farming-5.18 Not tainted 5.15.0-89-generic #99-Ubuntu
Dec 31 20:07:19 subtrx kernel: [2170950.308055] Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F5k 09/25/2020
Dec 31 20:07:19 subtrx kernel: [2170950.308056] Call Trace:
Dec 31 20:07:19 subtrx kernel: [2170950.308058]  <TASK>
Dec 31 20:07:19 subtrx kernel: [2170950.308061]  show_stack+0x52/0x5c
Dec 31 20:07:19 subtrx kernel: [2170950.308067]  dump_stack_lvl+0x4a/0x63
Dec 31 20:07:19 subtrx kernel: [2170950.308071]  dump_stack+0x10/0x16
Dec 31 20:07:19 subtrx kernel: [2170950.308072]  bad_page.cold+0x63/0x94
Dec 31 20:07:19 subtrx kernel: [2170950.308076]  check_new_page_bad+0x6d/0x80
Dec 31 20:07:19 subtrx kernel: [2170950.308080]  rmqueue_bulk+0x45f/0x770
Dec 31 20:07:19 subtrx kernel: [2170950.308082]  ? nvme_queue_rq+0x13c/0x1e1 [nvme]
Dec 31 20:07:19 subtrx kernel: [2170950.308087]  rmqueue+0x5a6/0xbb0
Dec 31 20:07:19 subtrx kernel: [2170950.308090]  ? kmem_cache_alloc+0x1ab/0x2f0
Dec 31 20:07:19 subtrx kernel: [2170950.308092]  ? xas_alloc+0xa7/0xd0
Dec 31 20:07:19 subtrx kernel: [2170950.308095]  get_page_from_freelist+0xdf/0x540
Dec 31 20:07:19 subtrx kernel: [2170950.308097]  ? __mod_memcg_lruvec_state+0x63/0xe0
Dec 31 20:07:19 subtrx kernel: [2170950.308100]  __alloc_pages+0x17e/0x330
Dec 31 20:07:19 subtrx kernel: [2170950.308103]  alloc_pages+0x9e/0x1e0
Dec 31 20:07:19 subtrx kernel: [2170950.308105]  __page_cache_alloc+0x7e/0x90
Dec 31 20:07:19 subtrx kernel: [2170950.308108]  page_cache_ra_unbounded+0xac/0x210
Dec 31 20:07:19 subtrx kernel: [2170950.308111]  force_page_cache_ra+0xe6/0x150
Dec 31 20:07:19 subtrx kernel: [2170950.308113]  page_cache_sync_ra+0x40/0xe0
Dec 31 20:07:19 subtrx kernel: [2170950.308115]  filemap_get_pages+0xde/0x3f0
Dec 31 20:07:19 subtrx kernel: [2170950.308117]  ? atime_needs_update+0x104/0x180
Dec 31 20:07:19 subtrx kernel: [2170950.308121]  filemap_read+0xbc/0x3e0
Dec 31 20:07:19 subtrx kernel: [2170950.308123]  ? uprobe_notify_resume+0x10/0x390
Dec 31 20:07:19 subtrx kernel: [2170950.308125]  ? xfs_file_buffered_read+0xb1/0xc0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308190]  ? xfs_file_read_iter+0xb3/0x1c0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308242]  generic_file_read_iter+0xe5/0x150
Dec 31 20:07:19 subtrx kernel: [2170950.308244]  ? down_read+0x13/0xa0
Dec 31 20:07:19 subtrx kernel: [2170950.308247]  xfs_file_buffered_read+0xa1/0xc0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308298]  xfs_file_read_iter+0xb3/0x1c0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308347]  new_sync_read+0x10d/0x190
Dec 31 20:07:19 subtrx kernel: [2170950.308351]  vfs_read+0x103/0x1a0
Dec 31 20:07:19 subtrx kernel: [2170950.308353]  __x64_sys_pread64+0x96/0xc0
Dec 31 20:07:19 subtrx kernel: [2170950.308354]  do_syscall_64+0x5c/0xc0
Dec 31 20:07:19 subtrx kernel: [2170950.308357]  ? irqentry_exit+0x1d/0x30
Dec 31 20:07:19 subtrx kernel: [2170950.308359]  ? common_interrupt+0x55/0xa0
Dec 31 20:07:19 subtrx kernel: [2170950.308360]  entry_SYSCALL_64_after_hwframe+0x62/0xcc
Dec 31 20:07:19 subtrx kernel: [2170950.308362] RIP: 0033:0x7fe169ec759f
Dec 31 20:07:19 subtrx kernel: [2170950.308365] Code: 08 89 3c 24 48 89 4c 24 18 e8 6d e4 f7 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff
 ff 77 31 44 89 c7 48 89 04 24 e8 ad e4 f7 ff 48 8b
Dec 31 20:07:19 subtrx kernel: [2170950.308366] RSP: 002b:00007fdaee9f0cf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
Dec 31 20:07:19 subtrx kernel: [2170950.308368] RAX: ffffffffffffffda RBX: 00007fe169ec7540 RCX: 00007fe169ec759f
Dec 31 20:07:19 subtrx kernel: [2170950.308370] RDX: 0000000000004d80 RSI: 00007fdb4003f000 RDI: 00000000000000e8
Dec 31 20:07:19 subtrx kernel: [2170950.308371] RBP: 00000000000000e8 R08: 0000000000000000 R09: 00007fdb4003f000
Dec 31 20:07:19 subtrx kernel: [2170950.308372] R10: 0000015940eb5aa0 R11: 0000000000000293 R12: 0000000000004d80
Dec 31 20:07:19 subtrx kernel: [2170950.308373] R13: 7fffffffffffffff R14: 0000015940eb5aa0 R15: 00007fdb4003f000
Dec 31 20:07:19 subtrx kernel: [2170950.308375]  </TASK>
Dec 31 20:07:19 subtrx kernel: [2170950.308376] Disabling lock debugging due to kernel taint

Total server standstill 5min later, latest log entry anywhere is actually the subspace node log at 2023-12-31T20:12:13.284700Z (nor further kernel nor syslog entries).

This is not kernel crash caused by farmer, it is kernel crash that happened while farmer was running.

The reason is probably related to hardware, I see nvme stuff in the backtrace.

Check if NVMe SSD has adequate cooling and look for hardware issues. If server is stable, there is nothing farmer can possibly do to crash the whole server, it can only potentially crash itself if there is a bug in the code.

Backtrace is mostly mem and disc as far as I understand it. I’m not sure which is cause and effect here.

I found nothing on the nvme, temp also good.
System is stable, last reboot ~25 days ago.

I guess some random error, probably mem (non-ecc).

Thanks a lot for looking & a happy new year!

Anything can happen, hopefully this is a one-off