Disk Error When Farming

Issue Report

Environment

  • Operating System: Ubuntu 22.04
  • Pulsar/Advanced CLI/Docker: Docker

Problem

I am running 4 NVMEs on an ASUS Hyper M.2 x16. Farming is stopping due to a disk that seems to get unmounted and disappears. The disk health seems fine, rebooting the server makes the disk appear again. Looking into dmesg I see:

[28474.862927] sd 13:0:7:0: [sdp] tag#5390 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[28474.862944] sd 13:0:7:0: [sdp] tag#5390 Sense Key : Illegal Request [current] [descriptor] 
[28474.862946] sd 13:0:7:0: [sdp] tag#5390 Add. Sense: Logical block address out of range
[28474.862948] sd 13:0:7:0: [sdp] tag#5390 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[28474.862949] blk_update_request: critical target error, dev sdp, sector 7813992704 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[34938.577890] nvme nvme2: I/O 749 QID 1 timeout, aborting
[34938.605607] nvme nvme2: Abort status: 0x0
[34969.298215] nvme nvme2: I/O 749 QID 1 timeout, reset controller
[35090.499486] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[35090.523768] blk_update_request: I/O error, dev nvme2n1, sector 6721074240 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523783] blk_update_request: I/O error, dev nvme2n1, sector 6837601672 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523792] blk_update_request: I/O error, dev nvme2n1, sector 7909380144 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
[35090.523799] blk_update_request: I/O error, dev nvme2n1, sector 7819571808 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523805] blk_update_request: I/O error, dev nvme2n1, sector 7330300872 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0
[35090.523811] blk_update_request: I/O error, dev nvme2n1, sector 6998007544 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35090.523818] blk_update_request: I/O error, dev nvme2n1, sector 6967402344 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523824] blk_update_request: I/O error, dev nvme2n1, sector 7042126872 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35090.523830] blk_update_request: I/O error, dev nvme2n1, sector 7072336728 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523836] blk_update_request: I/O error, dev nvme2n1, sector 7882618296 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35163.860310] INFO: task kcompactd0:212 blocked for more than 120 seconds.
[35163.860318]       Not tainted 5.15.0-94-generic #104-Ubuntu
[35163.860320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35163.860322] task:kcompactd0      state:D stack:    0 pid:  212 ppid:     2 flags:0x00004000

I also see this in dmesg later on - and it repeats periodically:

[35284.693706]  </TASK>
[35284.693777] INFO: task farming-2.0:8023 blocked for more than 120 seconds.
[35284.693779]       Not tainted 5.15.0-94-generic #104-Ubuntu
[35284.693780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35284.693781] task:farming-2.0     state:D stack:    0 pid: 8023 ppid:  7782 flags:0x00004220
[35284.693783] Call Trace:
[35284.693784]  <TASK>
[35284.693785]  __schedule+0x24e/0x590
[35284.693786]  schedule+0x69/0x110
[35284.693787]  __submit_bio+0xc4/0x220
[35284.693790]  ? wait_woken+0x70/0x70
[35284.693792]  submit_bio_noacct+0xc0/0x120
[35284.693793]  submit_bio+0x4a/0x130
[35284.693794]  ext4_mpage_readpages+0x403/0xb20
[35284.693796]  ? wait_on_page_bit_common+0x3b2/0x3d0
[35284.693798]  ext4_readpage+0x3f/0x90
[35284.693799]  filemap_read_page+0x3b/0x100
[35284.693801]  filemap_update_page+0x20c/0x290
[35284.693802]  filemap_get_pages+0x276/0x3f0
[35284.693804]  filemap_read+0xbc/0x3e0
[35284.693806]  ? blk_queue_exit+0x1a/0x50
[35284.693807]  ? __blk_mq_free_request+0x96/0xc0
[35284.693809]  ? blk_update_request+0x2af/0x540
[35284.693810]  generic_file_read_iter+0xe5/0x150
[35284.693812]  ext4_file_read_iter+0x5b/0x190
[35284.693814]  ? aa_file_perm+0x127/0x2a0
[35284.693816]  new_sync_read+0x10d/0x190
[35284.693819]  vfs_read+0x103/0x1a0
[35284.693821]  __x64_sys_pread64+0x96/0xc0
[35284.693822]  do_syscall_64+0x5c/0xc0
[35284.693824]  ? irqentry_exit+0x1d/0x30
[35284.693826]  ? common_interrupt+0x55/0xa0
[35284.693827]  entry_SYSCALL_64_after_hwframe+0x62/0xcc
[35284.693829] RIP: 0033:0x7efea0486c6f
[35284.693831] RSP: 002b:00007efce03d8720 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[35284.693832] RAX: ffffffffffffffda RBX: 00007efea0486c10 RCX: 00007efea0486c6f
[35284.693833] RDX: 0000000000004f40 RSI: 000003f8cc8cf000 RDI: 0000000000000044
[35284.693834] RBP: 0000000000000044 R08: 0000000000000000 R09: 0000000000000000
[35284.693834] R10: 0000000115e46540 R11: 0000000000000293 R12: 0000000000004f40
[35284.693835] R13: 7fffffffffffffff R14: 0000000115e46540 R15: 000003f8cc8cf000
[35284.693836]  </TASK>

smartctl report:

=== START OF INFORMATION SECTION ===
Model Number:                       TEAM TM8FP4004T
Serial Number:                      1B2310270295584
Firmware Version:                   VB421D65
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Feb 12 06:16:58 2024 MST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0   230000   50000
 1 +     4.00W       -        -    1  1  1  1     4000   50000
 2 +     3.00W       -        -    2  2  2  2     4000  250000
 3 -     0.50W       -        -    3  3  3  3     4000    8000
 4 -   0.0090W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    9,149,567 [4.68 TB]
Data Units Written:                 3,368,493 [1.72 TB]
Host Read Commands:                 201,123,805
Host Write Commands:                13,278,098
Controller Busy Time:               0
Power Cycles:                       6
Power On Hours:                     666
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

Nothing Subspace-specific. You kernel says disk errors are happening. Could be unstable CPU/RAM, defective SSD, signal integrity issues due to having SSD installed into expansion card far from CPU or something else.

Nothing Subspace can do about it, the issue is closer to hardware. I’d suggest to try that same SSD in USB adapter or install in native M.2 slot on motherboard first to make sure expansion card is not the reason here.