Issue Report
Environment
- Operating System: Ubuntu 22.04
- Pulsar/Advanced CLI/Docker: Docker
Problem
I am running 4 NVMEs on an ASUS Hyper M.2 x16. Farming is stopping due to a disk that seems to get unmounted and disappears. The disk health seems fine, rebooting the server makes the disk appear again. Looking into dmesg I see:
[28474.862927] sd 13:0:7:0: [sdp] tag#5390 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[28474.862944] sd 13:0:7:0: [sdp] tag#5390 Sense Key : Illegal Request [current] [descriptor]
[28474.862946] sd 13:0:7:0: [sdp] tag#5390 Add. Sense: Logical block address out of range
[28474.862948] sd 13:0:7:0: [sdp] tag#5390 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[28474.862949] blk_update_request: critical target error, dev sdp, sector 7813992704 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[34938.577890] nvme nvme2: I/O 749 QID 1 timeout, aborting
[34938.605607] nvme nvme2: Abort status: 0x0
[34969.298215] nvme nvme2: I/O 749 QID 1 timeout, reset controller
[35090.499486] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[35090.523768] blk_update_request: I/O error, dev nvme2n1, sector 6721074240 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523783] blk_update_request: I/O error, dev nvme2n1, sector 6837601672 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523792] blk_update_request: I/O error, dev nvme2n1, sector 7909380144 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
[35090.523799] blk_update_request: I/O error, dev nvme2n1, sector 7819571808 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523805] blk_update_request: I/O error, dev nvme2n1, sector 7330300872 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0
[35090.523811] blk_update_request: I/O error, dev nvme2n1, sector 6998007544 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35090.523818] blk_update_request: I/O error, dev nvme2n1, sector 6967402344 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523824] blk_update_request: I/O error, dev nvme2n1, sector 7042126872 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35090.523830] blk_update_request: I/O error, dev nvme2n1, sector 7072336728 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[35090.523836] blk_update_request: I/O error, dev nvme2n1, sector 7882618296 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[35163.860310] INFO: task kcompactd0:212 blocked for more than 120 seconds.
[35163.860318] Not tainted 5.15.0-94-generic #104-Ubuntu
[35163.860320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35163.860322] task:kcompactd0 state:D stack: 0 pid: 212 ppid: 2 flags:0x00004000
I also see this in dmesg later on - and it repeats periodically:
[35284.693706] </TASK>
[35284.693777] INFO: task farming-2.0:8023 blocked for more than 120 seconds.
[35284.693779] Not tainted 5.15.0-94-generic #104-Ubuntu
[35284.693780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35284.693781] task:farming-2.0 state:D stack: 0 pid: 8023 ppid: 7782 flags:0x00004220
[35284.693783] Call Trace:
[35284.693784] <TASK>
[35284.693785] __schedule+0x24e/0x590
[35284.693786] schedule+0x69/0x110
[35284.693787] __submit_bio+0xc4/0x220
[35284.693790] ? wait_woken+0x70/0x70
[35284.693792] submit_bio_noacct+0xc0/0x120
[35284.693793] submit_bio+0x4a/0x130
[35284.693794] ext4_mpage_readpages+0x403/0xb20
[35284.693796] ? wait_on_page_bit_common+0x3b2/0x3d0
[35284.693798] ext4_readpage+0x3f/0x90
[35284.693799] filemap_read_page+0x3b/0x100
[35284.693801] filemap_update_page+0x20c/0x290
[35284.693802] filemap_get_pages+0x276/0x3f0
[35284.693804] filemap_read+0xbc/0x3e0
[35284.693806] ? blk_queue_exit+0x1a/0x50
[35284.693807] ? __blk_mq_free_request+0x96/0xc0
[35284.693809] ? blk_update_request+0x2af/0x540
[35284.693810] generic_file_read_iter+0xe5/0x150
[35284.693812] ext4_file_read_iter+0x5b/0x190
[35284.693814] ? aa_file_perm+0x127/0x2a0
[35284.693816] new_sync_read+0x10d/0x190
[35284.693819] vfs_read+0x103/0x1a0
[35284.693821] __x64_sys_pread64+0x96/0xc0
[35284.693822] do_syscall_64+0x5c/0xc0
[35284.693824] ? irqentry_exit+0x1d/0x30
[35284.693826] ? common_interrupt+0x55/0xa0
[35284.693827] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[35284.693829] RIP: 0033:0x7efea0486c6f
[35284.693831] RSP: 002b:00007efce03d8720 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[35284.693832] RAX: ffffffffffffffda RBX: 00007efea0486c10 RCX: 00007efea0486c6f
[35284.693833] RDX: 0000000000004f40 RSI: 000003f8cc8cf000 RDI: 0000000000000044
[35284.693834] RBP: 0000000000000044 R08: 0000000000000000 R09: 0000000000000000
[35284.693834] R10: 0000000115e46540 R11: 0000000000000293 R12: 0000000000004f40
[35284.693835] R13: 7fffffffffffffff R14: 0000000115e46540 R15: 000003f8cc8cf000
[35284.693836] </TASK>
smartctl report:
=== START OF INFORMATION SECTION ===
Model Number: TEAM TM8FP4004T
Serial Number: 1B2310270295584
Firmware Version: VB421D65
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Feb 12 06:16:58 2024 MST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 100 Celsius
Critical Comp. Temp. Threshold: 110 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 230000 50000
1 + 4.00W - - 1 1 1 1 4000 50000
2 + 3.00W - - 2 2 2 2 4000 250000
3 - 0.50W - - 3 3 3 3 4000 8000
4 - 0.0090W - - 4 4 4 4 8000 30000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 32 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 9,149,567 [4.68 TB]
Data Units Written: 3,368,493 [1.72 TB]
Host Read Commands: 201,123,805
Host Write Commands: 13,278,098
Controller Busy Time: 0
Power Cycles: 6
Power On Hours: 666
Unsafe Shutdowns: 3
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged