I have two Samsung 970 EVO Plus 1TB NVME SSDs on a generic NVME PCIe adaptor
(amazon) running for less than
150 days in a ZFS stripe.
Already I’m getting read and checksum failures in ZFS and SMART isn’t happy
either. Is this expected?
The data isn’t critical (monitoring databases, temporary storage) and I have
backups of it anyway, but the issues are affecting my usage.
zpool status:
pool: shasta
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0 days 00:17:43 with 1 errors on Sun Jan 29 14:32:08 2023
remove: Removal of /dev/mapper/shasta0_crypt canceled on Sun Jan 29 14:13:49 2023
config:
NAME STATE READ WRITE CKSUM
shasta DEGRADED 0 0 0
shasta0_crypt DEGRADED 6 0 38 too many errors
shasta1_crypt ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<snip>
And here’s smartctl for nvme0:
smartcl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-135-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 1TB
Serial Number: <redacted>
Firmware Version: 3B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 6
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization: 380,888,137,728 [380 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5b1140c3a7
Local Time is: Fri Feb 3 04:49:54 2023 PST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.54W - - 0 0 0 0 0 0
1 + 7.54W - - 1 1 1 1 0 200
2 + 7.54W - - 2 2 2 2 0 1000
3 - 0.0500W - - 3 3 3 3 2000 1200
4 - 0.0050W - - 4 4 4 4 500 9500
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 95%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 54,168,819 [27.7 TB]
Data Units Written: 47,512,583 [24.3 TB]
Host Read Commands: 260,828,947
Host Write Commands: 684,198,598
Controller Busy Time: 2,284
Power Cycles: 10
Power On Hours: 2,820
Unsafe Shutdowns: 4
Media and Data Integrity Errors: 36
Error Information Log Entries: 36
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 36 Celsius
Temperature Sensor 2: 39 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 36 1 0x00ae 0x4502 0x000 915751488 1 -
1 35 5 0x01a6 0x4502 0x000 915751312 1 -
2 34 2 0x00f4 0x4502 0x000 915750928 1 -
3 33 4 0x028f 0x4502 0x000 915742112 1 -
4 32 4 0x02be 0xc502 0x000 232311792 1 -
5 31 3 0x007f 0x4502 0x000 232311792 1 -
6 30 1 0x00a2 0x4502 0x000 232311792 1 -
7 29 8 0x0278 0x4502 0x000 219233800 1 -
8 28 8 0x0277 0x4502 0x000 915751488 1 -
9 27 1 0x0083 0x4502 0x000 915751312 1 -
10 26 3 0x0043 0x4502 0x000 915750928 1 -
11 25 6 0x02b6 0x4502 0x000 915742112 1 -
12 24 1 0x00b7 0xc502 0x000 232311792 1 -
13 23 3 0x005f 0x4502 0x000 232311792 1 -
14 22 6 0x02ae 0x4502 0x000 915751488 1 -
15 21 1 0x00b2 0x4502 0x000 232311664 1 -
... (20 entries not shown)
And for nvme1
smartcl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-135-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 1TB
Serial Number: <snip>
Firmware Version: 3B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 6
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization: 340,720,836,608 [340 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5b1140c2bf
Local Time is: Fri Feb 3 04:49:56 2023 PST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.54W - - 0 0 0 0 0 0
1 + 7.54W - - 1 1 1 1 0 200
2 + 7.54W - - 2 2 2 2 0 1000
3 - 0.0500W - - 3 3 3 3 2000 1200
4 - 0.0050W - - 4 4 4 4 500 9500
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 46 Celsius
Available Spare: 96%
Available Spare Threshold: 10%
Percentage Used: 3%
Data Units Read: 60,734,021 [31.0 TB]
Data Units Written: 54,941,078 [28.1 TB]
Host Read Commands: 287,178,400
Host Write Commands: 728,180,577
Controller Busy Time: 2,680
Power Cycles: 10
Power On Hours: 3,180
Unsafe Shutdowns: 4
Media and Data Integrity Errors: 10
Error Information Log Entries: 10
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 56 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 10 3 0x03e5 0xc502 0x000 176284544 1 -
1 9 5 0x033b 0x4502 0x000 176284544 1 -
2 8 8 0x0149 0xc502 0x000 176284544 1 -
3 7 1 0x00ba 0x4502 0x000 176284544 1 -
4 6 7 0x00c4 0xc502 0x000 176284544 1 -
5 5 1 0x008d 0x4502 0x000 176284544 1 -
6 4 5 0x033c 0xc502 0x000 176284544 1 -
7 3 2 0x0133 0x4502 0x000 176284544 1 -
8 2 1 0x00a0 0xc502 0x000 176284544 1 -
9 1 2 0x0111 0x4502 0x000 176284544 1 -