SSD は DMA 起動エラーを表示しますが、smartctrl にはエラーは表示されません。

SSD は DMA 起動エラーを表示しますが、smartctrl にはエラーは表示されません。

Dell Poweredge T105にOCZ-ARC100を取り付けました。システム(CentOS 7)を起動すると、後者にBDMAエラーが表示されます。

jun 25 15:40:21 myhost kernel: ata4.00: ATA-8: OCZ-ARC100, 1.01, max UDMA/133
jun 25 15:40:21 myhost kernel: ata4.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 0/32)
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: scsi 3:0:0:0: Direct-Access     ATA      OCZ-ARC100       1.01 PQ: 0 ANSI: 5
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB)
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write Protect is off
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
jun 25 15:40:21 myhost kernel:  sda: sda1 sda2 sda3
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Attached SCSI disk
jun 25 15:40:21 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:21 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:21 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:21 myhost kernel: ata4.00: cmd c8/00:08:00:4b:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:00:4b:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:21 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:21 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: ata4: EH complete
...
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:d0:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:d0:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: limiting speed to UDMA/100:PIO4
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:f8:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:f8:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4: hard resetting link
jun 25 15:40:22 myhost kernel: ata4: nv: skipping hardreset on occupied port
jun 25 15:40:22 myhost kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/100
jun 25 15:40:22 myhost kernel: ata4: EH complete

OCZをSATA-USB2アダプタに接続し、smartctrlを実行しました。

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-gentoo-nvidia] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     OCZ-ARC100
Serial Number:    A22L0061518000567
LU WWN Device Id: 5 e83a97 100061d69
Firmware Version: 1.01
User Capacity:    120.034.123.776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jun 25 15:28:55 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x1d) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   0) minutes.
Extended self-test routine
recommended polling time:    (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   000   000   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       252
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       84
171 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       39711824
174 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       10
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
208 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       5
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
224 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   100   000    Old_age   Offline      -       100
241 Total_LBAs_Written      0x0000   100   100   000    Old_age   Offline      -       92
242 Total_LBAs_Read         0x0000   100   100   000    Old_age   Offline      -       221
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3316691

SMART Error Log Version: 1
No Errors Logged

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

ここには明らかにエラーの兆候はありません。 BMDMAエラーにはあまり注意を払っていませんでしたが、最初はドライブが死ぬと思いましたが、今はこれが正しい診断かどうか疑問に思います。また、ドライブを新製品(Western Digital Blue 500GB)に交換すると、エラーなしで動作するという誤解を受けました。しかし、違いは、OCZが実際に比較すると非常に速いということです。

上記のエラー(明らかにDMAエラー)をどのように説明し、この問題をどのように解決できますか?たとえば、フラッシュOCZファームウェア?特定のカーネルパラメータを使用しますか?

ところで、BIOSはATASATAディスクにバスオプションを使用するように強制します。たとえば、AHCIに変更することはできません。これは、SATAバスに接続されているCD / DVDドライブまたはFusion MPTハードウェアRaidアダプタが原因であると考えられます。とにかくここでは(文字通り)選択の余地はありませんが、少なくともWDドライブの場合には問題にならないようです。


編集する:サーバー自体でドライブセルフテストを実行しましたが、結果は次のとおりです。

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.21.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   000   000   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       253
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       85
171 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       39711824
174 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       10
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
208 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       5
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
224 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   100   000    Old_age   Offline      -       100
241 Total_LBAs_Written      0x0000   100   100   000    Old_age   Offline      -       92
242 Total_LBAs_Read         0x0000   100   100   000    Old_age   Offline      -       222
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3316768

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       253         -

しかも、次のようにそれからヒントsmartctlはドライブの内部をテストし、ドライブに欠陥がないと安全に仮定できると思います。もう少し調べてみましょう...

ベストアンサー1

おすすめ記事