mdadm DegradedArray、ソフトウェアの問題かハードウェア障害ですか？

2024-06-27 • tag-icon

mdadm DegradedArray、ソフトウェアの問題かハードウェア障害ですか？

私のホスティングプロバイダの専用サーバーから、すべてのRAIDアレイmd0 / md1 / md2について次の電子メールを受け取りました。

This is an automatically generated mail message from mdadm running on cn.com
> `This is an automatically generated mail message from mdadm running on
> example.com
> 
> A DegradedArray event had been detected on md device /dev/md/2.
> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
> [raid4] [raid10] md2 : active raid1 nvme0n1p3[0]
>       903479616 blocks super 1.2 [2/1] [U_]
>       bitmap: 7/7 pages [28KB], 65536KB chunk
> 
> md0 : active raid1 nvme0n1p1[0]
>       33520640 blocks super 1.2 [2/1] [U_]
>        md1 : active raid1 nvme0n1p2[0]
>       523264 blocks super 1.2 [2/1] [U_]
>        unused devices: <none> `

これがRAID同期の問題なのか、それともハードドライブに実際に欠陥があるのかはわかりません。 Linuxの専門家の助けを願っています。

2つのNVME Samsungデバイスがソフトウェアraid mdadmで実行されています。

$ fdisk -l
nvme1n1     259:0    0 894.3G  0 disk
├─nvme1n1p1 259:2    0    32G  0 part
├─nvme1n1p2 259:3    0   512M  0 part
└─nvme1n1p3 259:4    0 861.8G  0 part
nvme0n1     259:1    0 894.3G  0 disk
├─nvme0n1p1 259:5    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:6    0   512M  0 part
│ └─md1       9:1    0   511M  0 raid1 /boot
└─nvme0n1p3 259:7    0 861.8G  0 part
  └─md2       9:2    0 861.6G  0 raid1 /

リストに示すように、nvme1n1とそのパーティションはraidグループにありません。明らかに、nvme1n1はオペレーティングシステムでも認識されます。

$ dmesg 
[ 7664.380493] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
[ 7664.380514] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 7664.380795] pcieport 0000:00:1b.4: AER:   device [8086:a32c] error status/mask=00000001/00002000
[ 7664.381066] pcieport 0000:00:1b.4: AER:    [ 0] RxErr
[ 7664.780438] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
[ 7664.780459] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 7664.780739] pcieport 0000:00:1b.4: AER:   device [8086:a32c] error status/mask=00000001/00002000
[ 7664.781011] pcieport 0000:00:1b.4: AER:    [ 0] RxErr

lspci2つのNVMEデバイスを表示

$lspci
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

たとえば、md0のmdadmの詳細を確認してください。

mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Aug  7 19:34:45 2021
        Raid Level : raid1
        Array Size : 33520640 (31.97 GiB 34.33 GB)
     Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Fri Mar  4 17:42:37 2022
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:0
              UUID : 2e61cb41:dee3a004:b12de575:72c13ed0
            Events : 46

    Number   Major   Minor   RaidDevice State
       0     259        2        0      active sync   /dev/nvme0n1p1
       -       0        0        1      removed

ここに /dev/nvme1n1p1 デバイスは表示されません。これは私にとって何を意味しますか？

私のmdadm.confファイル

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0  metadata=1.2 UUID=2e61cb41:dee3a004:b12de575:72c13ed0 name=rescue:0
ARRAY /dev/md/1  metadata=1.2 UUID=455ba7de:599eb665:202c1fe8:33c709f4 name=rescue:1
ARRAY /dev/md/2  metadata=1.2 UUID=c1f88478:e4ed5e8d:56f296cc:38e97b8c name=rescue:2
ARRAY /dev/md/0  metadata=1.2 UUID=e8c8f0cb:91007124:62e03226:94a707dc name=rescue:0
ARRAY /dev/md/1  metadata=1.2 UUID=a335efb7:cc52634c:3221294c:e7feb748 name=rescue:1
ARRAY /dev/md/2  metadata=1.2 UUID=f2a13b49:17f5e812:8e7c5adf:3114a929 name=rescue:2

# This configuration was auto-generated on Sat, 07 Aug 2021 19:35:14 +0200 by mkconf

あなたが私を助けることができることを願っています

ベストアンサー1

これはハードウェアレベルのエラーです。ホスティングサーバーなので、プロバイダーに障害のある機器を交換するよう依頼してください。修正しようとしないでください。これがあなたが支払うものです。

ホスティングプロバイダとダウンタイムを予約する必要があります。
どのディスクデバイスに障害があるかを100％確信していることを確認してください。（私はよりよく知る必要があるベンダーから以前に良いディスクを交換しました。
可能であれば、問題が発生した場合に備えてバックアップを作成してください。とにかくバックアップが必要なので、追加のコピーを入手してください。

ベストアンサー1

おすすめ記事