システムでEDACエラーが発生するたびにスクリプトを呼び出す必要があります。
そのために、以下のUDEVルールを作成しました。変更が発生した場合はce_count
実行したいのですが、実行しましたが/var/tmp/test.sh
エラーが発生しましたが、スクリプトは実行されませんでした。udevadm control --reload-rules && udevadm trigger
udevadm monitor
mce-inect
# cat /etc/udev/rules.d/98-edac.rules
ACTION=="change", ATTR{ce_count}, KERNEL=="mc0", RUN+="/var/tmp/test.sh"
# udevadm info -ap /sys/devices/system/edac/mc/mc0
Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.
looking at device '/devices/system/edac/mc/mc0':
KERNEL=="mc0"
SUBSYSTEM=="mc0"
DRIVER==""
ATTR{ce_count}=="21"
ATTR{ce_noinfo_count}=="0"
ATTR{max_location}=="channel 7 slot 2 "
ATTR{mc_name}=="Broadwell Socket#0"
ATTR{seconds_since_reset}=="5223"
ATTR{size_mb}=="65536"
ATTR{ue_count}=="0"
ATTR{ue_noinfo_count}=="0"
looking at parent device '/devices/system/edac/mc':
KERNELS=="mc"
SUBSYSTEMS=="edac"
DRIVERS==""
looking at parent device '/devices/system/edac':
KERNELS=="edac"
SUBSYSTEMS==""
DRIVERS==""
edac / mceの失敗を引き起こすために、次の方法を使用しましたmce-inject
。
./mce-inject ./basic-inject.txt
# cat basic-inject.txt
CPU 0 BANK 8
STATUS corrected
ADDR 0x12345125
MCGCAP 0x7000c16
APICID 0
MCGSTATUS 0
SOCKETID 0
MISC 0x50683286
STATUS 0x8c00004000010090
挿入エラー後にカーネルsyslog / dmesgにログエントリがある
[ +4.436747] Starting machine check poll CPU 0
[ +0.000013] mce: [Hardware Error]: Machine check events logged
[ +0.000008] Machine check poll done on CPU 0
[ +0.000030] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ +0.000002] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004000010090
[ +0.000001] EDAC sbridge MC0: TSC 0
[ +0.000002] EDAC sbridge MC0: ADDR 12345100
[ +0.000000] EDAC sbridge MC0: MISC 50683286
[ +0.000002] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1593625089 SOCKET 0 APIC 0
[ +0.000005] EDAC DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 1
[ +0.000005] EDAC DEBUG: get_memory_error_data: TAD#0: address 0x0000000012345100 < 0x000000007fffffff, socket interleave 0, channel interleave 2 (offset 0x00000000), index 0, base ch: 2, ch mask: 0x04
[ +0.000007] EDAC DEBUG: get_memory_error_data: RIR#0, limit: 31.999 GB (0x00000007ffffffff), way: 4
[ +0.000002] EDAC DEBUG: get_memory_error_data: RIR#0: channel address 0x091a2880 < 0x7ffffffff, RIR interleave 2, index 1
[ +0.000002] EDAC DEBUG: sbridge_mce_output_error: area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4
[ +0.000007] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#1 (channel:2 slot:1 page:0x12345 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4)
[Jul 1 17:41] perf: interrupt took too long (3923 > 3920), lowering kernel.perf_event_max_sample_rate to 50000