NSFW 2.0 Memory Errors

spikebyte · May 7, 2021, 12:27am

Hey all,
Keep getting kernel panics and reboots on my 2.0 server due to random memory issues. It seems that once I get close to 100% memory usage for a few days, I hit a bad memory module and the server crashes. Any help would be much appreciated! For reference, here’s my build: Anniversary SNAFU 2.0 Example Build (Virtualization)

I currently have 128GB installed. Here’s what I have installed:

I’m currently running promox on this host with zfs datasets for data and boot disks. As the memory usage comes up, when it gets near 100% for a few days, I’ll get the following errors and will lock up until I reboot it. Here’s what I can see:

# Memory errors
kernel: [137160.613217] EDAC MC1: 2 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1bbefd6 offset:0x700 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:1 rank:1)
kernel: [137160.697675] EDAC MC1: 2 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x1bbefd6 offset:0x740 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:2 rank:1)

$ ras-mc-ctl --guess-labels
memory stick 'P1_DIMMA1' is located at 'Node0_Bank0'
memory stick 'P1_DIMMA2' is located at 'Node0_Bank0'
memory stick 'P1_DIMMB1' is located at 'Node0_Bank0'
memory stick 'P1_DIMMB2' is located at 'Node0_Bank0'
memory stick 'P1_DIMMC1' is located at 'Node0_Bank0'
memory stick 'P1_DIMMC2' is located at 'Node0_Bank0'
memory stick 'P1_DIMMD1' is located at 'Node0_Bank0'
memory stick 'P1_DIMMD2' is located at 'Node0_Bank0'
memory stick 'P2_DIMME1' is located at 'Node1_Bank0'
memory stick 'P2_DIMME2' is located at 'Node1_Bank0'
memory stick 'P2_DIMMF1' is located at 'Node1_Bank0'
memory stick 'P2_DIMMF2' is located at 'Node1_Bank0'
memory stick 'P2_DIMMG1' is located at 'Node1_Bank0'
memory stick 'P2_DIMMG2' is located at 'Node1_Bank0'
memory stick 'P2_DIMMH1' is located at 'Node1_Bank0'
memory stick 'P2_DIMMH2' is located at 'Node1_Bank0'

maybe?

CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
CPU_SrcID#1_Ha#0_Chan#1_DIMM#0

memory stick 'P1_DIMMA1' is located at 'Node0_Bank0'
memory stick 'P2_DIMME2' is located at 'Node1_Bank0'

Here’s an example of one whole memory error chunk out of the kernel log:

kernel: [137160.612524] mce: [Hardware Error]: Machine check events logged
kernel: [137160.613203] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kernel: [137160.613204] EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 7: cc00008000010092
kernel: [137160.613205] EDAC sbridge MC1: TSC 4ac7411cc9a75
kernel: [137160.613205] EDAC sbridge MC1: ADDR 1bbefd6700
kernel: [137160.613206] EDAC sbridge MC1: MISC 424c8086
kernel: [137160.613207] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1619851803 SOCKET 1 APIC 20
kernel: [137160.613217] EDAC MC1: 2 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1bbefd6 offset:0x700 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:1 rank:1)
kernel: [137160.697023] mce: [Hardware Error]: Machine check events logged
kernel: [137160.697661] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kernel: [137160.697663] EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 7: cc00008000010092
kernel: [137160.697663] EDAC sbridge MC1: TSC 4ac74191fdc4f
kernel: [137160.697664] EDAC sbridge MC1: ADDR 1bbefd6740
kernel: [137160.697664] EDAC sbridge MC1: MISC 425a3a86
kernel: [137160.697665] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1619851803 SOCKET 1 APIC 20

I’ve tried searching the contents of dmidecode -t memory | less and searching for that memory address, but it’s not all that helpful.

Is there any simple way to identify the bad module from this information? I found a pretty helpful article, but am struggling to compute the memory hex ranges that this error is occurring within. https://forums.centos.org/viewtopic.php?t=68484

I’d prefer not to reboot and run memtest86 for eons to figure out that the memory is bad since I already know there is a bad memory module, I just need to figure out which one.

This is what I’m thinking I should try to swap out, but I’d like a second set of eyes…

Any ideas?

faultline · May 7, 2021, 5:17pm

You mixed ECC and non ECC?

spikebyte · May 7, 2021, 5:22pm

Did I? I’m looking into that now. If so, that would explain it. lol

The memory I bought is only different in the last 4 numbers. The “ECC” is M393B1K70DH0CHD-YH9 1128, the other stuff I bought is ~~M393B1K70DH0CHD-YH9 1108~~ M393B1K70DH0-CH9

seanho · May 7, 2021, 11:34pm

Naw those are both ECC RDIMM, should be fine. My guess is A1 B1, but unfortunately you’re probably going to have to binary-search to find the flaky DIMM. Also remember it could be the slot (though if it only happens when memory is near full, it’s probably not a slot). And memtest isn’t perfect; passing it isn’t a guarantee. Sorry for the bad news.

spikebyte · May 8, 2021, 2:37am

You have any tips on performing a binary-search? I’ve never done that before. Are you talking about finding the range of each memory stick and seeing where the fault lies within that range?

Do you have any tutorials on that?

Re. memtest, that’s kinda why I’ve been avoiding it. It happens so randomly after days and days of being online.

seanho · May 8, 2021, 4:34pm

I just meant isolating the fault to the granularity of a single stick. Something like a Latin square sequence, half in, half out.

spikebyte · May 10, 2021, 5:03pm

Here’s the results of the memtest run… Not sure how something can have errors but pass…

Any suggestions?

spikebyte · May 12, 2021, 7:54am

I think I might have figured out what’s up…

It turns out that IPMI has been logging all the memory issues, which is pretty neat!

Here’s what is produced:

 228 | 05/08/2021 | 20:19:14 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 229 | 05/08/2021 | 20:19:14 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22a | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22b | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22c | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22d | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22e | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 22f | 05/08/2021 | 20:19:15 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 230 | 05/08/2021 | 20:19:16 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 231 | 05/08/2021 | 20:19:16 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 232 | 05/08/2021 | 20:19:43 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 233 | 05/08/2021 | 20:19:43 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 234 | 05/08/2021 | 20:19:43 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 235 | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 236 | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 237 | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 238 | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 239 | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 23a | 05/08/2021 | 20:19:44 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 23b | 05/11/2021 | 07:21:02 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 23c | 05/11/2021 | 07:35:07 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted
 23d | 05/11/2021 | 08:04:53 | Memory | Correctable ECC (@DIMMG1(CPU2)) | Asserted

I invert-matched DIMMG1 to make sure there weren’t any other DIMMs that were alerting, and it seems to just be DIMMG1.

Here’s how I got this output:

ipmitool -I lanplus -H <ipmi-ip-address> -U <user> -P '<password>' sel list