RAM – Checking for errors
Run BreakIn
It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors.
Run memtest86+
memtest86+ is a free utility that will test writing and reading to the systems RAM. If your system does not already have memtest86+ as a boot option you can add it in CentOS by doing the following:
$ yum install memtest86+
$ memtest-setup
This will both install memtest86+ and run the initial setup to add it to the boot options in grub. When you are ready to run the test, reboot the machine and look for the Memtest86+ option on the grub boot option list.
Check system logs
Memory related errors can appear in many different ways. The following files are a good place to scan through for any errors related to memory.
$ cat /var/log/messages | less
$ cat /var/log/mcelog
$ dmesg
If your DIMMs have ECC capability the edac-util program can read information from EDAC (Error Detection and Correction) drivers in the kernel, using files exported by these drivers to record corrected and non-corrected errors. This can also be useful for narrowing down which DIMM errors are coming from.
$ edac-util -v
If you are unsure about any of the output from the utilities above you can send the output to support@advancedclustering.com and we will gladly look over the output for you.