If you have identified a failed, or failing disk, it is possible to replace it using the MegaCli utility. In the example below we will cover replacing a failed disk from a raid 5 that has three disks total.
The first thing we want to check is the status of our raid 5.
[root@raid log]# MegaCli64 -ldinfo -lALL -aALL Adapter 0 — Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 Size : 929.458 GB Parity Size : 464.729 GB State : Degraded Strip Size : 64 KB Number Of Drives : 3 Span Depth : 1 Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Cached, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Disk’s Default Encryption Type : None Is VD Cached: Yes Cache Cade Type : Read Only
You can see in the example above that the state of the array is showing up as ‘State : Degraded’. This means that at least one disk has failed, or is not present in the array. Next we will want to look at all of our disks:
[root@raid log]# MegaCli64 -pdlist -aALL
The output of that command is quite long, but in our example it shows three disks and their primary information is:
Enclosure Device ID: 252 Slot Number: 0 …. Firmware state: Online, Spun Up Enclosure Device ID: 252 Slot Number: 1 …. Firmware state: Online, Spun Up Enclosure Device ID: 252 Slot Number: 2 …. Firmware state: Online, Spun Up Enclosure Device ID: 252 Slot Number: 3 …. Firmware state: Offline <==== This is what to look for
In our example the failed disk is shown as ‘Enclosure Device ID:252′ and ‘Slot Number: 3′. So for MegaCli syntax this drive will be reference as [252:3] in the examples below. Now that we know the EIDs and slot numbers of each of the drives we can go ahead and remove the failed drive.
First we set the original disk offline if an error has not already cause the controller to set it offline
[root@raid log]# MegaCli64 -pdoffline -physdrv[252:3] -a0Adapter: 0: EnclId-252 SlotId-3 state changed to OffLine.Exit Code: 0x00CODE
Mark the failed disk as missing
[root@raid log]# MegaCli64 -pdmarkmissing -physdrv[252:3] -aAll EnclId-252 SlotId-3 is marked Missing. Exit Code: 0x00CODE
Mark the failed disk as prepared for removal
[root@raid log]# MegaCli64 -pdprprmv -physdrv[252:3] -a0 Prepare for removal Success Exit Code: 0x00CODE
Now you can go replace the faulty disk, it might help to use the hdd identify command to locate the disk
[root@raid log]# MegaCli64 -pdlocate -start -physdrv[252:3] -a0 Adapter: 0: Device at EnclId-252 SlotId-3 — PD Locate Start Command was successfully sent to Firmware Exit Code: 0x00CODE
- Depending on your setup, there's two options:
If you use hot spares and the original hot spare was already put into the raid array, set the new disk to replace the hot spare that just went into service
[root@raid log]# MegaCli64 -PDHSP -Set -PhysDrv[<enclosure#>:<disk#>] -a<adapter#>CODE
If you don’t use hot spares you will need to add the disk to the array and start the rebuild manually
[root@raid log]# MegaCli64 -PdReplaceMissing -PhysDrv[252:3] -Array0 -row0 -a0 [root@raid log]# MegaCli64 -PDRbld -Start -PhysDrv[252:3] -a0CODE
Optional: We can watch the rebuild progress. Depending on the size of the array this may take a considerable amount of time. Also the raid array is usable during this time, but you can expect to encounter performance hits while the raid array is rebuilding.
[root@raid log]# MegaCli64 -PDRbld -ShowProg -PhysDrv[252:3] -a0CODE