You can skip to the bottom of this article, at this point it seems the issue was corrected with a RAID card firmware update. Thanks to HighPoint for being so quick on the support with this issue.
A couple months ago I colocated my new server, and have been slowly migrating all the data from my old server to the new one. All this time, the new sever has been solid. That was until last night. When my raid card decided to drop a couple drives. Making the system unresponsive, resulting in about 6 hours of downtime while My datacenter guy and me tried to diagnose the problem and get it back online. For anyone thinking about putting WD green drives in a server, you might want to rethink that.
Before I get into my detailed discussion, I thnk its good to give everyone some background on my hardware. The server is a Supermicro 1U Rack Server with Dual Intel Xeon 5570 CPUs. 48GB of RAM. Dual Gbit Ports, and 8 2.5" Hot Swap Hard Drive Trays. I am running 6 x Samsung 840 SSD drives ( 250GB size), and 2 x WD Green Drives ( 2TB size). The Raid card is an HighPoint RocketRaid 4520, and is configured with RAID 6 for the SSD drives ( any 2 drives can fail and I am still alive ), and JBOD for the WD green drives.
I have my Raid card setup to email me notifications if something goes wrong.
At 6:15AM I got an email from my server
hptiop: Disk 'WDC WD20NPVT-00Z2TT0-WD-WX21AB2F7090' at Controller1-Channel4 failed.
At 9:33AM, After a reboot I got another email from my server
hptiop: Disk 'WDC WD20NPVT-00Z2TT0-WD-WXB1E92PDS94' at Controller1-Channel3 failed.
hptiop: Disk 'WDC WD20NPVT-00Z2TT0-WD-WX21AB2F7090' at Controller1-Channel4 failed.
hptiop: An error occured on the disk at 'WDC WD20NPVT-00Z2TT0-WD-WX21AB2F7090' at Controller1-Channel4.
My OS runs off the RAID 6 SSD array. The only time the WD drives are used are when backups are being stored. So it didn't make sense that the loss of the WD drives would cause the whole system to lockup. But that is exactly what happend.
The WD green drives are not enterprise drives. They are consumer drives, they are meant to store your precious files, and use very little energy. They ARE NOT meant to be run in high throughput environments. RAID, NAS, DB servers these are not where you want to see WD green Drives. I knew this when I built my server, but I didn't think using them as standalone dirves for data backups would be an issue. I was wrong.
Here is what happens with consumer hard drives. If the drive detects a bad sector it will attempt to recover the data and move it to another sector on the hard drive. Then it will record in the SMART table how much data was moved on the Reallocated Sectors Value. This might be a good feature for consumers who dont' tend to have any fault tollerance to their hardware, but in a server environment you don't want a hard drive to hit the brakes while it attempts to do a deep recovery of some data. Everything in your server will come to a standstill while the hard drive attempts to repair itself.
WD Enterprise grade hard drives have a feature called TLER (Time Limited Error Recovery). This feature means that the hard drive will only spend 7 seconds attempting to recovery bad sectors, before giving up and letting the system continue with business as usual. Without TLER, the harddrive is free to hold up the system for minutes or even hours depending on how long it takes to attempt a deep repair of the data. During this time the RAID card will see that the drive is not reponding and will drop it out of raid, causing your system to slow down as it trys to rebuild your array. However even if your drive is not in RAID, it can still hold up the entire system while it attempts to repair itself.
This was what happend to me. The whole system seemed locked up. Pings were responding, but I couldn't connect to the server via FTP, Remote desktop, or HTTP. But at the same time my connection attempts did not time out. It was as if the system was too busy doing something else. So after getting a hard reboot, the system booted back up and I was able to get back in. Though I noticed my RAID card had dropped both WD drives. After some DB repairs, and windows updates, I tried cleared one of the WD drives on the raid card so it would come back to life. But after reboot the system did not come back. I had my datacenter guy look at the system, and he said it was acting like it was about to boot up just after Windows Updates. But it was stalled for over 30 mintues. I told him to pull the 2 green drives from the machine and then reboot. He did that, and the system came back up within 2 minutes.
WD Green, Blue and Black Drives do not support TLER. WD Red Drives, and RE drives do. So I have ordered a few WD Red drives to replace the green drives. Hopefully this will be the last time I have to deal with this issue. Interesting though was NewEgg.com no longer sells the 2.5" WD Green drives in the 2TB size. I can only imagine they had too many returns with them.
Once I get the drives back from the datacenter I will run test them, and report back what I found.
I contacted HighPoint ( the manufacturer of the RocketRaid 4520 RAID card ), about the issue I was having. They got back with me within 24 hours and provided a bios update that they suggested might solve the issue. My RAID card is currently running on the v1.3 Bios with Firmware v188.8.131.52. However the Bios update they provided was Bios v1.3 with Firmware 184.108.40.206. I found it interesting that the Firmware update on HightPoints own website was only v.220.127.116.11. Seems they are holding back on releasing this new firmware except in special circumstances.
Taking a look at the Revision History on the included Readme file showed the following
4. Revision History
Fixed the poor performance issue with SAS devices behind the SAS expander.
Fixed compatibility issue with Sans Digital SAS expander.
Add support Marvell 9715 chipset PM.
Fixed a compatibility issue with WD 4TB SAS HDD.
No where did I see anything that indicated my issue would be resolved, but it was worth a try. Using the HighPoint RAID management Web Admin, I updated the firmware. Afterward it prompted me to reboot the server. I rebooted, and the system came back on just fine. The next step was to contact my datacenter guy and have him put the 2TB WD green drives back into the server. This was in case the drives were ok, and the hickup I experienced ealier was a result of the RAID card.
I will update this blog later once I get a chance to continue to this project, and will report back.
Just to update everyone. I still haven't put the old green drives back into the server. I have been running remote backups in the mean time. I dropped off 3 x WD 1TB Red drives to my datacenter guy to take to the datacenter next time he is there. This way in case the Green Drives are the problem, I have spares on hand. This whole incident is ironic as in my old server I have been running a 1TB WD Black Drive with no issues for nearly 5 years. Its hard to believe that 2 green drives would die after just 3 months. But on my old server my Black Drives were directly connected to the SATA ports on the motherboard, and not running through the RAID controller. I will report back once I have more info.
Just another quick update. I should have my old server back with the 2TB green drives within the next week. Once I have those back I will do some analysis on the drives and report back.
I have been delayed in getting my old server back, along with the 2TB green drives. After a update to the RAID card's Firmware, I decided to take a chance and have the datacenter put my 2TB green drives back into my new server. So far so good. Both drives seem to be responding fine. Its possible this issue was just a glitch in the raid card's firmware. My first backup is tonight at 2AM. I will report back how it goes.
Sorry the delay in updating this blog. Its been a few weeks now, and the 2TB drives are working fine. No errors or any other issues. I would say the problem definitely was tied to the RAID card at this point, considering that both drives stalled within minutes of each other. If anything changes I will update this blog. Until then you can assume its safe to run the Green drives on this RAID card in JBOD. I still would recommend using RE drives or RED drives if you are going to run any level of RAID ( 1, 0, 5, 10, ext )