Normally when a web server goes down you would think its from something major, like a power outage, or a disk failure. In my case it was a failing 2.5" 2TB WD Green drive. In case anyone else has failing server, this might help you.
Servers go down. That is just a fact of life. And when it happens it brings a world of stress for Server Administrators. A quiet day quickly goes to dozens of inbound calls and texts from clients wanting to know if the server is down. After all many of their businesses come to a complete stand still when they can't send email.
This recently happend to me. The server went down around 12:30 AM. About 30 minutes after nightly backups started. In the morning I was greeted with some texts and missed calls from clients letting me know their sites are down. Tried to remote in, and nothing. Though Ping Test showed the server reachable, and responding.
I called the datacenter and requested a hard reboot, which in itself is scary, as any write operation during a hard reboot can cause coruption issues. 15 minutes later I watched as my pings, became unreachable as the server rebooted. Then it was accessible again. I logged in and looked for anything that would indicate an application error, or even a security breach. Nothing.
I decided that it might be a good time to do a full system backup, just in case. As for the past 2 months my backups have all been incremental. So I started up the backup software, and 2 hours later the server locked up again. Another call to the datacenter to get a hard reboot, then a quick repair of some corruption issues on the DB server, and finally back to diagnosing the cause. I also went ahead and started an offsite backup just in case. While that backup was running I continued to search for the cause.
I was pretty sure it was related to the backup software putting the system under stress. But the backup software has been running perfectly for over 6 months. This lead me to assume it was a hardware issue. My server has a 2TB WD green SATA drive for nightly backups. When I looked at the SMART values for that drive, it showed over 2000 reallocated sectors, and the management software seemed to think the drive was on its last leg. I have moved my backups to a secondary drive for the mean time.
Here is how this dying drive crashes the server.
My primary drive is a RAID 6 array with SSDs, which can easily read at 500 MB/sec.
The backup drive is a Single 2TB SATA drive 5400 RPM.
The backup software collects data and pushes it to memory, then the memory pushes it on to the drive.
Since the server can read faster than it can write, the data is queued up in memory. This usually isn't a problem as the system will try to manage how much and slow down reads.
Everything is great until the backup Drive hits a bad sector, then it attempts to run a reallocation operation, which stops any writes from happening to the drive. This stalls the drive for several seconds, sometimes as long as 30 seconds. In the mean time the backup software continues to queue up data to memory waiting to be written. Eventually your server breaks the 50% memory threshold and performance starts to degrade, which just makes things worse. Other requests get backed up, and push memory usage even higher. Once the system is out of memory, it locks up. This is what happend to my server.
Just wanted to share this with anyone else suffering from similar issues with a Windows Server