OMGN: Online Movies & Games Network

Outage Information: September 16th - 17th

Gamer of Darqness; Oct. 3, 2010; By Robert F. Ludwick
Type: News
OMGN's outage on September 16th-17th... sucked.

From September 16th at about 10:15pm Pacific Time through September 17th at about 12:45am, OMGN and all websites hosted on the DarqFlare Enterprises server in Las Vegas, NV were inaccessible. I apologize for taking so long to post about what happened, I meant to within a couple days.

Occasionally we'll perform operating system and software upgrades on our server. This usually doesn't end up with any issues for all of the websites hosted on the DFE server, aside from a couple minutes of downtime during required reboots, such as when we install a new kernel.

Our server is a custom-built server running Linux (Gentoo, specifically). We've had remarkably excellent uptime until this incident. We were performing software upgrades on the server that evening and when we had it reboot, it never came back up.

I was going to go to bed on it and fiddle with it in the morning, but it was eating away at me and I couldn't sleep. So at the bright hour of 11:00pm, I swapped out my pajamas for clothes and drove off to the datacenter where the server is housed. Yeah, see, we don't host the server in the confines of a residential property such as my own. It's co-located here in Las Vegas with great network and power backups.

I arrived and got into the facility and hooked up a crash cart (monitor, keyboard and mouse) to the server to find that the kernel wasn't seeing the hard drives in the system. Upon reboot, the system would load the kernel and it would start detecting devices, however it wouldn't be able to see the hard drives to build the RAID 1 array and it would cease processing.

The interesting thing here to me was that it was able to load the kernel and filesystem, which meant the filesystem wasn't damaged. That was one of my worries, as I had to update Grub, the bootloader we've chosen to use to load our different kernel versions. So I got to strike that one off the list.

My next guess was that the new kernel we installed had issues and wasn't configured properly. Well, the server has a 0-second wait on selecting the kernel to load; this is because if we're rebooting the server, the shorter the downtime the better. So after a couple tries I managed to get it to load the previous kernel version we were running. No dice, the same issue arose.

After fiddling around a bit and even going back one more kernel version, it still wasn't detecting the hard drives. I had unfortunately forgotten to bring my disc pack with me so I carted myself off home to grab my Gentoo LiveCD so I could load an environment from disc to see if it could detect the hard drives.

Round trip completed, I got back to the facility and loaded up the disc. The kernel on disc saw the hard drives and was able to assemble the RAID 1 array with no issues at all. I was able to mount the drives properly and start messing around in there. Unfortunately, going between different kernel versions I was unable to get any progress. It was getting late and I had to work at 7 in the morning. I do have a day job outside of OMGN, much to my chagrin; I'd much rather work on this site for a living.

That day at work I asked if one of my sysadmins would be so kind to lend me some of his lunch hour to pore over the system to find the issue. I'm pretty decent with Linux these days but this was confounding me. He agreed and at about 11:30 we hopped over to the datacenter together to assess the issue.

I gave him a rundown of what was going on for awhile and we tried rebuilding udev because that was his best guess as well. Unfortunately, udev wasn't building properly. It was complaining about using a 64-bit compiler on a 32-bit system. It seems I nabbed the wrong Gentoo LiveCD from my house. Thankfully, my sysadmin friend was able to burn off a 32-bit LiveCD.

We loaded it up and recompiled udev, then noticed something that I wish I'd have noticed the first time I was doing these software upgrades. The install complained about 3 Linux kernel flags being set that shouldn't be. The flags were support for older, legacy filesystems. We noted them down and went and recompiled the kernel without those flags, then recompiled udev again. Rebooted the machine with the hopes that it would come up again.

And it did.

After that, it was just a little bit of housekeeping and the DFE server was back up again. So here's a lesson for you folks: pay very, very close attention to any output messages from operating system and software installations in Linux. I didn't see those messages the first time and it cost DFE about 14 1/2 hours of downtime.

OMGN is not responsible for any blog post content on this site. The blog post author is responsible for all blog post content.

Comments

Recently Commented Blog Posts

None! Go comment today and be seen.

Highly Commented Blog Posts

None! Go comment today and be seen.