Verified:

qzjul Game profile

Administrator
Game Development
10,263

Jul 30th 2010, 0:42:41

Hi all; Sorry about the major disaster we had there, here's what happened FYI.

I'm planning on moving the server on the weekend; so in preparation of that, I wanted to do a quick test to make sure that the server would, in fact, come back after I moved it.

Unfortunately, as you can see it did not.

Why not?

I forgot some important little details to do with RAID arrays back a while ago; we had a disk failure, and I hot-swapped it for a new drive, no problem, game kept on running and hardly anybody noticed anything ;) Then later I added a 3rd drive to the array, to make sure we had an extra mirror just in case.

However, I forgot to rebuild the mdadm.conf file and update the initramfs which basically makes it so that the kernel understands which drives belong to which array on booting.


So when I rebooted, it was expecting different partitions on different drives, and got totally confused. This was at 11pm my time. I suspect I might have figured it out last night, except for the fact that it took nearly 10 minutes each time I tried to boot for it to fail and drop me to a BusyBox prompt. This necessarily protracted the amount of time needed to test things.... I ended up booting to a LiveCD about 10x, verifying the RAID was good, and looking stuff up online trying to figure out why the heck the boot sequence couldn't figure out what the drives were =/ Anyway, I went to bed at 4am, got up at 7am to go to work, looked up a few things there, built a list of commands to try; got home at 6pm, and as I was booting thought of the solution, fixed it, and here we are.... well and it forced me to do a check of all the drives in the system, as there had been 240 days without a check... (the system had been online for 192 days -- that took 30 or 40 minutes).



So the lesson of the day:

If you ever change a RAID array (especially hot-swap).... update your mdadm.conf AND initramfs RIGHT THEN AND THERE, because if you reboot, everything will be totally fubared


Addendum:

To adjust for the unexpected downtime, all servers except express have been extended for a full day. All market packages will stay on the public market 19 hours longer than usual. All countries have had 19 hours added to the last time they played. For most countries, the downtime will not be significant.

There were around 30 or so countries that logged in before the db was updated. These countries will display a negative "Last Played" time on the portal. All that means is that these countries won't gain turns until 19 or so hours have passed.

We realize that this fix is not perfect, but it appears to be the most fair solution. Thank you for your understanding and sorry for the downtime.

Edited By: Slagpit on Jul 30th 2010, 1:06:10
Back To Thread
See Original Post
See Subsequent Edit
Finally did the signature thing.