I know I’m not the first person to say this (2 people advised me of this today), but don’t use Windows Server Backup. Get a real backup solution and save yourself some serious headaches. I know I will be after my latest fun with WSB. *Caveat – my experience is with W2K8r2 servers, so perhaps it’s been improved in newer iterations.
I’ve been managing server hardware for a couple of small businesses for about 10 years now, and I’ve always just used Windows Server Backup (in conjunction with Azure backup, as of a couple of years ago) to backup said servers. Today was one of those moments where you have fleeting thoughts of “Am I going to get fired? Did I just lose all of the data for this organization?” I’d like to believe that I have better safeguards in place than to allow that to happen (and it turns out, I was able to get it working again), but man does it scare the crap out of you when you can’t get things working.
Considering my server is in a RAID 5 configuration, you might think “What is the problem? Swap out the failed drive and move on.” That’s what I normally do when I encounter a degraded virtual disk, but this time it didn’t work because a second disk failed during the rebuild process (different from the disk that caused the degradation in the first place). The system was fine until a several-hours-long power outage depleted my battery backup’s power, causing the system to go down in the dead of night. As a side effect, my virtual disk was degraded, and two of the physical disks comprising the virtual drive reported problems. The virtual disk kept failing to rebuild, and pulling out the drive that was keeping it from rebuilding would rendered the machine useless, since it wasn’t the hot spare.
I woke up extra early the next morning to come in and restore the backup image that was taken a few hours before that. The goal was to replace all of the failed physical disks, delete the virtual disk and create a new one with the same parameters as the old, and then restore the image – all before business started that day.
I guess I forgot how much of a PITA Windows Server Backup was to use. Here are the instructions I left myself from the last time I had a multi-disk failure:
- A Windows System Image backup must be performed before attempting to restore.
- Verify this before you nuke your virtual disk
- Since the server has a Dell PERC s300 RAID controller, Windows needs the appropriate drivers to work with the RAID controller when restoring the image. Go to the Dell Support website and enter the Service Tag for the server.
- Find the Windows Server 2008 R2 RAID / PERC s300 driver in the list of available drivers.
- Download the “Hard Drive” format file (an EXE) and run it on the computer where you downloaded it (running the file will extract the drivers to a folder of your choosing)
- Doing so will extract the drivers to a file on the computer where the EXE was run
- Place the folder of drivers from the previous step onto an external drive (or burn to CD)
- Check the settings of your RAID configuration – verify size and caching options so you can use them for your new image
- Power off the server
- Swap out any bad hardware for new drives
- Restart the server and enter the RAID configuration menu (Ctrl + R).
- Find the virtual disk and delete it (This can be also be done in OpenManage Server Administrator prior to rebooting).
- Initialize any new physical disks for use in the virtual disk.
- Create a new virtual disk, matching the configuration to what it was previously (usually all available space).
- Make sure to swap the virtual disk into virtual disk slot 1 (the only bootable one), if you have multiple virtual disks in your array.
- Install the Windows Server OS disc into the DVD tray
- Continue to boot
- When Windows Setup loads, choose the language and hit the next arrow.
- Choose the “Repair an installation” link.
- Click the “Load Drivers” button
- Plug the external drive with the downloaded drivers into the server and choose the optical drive when Windows prompts you for the driver location.
- Plug the external drive with the backup into the server
- Choose the Restore from System image option and hit next
- Windows Server Backup should now detect any backups from the external drive
- Note, this can take a reallllly long time (> 30 minutes)
- Choose next and Finish. The image should restore after a few hours.
One would think that with these detailed instructions, it should be fairly easy to restore the image. The answer is no. For one thing, Windows Recovery seemed to really struggle to recognize my external backup drive. This is especially disconcerting and leads you down some incorrect paths (starting to think the data is corrupted and trying to fix problems with the disk). Eventually, I got it to work through trial and error, but it took far too long and interrupted business operations far longer than it should have. Some problems I encountered:
- The “Load Drivers” step is essential if you are restoring your image over an existing drive (and not starting from scratch). It may also be essential if you are starting from scratch – it never worked for me without loading those drivers, so I think it’s a good idea anyway (assuming you’re working with a RAID controller)
- Feedback with the System Image Recovery is extremely poor. You have no idea what is going on most of the time. You are simply stuck with progress bars that never end. You really just have to wait and hope that it completes.
- My first attempt at creating the virtual disk resulted in a just slightly smaller size than “all available space”, so during the image restore process, I received the helpful error “A data disk is currently set as active in BIOS. Set some other disk as active or use the DiskPart utility to clean the data disk, and then retry the restore operation. (0x80042406).” I opened up the diskpart utility and cleaned the data disk, but as I suspected, the problem was really something else. The disk size has to match (or exceed) the cloned drive
- At one point, I loaded Windows as a fresh install and tried to restore from Windows Server Backup within the OS. The only problem with that is that you can’t do a bare metal restore – you can only restore files and drives. While this was better than nothing, I didn’t like the idea of trying to figure out how to reconfigure all the software on that machine. I’m glad I stuck with the image restore.
- Windows Server Backup has always been incredibly slow, as well. For the most part, I use the command line when running backups with it, but the user interface is ungodly slow. Just opening it and getting to the initial view is really laggy.
- It seemed like it mattered when I plugged my external drive in. If I had the drive plugged in when I started Windows Recovery, the image restore never found it. But when I plugged it in after loading the drivers, it seemed to be okay
In general, unresponsive software is a huge pet peeve of mine. Let me know that something is going on. I don’t know whether something is hanging or just takes forever. Also, no software should be this picky – especially something involved in a potentially mission critical application.
If you’ve encountered problems trying to restore your Win2K8r2 server and you’re using RAID-5, these steps might help. As I discovered today, the internet is full of reports of problems restoring from WSB (most posts are older because Server 2008 R2 is pretty old now, in OS years).
This episode did help me reflect on some problems though – namely, that I need to have more resilient processes in place for problems like this. For example, what if the motherboard failed on this server? What do I do then? Have Dell overnight a motherboard for this? Redundancy is non-existent here. Furthermore, this is an area where you need to practice – I’m sure if I had done restores more than 2-3 times in my entire life, it wouldn’t have been so bad. Developing some kind of a failure scenario like the infamous “Chaos Monkey” would help make things much more resilient. At the very least, better documentation on the services and applications installed on the server now will help in the event of a catastrophic loss or even just a migration.
As a developer, though, the correct answer is really to just virtualize all the things so I don’t have to worry about disk failures anymore 🙂