Mar 202015

Until recently, Mozy Pro was used to backup offsite nightly.  The backup included all administrative data, user profiles, SQL databases, etc.  The SQL databases were automatically backed up to the server nightly, and then pushed offsite via Mozy – a huge improvement from the unreliable DLT autoloader we had previously used.

The document attachment feature of production software has helped us realize the benefit of transitioning  to a paperless file system.  However, once the SQL attachment database alone grew past 40GB, uploading the nightly backup was taking longer than 48 hours!  Even worse, the performance impact of Mozy uploading the data from our SQL server during business hours was resulting in noticeable delays at point of sale terminals!   Although the 4 NICs in the server are bonded together into a 4Gbps ethernet connection to the core switch, the SQL databases are stored on a RAID5 volume which also has an impact on database performance.

In looking for ideas to improve the backup schema, I turned to the Google group moderated by our software vendor. Another IT Manager recommended Backup Assist (BA) for easily scheduled SQL backups, in lieu of doing something in Powershell or changing settings in the SQL Server Management Studio.  We are now running BA 7.5 on our Windows 2003 server.  I would strongly recommend this software, as it has an easy to navigate GUI, an excellent feature set, and add-on modules which are very useful.  Compared to some other Enterprise class backup management software, BA is a good value.  The SQL backup module automates creating a daily backup of SQL databases as well as incremental backups every 15 minutes.  This is now a crucial part of our disaster recovery protocol.  Previously, a catastrophic failure could have ultimately resulted in the loss of all transactions and data modified on the day of the failure. Now any lost transactions should be possible to manually recreate, even in the event of a catastrophic failure!

There are three multi-mode fiber links between the buildings at work, and internal network data wiring has been optimized to eliminate two switches from the physical network topology.  Ethernet media converters are being replaced with fiber optic transceivers in the more advanced layer-2 switches that will allow link aggregation on trunk ports.  This is important to the new backup schema, as the additional bandwidth used by Backup Assist to transfer the SQL backups to the back building are physically as well as logically logically seperated (VLANs) from other network traffic using VLAN  access rules on the core layer 3 switch stack.

I prefer rsync as a network transfer agent, but I will need to do some more work in setting up multiple rsync daemons between local storage servers in back building and the offsite server.  2MBps upload across hardware VPN keeps a steady stream of traffic between the local storage servers and offsite server.  This has caused a few incomplete backups with the rsync daemon getting confused running multiple operations in the same shell.  For now, SMB is working with Backup Assist to transfer the SQL backups to the local storage server in the back building.  Because I am using a non-NTFS (ext4) file system on the offsite server, the current backup reports on Backup Assist throw some warnings about NTFS permissions and mount points being stripped from files and folders in copy job.  I don’t see this as a problem given the nature of what we are accomplishing with our offsite backup.  Most restores are done from the Shadow-copied files and folders on the local servers.  The offsite backups are more for full restores in the event of catastrophic failure or intentional deletion.

There are two backup sets running for SQL.  The production software file attachment database, and main database are archived in full each business day, and incrementally backed up every 15 minutes for the remainder of the day.  Each hour, these files are copied across the fiber link to a local storage backup location where they are then moved through a series of six folders based on date/time stamp of the backup files so that there are approximately 2 weeks worth of backups available on the local network.  Another backup job makes a full backup of all additional databases in the production environment and transfers all of the SQL backups across the fiber link to a different location in the local storage server at the end of the day.  All the SQL backups are then archived daily to an alternate location that keeps one weeks worth of backups of the SQL database in folders labeled by each day.  These are then transferred to the offsite server across the VPN with rsync over SSH.  On Sundays, all of the other network data (user profiles, documents, files, pictures, etc) are copied to the local backup servers, where they are then transferred to the offsite server via the rsync daemon that runs hourly, pushing any new content to the offsite server.

Because I wanted to keep the remote system headless, and because I had some existing components hanging around the office, here is the equipment I used at the remote location:
SonicWALL TZ-100W
  • Inexpensive firewall that allows an easy to manage hardware VPN with the main office.
APC Masterswitch AP9225
  • Initially, I think I would have preferred the AP9211. The installed firmware offers a web-GUI option to ‘restart’ individual outlets.
  • The AP9255 has matching serial ports for pairing UPS communication with each outlet.   It may be possible to integrate some environmental probes for temperature/humidity, and RS-232 communication with  an Arduino board for more advanced household automation applications.
  • I haven’t looked at scripting or automation on these yet, so time will tell which ultimately works better.  I want an iOS app that will give me push button access to these controls so that I can simplify procedural troubleshooting that may require restarting devices based on error condition (e.g. triggering the cable modem to turn off and on if the VPN drops for x minutes and google’s DNS cannot be PINGed.
PowerConnect 2816
  • An 8 port switch would have been fine, but the 16 was available.  SNMP allows network traffic data to be graphed which is useful for planning and scheduling backup transfers.Screen Shot 2015-03-20 at 1.06.43 AM
PogoPlug E02
  • ARM powered ‘cloud storage’ device marketed to consumers that is easily reconfigured to run Arch Linux.
8GB Sandisk USB Flash drive
  • This is partitioned as an ext3 volume and serves as the root volume for the Linux OS.
Sabrent USB 3.0 to SATA Dual Bay External Hard Drive Docking Station
  • Easy access to media, and adaptable as backup requirements change.  Disk cloning is a nice perk.
2 x WD RE4 2TB Enterprise Hard Drives
  • Target Storage media for offsite data
APC Back-UPS 750VA
  • Provides battery power to network components until generator is online in the event of power failure

DIY Mozy

I am working on setting up scripts to automatically correct and report common issues with the rsync portion of the offsite backup.

Now for the $64 question:

Who has fully implemented and tested a disaster recovery protocol and purposely failed over to backup systems to test?  I have not done a SQL restore from the offsite data yet, so I cannot call this project complete until I have done so.  Since Windows 2003 is end of life this June, I will have to replace the server with Windows 2012.  Given the age of the existing system, coupled with hardware limitations (I would have to add another CPU to increase the memory based on the server systemboard chipset), I plan to deploy new hardware.  Once this is operational, the plan is to house the 2nd server offsite and keep the software updated to match the production server (SQL version and production software version), provided the operating system and software licensing allow it.

This should allow for greater flexibility in recovery capability.  In a case where we lose the production server, the SQL database could be restored offsite, and production software could be accessed across the VPN.  Alternatively, the backup server could be physically moved allowing IT to react to any number of recovery scenarios.