header image

Contingency Planning What if the building catches fire?
Disaster Recovery High Availability Systems Compliance Policies

"What if the building catches fire?"
One of my favorite managers used to ask that when reviewing roadmaps. It was always good for a chuckle from a stressed out analyst who was responsible for everything being accounted for, but it wasn't just a joke.
If the building burns down you should never lose more than a day's worth of work, preferably not even that but it's nothing worth dying over.
Contingency plans do not need to foresee every possibility, only account for them.
Forecasting can only go so far, focus on event results rather than details. "What was the outcome" instead of "What happened".
What happened- Database error caused server cluster to fail, brought down the system for 4 hours.
Outcome- System was down for 4 hours.
With contingency planning it's not so important to prepare for each component failure, as it is to ensure a highly available system or failover cluster can maintain primary functionality. If a server goes down, it's not so important to know the details why but more so to account for the possibility.
Conversely- Focus on what you'll need instead of what needs to be done.
What needs to be done- Reconfigure database structure to accommodate server transfers
What will be needed- Good DBA's.

How will Budget be affected? Will schedule be delayed? How can pieces be moved and temporarily redirected to avoid interruptions to the production environment?

A production environment is a sacred place. It should never come down, but things do happen outside the realm of reasonable control. Short of power outage there should always be a way to get back up immediately. Failsafe's don't always need to support full functionality but instead correctly and quickly follow sequence to recover primary functionality.

A recovery option should have automatic procedures for every scenario. Hard Disk failure is easily taken care of through Sans network or cluster. Local power outage? Generators (Don't laugh, they save money in the long run preventing expensive downtime and only have to kick in when needed). Network is covered by HA (high available) systems and remote servers. What if the entire local ISP goes down? Have a server cluster in remote location to assume master role. Of course that whole business gets rather complicated, thus the popularity of cloud systems. A lot of companies leave the hard stuff to third party experts. Hardliners and those with the available funds will prefer to store data themselves. Data Warehousing and Database Engineering is an area where experts should always be employed, no company should ever skimp in this regard. Not everybody finds Database guys easy to talk to, but what's important is knowing what they're talking about. And that they know what they're talking about.
Process recovery is more flexible so long as the correct units are trusted to be adaptable. Mostly it's important for the division of responsibility to be clear in a way which encourages cooperation rather than conflict. It's also helpful getting the correct department heads involved in any planning or transition process to promote pride of ownership.
When it comes to development it's always harder to map out contingencies to ensure no delays affect other work. It takes a deep understanding of each task and objective to stay on top of the status for each. Understanding the resources assigned, task difficulty, estimated time for completion, and connected tasks are all required for properly planning for delays. If a programmer assigned a feature gets stuck, is there somebody else who can assist or is it better for one person to take extra time?

Not all delays can be managed completely, it's important to know when an issue should be escalated. Sometimes an issue or decision will require full project attention, these should be if effects are project-wide or it could be a drastic shift in development focus. These should not happen but if they do it's just as important they are handled properly.
So what should one plan for? Employees or Resources temporarily or permanently removed. Directives beyond dispute from project owners adjusting project objectives and are beyond dispute. Required external functionality becomes unavailable.
And what do you do? You'll need shifts in responsibilities and reassessment of allocations. You'll need objectives that are not fully reliant/dependent, since this almost always would happen early on in the project- early tasks should have been base functionality which supports a wide range of features. You'll need to be ready to cut processes that can be bypassed without altering any other parts of the process (If designers sent output through editors or directors for printing, be ready to have a quick option for them to do the same necessary functionality themselves. Meaning security roles need to be flexible as well.)
If the building catches fire you should already have all work up to the current day backed up on the servers with a redundant, compressed backup image saved to a remote server. Upon hearing the alarms all programs should instantly be shut down and Dropbox accounts should be synced; today's work files were opened from configured folders so it was automatically synced upon being closed. Being super diligent you also copied the most essential files via your USB 3.0 key (nothing over 50GBS, it is a fire for Christ's sake). If within your vicinity or authority, shut down the physical servers to save their current state and unplug the boxes to protect from electrical surges. Finally give a good luck pat on your way out.