Avo Reid on CTO: Business Continuity, Disaster Recovery, Resiliency

“Court disaster long enough and it will accept your proposal.” Mason Cooley

Independent Software Vendors (ISV's) like any organization must engage in the activities of Business Continuity Planning (BCP) also called Business Continuity and Resiliency Planning (BCRP). This is especially critical for ISV's that provide Software as a Service (SaaS) solutions to their customers. The process identifies exposure to internal and external threats that can disrupt or worse interrupt the operations that are the lifeblood of the business. Once these risks are identified a recovery plan is developed to return the business back to full operations. Once a recovery plan is in place the business can evaluate, with the knowledge of the risks identified, what hard and soft assets can be applied to prevent a disruption from occurring in the first place, improving resiliency of the business.

Some objectives for the BCRP we listed in our last planning cycle:

Identify risks, critical production components and the impacts of their failure.
Establish systems to monitor the health of these critical production components.
Document recovery procedures to restore critical production components in the event of failure in a time frame that does not breach the customer End User License Agreement (EULA) or Service Level Agreement (SLA). - These recovery procedures also assist in avoiding confusion experienced during an outage.
Identify personnel that must be notified in the event of an outage.
Create a plan to communicate with key people during the recovery and an escalation procedure.
Establish a testing procedure to validate the recovery plans.
Establish a process in which the plan can be maintained, updated and tested periodically.
Serve as a guide for the IT or Network Services Team.

The BCRP process should be considered cyclical, something that should be executed at least once a year. The BCRP Cycle is composed of 3 main phases (the 3 R's), Risk Analysis, Recovery or Solution Design and Resiliency or Maintenance. At a more detailed level we can define the BCRP Cycle using the diagram below.

Risk analysis
This phase should identify exposure to internal and external threats that can disrupt or worse interrupt the operations of the company. Establishing 'Severity Levels' is useful in this phase. For example:

Level 1 Disaster Recovery - Severe Outage
This level is assigned to those risk scenarios where the disruptions are as the name implies disastrous and affect the availability of any component that interrupts operations completely and cannot be fixed at the production site and operations have to be moved to a new location. These disruptions by definition cannot be resolved at the Production Site and will result in instant escalation (chain of notification and approval) o move the Production Site to the Disaster Recovery site, a significant declaration. Think of a meteorite hitting your data center, or as actually happened this year at a local data center here in DC, a back hoe cutting your data centers internet trunk. Here the potential for EULA/SLA breach is high depending on the fail over time to the new site.
Level 2 Operational Recovery - Outage
This level is assigned to those risk scenarios where the disruptions affect the availability of any component that interrupts operations completely but can be fixed at the Production Site. Rhe recovery plan for these risk scenarios should be well within the time that might lead to a breach of any customer EULA/SLA. There will be an escalation procedure in place that will move a Level 2 risk to a Level 1 risk should recovery take longer than expected.
Level 3 Offline Recovery - Redundant Outage
This level is assigned to those risk scenarios where the disruption has no effect on operations, i.e. a pooled web server goes down and the load balancer automatically takes it out of circulation. The minimal impact of these outages are usually due to built in resiliency, never the less the outage must be addressed to bring the system back to complete health.

Recovery / Solution Design
First it's important to define the two types of recovery planning that were outlined in the Severity Level definitions above:

Disaster Recovery
The process of establishing procedures to recover operations in a location other than the primary production facility after a declaration of disaster.
Operational Recovery
The process of establishing procedures to recover production in the same location and does not require a declaration of disaster.

This phase should produce procedures and identify soft and hard assets to recover from the disruption and bring the business back to full execution after a disruption in both the Disaster and Operation Recovery scenarios.

It is also important to define 2 important numbers which in the end will have a significant impact on the solution design:

Recovery Point Objective (RPO) - the acceptable latency of data that will not be recovered (usually driven by transaction volume and speed).
Recovery Time Objective (RTO) - the acceptable amount of time to restore operations (usually driven by EULA/SLA financial impact).

Implementation
This phase should establish the necessary monitors, documentation, communication protocols, testing and fail over environments to successfully execute and test the Recovery / Solution Design. Disaster Recovery will require a fail over operations site be put in place at a different location, this site might be a good candidate to conduct testing and validation of the Recovery / Solution Design rather than jeopardize production.

Testing and Validation
This phase should execute the Recovery / Solution Design through forced disruptions (on the fail over or test platform) which if successful will validate the recovery plans and provide metrics for impact on customer EULA/SLA. Part of the validation process is to benchmark your Disaster recovery fail over site to make sure it's performance and through put are acceptable, remember your recovering for all your customers, not just a select few.

Maintenance and Resiliency
This phase is a post mortem, what did you learn from the Testing and Validation Phase, what parts in the Recovery & Solution Design had to be modified because they didn't work, this information should update the BCP. It's also useful here since you have become so educated on your operations, and risks that can impact your operations is to use this knowledge to identify production components that may be candidates for resiliency improvements. Where these improvements include capital expenses these should be included in the next budgeting cycle using your BCP to make the business case for the expenditure.

Good Luck!

Avo Reid on CTO

Wednesday, October 23, 2013

Business Continuity, Disaster Recovery, Resiliency

No comments: