Simplify Business Continuity
I’ve been consulting for about the last 18+ years and have worked with dozens of organizations of all sizes, shapes and even nationalities. Most of the organizations I’ve consulted with are in the ‘small to medium enterprise’ space which we categorized as having at least 1000 – 2500 employees. One of the characteristics most of them shared was non-existent or poorly implemented Business Continuity (BC) and Disaster Recovery (DR) practices.
You will note that I didn’t say that they all lacked BC/DR planning. Many of them, when faced with an audit, could pull some thick binder off the shelf, blow the proverbial dust off the cover, and proudly announce, “Here it is,” and receive the coveted passing grade by the auditor. Some even ‘tested’ their BC/DR plans. What I mean by ‘poorly implemented’ is the fact that these plans were often not socialized well, not accessible by key personnel, not explicitly mapped to the greatest risks to the company, impractical, inconsistently tested and sometimes lacking in basic common sense.
Usually, when I speak to customers or other colleagues about ‘fixing’ business continuity, I get feedback like, “Yes, that’s a priority for <insert any year as long as it isn’t the current one>.” I think a lot of people in my position fall victim to the ‘Engineer Fallacy’: i.e. "If I can’t make it perfect, I’m not going to do it at all". I like to reflect on one of my favorite quotes by George Patton who said, “A good plan violently executed now is better than a perfect plan executed next week.”
Simple BC Planning
Let’s consider one approach to creating a ‘good’ business continuity plan ‘now’:
- Define your terms – Many IT professionals, much less your average executive, get confused on the definition of terms when it comes to BC and DR. I define business continuity planning as “the architecture, processes, and procedures that allow the business to function when its systems are interrupted by an unplanned event.” Notice that I’m focused on business systems, not facilities, supply chains, etc. Those, too, are part of an overall business continuity strategy but outside my current scope of responsibility. Disaster recovery planning is at the extreme end of your business continuity planning exercise and addresses more extreme risks. Write down your definition and socialize it with your executive team.
- (Literally) Table your risks – Create an actual table of the risks your business systems face. I’d recommend, however, that you not start with the table though; start with a brainstorming session. Get a group of knowledgeable folks in a conference room, order a pizza and write every risk you can think of on a whiteboard. Include business stakeholders outside of IT. Here are a few ideas to get you started: Hardware failure, power outage, misconfiguration, malicious insider, lightning strike, fire, ransomware attack, service provider outage, asbestos found in the building, etc. Order your risks into a list from most to least likely to occur and circle the ones you think could reasonably happen in the next 12-18 months.
- Quantify your most important business systems – This step is often known as creating a ‘Business Impact Analysis’ or BIA. I know large consulting organizations that charge six figures to help a business create its BIA. The traditional process for creating one is long and involved and not without merit. I’d contend, however, that if you don’t have one, something is better than nothing. If you simply repeat the process you followed to identify risk above to generate a prioritized list of business systems and applications, you will be a significant step closer to a functional BIA. In this case, draw a circle around those systems that would cripple the ability of the business to generate or realize revenue if they were unavailable for more than 4 hours. Again, here is a starting point for your brainstorming session: Core network, phone system, ERP, cell towers, Internet connection, e-mail, core router(s), core switch(es), CRM, etc.” We call this most essential class of systems ‘Tier 1’.
- Create a 'BC Plan Table' and list your mitigation strategies, in order of efficacy, for each risk scenario. By way of illustration:
Figure 1: Example BC Plan Table
BC Plan Table: Objections
So - you have the bones of a BC plan in your BC Plan Table above. This will, I’m sure, generate objections from those of you looking for ways to fill the binder your auditor is expecting. This will, perhaps, be a good time to clear up a few things:
- “But this isn’t what my auditor is looking for!” - Auditors hang you with the rope you give them. It is up to you to match your business requirements and obligations (including compliance requirements) against your budget. Would you like, say, Bank of America’s BC/DR budget to address your own requirements? Sure; however, you have to work within the budget you’ve been given or you will need to obtain a larger budget. If you are concerned about how your auditor (board of directors, executives, etc.) will react to a simple ‘BC Plan Table’, all you need to do is scope it in the context of your available resources and real-world objectives. If you have agreement on either budget or objectives, you can work toward clarifying the remaining areas of ambiguity.
- “But even this simple table requires a lot of work to create and maintain!” - Yes. It is faster and requires less work than traditional approaches, but if you are looking for a free lunch, this isn’t it.
- “But this still requires a lot of testing!” - Yes. See below.
BC Plan Table: Testing and Validation
A BC Plan that isn’t regularly tested is worse than no plan at all. Why? Because if you fail to create a plan, you may just be too busy, resource-strapped, etc. If you create a plan and don’t test it you have again woven a rope with which to hang yourself when things go off the rails and have given the organization a false sense of security. Imagine the conversation between you and your CEO: “Yes sir, that is one of the scenarios in the BC plan you approved last year... No, we’ve been too busy to test the mitigation strategies... Yes, sir, I’m familiar with dice.com....”
My recommendation is that you set the expectation on testing and validation coincident with the plan rollout. Here are a couple of other pointers for you to consider:
- Define a PROCESS, not just a point-in-time test - There is no greater waste of time than going through this exercise in year 1 and re-inventing the wheel in year 4. It’s tremendously more effective to roll out a standing test and validation process as an assumption from day 1.
- Start with testing the first mitigation strategy for each of your primary systems - Don’t try to test everything at once, take a deliberately iterative approach to testing. And, YES, test your active/passive HA pairs, redundant hardware components, etc. Testing these components that are supposed to provide high availability is EXACTLY what you are verifying. Some of the items that made it into your primary mitigation strategy don’t have an easy way to test (e.g. diverse power feeds into a building). Identify the first mitigation strategy that you can test and go from there.
- Create recurring change requests - determine the appropriate cadence for testing critical systems and instantiate the testing and validation into your standard change control process. Ideally, your CR process has the concept of pre-approved changes that you can execute in a standing change window and if so, this will be an easy step to implement.
- Link your testing and change documentation directly from your BC Plan table - This will allow you to present one package to the folks that need access to the BC plan and allow you to continue to flesh out your plan. It will also give your auditors something to do.
Tying It All Together
Following this outline will get you to the place where you can say with confidence and integrity that you have a business continuity plan in place and processes implemented to regularly test and validate its essential components.
There are, however, several choices we made along this journey that you will need to track. When you complete this process and want to iterate and improve, you may need to broaden your parameters:
- We chose to only address those risks we deemed likely to occur in 12-18 months. Disasters usually fall outside of this likelihood estimate and if you wish to extend this process to DR, then you will want to broaden your scope.
- We are only looking at ‘Tier 1’ systems. In the spirit of iteration, your next cut may not be all of ‘Tier 2’ but you may, for instance, want to explicitly include those systems that teetered on the fence between Tier 1 and 2.
- You started your testing and validation by exercising the first mitigation strategy for your most critical systems and greatest risk. Don’t stop there. You need to address mitigation strategies for all of the risks you deemed possible within 12-18 months.
So - if you have been putting off creating or re-working your business continuity implementation because you were looking for a few spare man-years of time to invest, perhaps it’s time to give Patton’s method a try. Make a good plan and exercise it now. Heck, make a halfway decent plan and exercise it now! Don’t let the perfect be the enemy of your good.
Great article, Glenn!