Preparing for the apocalypse

Robert McCann

Published Jul 26, 2018

NOTE - Below is my personal opinion - no actual DR Events were experienced or executives harmed in the making of this post... That I personally know about anyway :)

Executives have a lot on their minds - not only are they trying to run a business, make sure that the a raft of stakeholders are happy, planning the next '5 year plan', and balancing the various needs of the business - but they are also responsible for ensuring that their busienss is protected in the event of anything BAD that may happen to it.

This may include but is certainly not limited to:

A fire that burns down your Head Office building
A plane crash that takes out all of your Senior Leadership Team
A Cyber Attack (Hacking, Ransomware or Fraud)
A Natural Disaster (Flood, Fire, Earthquake)

Generally all of those events are covered under what is termed Disaster Recovery (DR) or Business Continuality Planning (BCP) - but note while those two terms are used interchangeably they are not always the same thing. As I'm in IT, I'm going to use what I consider the definition of each of those terms - that is for me DR is focused around the recovery of systems and applications that allow the business to function in the event that they become unavailable (generally when multiple systems and\or a whole site where they are located fails). BCP to me is more focused around continuing the actual running of the business which may include recovering the IT systems, but is much broader in scope (for example if the office everyone works catches fire and burns down).

While this article is more focused on the impact on IT systems, my argument is that both IT and the business need to work together to get the best outcome and ensure that nothing critical is missed in the process. My hope is that this gets IT people thinking more about including the business in such planning and that the business people start thinking that they need to be speaking to their IT department more about what they currently have in place and ensuring that such planning meets their current and future needs.

In any event, the person who generally has the shoulder the responsibility for ensuring that IT systems are available in both a DR and BCP Event and may be legally responsible if such systems don't function is the Chief Information Office (or CIO) - who generally works with or has under them as a direct report the IT Disaster Recovery Manager (or ITDRM for short).

From an IT standpoint it may be tempting to just look at all the IT systems, classify them and then just set the expectation with the business that system y will be available after x period of time (the old stand in for this is 24 hours) - however doing this is dangerous. The issue is that systems IT believe are important may actually not be that important and that the business has a totally different view of what they consider to be critical systems (even if they don't know what they systems actually are). For example IT may think that not having access to email for a day is not very impactful to the business but to the business this may something that is unacceptable as there are processes that depend on email being available which might lead to financial, legal or regulatory penalties.

The ITDRM role should not be just concerned with IT systems. Instead they need to be be more concerned with engaging the business to ensure that any IT DR planning will cover what the business stakeholders consider to be important. Doing this ensures that IT understands the businesses processes that MUST BE available to the business and that the business understands which systems IT consider to be critical (and can then discuss and realign things from there as needed).

Once this work has been done IT can start to sit down and plan out what the order of recovery of systems will actually look like - because it's very, very rare that a single business process or function will include just a single system.

Planning should include things like what how long the business can function without being able to execute a process, what processes the business must execute (legally or because of regulations), who can activate a BCP or DR plan and what the expectation is for the time to recover systems. Ideally such planning will also include failback once the event is finished, workarounds in the interim while systems are coming up and also when things will be considered to be in normal operation.

Doing this ensures that it's not just IT taking on all the responsibility, risk, and time investment in this work. This also has the benefit of providing transparency into the IT world and shines a light on how complex creating and executing such a plan really is. Also the business will be better prepared to work with IT moving forward when implementing new processes and systems as they will be better able understand why it's important to get the design of a system right from the beginning. IT will also be more aware of the questions they need to be asking the business to get the full picture in future projects as well.

Let me give you an example of why I think this is so important and where expectations on how complex something is to recover become misaligned. Consider a simple process like paying an employee. From business standpoint it's very easy - they have one application that they use and the understand. To them it's just one thing of a few things they use to get their job done. But say you have a DR Event and your primary datacenter goes down and you need to recover this application.

Has the ITDRM communicated to the business the complexity of what will need to be done to restore just this one application? Does the business know what other things need to be looked at first before this application will be available and work completely again?

From a business perspective this application:

Is being Accessed in some manner
Is Running somewhere
Is Storing data in some place
Is sending out a Notification to payroll and users

So there are 4 'things' involved in the application function (in bold above) that they most likely think are all just the one or two systems. Simple Right?

Let's look at this system from an IT perspective to them this application is:

Running on one or more servers
Uses Active Directory as an authentication source for users to be able to access it
Is being protected using a firewall
Knows that the application requires internet access (to send banking information)
Understands that the application uses email to send out a notifications (to payroll users and employees)

In a DR Event in order for the payroll system to be able to be recovered, IT will first need to recover (or verify that they are working) the lower level systems:

Switches\Routers
Physical Servers
Physical Storage
Physical Firewalls

Once this is completed, IT systems running on top of this infrastructure will be looked at:

Hypervisor
Operating System
Database
Authentication provider
Email
Internet Access

Finally the application it self will be restored and then from there users may be able to use it (they will need to test it first).

So what the business thought was just recovering the one system - the payroll application - turns out that IT need to verify and recover 10+ systems (at least) just so payroll can pay your employees! Can IT meet the business expectations in doing all of the above so that the business can pay their people? Does IT have the systems in place so that they can quickly recover? Is the business aware of how long it will take them? What happens if the recovery process fails? What trade offs are there while the in the DR Event? Are there any risks that IT or the business need to be aware of?

This is why it's important to plan, plan and plan and the to plan some more around disaster recovery. Just having the conversation with the business on a consistant basis will ensure that both they are aware of the complexity and also provide IT funding and resources to help them simplfy it so that it doesn't become a whole thing that utimately brings down the business. It will also bring people together hopefully and help everyone get a better night's sleep knowing that if the worst should happen people will be better prepared to deal with it.

I'm sure I could go on further (and I could - it's a deep and complex topic) but really I hope I've made the point that this stuff is important and very complex. It shouldn't be the job of just one person (or maybe even a few people) to deal with; it needs to be understood and created by everyone.

Rome wasn't built in a day and it certainty wasn't built by just the one person!

Any questions or feedback feel free to drop me a line - I'm no expert, but would be interested to hear people's thoughts on the subject and any experiences they've had in this area.

To view or add a comment, sign in

Preparing for the apocalypse

Robert McCann

More articles by Robert McCann

Explore content categories

More articles by Robert McCann

Moving to the left

Automation in action

A 'great' Engineer has a 'good' Engineer next to him (or her)

NBN: The network we wanted and what we got

Document you must, diagram you may

Drowning in a World of Content

Tripping over Azure

If you're not already moving to TLS 1.2 do so!

Rob's 10 Laws of KISS for Active Directory

Thoughts on Outsourcing your IP

Explore content categories