21 October 2008 Forum : Disaster Recovery

The topic for our forum on Tuesday, 21st October was 'Disaster Recovery'.

IT disasters can take many forms, and potentially cost a business a considerable amount in loss of working time and data. Consequently, having contingency plans for disaster recovery (DR) is becoming an increasingly important investment. There are numerous prevention and recovery methods and applications, which all vary in cost and function. The aim of the forum was to discuss the extent of such planning, and the options and restrictions involved in the companies the attendees represented.

In order to give a general overview, both of the DR debate and of the state of DR planning in the businesses of those attending, we designed a questionnaire that attendees filled out and returned prior to the forum date. There were 12 questions, each relating to a different aspect of DR. We collated the answers to help us structure the debate. At the forum we went through the questions (and the responses) in order. The points that came up in discussion are listed below, under the answers from those who completed the questionnaire.

1. Likely Cause of a Disaster

My office gauges the most likely cause of a disaster to be:

4A natural disaster
1An external security breach
1An internal security breach
0Human error
3Equipment failure/network failure
1Other
1No answer

The most common answer to this question was 'A natural disaster', a category in which respondents included problems resulting from flooding or overheating due to failures in environmental control equipment .

Symantec, in a recent IT Disaster Recovery survey, suggested that hardware and software failure is the most frequent cause of system breakdowns (and hence the need for DR plans). The next most common answers were external security threats, then power-related reasons. » more

2. Disaster Prevention

Does your office have disaster prevention methods other than a DR plan? In other words, does your office do security scans, have comprehensive fire protection and observe physical security measures?

0Comprehensive
9Some
2None

'Some' was the most common answer in this case. Following the earlier discussion of potential threats, we suggested several options for circumventing them, especially in terms of security. This aspect was noted as important, as demonstrated by a couple of case studies in which significant amounts of hardware had been stolen from offices.

Our suggestions included investment in environmental monitoring systems, the use of web cams to allow for remote monitoring of server areas, regular security checks including external port scanning, and the installation of an internal intrusion detection system (IDS) such as Tripwire. Simple security measures such as locking the server room and securing laptops overnight were recommended as inexpensive and effective measures for helping avoid disasters.

3. Costing DR

Has your office analysed both the cost of a disaster and the cost of having and maintaining a DR plan?

2Yes
9No

The most common answer by far to this question was 'No'.

The benefit of costing lost productivity and potential loss of work is that it can be used to help determine a budget for equipment, services and time required for achieving an acceptable downtime period. Backup and off-site synchronisation costs were considered to be the largest direct cost associated with having a DR capacity.

4. DR Plan

Does your office have a DR plan in place?

4Yes
6No
1No answer

The most common answer to this question was 'No'.

Discussion of why many respondents did not have a DR plan raised several issues. Two of the most prominent were the difficulty of preparing a plan that can cover all situations, and the fact that much server software licensing is tied to specific hardware.

Nevertheless the reasons for having a DR plan are compelling. Having a plan suggests that a cost model and budget have been set, that an office is prepared to make preparations and test their efficacy, and that specific people will be named and trained. Recovery from a disaster is difficult to achieve without a plan.

5. Data loss window

What is the period of data loss for which your office is prepared?

3<12 hours
5<24 hours
2<2 days
1>2 days

The most common answer to this question was 'less than 24 hours', although it was unclear whether this bore any relation to what could actually be achieved. It was suggested that the question be reworded as 'how old would your data be on recovery from the latest point before the next backup?'

In discussion it became clear that in the case of a major disaster, many attendees would have difficulty in retrieving data that was less than five days old.

We suggested that off-site data replication services should be used to reduce the data loss window.

6. Up-and-running

How long will it take for your IT systems to be up-and-running (to the agreed capacity stated in your DR plan) after the destruction of your primary site?

2<1 day
3<3 days
2<5 days
4>5 days

The most common answer was 'more than five days', which accounts for the time it would take to repurchase hardware and run backups off tape.

In the case of a serious disaster, the benefits of having a secondary site, both for the purposes of digital storage and for the physical use of staff on a temporary basis, were established. This in turn led to a discussion of costs, and the availability of alternatives for those companies who could not afford virtualisation or offsite synchronisation.

To reduce the period between the disaster and the point at which systems are once more up-and-running, it is clear that having a secondary site available is beneficial, as is having a live or close-to-live copy of essential data.

Attendees discussed ideas such as having inter-office agreements for sharing each others' sites as a DR recovery measure, and using staff's home computers and team-based mini-servers to help teams to start functioning again at homes or hotels.

Since the time taken for recovering data lost and for getting essential systems up-and-running should be added together, the total (adjusted) median DR time required was estimated by the respondents and attendees of the forum at 10 days. However, this does not include the time required to restore IT services to all office staff.

7. Critical data

Does your DR plan differentiate between critical and non-critical data?

6Yes
5No

The discussion focused first on what constituted critical data. It was suggested that in design firms, data should be prioritised first by project, and then by whether IT personnel or other staff ought to be responsible for deciding the status of data. The distinction between critical and non-critical functions such as email was also discussed.

Differentiating critical data can help in reducing up-and-running times.

8. Recovery for specific staff

Does your DR plan prioritise between staff members?

5Yes
6No

The possibility of backing up the accounts of individual users was debated.

It was noted that it is preferable to enable workers who are important to the business to be back in action with minimum delay. Senior staff and, in particular, project managers should ideally be prioritised.

9. Implementation Responsibilities

Does your DR plan name specific individuals as responsible for implementation?

5Yes
6No

Naming individuals in advance is useful for dispelling confusion about who is responsible for disaster recovery. It also means that members of staff are involved in the practical aspects of improving the DR plan, and in ensuring that members of the company who are not part of the IT team are cognisant of the details of the plan.

10. Testing

How often do you test your DR plan?

0Monthly
1Quarterly
0Yearly
4Ad hoc
6Never

The most common answer to this question was 'Never'.

The difficulties of testing DR were discussed. The ideal way of doing DR is to have a "hot" DR site (that is, a second site with a live, up-to-date copy of the office's data). For many the cost of such a site is prohibitive, unless two offices of the company are located close to each other.

11. Plan Updates

How often do you update your DR plan?

1Quarterly
1Yearly
7Ad hoc
2Never

The overwhelming majority of respondents answered 'Ad hoc'. This was alarming, given that a DR plan should be closely regulated. However, following the discussion, we accepted that this was a result of constant changes, for example in personnel and software, in the companies represented by the respondents.

It was generally agreed that DR plans should be regularly updated.

12. Importance of DR

How important does your office's management rate DR?

3Very
1Quite
6Of some importance
1Not important
0Irrelevant

The most common response by far was 'Of some importance'. Following discussion of the potential costs to a business of a disaster, it was agreed that management should rate DR as more important than this. Methods of persuading senior staff to take the subject more seriously were debated.

The most direct way of engaging management is to cost the DR time, as set out in Question 6.

Conclusion

Attendees of the forum agreed to exchange notes on what tools they were using to help with DR, particularly those concerned with data synchronisation.

Attendees:
  • Jevon Tucker, 3DReid
  • Ben Stratton-Woodward, Exposure
  • James Tansley, Hamiltons
  • Chris Poulton and Hugh Fernando, Wilkinson Eyre
  • Luke Ritchie, Dixon Jones
  • Warren Binns, Reardon Smith
  • Bevan Badenhorst, Steffian
  • Craig Barrett, ORMS
  • John Milsom, Fletcher Priest
  • Rory Campbell-Lange, Mark Adams and Tim Whiteley Campbell-Lange Workshop