Disaster Recovery
When a critical IT incident occurs, disaster recovery deals with the restart, but also with the recovery...The name already suggests it: With so-called disaster recovery, things get serious, because a crisis situation is being negotiated once again. And more precisely: when a critical IT incident occurs, disaster recovery deals with the restart, but also with the recovery of IT infrastructures, IT systems and applications that have failed or been damaged by a disruption. In order to carry out a structured restart or recovery of failed IT infrastructures, IT systems and applications, it is necessary to plan and organise the corresponding measures in advance. The planning and organisation of these measures is nowadays referred to as IT Service Continuity Management / Disaster Recovery Management. A brief clarification of the terms: The term disaster recovery dates back to the 1970s and is no longer quite up to date. Disaster recovery focused primarily on the operational management of an IT emergency. Instead, the term "IT Service Continuity Management" (ITSCM for short) is used today and this term is therefore also used in the following in this article.
Disaster recovery - definition of the crisis case
The emergency scenarios for IT service continuity management are based on the worst-case scenario approach (data centre failure, WAN or data centre coupling failure, data centre inaccessibility). Included is every critical situation and the damaging effect caused by it, which can lead to IT infrastructures, IT systems and applications failing and data being lost. Conceivable causes for the failure of data centres are seismic or climatic natural disasters, such as earthquakes, floods, storms and hurricanes, but also fire or the failure of the power supply. Criminal acts are also conceivable. All these factors can then in turn lead to the failure of IT infrastructures, IT systems and applications and/or to a loss of data. And it is then the task of the ITSCM to restore those IT services and data or to get them restarted.
Where exactly are these data and functions located?
Nowadays, almost all of a company's business processes require functioning IT. Thus, disaster recovery or ITSCM as a continuing management system plays a very important role in a company's ability to work. ITSCM-relevant systems are part of daily operations: typical IT infrastructures, IT systems and applications are, for example, data networks (LAN), storage area networks (SAN), mainframes, server systems, databases, middleware, application software and also telephone systems.
Aim of Disaster Recovery/ITSCM
In the end, disaster recovery or ITSCM serves the overarching goal of minimising the potential damage to the company affected by a critical IT incident or outage.
Disaster Recovery as distinct from Business Continuity Management
If you are looking for a German definition of disaster recovery, the term is best translated as IT emergency management. Although often used as a synonym, disaster recovery should not be confused with business continuity management (BCM). BCM refers to a much broader area and primarily ensures the continuity of business operations. BCM deals with the general continuation and maintenance of all time-critical business processes in an emergency scenario. Disaster recovery, on the other hand, is limited to the restart and recovery of IT infrastructures, IT systems and applications and their data in potential IT emergency, IT failure and IT disruption situations. Disaster recovery or ITSCM thus only illuminates a partial area of the processes and systems covered by BCM.
What does a sensible disaster recovery concept look like?
Like BCM, disaster recovery/ITSCM also has the task of securing the relevant systems for an IT emergency, so of course planning must be done in advance. The creation of a disaster recovery concept or so-called disaster recovery plans is indispensable at this point.
The Disaster Recovery Plan (DRP)
The disaster recovery plan (DRP for short) includes all measures and regulations that enable a successful restart in the event of a disaster. Conversely, this means that all business-critical IT infrastructures, IT systems and applications must be identified beforehand.
Measures of the Disaster Recovery Plan
The disaster recovery plan contains all necessary measures and regulations that are required in a failure scenario to get IT infrastructures, IT systems and applications up and running again as quickly as possible. Accordingly, components of these measures include, for example, the provision of replacement hardware (in the event of hardware malfunction or failure), but also steps to restore data and the like. In addition, the disaster recovery plan specifies the persons responsible in the event of an emergency as well as the step-by-step procedure for implementing the measures in order to ensure a smooth process.
Putting it to the test: Disaster recovery test
The actual effectiveness of the DRP needs to be tested and the plan should therefore be regularly tested and run through with the authentic involvement of all responsible employees. This shows what is already working in terms of recovery and restart of IT infrastructures, IT systems and applications and what can still be improved. This requires strategic planning with annual planning.
Important key figures for successful disaster recovery
There are key figures that provide indications for disaster recovery/ITSCM and have an impact on the design of subsequent emergency measures. The most important values are the recovery time objective and the recovery point objective. They are defined individually for time-critical business processes and must be adhered to by the ITSCM.
Recovery Time Objective (RTO)
Time factor: What is the maximum length of a time-critical business process? This key figure is very variable and turns out very differently depending on the importance of the system.
Recovery Point Objective (RPO)
Loss factor: How much data loss is acceptable? This is about the data that is lost between two backups. So the lower the value of the RPO, the less data loss there is.