Business Continuity

English Español Française Deutsch Portugez Italiano עברית

Business Continuity in an Age of Terror

By Eldad Galker.

September, 2001

The author of these lines is owner and General Manager of the Chief Group, a group of companies founded in the nineteen eighties and dealing ever since with Data Survivability in computerized systems.

One of the highest goals in the computing area is Service Continuity (business continuity). The amounts of money and other resources nowadays invested in backup systems, survival and recovery are enormous. According to IDC estimations by analysis of the yearly sales revenues of the companies supplying backup and recovery solutions, the yearly investment in this market slice surpasses the 2.7B$ and will surpass the 4.7B$ in the year 2005. The recent terrorist events in USA (9/11) will certainly affect those estimations in the short and long terms.

The fact that Morgan Stanley, which offices were located in the New York World Trade Center, managed to go back to functionality within less than 24 hours, is only a result from the building of a suitable preventive backup system, performed according to a correct risk assessment. That risk assessment was based on the understanding of the need for preparation of an alternative backup infrastructure for the possible case of computer systems collapse.
Although the monthly current expenses of Morgan Stanley on those backup systems surpasses the 100,000$, it is possible to find suitable solutions for more suitable budgets of medium and even small companies.

In order to understand the accurate risk assessment as it should be done, the difference must be noted between the computer technician’s understanding of the computer system, and the end user’s understanding of the computer, and also between those in general and the term “system” in particular.

The computer in itself is a tool, and as such, it is its mission to aid in the creation, storage, finding and quick retrieval of information when needed.
The computer, as a simple pencil, makes it possible to transfer ideas, thoughts, data and general information from the human conscience to the written media. As the mentioned pencil, it allows erasing and rewriting of the written data. The computer, its operating system and the software installed
In it, lack all importance by themselves without a human being transferring its thoughts, exactly as the pencil has no importance by itself.

Continuing the analogy, the computer allows the filing of documents as in a bookcase, in different file holders, and sorting them by names, date of creation, etc. And this, in order of making possible the location and opening of the files when so needed. The computer allows doing so by making use of different ways through the operating system, “adaptors of hardware components” (drivers), and specific software programs.

For the computer technician, the computer system is worthless while the central processor unit – CPU, which allows the configuration of the capacities and performance of the system, hasn’t been installed. The technician can configure the computer according to his experience and understanding, through changes to be performed in the definition features of the BIOS system. In the eyes of the systems technician, the computer hardware elements are worthless while the operating system and the utilitarian software programs have not been installed. Those, as well, can be adapted to different needs by changes introduced and adaptations in the configuration files or the Registry.

In each case, computer technicians assume that the computer’s purpose is to properly function, and to achieve this they invest the best of their efforts.
Technicians believe that the correct installation of a hardware system, an operating system, and utilitarian software programs, in such a way that all the hardware components should optimally function without interfering with each other, is the highest goal. Beyond this, they rarely show additional interest in the system. For the technicians, System is a hardware complex with a sound and properly working operating system, which allow them to apply their judgment and experience to affect the performance through changes they introduce in the system.

In contrast, for the end user, System means being able to click on an icon with the mouse and by this getting instant access to the mechanism through which they can transfer information from their mind to the databases in the computer or to quickly retrieve it in case of need. A smoothly working system is the basis for all their computer related activities. Being the one and only purpose of the computer the making possible for the user to transfer information to and from the system, any system complying with all the standards and conditions posed by the technical experts but not allowing the end user to handle it according to his needs, experience, knowledge or skills, is definitely not a sound nor effective system.

As mentioned, computer systems are binary systems, always composed by two components: SYSTEM and DATA. Every system by itself allows, through the user interface, to define, modify or process data created or collected by human beings.

When checking on the ways of work with huge databases, even worse confusion can be found: in the eyes of the system technicians responsible for the backup of the database, the whole database represents DATA, which has to be backed up. They do not consider any difference between the database system and the data storied in it- which was created by the users. Therefore, they sometimes provoke a deficient service or even service interruptions to their customers: the end users.

The existent backup and recovery methods also make a difference between the treatment of the data and the treatment of the systems.

Backup Solutions Versus Fault Tolerant Solutions.

To protect the system, methods called Fault Tolerance are used. These methods make possible the continuation of system functionality even after the happening of a hardware error or even a system failure.

Fault Tolerance solutions are not expected or capable to cope with software errors or data problems. Among the more popular solutions of the kind there are the RAID and the MIRROR. By these methods, in case one of the hard drives in the systems stops working for any reason, the system’s function and service go on unaltered. The system can generally not cope with the collapse of more than one hard drive at the time. Systems of the kind are completely “indifferent” to the sort and content of the data storied in them – it is possible to totally delete it, contaminate it with viruses or scramble it in any possible way without getting any alert or protection from the computer’s system.

To protect Data, the Backup method is used. Backup means to keep a copy of the previous information in a different way. A different way means, to transfer the information to an other location or another computer, another hard drive, or another media kind as for instance magnetic tapes or optic media, and even printing it on paper. The more historic versions kept of the information, the better chances for and quality of the information at retrieval.

We daily meet organizations where damaged data has been backed up without any awareness of the damage which made the data useless. Keeping a number of historical versions of the data can often help in the reconstruction of the desired information.

The difference between these two methods is clear and sharp. The survival of the system is ensured only by Fault Tolerant solutions, while the data is protected only by Backup solutions. It is of course possible to defend data
from damage through Fault tolerant solutions. However, we must be aware about the fact that this kind of solutions protect the data availability, but not its contents nor its validity. It is also possible of course to protect the system through backup solutions, while aware that this kind of method will allow the retrieval of the definition files only to a sound system.

The protection of data by the means of a fault tolerant solution is not effective, because any damage to the original data will instantly affect also and in the same way the alternative data. As said before, fault tolerant solutions are completely “indifferent” to the data content and any act like modification or complete deleting performed on the data, are perfectly legal as far as this type of solutions are concerned.

The protection of the system by the means of a backup solution will be ineffective at the same degree, given that the meaning of a backup procedure is to copy the files and the information. Copies of the kind do not make possible the performance of a system boot process in a case of crash, but only if reinstalling all the system anew, and installing the proper backup software before being able to retrieve the files from the backup.

Fault Tolerant solutions run mostly on hard drives. Additional solutions allow alternative computer systems on line, and even alternative locations containing everything needed to continue the corporate operation. Lately, some of the manufacturers make it possible to actively backup the system to magnetic tapes in a method that allows its retrieval without an operating system or the need for any backup & recovery software, in a similar way to the still used in mini computers.

Backup solutions were traditionally implemented on magnetic tapes, which permit portability out of the backup site or into safes.

These backups are available and convenient, but not always reliable. Therefore, they have to be created according to strict work procedures of regular tape refreshment, tape head cleansing, on tape data quality tests, and optimal environment storage conditions. These solutions grant a relatively simple backup, while many times the retrieval process is slow, complicated or inconvenient.

As well, data can be backed up to floppies, CD or DVD. In all these options the data volume represents an obstacle. Lately, some manufacturers of backup systems make it possible also to use hard drives as a part of the backup process, but not always as a part of the retrieval process.

Combined solutions exist, which grant the virtues of all methods, such as RAIT (Raid on Tapes) which enables the simultaneous recording of a number of tapes and thus significantly increases the read/write velocity from tapes and the survival chances of the data.

In the area of database backup the considerations have to be the same as in the backup of servers: the database system has to be backed up to a sound and functional copy which can be immediately activated in case of damage as in Fault Tolerance. Separately, the data accumulated in it has to be preventively backed up for the probable case in which the database will have to be rebuilt and the last data version, retrieved.

Data backup has to be preferentially kept in the simplest possible mode for data recovery/ rescue. Keeping backed up data under compression or encryption, or in a non standard format, difficult, delays and makes more expensive the whole process of recovery and recuperation.

Down Time

When taking into consideration the corporate backup processes, there are a number of critical factors affecting the corporate decisions related to those processes:
    1. Down Time, or period of time (in hours) of expected service interruption.        DT
    2. All inclusive cost of each DT hour.*                                                                DT$
    3. Expected time (in hours) elapsed between consecutive DT events.                   T
    4. Quality of the retrieved data (percentage retrieved from the lost data)
        after the service interruption.                                                                            Q
    5. All Inclusive cost of hour (from the yearly cost) of corporate data survival
       protective measures.                                                                                        $

*According to Contingency Planning Research (http://www.contingencyplanningresearch.com) in the year 2001 survey, 46% of the companies reported that the cost of one DT hour can reach up to 50 K$, and 28% of the companies estimated the same cost between 51K$ and 250K$. As a result of the survey it is also clear that 40% of the companies are not in existential danger within the range of 72 DT hours, and 21% of them within 48 DT hours.

The optimal monetary investment ($) should increase the time (T) elapsed between DT events, decrease the DT value, and increase the quality (Q) of the available data at the end of each DT event. Thus, at minimal costs regarding the organizational needs degree of flexibility, according to the different parameters, and as a result of the cost / effective analysis of the
financial investment needed to prevent the direct and indirect damage caused by probable service interruption.

Theoretically, it could be said that an ideal situation in which no computer system mishap will ever happen is impossible to reach. Mathematically expressed, when Q=100, DT=0, and T=∞, then $=∞. Therefore, an ideal solution is a utopia, and only an optimally focused solution is practical. Optimal solutions do not always require financial investments. Mostly, they require, first of all, the investment of serious thought and attention to the users needs, so to find the issues with which is possible to compromise in order to reach optimal performance.

Examples:

        A. An isolated RAID system with no back ups, in which a single hard drive collapsed. The system            does not interrupt the supply of services, so DT=0 and Q=100%. In the same system, when            two hard drives happen to collapse and the interruption of services is total, DT=∞ (“infinite”)            and Q=0. The only possible outlet for such a situation is submitting the system for treatment at            a data recovery laboratory in order to decrease DT and increase Q. In this kind of systems, it            is compulsory to add backup to magnetic tapes.

        B. Assuming that the average lifetime of a computer system stays on about 4 years (between         crashes) and the organization is able to stand DT =1 hour once in 4 years and also Q equaling the last 24 hours old sound data backup. In such a case, the crash would make the organization loose all the new data created or updated during the last workday. The proper advice here will be to install a backup to tape system to back the daily important data up. The restrictions, however, will be:
     1. No individual tape unit will be in service for more than 20 times and in any case it won’t enter rotation for over 6 months.
     2. Tape heads will be cleansed once in 20 backup sessions, and at least once a month.
     3. The backed up data has to undergo a sampling recovery test from each of the tape units at least once in the lifetime of the unit, etc.
    4. Once in a quarter, a system crash and recovery simulation test will be carried out.
The most common solution to decrease DT and increase Q is the addition of magnetic tape for data backup. This increases the $ factor. In this situation DT equals a number of hours and Q lacks the gap between the latest data created or updated since the last smooth backup, and the last sound backup session.

An additional solution to the RAID problem is to install RAID 10 (a system composed by two RAID 5, which mirror each other). This solution doubles the expenses, 2X$, but decreases DT to 0 while enhancing Q to 100. Yet, this last statement will be true while two hard drives, one from each RAID 5, have still not simultaneously collapsed. In such a case, which we’ve already witnessed, the situation is the same as in the former structure.

Organizations in which a 24 old Q represent real direct or indirect damage and the lost hours of work accumulation is relatively high, much more creative combinations are to be considered. As systems that combine the backup and security elements together without overloading the data traffic in the organization

The first element to identify in the organizational risk analysis is the hourly price of service interruption for each one of the systems.

Personal Down Time (PDT)

In any survival and recovery process, one out of the some times unnoticed components, is the concept expressed by a new term we here claim: Personal Down Time (PDT). This denomination refers to the time loss of one single corporate user. For instance, an employee who invests great effort in the creation of a file and later erroneously deletes or scrambles it, provokes himself PDT as long as the time needed to rewrite allover the work anew, and for as many hours as the delay in the execution of his other deeds.

The damage is seemingly not significant, but while better analyzing the issue it can be discovered that in average numbers one of 100 employees in a typical organization happens to suffer 3 hours PDT at least once a day. In a 100 employees organization working 23 days a month, the time loss would be 69 hours a month, or 8,6 workdays which represent 37% of a job. In any organization with 266 employees, the meaning is the payment of a full salary to one extra (virtual) employee called PDT. This statement can be easily checked. Didn’t happen to any of us by chance to load a file, change it and then save it by “save” instead of by “save as “, while later many hours were needed to rewrite the original file ? Doesn’t this kind of insignificant human error happen to us at least twice or three times a year? (Approximately once in 100 days).

When comparing from the cost point of view that accumulative harm with the damage caused by DT once in four years to the main server, it turns out the damage accumulated during 4 years is many times more significant. To our surprise, we see that the corporate investment in PDT prevention or in the improvement of Q at recovery from PDT is near to 0.

In addition, we clearly see that the conventional backup solutions are not planned to solve the PDT problem. Mostly, the recovery process of a single lost file from tape lasts almost the same time needed to re create the same single file, if not even more.

Shall it be emphasized that this cost calculation does not take into account the total lost of creative work, lateness in scheduled projects, or in service supply, all of which might be originated by PDT.

Organizations in which the creative work component is high, such as in software houses, graphic departments, law offices or accounting offices, the loss of one day’s work can cause average accumulative damage of three days. In this kind of organizations it is compulsory to install backup solutions which allow data copies each 30 to 60 minutes, in order to reduce the damage to a minimum tolerable.

Given tape backup solutions are not planned to supply backup sessions as frequent as needed, the demanded solution in most organizations is the data versions backup on hard drives. The hard drive is one of the daily continuously decreasing cost components, while its available capacity continuously increases. The write and read velocity is higher than in any other comparable storage method, and the location and extraction of data are immediate.

In conclusion:

Backup is not our highest goal, in the same way that a completely sound ---but unavailable for the user- computer system isn’t. The backup is only an aid for service continuity. In order to supply service continuity it is mandatory to combine quick recovery solutions, and not precisely quick backup solutions.

Backup and survival solutions must suit the organization according to its needs, capacity, and reasonable threats to service continuity. Survival arrays have to fit the technical skills of the computer caretaking staff and to the accorded regulations for work conditions within the organization. Correct work regulations which computer responsible staff cannot completely accomplish, are unacceptable. In such a case, the regulations have to be reconsidered and matched to the attainable.

In addition to the various modern threats to service continuity such as system crash, backup system crash and power supply failures, also the different terrorism threats are to be added today to the list of reasonable, possible and probable threats and include them in the corporate risk analysis

For a small organization which can build an alternative small system in an alternative location (in some cases the owner’s own home), it is recommended to perform everyday a complete backup to tape and to retrieve, daily, the full content of the tape to the alternative system.

By this method, the tape condition and the backed up data quality are daily tested, while also keeping an intact updated alternative system, ready to service in case the original computer system at the company’s premises is down or missing.

These kind of simple and relatively inexpensive solutions are also used, although at a different scale, in bigger, more branched and complex organizations.

Under the modern organizations' work conditions, in which the backup time-window continuously decreases and the recovery quality importance continuously increases, matching backup and recovery systems are to be built to make possible the high frequency data backup – each number of hours or even minutes- and the different data generations immediate recovery, to and from hard drives. In addition, the data has to be backed up to magnetic tape units to be driven out of the location or to safes to prevent the cases of fire, robbery, or any other kind of total damage. In this situation, we face today the only whole solution which allows focused or complete data recovery according to needs, and at low costs.

Comments to this article will be welcome at eldad@chief-group.com

A backup and recovery solution for part of the problems presented in this article can be freely downloaded from the Chief Group site http://www.bos.co.il

More sites of interest are:

http://www.idc.com

http://www.contingencyplanningresearch.com

http://www.ironmountain.com

http://www.snwonline.com

http://www.infosecuritymag.com/articles/may00/departments1_note.shtml