Business Continuity in an Age of Terror
By Eldad Galker.
September, 2001
The author of these lines is owner and General
Manager of the Chief Group, a group of companies founded in the
nineteen eighties and dealing ever since with Data Survivability
in computerized systems.
One of the highest goals in the computing area
is Service Continuity (business continuity). The amounts of money and other resources
nowadays invested in backup systems, survival and recovery are enormous.
According to IDC estimations by analysis of the yearly sales revenues of the
companies supplying backup and recovery solutions, the yearly investment in this
market slice surpasses the 2.7B$ and will surpass the 4.7B$ in the year 2005.
The recent terrorist events in USA (9/11) will certainly affect those
estimations in the short and long terms.
The fact that Morgan Stanley, which offices were
located in the New York World Trade Center, managed to go back to functionality
within less than 24 hours, is only a result from the building of a suitable
preventive backup system, performed according to a correct risk assessment. That
risk assessment was based on the understanding of the need for preparation of an
alternative backup infrastructure for the possible case of computer systems
collapse.
Although the monthly current expenses of Morgan
Stanley on those backup systems surpasses the 100,000$, it is possible to find
suitable solutions for more suitable budgets of medium and even small companies.
In order to understand the accurate risk
assessment as it should be done, the difference must be noted between the
computer technician’s understanding of the computer system, and the end user’s
understanding of the computer, and also between those in general and the term
“system” in particular.
The computer in itself is a tool, and as such,
it is its mission to aid in the creation, storage, finding and quick retrieval
of information when needed.
The computer, as a simple pencil, makes it
possible to transfer ideas, thoughts, data and general information from the
human conscience to the written media. As the mentioned pencil, it allows
erasing and rewriting of the written data. The computer, its operating system
and the software installed
In it, lack all importance by themselves without
a human being transferring its thoughts, exactly as the pencil has no importance
by itself.
Continuing the analogy, the computer allows the
filing of documents as in a bookcase, in different file holders, and sorting
them by names, date of creation, etc. And this, in order of making possible the
location and opening of the files when so needed. The computer allows doing so
by making use of different ways through the operating system, “adaptors of
hardware components” (drivers), and specific software programs.
For the computer technician, the computer system
is worthless while the central processor unit – CPU, which allows the
configuration of the capacities and performance of the system, hasn’t been
installed. The technician can configure the computer according to his experience
and understanding, through changes to be performed in the definition features of
the BIOS system. In the eyes of the systems technician, the computer hardware
elements are worthless while the operating system and the utilitarian software
programs have not been installed. Those, as well, can be adapted to different
needs by changes introduced and adaptations in the configuration files or the
Registry.
In each case, computer technicians assume that the
computer’s purpose is to properly function, and to achieve this they invest the
best of their efforts.
Technicians believe that the correct
installation of a hardware system, an operating system, and utilitarian software
programs, in such a way that all the hardware components should optimally
function without interfering with each other, is the highest goal. Beyond this,
they rarely show additional interest in the system. For the technicians, System
is a hardware complex with a sound and properly working operating system, which
allow them to apply their judgment and experience to affect the performance
through changes they introduce in the system.
In contrast, for the end user, System means
being able to click on an icon with the mouse and by this getting instant access
to the mechanism through which they can transfer information from their mind to
the databases in the computer or to quickly retrieve it in case of need. A
smoothly working system is the basis for all their computer related activities.
Being the one and only purpose of
the computer the making possible for the user to transfer information to and
from the system, any system complying with all the standards and conditions
posed by the technical experts but not allowing the end user to handle it
according to his needs, experience, knowledge or skills, is definitely not a
sound nor effective system.
As mentioned, computer systems are binary
systems, always composed by two components: SYSTEM and DATA. Every system by
itself allows, through the user interface, to define, modify or process data
created or collected by human beings.
When checking on the ways of work with huge
databases, even worse confusion can be found: in the eyes of the system
technicians responsible for the backup of the database, the whole database
represents DATA, which has to be backed up. They do not consider any difference
between the database system and the data storied in it- which was created by the
users. Therefore, they sometimes provoke a deficient service or even service
interruptions to their customers: the end users.
The existent backup and recovery methods also
make a difference between the treatment of the data and the treatment of the
systems.
Backup Solutions Versus Fault Tolerant
Solutions.
To protect the system, methods called Fault
Tolerance are used. These methods make possible the continuation of system
functionality even after the happening of a hardware error or even a system
failure.
Fault Tolerance solutions are not expected or
capable to cope with software errors or data problems. Among the more popular
solutions of the kind there are the RAID and the MIRROR. By these methods, in
case one of the hard drives in the systems stops working for any reason, the
system’s function and service go on unaltered. The system can generally not cope
with the collapse of more than one hard drive at the time. Systems of the kind
are completely “indifferent” to the sort and content of the data storied in them
– it is possible to totally delete it, contaminate it with viruses or scramble
it in any possible way without getting any alert or protection from the
computer’s system.
To protect Data, the Backup method is used.
Backup means to keep a copy of the previous information in a different way.
A different way means, to transfer the information to an other location or
another computer, another hard drive, or another media kind as for instance
magnetic tapes or optic media, and even printing it on paper. The more historic
versions kept of the information, the better chances for and quality of the
information at retrieval.
We daily meet organizations where damaged data
has been backed up without any awareness of the damage which made the data
useless. Keeping a number of historical versions of the data can
often help in the reconstruction of the desired information.
The difference between these two methods is
clear and sharp. The survival of the system is ensured only by Fault Tolerant
solutions, while the data is protected only by Backup solutions. It is of course
possible to defend data
from damage through Fault tolerant solutions.
However, we must be aware about the fact that this kind of solutions protect the
data availability, but not its contents nor its validity. It is also possible of
course to protect the system through backup solutions, while aware that this
kind of method will allow the retrieval of the definition files only to a
sound system.
The protection of data by the means of a fault
tolerant solution is not effective, because any damage to the original data will
instantly affect also and in the same way the alternative data. As said before,
fault tolerant solutions are completely “indifferent” to the data content and
any act like modification or complete deleting performed on the data, are
perfectly legal as far as this type of solutions are concerned.
The protection of the system by the means of a backup solution will be ineffective at the same degree, given that the meaning of a backup procedure is to copy the files and the information. Copies of the kind do not make possible the performance of a system boot process in a case of crash, but only if reinstalling all the system anew, and installing the proper backup software before being able to retrieve the files from the backup.
Fault Tolerant solutions run mostly on hard drives. Additional solutions allow alternative computer systems on line, and even alternative locations containing everything needed to continue the corporate operation. Lately, some of the manufacturers make it possible to actively backup the system to magnetic tapes in a method that allows its retrieval without an operating system or the need for any backup & recovery software, in a similar way to the still used in mini computers.
Backup solutions were traditionally implemented on magnetic tapes, which permit portability out of the backup site or into safes.
These backups are available and convenient, but
not always reliable. Therefore, they have to be created according to strict work
procedures of regular tape refreshment, tape head cleansing, on tape data
quality tests, and optimal environment storage conditions. These solutions grant
a relatively simple backup, while many times the retrieval process is slow,
complicated or inconvenient.
As well, data can be backed up to floppies, CD
or DVD. In all these options the data volume represents an obstacle. Lately,
some manufacturers of backup systems make it possible also to use hard drives as
a part of the backup process, but not always as a part of the retrieval process.
Combined solutions exist, which grant the
virtues of all methods, such as RAIT (Raid on Tapes) which enables the
simultaneous recording of a number of tapes and thus significantly increases the
read/write velocity from tapes and the survival chances of the data.
In the area of database backup the
considerations have to be the same as in the backup of servers: the database
system has to be backed up to a sound and functional copy which can be
immediately activated in case of damage as in Fault Tolerance. Separately, the
data accumulated in it has to be preventively backed up for the probable case in
which the database will have to be rebuilt and the last data version, retrieved.
Data backup has to be preferentially kept in the
simplest possible mode for data recovery/ rescue. Keeping backed up data under
compression or encryption, or in a non standard format, difficult, delays and
makes more expensive the whole process of recovery and recuperation.
Down Time
When taking into consideration the corporate
backup processes, there are a number of critical factors affecting the corporate
decisions related to those processes:
1.
Down Time, or period of time (in hours) of expected service interruption.
DT
2.
All inclusive cost of each DT hour.*
DT$
3.
Expected time (in hours) elapsed between consecutive DT events.
T
4.
Quality of the retrieved data (percentage retrieved from the lost data)
after the service interruption.
Q
5.
All Inclusive cost of hour (from the yearly cost) of corporate data
survival
protective measures.
$
*According
to Contingency Planning Research (http://www.contingencyplanningresearch.com)
in the year 2001 survey, 46% of
the companies reported that the cost of one DT hour can reach up to 50 K$, and
28% of the companies estimated the same cost between 51K$ and 250K$. As a result of the survey it is also clear that 40%
of the companies are not in existential danger within the range of 72 DT hours,
and 21% of them within 48 DT hours.
The optimal monetary investment ($) should
increase the time (T) elapsed between DT events, decrease the DT value, and
increase the quality (Q) of the available data at the end of each DT event.
Thus, at minimal costs regarding the organizational needs degree of flexibility,
according to the different parameters, and as a result of the cost / effective
analysis of the
financial investment needed to prevent the
direct and indirect damage caused by probable service interruption.
Theoretically, it could be said that an ideal
situation in which no computer system mishap will ever happen is impossible to
reach. Mathematically expressed, when Q=100, DT=0, and T=∞,
then $=∞.
Therefore, an ideal solution is a utopia, and only an optimally focused solution
is practical. Optimal solutions do not always require financial investments.
Mostly, they require, first of all, the investment of serious thought and
attention to the users needs, so to find the issues with which is possible to
compromise in order to reach
optimal performance.
Examples:
A.
An isolated RAID system
with no back ups, in which a single hard drive collapsed. The system
does not interrupt the supply of services, so DT=0 and Q=100%. In the
same system, when
two hard drives happen to collapse and the interruption
of services is total, DT=∞
(“infinite”)
and Q=0. The only possible outlet for such a situation is submitting the
system for treatment at
a data recovery laboratory in order to decrease DT and increase Q. In
this kind of systems, it
is compulsory to add backup to magnetic tapes.
B.
Assuming that the average lifetime of a computer system stays on about 4
years (between
crashes) and the organization is able to stand DT =1 hour once in 4 years
and also Q equaling the last 24 hours old sound data backup. In such a case, the
crash would make the organization loose all the new data created or updated
during the last workday. The proper advice here will be to install a backup to
tape system to back the daily important data up. The restrictions, however, will
be:
1. No individual
tape unit will be in service for more than 20 times and in any case it won’t
enter rotation for over 6 months.
2. Tape heads
will be cleansed once in 20 backup sessions, and at least once a month.
3. The backed up
data has to undergo a sampling recovery test from each of the tape units at
least once in the lifetime of the unit, etc.
4.
Once in a quarter, a system crash and recovery simulation test will be carried
out.
The most common solution to decrease DT and
increase Q is the addition of magnetic tape for data backup. This increases the
$ factor. In this situation DT equals a number of hours and Q lacks the gap
between
the latest data created or updated since the
last smooth backup, and the last sound backup session.
An additional solution to the RAID problem is to
install RAID 10 (a system composed by two RAID 5, which mirror each other). This
solution doubles the expenses, 2X$, but decreases DT
to 0 while enhancing Q to 100. Yet, this last statement will be true while two
hard drives, one from each RAID 5, have still not simultaneously collapsed. In
such a case, which we’ve already witnessed, the situation is the same as in the
former structure.
Organizations in which a 24 old Q represent real
direct or indirect damage and the lost hours of work accumulation is relatively
high, much more creative combinations are to be considered. As systems that
combine the backup and security elements together without overloading the data
traffic in the organization
The first element to identify in the
organizational risk analysis is the hourly price of service interruption for
each one of the systems.
Personal Down Time (PDT)
In any survival and recovery process, one out of
the some times unnoticed components, is the concept expressed by a new term we
here claim: Personal Down Time (PDT). This denomination refers to the time loss
of one single corporate user. For instance, an employee who invests great effort
in the creation of a file and later erroneously deletes or scrambles it,
provokes himself PDT as long as the time needed to rewrite allover the work
anew, and for as many hours as the delay in the execution of his other deeds.
The damage is
seemingly not significant, but while
better analyzing the issue it can be discovered that in average numbers one of
100 employees in a typical organization happens to suffer 3 hours PDT at least
once a day. In a 100 employees
organization working 23 days a month, the time loss would be 69 hours a month,
or 8,6 workdays which represent 37% of a job. In any organization with 266
employees, the meaning is the payment of a full salary to one extra (virtual)
employee called PDT. This statement can be easily checked. Didn’t happen to any
of us by chance to load a file, change it and then save it by “save” instead of
by “save as “, while later many hours were needed to rewrite the original file ?
Doesn’t this kind of insignificant human error happen to us at least twice or
three times a year? (Approximately once in 100 days).
When comparing from the cost point of view that
accumulative harm with the damage caused by DT once in four years to the main
server, it turns out the damage accumulated during 4 years is many times more
significant. To our surprise, we see that the corporate investment in PDT
prevention or in the improvement of Q at recovery from PDT is near to 0.
In addition, we clearly see that the conventional
backup solutions are not planned to solve the PDT problem. Mostly, the recovery
process of a single lost file from tape lasts almost the same time needed to re
create the same single file, if not even more.
Shall it be emphasized that this cost calculation
does not take into account the total lost of creative work, lateness in
scheduled projects, or in service supply, all of which might be originated by
PDT.
Organizations in which the creative work
component is high, such as in software houses, graphic departments, law offices
or accounting offices, the loss of one day’s work can cause average
accumulative damage of three days. In this kind of organizations it is
compulsory to install backup solutions which allow data copies each 30 to 60
minutes, in order to reduce the damage to a minimum tolerable.
Given tape backup solutions are not planned to
supply backup sessions as frequent as needed, the demanded solution in most
organizations is the data versions backup on hard drives. The hard drive is one
of the daily continuously decreasing cost components, while its available
capacity continuously increases. The write and read velocity is higher than in
any other comparable storage method, and the location and extraction of data are
immediate.
In conclusion:
Backup is not our highest goal, in the same way
that a completely sound ---but unavailable for the user- computer system isn’t.
The backup is only an aid for service continuity. In order to supply service
continuity it is mandatory to combine quick recovery solutions, and not
precisely quick backup solutions.
Backup and survival solutions must suit the
organization according to its needs, capacity, and reasonable threats to service
continuity. Survival arrays have to fit the technical skills of the computer
caretaking staff and to the accorded regulations for work conditions within the
organization. Correct work regulations which computer responsible staff cannot
completely accomplish, are unacceptable. In such a case, the regulations have to
be reconsidered and matched to the attainable.
In addition to the various modern threats to service
continuity such as system crash, backup system crash and power supply
failures, also the different terrorism
threats are to be added today to the list of reasonable,
possible and probable threats and include them in the corporate
risk analysis
For a small organization which can build an
alternative small system in an alternative location (in some cases the owner’s
own home), it is recommended to perform everyday a complete backup
to tape and to retrieve, daily, the full content of the tape to the alternative
system.
By this method, the tape condition and the backed
up data quality are daily tested, while also keeping an intact updated
alternative system, ready to service in case the original computer system at the
company’s premises is down or missing.
These kind of simple and relatively inexpensive
solutions are also used, although at a different scale, in bigger, more branched
and complex organizations.
Under the modern organizations' work
conditions, in which the backup time-window continuously decreases and the
recovery quality importance continuously increases, matching backup and recovery
systems are to be built to make possible the high frequency data backup – each
number of hours or even minutes- and the different data generations immediate
recovery, to and from hard drives. In addition, the data has to be backed up to
magnetic tape units to be driven out of the location or to safes to prevent the
cases of fire, robbery, or any other kind of total damage. In this situation, we
face today the only whole solution which allows focused or complete data
recovery according to needs, and at low costs.
Comments to this article will be welcome at eldad@chief-group.com
A backup and recovery solution for part of the problems presented in this article can be freely downloaded from the Chief Group site http://www.bos.co.il
More sites of interest are:
http://www.contingencyplanningresearch.com
http://www.infosecuritymag.com/articles/may00/departments1_note.shtml