Highest
High
Lowest
Normal
Operations
Transition to
Maintenance
can be at increased risk of unplanned outages. Transitions
from normal modes of operation to maintenance
modes and from maintenance modes to normal modes
of operation tend to be the times of greatest risk to
operations (Figure 3).
The transition from a maintenance activity back to
normal represents the time when risk is highest. This can
be due to personnel fatigue after a lengthy maintenance
task, attempt to meet a deadline, or inadequate step-bystep
instructions and training.
Within BICSI 009, the value and considerations
of interval, reactive, and condition-based maintenance
methodologies are discussed. Interval-based mainte-
nance is when a series of maintenance tasks are
completed at defined time intervals. Reactive-based
maintenance is when components are repaired or
replaced on the failure or when an obvious fault
appears either by a routine patrol or a monitoring
system. Condition-based maintenance is a form of
reactive-based maintenance, but it uses advanced
monitoring techniques to carry out maintenance tasks
on components and equipment before they fail and
only when required. When selecting a maintenance
method for a specific component, considerations
and inquiries include:
• Does the component have a known or predictable
failure pattern or does it have a random failure
pattern?
• Can the condition of the component be determined
and quantified through advanced real-time
monitoring?
34 I ICT TODAY
Maintenance
Activities
Transition
to Normal
Normal
Operations
FIGURE 3: Risk during normal versus maintenance modes.
• Can components be replaced without introducing
risk to IT services?
• Is the component within a redundant system?
• Does the business have black-out windows
where business processes dictate no changes
to operations due to risk associated with changes
or maintenance activities?
Emergency Operating Procedures
Emergency operating procedures (EOPs) are
developed for specific events that could occur in
the following areas:
• Within the data center
• External to the data center but on the property
• On adjacent properties that could have an impact
on data center operations
• Within the local region that could affect operations
personnel from being able to perform their roles
or critical services from being provided
EOPs are probably the area where data centers have
the least amount of established procedures and policies.
This should not be a surprise as SOPs are part of everyday
operations, MOPs are part of regular activities on
equipment, while EOPs are hopefully never required.
A lack of EOPs can quickly cripple operations when
methodical responses to an event or life threatening
emergency response procedures are not defined.
Consider the following examples of real-world
emergency events whereby well-defined EOPs ensured
a smooth transition, while a lack of EOPs adversely
impacted operations.
A data center with a defined shut down procedure
ensured a smooth transition when the facility had to be
shut down and sealed due to a forest fire located outside
the local area; smoke was inundating the community.
The corrosive smoke, which was entering the data center
through the fresh air supply required to maintain
positive pressure within the computer room, presented
a risk to the IT systems.
A data center, located outside of a flood plain,
confronted short-term overland flooding, thereby
restricting access to the site for a period of 48 hours.
Personnel on site could not get home at the end of their
shifts, and personnel at home could not get to the site