May/June 2019 I 33
HUMAN ERROR
Data center operations need to consider quantitative
characteristics and metrics, such as Availability,
Reliability and MTBF, as well as the undeterminable
effect of human error.
IBM conducted a study to evaluate the ability of
technicians to resolve failed drives within a redundant
array of independent disks (RAID).3 Five technically
savvy personnel were tasked to perform a basic repair
of replacing a drive within a RAID array. The technicians
were to complete this task multiple times on up to three
different OS environments. All technicians were trained
on how to perform the repair and given printed
step-by-step instructions for each OS environment.
They completed the tasks in a low-stress environment,
void of alarms, angry customers or supervisors. A total
of 99 repairs were attempted. Errors due to human error
resulted in 8 to 23 percent of the attempts depending on
the OS environment. Human error cannot be predicted,
but policies and procedures certainly can help to reduce
or eliminate them.
BICSI 009-2019 STANDARD
BICSI 009 is focused on data center operations, which
compliments the ANSI/BICSI 002 standard that com-
prehensively covers data center design. The sections
of BICSI 009 that are most relevant to data center
operations include:
• Standard Operating Procedures
• Maintenance Operating Procedures
• Emergency Operating Procedures
• Management
Standard Operating Procedures
BICSI 009 provides guidance regarding standard
operating procedures (SOPs). SOPs are developed
for all personnel working within the data center or
for those responsible for providing data center services.
A data center’s SOPs are written to address safety
requirements, personnel code of conduct, quality
of work, and defined processes for work order requests,
approval and implementation. The SOPs are general
policies and procedures to which all personnel
must adhere.
Maintenance Operating Procedures
Maintenance operating procedures (MOPs) are
developed for the specific data center technicians that
are responsible for specific components or systems.
Because human error is not predictable, it is the
leading contributor to unplanned downtime. Under
normal operating conditions, the data center responds
to various internal and external conditions (i.e., utility
power, outdoor temperature, humidity) without the need
for any human interaction. As technology continues
to be developed, automation within data centers
is increasing with the implementation of more
sophisticated control systems through machine learning
and other artificial intelligent technologies. The clear
boundary that used to exist between data center facilities
and data center IT no longer exists. Common protocols
are being developed that enable compute systems, storage
systems, network systems, power systems, and cooling
systems to communicate with each other. This ultimately
creates one critical infrastructure ecosystem that
integrates both IT and facility systems, thereby enabling
the critical infrastructure to respond to IT requirements
in real time. With this increased interaction between
facility and IT systems, human error during human
interaction with either facility or IT systems can have
cascading results.
Most human interaction with the data center is during
maintenance activities, which is a time when the systems