32436PN - Page 32

To simplify, availability is the ratio of total time a system or component is functional (uptime) within a specified time interval divided by the time interval (uptime plus downtime). As established earlier, while this is the most common metric used within the data center industry to measure past success, it does not provide a complete picture. When looking forward, availability represents the probability that a solution will be functional at a given moment in time. In contrast, reliability is the probability that a solution will be functional without disruption over the entire defined time interval. This is an important distinction that is illustrated in Figure 2, and it should be understood when evaluating failure rates to make decisions regarding future solutions or processes to be implemented. It is worth reiterating that the primary driver of all data center operation policies and procedures is to reduce unplanned disruptions in the data center IT services. Therefore, what is more important; to reduce the number of disruptions or the length of the disruption? Obviously, they are both important. High reliability equates to fewer disruptions and high availability equates to shorter disruptions when they occur. There is no automatic correlation between availability and reliability. High availability does not inherently indicate high reliability; likewise, high reliability does not inherently indicate high availability. EQUIPMENT FAILURE RATES The expressions for availability and reliability both include MTBF. When creating maintenance procedures and policies, it is important to understand what MTBF is and what it is not. MTBF is typically expressed in units 32 I ICT TODAY of hours, which implies that it is solely time related. This can lead to confusing MTBF definitions with end-of-life or, even worse, maintenance intervals that could be wholly incorrect. To help define MTBF, it is best to understand failure rate. Consider the following scenario: A manufacturer wants to know the failure rate of a particular widget produced. The failure rate is established by analyzing how many widgets fail during a defined time interval, using a large number of widgets as the sample set. To illustrate, a vendor operates 1,000 widgets for 1,680 hours (10 weeks) and records the number of widgets that fail. If 3 widgets fail during the 10-week test, the failure rate is: How does this relate to MTBF? MTBF is the inverse of failure rate: The MTBF units are actually widget*time per failure. However, the industry has dropped the “widget” and the “per failure” parts to simplify (or confuse) the expression. MTBF, although expressed in hours, does not predict life cycle or maintenance cycle. MTBF is relevant when comparing alternate processes or solutions to help guide decisions regarding the best option, but it is not meant to characterize a single implementation. Rather, it characterizes a large quantity of implementations. In other words, if a data center has 10,000 drives, it can help quantify how many spare drives should be kept on hand to support them, but it does not help predict when one of them may fail. FIGURE 2: Availability versus reliability.