---------------------------------------------------------------- Information systems require maintenance ---------------------------------------------------------------- ---------------------------------------------------------------- Reliability theory corrective maintenance ---------------------- - Repair of failed units - where units can be repaired and their failures do not adversely affect a whole system Provision of redundant units ---------------------------- system reliability can be improved by providing redundant and spare units # Maintenance of units before failure "it is not wise to maintain units with unnecessary frequency" [why?] Age Replacement --------------- preventive replacement for units without repair failure during operation may be costly repair my be expensive repair might not be practical if it causes additional down-time after a catastrophic failure, repair might not be possible if preventative maintenance is not possible, preventative replacement may work in the case of degraded failure, performance deteriorates over time/ maintenance costs increase over time. At some point, it will be easier to replace the unit rather than attempting repair. if performance deteriorates at a predictable rate, you should replace the unit after a certain amount of time. types of age: - continuous time (calendar time since installation) - discrete time (number of cycles) more on discrete time: discrete time could be measured in cycles or failures if a discrete cycle imposes significant wear or damage, or in other words, an increasing chance of failure, then it is best to measure age in cycles age replacement of parallel or redundant systems: some systems with parallel units may only fail if multiple units fail. It is important to replace bad units before others fail. for high availability systems it may be important to replace units with an offset to reduce likelyhood of multiple age failures at once. silent failure: some systems may fail silently. unit failure may only be evident after a catastrophic system failure. if routine inspection is not viable, routine replacement is necessary. A hybrid approach may be effective where units suffer great deterioration with both age and use. ---------------------------------------------------------------- vs preventive maintenance: preventive maintenance for units with repair on a specific schedule ---------------------------------------------------------------- Reliability Metrics ------------------- reliability quantities of repairable units such as - mean time to failure - availability(probabliliy system will work at any given time) - expected number of failures - probability that it will perform how long a unit can operate without failure reliability failure rate: - hazard rate - risk rate - force of mortality availability repair limit policy ------------------- (Repair vs replace) If the repair of a failed unit takes a long time, it may be better to replace it than to repair it. This policy is achieved by stopping the repair if it is not completed within a specified time, and by replacing a failed unit with a new one. This policy is called a repair limit policy, and is a striking contrast to the preventive maintenance policy. inspection policy ----------------- check units such as standby and storage units whose failures can be detected only through inspection For example, consider the case where a standby unit may fail. It may be catastrophic and dangerous that a standby unit has failed when an original unit fails. To avoid such a situation, we should check a standby unit to see whether it is good. If the failure is detected then the maintenance suitable for the unit should be done immediately. (test backups) random replacement/inspection policy ------------------------------------ ??? Failure ------- - failure modes - typse of failure: - intermittent failure - extended failure - complete failure - partial failure - "catastrophic failure" is sudden + complete - "degraded failure" is partial + gradual "Time to failure" - operating time - calendar time - number of cycles to failure - mean time to failure - availability - expected cost per unit of time Repair Limit Policy ------------------- repair a failed unit if the repair time is short or to replace it if the repair time is long. This is achieved by stopping the repair if it is not completed within a repair limit time, and the unit is replaced. This policy is optimum over both deterministic and random repair limit time policies Similar repair limit problems can be applied to army vehicles. When a unit requires repair, it is first inspected and its repair cost is estimated. If the estimated cost exceeds a certain amount, the unit is not repaired but is replaced Standby System with Spare Units ------------------------------- redundant systems ----------------- For the analysis of redundant systems, it is of great importance to know the behavior of system failure; i.e., the probability that the system will be in system failure, the mean time to system failure, and the expected number of system failures. For instance, if the system failure is catastrophic, we have to make the time to system failure as long as possible, by doing the PM and providing standby units.