Failure Analysis and Evaluation of Reinforced Computer Equipment

Based upon the synopsis of failure analysis and fault tree analysis, this paper puts up the component framework of reinforced computer equipment and typical fault tree, carries out practical analysis and systematic application of failure analysis and fault tree analysis, and implements mechanism analysis, correction measures and probability evaluation. This will benefit capability improvement of reinforced computer equipment and related information system.


Introduction
Failure analysis is to carry out technology and management activities of estimating product failure modes, searching product failure mechanism and causes, and advancing countermeasures so as to prevent further similar failure.It is of practical importance to reliability engineering, safety engineering and maintenance engineering [1][2].
Fault tree analysis is generally to search for diversified causes, and ascertain potential design bugs of hardware and software by means of deductive method, so as to take corrective measures.It can also be used to compare various design projects as viewed from safety, and provide standards for the establishment of use, testing and servicing procedures.
Reinforced computer equipment is developed to perform business data processing and information transfer.It has found wide application in various information systems, and taken an important role in system operation.Failure analysis and fault tree analysis are undoubtedly meaningful to the performance and quality of reinforced computer.

Resume
Reinforced computer mainly carries out data processing and information transfer, and plays a more and more important role in routine operation of related information system.During the scheme, primary, normal and customization phases of equipment development, failure analysis technology is emphasized to ensure equipment availability and effectiveness; moreover fault tree analysis finds finer application in equipment failure analysis process [3][4][5].
Fault tree analysis comprises qualitative analysis and quantitative analysis.Qualitative analysis is to look for causes and related combinations that conduce to top event, and identify all fault modes, namely all the minimal cutsets, so as to distinguish latent faults, and improve equipment design.Quantitative analysis is to compute generant probability of top event based on generant probabilities of bottom event and logic relation of fault tree.The generant probability of top event is usually described in (1) and (2).
And wherein P(t) -generant probability of top event; Kj -the jth minimal cutset; Nk -mumberof minimal cutset; P[Kj(t)] -generant probability of the jth minimal cutset during time t; Fi(t) -generant probability of the ith component of the jth minimal cutset during time t.
The construction steps of fault tree analysis consist of system investigation and fault analysis, top event choice, fault tree depiction and fault tree simplification.The emphases are as follows: • Boundary condition should be explicit; • Fault events should be strictly defined; • Fault tree should be constructed step by step and from top to bottom; • Gate shouldn't be directly connected to another gate; • Events should be concretely described.

Equipment structure
The Reinforced computer is composed of power supply module, computation module and connection module.
Power supply module includes connector plug, filter, backboard and AC-DC electrical source; computation module includes mainboard, hard disk and mainboard backboard; and connection module includes 20pin connection lines that provide computation module with 12V, 5V and3.3Vcurrents for power supply module.Its connection structure is described in Fig. 1.

Problem description
The reinforced computer equipment (JJ3) ran into fault during system customized experimentation.When it turned on for a period of time, electrical source indicator was lit, and schedule bar of operating system were read orderly; however, it rebooted automatically while schedule reading was not finished, and repeated over and over.Related function and deductive analysis was accordingly implemented with respect to the computer fault, and equipment fault tree is established as Fig. 2.And we adopt the following labels: X1 -OS abnormity; X2 -mainboard malfunction; X3 -hard disk malfunction; X4 -copper application of backboard PCB falling short of routing requirement; X5 -electrical source malfunction; X6 -20pin connection lines being inappropriately chosen; X7 -bad contact of 20pin connection lines.

Problem investigation
To examine whether improper connection of electrical wires with backboards of electrical source or mainboard result in equipment malfunction, we can renewedly interpose and pull the electrical wires of doubtable equipment with linker of backboards of electrical source and mainboard.We interpose and pull out the 20pin connection lines and linker of backboards of electrical source and mainboard, restart the equipment, and then operation system is normal; continuously perform restarting experiment for six hours, although the fault phenomenon does not appear.Moreover, we replace the former 20pin connection lines with spare 20pin connection lines, and continuously test the equipment for thirty hours, yet there is no abnormity.We take out 20pin stretch feet from the troubled equipment, and make clear that stretch feet aren't injured by means of observation, so we can eliminate factors of manufacture technics.Thus press and joint technics of cable stretch feet has no problem.And we make experiment on new 20pin stretch feet connection for five times, and find nothing wrong with stretch feet, so we can deduce that equipment malfunction does not result from linker use.Thus cable stretch feet are abnormal during the general use process of the linker.

Fault Mechanism Analysis
Equipment mainboard and power supply is composed of units which are shown in Fig. 3. P17C9X is the translation chip of PCIE to PCI that provides PCI expansion function; P17C9X is charged by 12V power supply, and when there is something wrong with 12V power supply, system will fail to realize PCI expansion function.Equipment mainboard make use of serial interface function of SuperIO that is charged by 3.3V power supply, and when there is something wrong with 3.3V power supply, system will fail to realize serial interface function.CPU of mainboard, PCH, memory and chips are charged by 5V power supply, and when there is something wrong with 5V power supply, equipment mainboard will fail to function and system have contact drop and shut off.And consequently it most probably is the abnormity of 5V electricity of linker that result in the equipment fault.
While electrical connection is joined to backboard linker, contact point produces contact resistance and heat energy with current on.The number of heat produced lies on the size of not only current but also contact resistance; and the littler contact area is, the more contact resistance, then the number of heat produced is greater while the same size of current.Wire is such material of positive temperature that resistance increases while temperature rises.The load of 5V power supply will continually enlarge after a period of operation time, and when the max current of stretch line interface is achieved, the excess heat generated will certainly consume essential task energy; thus 5V power of mainboard charged fall short of normal demand of power supply, and moreover, system automatically shut down.Equipment mainboard adopts AT task mode, and system AC-DC power will generates output when AC220V power has not been shut off after equipment fault; and mainboard automatically reboots as soon as it inspects electrical input, although 5V power of mainboard can't attain normal demand of power supply, and thus system schedule bar will display equipment status of automatic shutdown and reboot time after time.
To sum up, we can deduce that connection of 5V electricity of linker went down to loose from the above analysis.

Validation and Evaluation
We fix 20pin connection of fault equipment into spare equipment, and manually touch the related connection pins, hence the fault phenomenon shows again; and then we replace the faulted 20pin connection with new one, and carry through testing for 30 hours, and there isn't question with the equipment.We take out and fix the faulted 20pin connection into the linker, glue the linker ends, so as to ensure cable connector plugs well contact with electrical outlets of mainboard and electrical backboards.We continuously carry through testing for 30 hours during which the equipment has been rebooted ever and again, and the equipment works properly.
Technics is supplemented to glue the linker ends, and ensure cable connector plugs well contact with electrical outlets of mainboard and electrical backboards.
We completely identify the products that comprise 20pin connection of electrical source, and implement linker ends gluing with respect to similar products; moreover, we add the requirements for joint position gluing in product design files and carry out related gluing procedure alteration.
According to equipment operation circs and related indexes shown in Table 1, the probability of equipment reboot automatically time and again can be evaluated [1,8]

Conclusions
Aiming at the fault took place during the experimentation of reinforced computer equipment, fault tree is set up, and detail analysis is presented; moreover, the fault is exactly located with its mechanism clear.Further, the equipment is effectively rectified, and technics is meliorated while similar products are fully corrected.Failure analysis and fault tree analysis are practically useful to the capability and stability of reinforced computer and information system, and it is necessary to continually investigate equipment faults, ascertain hidden mechanism and improve related system performance.

Figure 2 .
Figure 2. Equipment fault tree of reinforced computer equipment.

Figure 3 .
Figure 3. Equipment mainboard and power supply structure.

Table 1 .
. And we shall pay more attention to improper wire contact, mainboard malfunction and electrical source malfunction in equipment development works.The probability of minimal cutsets.