Non-linear failure rate : A comparison of the Bayesian and frequentist approaches to estimation

Abstract. In this article, a new generalization of linear failure rate called nonlinear failure rate is developed, analyzed, and applied to a real dataset. A comparison of Bayesian and frequentist approaches to the estimation of parameters and reliability characteristics of non-linear failure rate is investigated. The maximum likelihood estimators are obtained using the cross-entropy method to optimize the log-likelihood function. The Bayes estimators of parameters and reliability characteristics are obtained via Markov chain Monte Carlo method. A simulation study is performed in order to compare the proposed Bayes estimators with maximum likelihood estimators on the basis of their biases and mean squared errors. We demonstrate that the proposed model fits a well-known dataset better than other mixture models.


Introduction
The Weibull failure rate, equals 0 at time t = 0.This failure rate model might be only appropriate for modeling some physical systems that do not suffer from random shocks.The model, where initial failure rate is negligible might be inappropriate for modeling some physical systems that require initial positive failure rate.That is the physical systems that suffer from random shocks and also from wear out failures.The linear failure rate (LFR), provides partly a remedy for this problem.
The LFR was originally proposed in [1], and has been studied in [2] as a special case of polynomial failure rate model for type II censoring.It is a generalization of an exponential distribution model (b = 0) and the Rayleigh distribution model (a = 0).The LFR model can also be considered as a mixture of failure rates of an exponential distribution and a Rayleigh distribution.Because the Rayleigh distribution also has a lot of limitations, as well as the LFR, that is not flexible enough to represent the most common types of failure rates, new generalizations of LFR must be developed.
In our study, we bring a new non-linear failure rate (NLFR) model, which is a generalized version of the LFR.We introduce it as a mixture of failure rates of the exponential and Weibull

The model 2.1 Non-linear failure rate model
Let the failure rate function of the system be in either of the following two states: (1) Initially, it experiences a constant failure rate model.(2) When the system wears out, it also experiences a wear out failure model.That is This model allows an initial positive failure rate, h(0) = a, whereas h(0) = 0 for most other increasing failure rate function, such as with the Weibull distribution.As mentioned in [2] for the LFR model, this type of situation would exist if failures result at random and also from wear out or deterioration.
Figure 1 shows that non-linear failure rate in equation (3) characterizes three of the most common types of failure rate functions which have been described in [4]: 1. Increasing failure rate (IFR): the instantaneous failure rate increases as a function of time.We expect to see an increasing number of failures for a given period of time.
2. Decreasing failure rate (DFR): the instantaneous failure rate decreases as a function of time.We expect to see the decreasing number of failures for a given period of time.
3. Constant failure rate (CFR): the instantaneous failure rate is constant for the observed lifetime.We expect to see a relatively constant number of failures for a given period of time.

Characteristics of the lifetime distribution
Using the relationship between h(t) and R(t) we have Then, the probability density function is given as The cumulative failure rate is given by And the mean time to failure (MTTF) is given by This integral can be obtained by using some suitable numerical methods.

Estimation of parameters and reliability characteristics
Let D : t 1 , ..., t n be the random failure times of n devices under test whose failure time distribution is given as in equation ( 5) and θ = (a, b, k).If there is no censoring, the likelihood function takes the general form The log-likelihood function can be written as ITM Web of Conferences 20, 03001 (2018) https://doi.org/10.1051/itmconf/20182003001ICM 2018 If some observations are right censored, we have to make an adjustment to this expression.For an observation of an observed failure, we put in the pdf as above.But for a right-censored observation, we put in the reliability function, indicating that observation is known to exceed a particular value.The likelihood function in general then takes the form This expression means that when t i is an observed failure, the censoring indicator is δ i = 1, and we enter a pdf factor.When t i is a censored observation, we have δ i = 0, and we enter a reliability factor [5].
The log-likelihood function for censoring case is

The cross-entropy method for continuous multi-extremal optimization
The main idea for the cross-entropy (CE) method for optimization can be stated as follows: Suppose we wish to maximize some "performance" function S (x) over all elements/states x in some set X ⊂ R n .Let us denote the maximum by γ * , thus To proceed with CE, we first randomize our deterministic problem by defining a family of pdfs { f (•; v), v ∈ V} on the set X. Next, we associate with equation ( 12) the estimation of (γ) = P u (S (X) ≥ γ) = E u I {S (X)≥γ} (13) the so-called associated stochastic problem (ASP).Here, X is a random vector with pdf f (•; u), for some u ∈ V (for example, X could be a normal random vector) and γ is a known or unknown parameter.Note that there are in fact two possible estimation problems associated with equation (13).For a given γ we can estimate , or alternatively, for a given we can estimate γ, the root of equation (13).Let us consider the problem of estimating for a certain γ close to γ * .Then, typically {S (X) ≥ γ} is a rare event, and estimation of is a non-trivial problem.The CE method solves this efficiently by making adaptive changes to the probability density function according to the Kullback-Leibler CE, thus creating a sequence f (•; u), f (•; v 1 ), f (•; v 2 ), . . . of pdfs that are "steered" in the direction of the theoretically optimal density f (•; v * ) corresponding to the degenerate density at an optimal point.In fact, the CE method generates a sequence of tuples {(γ t , v t )}, which converges quickly to a small neighborhood of the optimal tuple (γ * , v * ).More specifically, we initialize by setting v 0 = u, choosing a not very small quantity , say = 10 −2 , and then we proceed as follows: Adaptive updating of γ t .For a fixed v t−1 , let γ t be the (1 − )-quantile of S (X) under v t−1 .That is, γ t satisfies where 2: Generate a sample X 1 , . . ., X n from the density f (•; vt−1 ) and compute the sample (1 − )-quantile γt of the performances according to equation (16).
3: Use the same sample X 1 , . . ., X n and solve the stochastic program equation (18).Denote the solution by ṽt .
5: Repeat steps 2-4 until a pre-specified stopping criterion is met.
A simple estimator of γ t , denoted γt , can be obtained by drawing a random sample X 1 , . . ., X n from f (•; v t−1 ) and evaluating the sample (1 − )-quantile of the performances as γt = S (1− )N ( 16) Adaptive updating of v t .For fixed γ t and v t−1 , derive v t from the solution of the program The stochastic counterpart of equation ( 17) is as follows: for fixed γt and vt−1 (the estimate of v t−1 ), derive vt from the following program Instead of updating the parameter vector v directly via the solution of equation ( 18) we use the following smoothed version where ṽt is the parameter vector obtained from the solution of equation ( 18), and α is called the smoothing parameter, with 0.7 < α ≤ 1.
Using normal updating, the sample X 1 , . . ., X n are sample from an n-dimensional multivariate normal distribution with independent components, N( μt−1 , σ2 t−1 ).While applying CE algorithm, the mean vector μt should converge to x * and the vector of standard deviations σt converge to the zero vector.In short, we should obtain a degenerated pdf with all mass concentrated in the vicinity of the point x * .More detail for this explanation can be found in [6,7], and a short introduction can also be found in [8].

Maximum likelihood estimation
In our study, we maximize the log-likelihood function given in equation ( 9) in case of noncensored data or equation (11) in case of censored data using CE algorithm and obtain the maximizer θ = (â, b, k).By using the invariance property of MLE's, 1.The MLE for R(t), say R(t), will be ITM Web of Conferences 20, 03001 (2018) https://doi.org/10.1051/itmconf/20182003001ICM 2018 2. The MLE for h(t), say ĥ(t), will be ĥ 3. The MLE for H(t), say Ĥ(t), will be 4. The MLE for MTTF will be which can be obtained by installing into formula (7) and integrating.

The Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm is the most popular MCMC method.The basic idea of the algorithm is to generate data from some generic distribution, say target distribution.Suppose that our target distribution has density π with respect to some reference measure (usually ddimensional Lebesgue measure).Then, given X n , a "proposed value" Y n+1 is generated from some pre-specified density q(X n , Y), and then accepted with probability If the proposed value is accepted, we set X n+1 = Y n+1 ; otherwise, we set X n+1 = X n .The function α(x, y) is chosen, of course, precisely to ensure that the Markov chain X 0 , X 1 , . . . is reversible with respect to target density π(y), so that the target density is stationary for the chain.If the proposed is symmetric, then this reduces to This particular case of the Metropolis-Hastings algorithm is called the Metropolis algorithm.It has long been recognized that the choice of the proposed density q(x, y) is crucial to the success (e.g.rapid convergence) of the Metropolis-Hastings algorithm.The most common case involves a symmetric random-walk Metropolis algorithm (RWM) in which the proposed value is given by Y n+1 = X n + Z n+1 , where the increments {Z n } are i.i.d.from some fixed symmetric distribution (e.g.N(0, σ 2 I d )).In this case, the crucial issue become how to scale the proposal (e.g.how to choose σ): too small and the chain will move to slowly; to large and the proposals will usually be rejected.Instead, we must avoid both extremes [9].

Adaptive MCMC
The search for improved proposal distribution is often done manually, through trial and error, though this can be difficult, especially in high dimensions.An alternative approach is adaptive MCMC, which ask the computer to automatically "learn" better parameter value "on the fly"that is, while algorithm runs.
Suppose {P γ } γ∈Y is a family of Markov chain kernels, each having stationary distribution π.(For example, perhaps P γ corresponds to RWM algorithm with increment distribution ITM Web of Conferences 20, 03001 (2018) https://doi.org/10.1051/itmconf/20182003001ICM 2018 N(0, γ 2 I d ).)An adaptive MCMC would randomly update the value of γ at each iteration, in an attempt to find the best value.
Let Γ n be the chosen kernel choice at the nth iteration, so for n = 0, 1, 2, . ...Here the {Γ n } are updated according to some adaptive algorithm.In principal, the choice of Γ n could depend on the entire history X n−1 , . . ., X 0 , Γ n−1 , . . ., Γ 0 , though in practice it is often the case that the pairs process {(X n , Γ n )} ∞ n=0 is Markovian.In general the algorithms are quite easy to implement, requiring only moderate amounts of extra computer programming [9].

Bayesian estimation
For our mixture model, the Bayesian model is constructed by specifying a prior distribution for θ = (a, b, k), and then multiplying with the likelihood function to obtain the posterior distribution.
Denote the prior distribution of θ as π(θ), the posterior distribution of θ given D : t 1 , . . ., t n is given by Because the denominator in equation ( 27) is a normalizing constant, Bayes' theorem is often expressed as: As mentioned in [11], the prior distribution is given beforehand, usually based on prior information of the parameters, such as that from historical data, previous experiences, expert suggestions, even wholly subjective suppositions, or simply from the point of mathematical conveniences.
or a diffuse prior, and if In our study, we use adaptive MCMC sampling to generate sample

Monte Carlo simulations
A Monte Carlo simulation study is conducted to compare the performance of CE and MCMC estimators for the parameters of the non-linear failure rate.For each of the following choice of parameters, we simulate 1000 datasets with sample size n = 25, 50, 100 and 200, respectively, and based on each dataset we computed the CE and MCMC estimator for the model parameters.The datasets are generated from the failure distribution Eq.( 5) using the acceptreject method.A clear explanation and proof of this method can be found in [13].In order to obtain MCMC estimators, we have implemented the adaptive MCMC algorithm to construct a Markov chain of length 20,000 with burn-in of 1000 and reduced autocorrelation by retaining only every 5 iterations of the chain and obtain 3801 samples.For prior information we used the generalized uniform distribution on R + , i.e.Tables 1-3 list the results of the simulation study.Bias is calculated as the mean of 1000 estimates minus the true value, and the mean square error (MSE) is calculated as the mean of the squared differences between 1000 estimators and the true value.1-3.From these figures, we see that the biases and MSEs of a and b are very small.So what we are interesting here is the biases and MSEs of the shape parameter k.For all three special cases of the shape parameter k: • In most of the cases, the CE estimator seem to have less biases than the MCMC estimator (see right panel of Figures 2, 4 and 6).
• In case of decreasing failure rate, i.e. k < 1, the MSEs of MCMC estimator seem to be always smaller than MSEs of CE estimator (see right panel of Figure 3).
• In case of increasing failure rate, i.e. k > 1, the MSEs of CE estimator seem to be always smaller than MSEs of MCMC estimator (see right panel of Figures 5 and 7).

Illustrative examples
We consider the aircraft windshield failure data given in Table 4.The data consist of 153 observations.Among them 88 are classified as failed windshields, and the remaining 65 are censored time, means that had not failed at the time of observation.The unit for measurement is 1000 hours [14].
In this study, we use both CE and MCMC methods to estimate the parameters and reliability characteristics.In order to obtain MCMC estimators, we have implemented the adaptive MCMC algorithm to construct a Markov chain of length 50,000 with burn-in of 20,000 and reduced the autocorrelation by retaining only every 5 iterations of the chain and obtain 6001 samples.In fact, it is not necessary to set the long length of burn-in of 20,000 because the adaptive MCMC rapidly converges to the stationary distribution.For prior information we used the generalized uniform distribution on R + , i.e.Tables 7-8 represent the Akaike Information Criterion (AIC) values and MLEs of parameters for mixture models, respectively.The AIC values and MLEs from model 1 to 7 have been obtained in [14].In comparison with these results, Table 7 shows that non-linear failure rate has smallest AIC value.A model for which the AIC is the minimum is considered to be the best approximating model among a set of alternative models.
For this right censored dataset, both CE and MCMC methods give almost the same results.The MLEs are easily obtained by using CE method.This fact is due to the stochastic search of CE method.To improve the result of CE algorithm, one can choose the initial value(s) of (vector) standard deviation σ0 (sometimes changing the initial value(s) of (vector) mean μ0 is needed) through trial and error until we get the smallest optimum value which yields the optimal estimator(s).Nevertheless, Bayes credible intervals are easily obtained by using MCMC method.

Conclusions
The benchmark dataset, the aircraft windshield failure data, shows that the innovative nonlinear failure rate might be more appropriate than some mixtures of distributions for modeling the data which might be originated from more than one failure mode.
We recommend using the CE and MCMC methods as the mutual add-in tools for parameter estimation.MLEs are easily obtained by using CE method.The CE method is not so sensitive to the choice of the initial values of the algorithm in comparison with other alternative optimization methods.However, Bayes credible intervals are easily achieved by using MCMC method.
In our study, we consider only the sum of two components in the non-linear failure rate which results from the assumption of two independent competing risks, i.e. the random shock and the wear out.The case of dependent competing risks will be considered next.And we only use the diffuse priors on the parameters for Bayesian estimation, so this study can also be continued with other alternative priors.

Figure 2 -
Figure 2-7 visualize the results from the Tables1-3.From these figures, we see that the biases and MSEs of a and b are very small.So what we are interesting here is the biases and MSEs of the shape parameter k.For all three special cases of the shape parameter k:

ITMFigure 2 .Figure 3 .
Figure 2. Biases of CE and MCMC estimators for a, b and k when k = 0.5

Figure 7 .
Figure 7. MSEs of CE and MCMC estimators for a, b and k when k = 3

Table 1 .
Comparison of CE and MCMC when a = 0.03, b = 0.07, and k = 0.5

Table 2 .
Comparison of CE and MCMC when a = 0.03, b = 0.07, and k = 2

Table 4 .
Aircraft windshield failure data

Table 5 .
Point estimates and two-sided 90% and 95% HPD intervals for a, b, k and MTTF

Table 5
shows our MCMC point estimates and two-sided 90% and 95% highest posterior density (HPD) intervals for a, b, k and MTTF, and Table6shows our CE point estimates and two-sided 90% and 95% bootstrap percentile confident intervals (PERC) for a, b, k and MTTF.Figure8shows posterior distributions and trace plots of each parameter of the Bayesian model obtained by MCMC algorithm, and Figures 9-11 show time courses of all relevant functions obtained by both CE and MCMC methods, i.e. reliability, failure rate and cumulative failure rate functions.

Table 6 .
Point estimates and two-sided 90% and 95% bootstrap percentile confident intervals for a, b, k and MTTF

Table 8 .
MLEs of parameters for nine mixture models