Statistical modelling of extreme rainfall in Peninsular Malaysia

Flash floods are known as one of the common natural disasters that cost over billions of Ringgit Malaysia throughout history. Academically, an extreme rainfall model is effective in modelling to predict and prevent the occurrence of flash floods. This paper compares four probability distributions, namely, exponential distribution, generalized extreme value distribution, gamma distribution, and Weibull distribution, with the rainfall data of 10 stations in peninsular Malaysia. The period of the data is from 1975 to 2008. The comparison is based on the descriptive and predictive analytics of the models. The determination of the most effective model is through Kolmogorov-Smirnov, Anderson-Darling, and chi-square test. The result shows that generalized extreme value is the most preferred extreme rainfall model for the rainfall cases in Peninsular Malaysia.


Introduction
In Peninsular Malaysia, massive floods frequently happened because of extreme rainfall during the monsoon season, especially the northeast monsoon season. Hence, understanding extreme rainfall characteristics are vital to disaster risk management. An efficient model will enable the Malaysian ministry to determine the probability of a flood occurrence.
The probability distribution of extreme rainfall plays a vital role in predicting flash floods. There are a few existing studies on the best-fit distribution of extreme rainfall. Syafrina and Norzaida [1] tested and compared the performance of gamma and Weibull in a weather generator model. Advanced weather generator (AWE-GEN) model is employed to model the rainfall at an hourly scale. They concluded that the gamma distribution provided a better result for the hourly rainfall data. On the other hand, Kumar et al. [2] conducted in Uttarakhand, India showed different results. The best-fit distribution is applied, with the comparison performed based on the goodness-of-fit test. Weibull distribution outperformed other distributions while the chi-square and log-Pearson are the next best distributions to be used. The usage of Theil-Sen's slope estimator, Mann-Kendall (MK) and modified Mann-Kendall (MMK) were discussed by Prabhakar et al. [3]. These estimators were applied for trend analysis of rainfall data for a long period. The changing point for long-term rainfall *Corresponding author: tanwl@utar.edu.my time series is investigated by using standard normal homogeneity test (SNHT) and Mann-Whitney-Pettitt (MWP) test in Odisha, India. The result showed a decreasing trend of rainfall beyond the year 1945.
Extreme rainfall events are ranked according to Weibull's method in the study done by Sabarish et al. [4], while the chi-square and Kolmogorov-Smirnov tests are used to investigate the suitability of distribution in Tiruchirappalli City, South India. The results showed that log-Pearson type III distribution is suitable for estimating rainfall amounts at various probability levels. A daily rainfall disaggregation model was adopted by Paola et al. [5] to evaluate the IDF curves of rainfall. The IDF curves were obtained using the probability distribution of Gumbel, and a short duration of rainfall data that are less than 24 hours have been obtained by using two different models of disaggregation in the historical rainfall data in African. The result showed that the effect of climate change affects the frequency of extreme events.
Ten commonly used probability distributions for extreme rainfall were considered in the study made by Nguyen and Nguyen [5], and a further investigation is made by Nguyen et al. [6] with the same distribution tested in Ontario, Canada. The results showed that generalized extreme value (GEV), generalized normal and Pearson type III (PE3) are the best overall distributions that provide the best goodness-of-fit and sturdy quantile extrapolations. Besides, GEV distribution is more preferred as compared to two other best overall distributions. Kar et al. [7] used a regional approach based on L-moments to estimate hourly rainfall frequency estimation and goodness-of-fit measure in Jeju Island, Korea. The study showed that Gumbel and GEV distributions are considered more reliable and successful models for the studied area. This study showed that the model is suitable and can be implicated in other areas with similar characteristics, limited rainfall data and steep land slope.
Smith [8] applied a weighted least-squares regression to measure Barker's rainfall trends in Southeast Texas. The result failed to demonstrate less frequent, more extreme annual rainfall events occurring now than occurred in the past. Mehr et al. [9] developed and applied a novel classification-forecasting model, namely binary GP (BGP), for teleconnection studies between sea surface temperature (SST) variations and maximum monthly rainfall (MMR) events in the northwest of Iran. A few limitations were found throughout the studies. One of the limitations is that the model is only suitable for maximum monthly rainfall forecasting, and there will be binary classification issues using genetic programming.
Generally, different areas with different rainfall characteristics affect the choice of appropriate distribution to be used. Hence, it is necessary to analyze the extreme rainfall characteristics to determine the best-fit distribution for extreme rainfall.

Data
This study focuses on 10 rainfall stations (refer to Table 1) from Peninsular Malaysia for the year 1975 to the year 2008. The northeast monsoon from November until February is considered in this study. All the historical rainfall data were obtained from the Department of Irrigation and Drainage Malaysia. A peak-over-threshold (POT) approach with thresholds of 90 th percentile and 95 th percentile is applied to obtain a list of extreme rainfall data to fit into some selected probability distributions.

Methodology
First, the rainfall data will start by applied the POT. POT is one of the many methods used in extreme value analysis by looking at the extreme values from the given data that exceed a particular threshold value. First, by applying POT, all the zero rainfall are withdrawn from the data. Then, the rainfall amount which exceeds a certain threshold will be included into the model. The thresholds used in this study are determined by the 90 th percentile and 95 th percentile. The extreme rainfall data are fitted to four probability distributions.

Probability distribution function
In this study, the rainfall data are cleansed using POT to obtain a list of extreme rainfall data for fitting into the four probability distributions. There are exponential distribution, GEV distribution, gamma distribution, and Weibull distribution.

Exponential distribution
The probability density function (pdf) for exponential distribution is shown as follow: where represents the rate. The maximum likelihood estimator (MLE) of is given by, where denotes the sample mean, in which the MLE represents the reciprocal of the sample mean.

Generalized extreme value distribution
The pdf for GEV distribution [10] is shown as follow: where , and represent the location, scale and shape of the distribution function, respectively. The log-likelihood function [11] is given by, The MLE's of , , and are those values that maximize the likelihood function, subject to the following constraints: A constraint ≤ 1 is imposed because the likelihood can be made infinite and cause the MLE to not exist when > 1.

Gamma distribution
The pdf for gamma distribution is shown as follow, where and represent the shape and scale of the distribution, respectively. The relationship between the coefficient of variation ( ) and mean ( ) of this distribution can be described as, where denotes the digamma function, and denotes the sample mean:

Weibull distribution
The pdf for Weibull distribution is shown as follow: where and represent scale and shape, respectively. The MLE's of and are the solutions of the simultaneous equations:

Goodness-of-fit test
The goodness-of-fit test used in this study are Kolmogorov-Smirnov (K-S), Anderson-Darling (A-D), and chi-square test, with the significant level of 5%.

Kolmogorov-Smirnv test
The K-S test compares the empirical distribution function ( ( )) with a specified cumulative distribution function ( ( )) . The equation for computing the Kolmogorov-Smirnov statistic ( ) is: where the equation is used to compute the distance between the two functions, ( ) and ( ). The larger the value of the test statistics, the higher the inconsistency between the observed data.

Anderson-Darling test
The A-D Test is the modified version of the K-S test that give higher weight on the tails of the tested distributions. The equation for the A-D test statistics is: where (1) to ( ) is the ordered sample of size from smallest to largest, and ( ) is the cumulative distribution function for the specified distribution. A null hypothesis is rejected if the AD is greater than the critical value of with the given significant level of .

Chi-square test
The chi-square test is used to check the suitability of a specific distribution by observing the sample's frequency. By using as the "observed count" and as the "expected count", the equation to calculate chi-square is: The null hypothesis for the test claim that there is no significant difference between the observed and expected frequencies whereas, the alternative hypothesis claims that they are different.

Results and discussions
First, we start with the data cleansing. All the zero rainfall are withdrawn from the data. After the data cleansing, the POT is applied to obtain the extreme rainfall data using the 90th percentile and 95th percentile thresholds. The parameters for each distribution are estimated for both thresholds using maximum likelihood estimator (MLE). Table 2 and Table 3 show the estimated parameters for exponential distribution, gamma distribution, Weibull distribution, and GEV distribution for all the rainfall stations for the 90th and 95th percentile thresholds. After all of the estimated parameters are obtained, a good-of-fit test is used to determine the best fit distribution for all the rainfall stations.
The test statistics of all the selected goodness-of-fit tests for exponential distribution (Exp), GEV, Gamma distribution (Gamma) and Weibull distribution (Wei) are as shown in Fig. 1. Fig. 1 shows that GEV distribution is the best overall result from all 3 of the goodnessof-fit tests for both the 90th and 95th percentile thresholds. Therefore, it can be concluded that GEV distribution is the best fit distribution for the extreme rainfall for the 10 selected rainfall stations.
The quantile-quantile (Q-Q) plots are adopted into the extreme rainfall to further visualize the suitability of the selected distribution. The Fig. 2 and Fig. 3 show the Q-Q plots of the four distributions with the threshold of 90 th percentile and 95 th percentile for 5 selected rainfall stations in Peninsular Malaysia. Fig. 1 and Fig. 2 show the Q-Q plots that the extreme rainfall data fit into GEV distribution the best, with the majority of the data fall around the straight line. In contrast, the gamma distribution and the Weibull distribution are the secondbest feasible choice with similar Q-Q plots. The exponential distribution would be the least favorable distribution to be chosen for fitting the extreme rainfall model.

Conclusion
The fitting distribution for extreme rainfall event is crucial in hydrology studies. The best-fit distribution can be used in hydrology model such as rainfall runoff model. In this study, 10 selected rainfall stations over Peninsular Malaysia during northeast monsoon season from year 1975 until year 2008 were fitted to four probability distributions. The four distributions are exponential distribution, gamma distribution, Weibull distribution and GEV. The threedifference goodness-of fit tests were used to the model performance assessment. The goodness-of fit tests, K-S test, A-D test, and chi-square tests have indicated that GEV distribution is the best-fit for all the 10 selected rainfall stations. The suitability of the selected probability distribution with the extreme rainfall data is visualized through the quantilequantile plots. Comparing all the quantile-quantile plots, GEV distribution shows the best at fitting the extreme rainfall data compared to other probability distribution. These results have shown the same agreement with the results obtained from the goodness of fit tests. It can be concluded that the GEV distribution is a best-fit probability distribution for the extreme rainfall event in Peninsular Malaysia. In future study, the GEV distribution can be used to predict extreme rainfall event.
The authors are grateful to the Drainage and Irrigation Department for providing the rainfall data. The work is funded by UTAR Research Fund Vote 6200/TG1 awarded by Universiti Tunku Abdul Rahman.