On using nonparametric approaches for precipitation estimation

Nonparametric density estimation methods have been used for precipitation estimation for decades. The new approach for estimating the density proposed recently by Geenens and Wang appears to offer advantages over them as far as their behavior in the tail is concerned. Real data analyses presented in this paper confirm the usefulness of this new approach for precipitation modelling.


Introduction
Kernel density estimation has many applications in environmental sciences; compare e.g. [1][2][3]. However, in many practical situations, e.g. when we want to estimate precipitation distribution, we have to estimate density of positive variables which may be challenging. It is known that the kernel estimators have heavy bias at and near the boundary; compare [4, page 594]. Some remedies for this defect and other drawbacks of kernel estimators that arise in the case when the random variable of interest is positive were proposed in the literature. However, these remedies are not quite satisfactory and the problem of density estimation of positive random variables is still an area of active research.
We will present some new methods addressing this problem in Section 2; among them the method proposed by Geenens and Wang [5] deserves in our opinion particular interest. Its usefulness will be illustrated on real datasets in Section 3. Conclusions will be given in the final section.

Density estimation of positive variables
For a given a sample consisting of i.i.d. copies of a random variable with unknown distribution admitting a density , the kernel of estimator takes the form (1) where the kernel is a non-negative function satisfying the conditions and is known as the bandwidth. K is usually assumed to be symmetric about 0.
The statistical properties of the estimator (1) have been presented e.g. in [6]. It has been shown that it reliably estimates when it is supported on . If the support of is bounded, the kernel estimator may have some drawbacks.
In this section we consider the case when is the distribution of a positive continuous random variable having density supported on . Such variables naturally arise in various areas of environmental sciences.
As it already has been mentioned, the kernel estimator (1) usually fails to correctly estimate the behaviour of near 0; also its tail behaviour often is not satisfactory. In the neighbourhood of 0 the estimator may place positive probability mass on the negative half-line which results in the boundary bias. In the tail region the estimator may produce artificial local maxima at each observation, sometimes referred to as spurious bumps; compare [7]. Several remedies to overcome the boundary bias, such as the reflection method [8], were proposed in 1970-ies and 1980-ies.
In 2000 Chen [9] addressed both mentioned problems arising in the case when is supported on by introducing the asymmetric kernel density estimator of (4) where is an asymmetric -supported density whose parameters are functions of and . He proposed two 'asymmetric kernels': where is the density of the Gamma distribution with the shape parameter and the scale parameter while (7) Several types of asymmetric kernels were investigated in the subsequent literature [10][11][12][13]. However, it seems that the estimators using these kernels do not show significant advantage over the Chen estimators.
Another approach to estimating supported on consists in estimating the density of and obtaining an estimator of the density of using well-known results concerning functions of random variables.
The fact that (8) for all suggests an estimator of having the form (9) Here is a suitably chosen estimator of the density of the random variable supported on . We will refer to the estimator (9) as the log-transform estimator of the density function ; compare [14].
Geenens and Wang [5] pointed out that choosing a kernel estimator for estimating may result in some drawbacks. This is due to the fact that the kernel estimators are known to poorly estimate tails of densities, compare [4, page 594]. One way to overcome this shortcoming is to use the local likelihood density estimator of (instead of the kernel estimator). The estimator constructed using this approach has many favourable properties; compare [5]. A simulation study in [5,Section 7] indicates that when the degree of the polynomial that approximates the logarithm of the density of Y is 2 (compare [5, Section 2.3]), this estimator has small Mean Integrated Absolute Relative Error (MIARE); for an estimator of the density of a positive random variable X MIARE is defined as (10) This shows that this local log-quadratic transformation kernel density estimator based on the log-transformation (in short: LL-LT estimator) perform well in the tail.
This property was confirmed by the simulation study presented in [15], the version of the preprint [5] that was accepted for publication; in this simulation study other criterion than MIARE, the mean integrated squared error, was taken into account.

Applications to precipitation estimation
The applications of nonparametric methods to precipitation estimation were considered in [16]. The authors suggest using the log-transform density estimator (9) in which was computed using the kernel method with bandwidth proposed by Sheather and Jones [17]. The log-transform kernel density estimator was also used in [18] for estimating precipitation amounts for various time intervals.
Using nonparametric methods for estimating precipitation extremes appears to be particularly challenging. The estimator (9) in which is estimated using spline methods seems to be particularly promising for this aim [19].

Real data analyses
We have performed analyses of three datasets that were labelled according to the places they were collected from:


The dataset 'Łódź' contains the measurements of monthly precipitation in June (in mm) in the years 1954-1992 from the meteorological station Łódź-Lublinek.  The dataset 'Wichita' contains the measurements of monthly precipitation in August (in mm) in Wichita, Kansas, in the years 1980-2011. This dataset is a part of the R package SPEI [20]. For the dataset 'Łódź' the Gamma distribution and the Lognormal distribution were fitted using the procedure fitdist from the package fitdistrplus [22]. The LL-LT estimator was computed using the R script accompanying the paper [5] (available at the Journal of the Comuptational and Graphical Statistics website).   Figure 1 together with the graphs of the Gamma kernel estimate as well as the Gaussian kernel estimate of the density of the precipitation. The Gamma kernel estimate was computed using the bandwidth determined by the unbiased cross validation method implemented in the package kdensity [23]. The Gaussian kernel density estimate was computed using the bandwidth determined by the Sheather-Jones method implemented in the bw.SJ procedure from the R package stats [24]. Some quantiles for the estimated densities were displayed in Table 1; the and quantiles are of interest when computing the standardized precipitation index (SPI) which could be defined as , where is the amount of precipitation, denotes the cumulative distribution function of the standard normal distribution and is the cumulative distribution function of the suitably chosen distribution fitted to data; compare e.g. [25].
We see that the LL-LT estimate did approximately as well as the Log-normal estimate. Let us note that the dataset 'Łódź' was also analyzed in [25]: the authors have demonstrated that the fit obtained for the Lognormal distribution was better than for the competing distributions considered in [25].
Similar results were obtained for the dataset 'Wichita'. The LL-LT and the Gamma estimates appear to perform best; the quantile for the Log-normal distribution in the Table 2 seems to be unreasonably large. The Gamma kernel estimate did well in the right tail, but its boundary behaviour seems to be less satisfactory; the values of the density function near the boundary seem to be overestimated using this approach.    The dataset 'Kraków' contains data concerning precipitation extremes. We estimated the density function of the maximal precipitation intensity (mm/h) in the years 1961-1985 with the time interval equal to 24 hours using the same methods as before. Additionally we fit the parameters of the GEV distribution to the data using the maximum likelihood method implemented in the package evd [26]. The densities estimates are displayed in Figure 3, the 0.99 and 0.995 quantiles corresponding to the obtained density estimates are presented in Table 3. We see that that the LL-LT and the GEV estimates of the density appear to perform best. It can be thus said that the 'local-likelihood logtransformation approach' proved to be successful also in this case.

Conclusions
The real data analyses presented in the previous section confirm that the nonparametric methods may prove to be efficient for precipitation estimation. A new nonparametric approach introduced by Geenens and Wang that bases on data transformation and local likelihood estimation appears to be particularly useful for this aim. There is a need of an extensive simulation study comparing this method with other new promising nonparametric approaches that may be used for precipitation estimation.