The study and comparison of one-dimensional kernel estimators – a new approach. Part 1. Theory and methods

In this article we compare and examine the effectiveness of different kernel density estimates for some experimental data. For a given random sample X1, X2, ..., Xn we present eight kernel estimators of the density function with the Gaussian kernel and with the kernel given by Epanechnikov [1] using several methods: Silverman’s rule of thumb, the Sheather–Jones method, cross-validation methods, and other better-known plug-in methods [2–5]. To assess the effectiveness of the considered estimators and their similarity, we applied a distance measure for measurable and integrable functions [6]. All numerical calculations were performed for a set of experimental data recording groundwater level at a land reclamation facility (cf. [7–8]). The goal of the paper is to present a method that allows the study of local properties of the examined kernel estimators.


Introduction
In many natural (e.g. meteorological or climatic) and typical engineering (e.g. hydrological or soil science) problems, it is extremely important to obtain knowledge of the density function of the probability distribution of features (X) describing the phenomenon being studied. This is also a fundamental concept in statistics. Specifying the function gives a natural description of the distribution of X and allows one to determine the probabilities for a<b. One approach to density estimation is parametric estimation (cf. the assessment of climate change impacts on a river runoff [9], models of atmospheric precipitation [10], river flow prediction for future climatic conditions [11] and many other papers in the fields of meteorology and hydrology). In this article we shall not be considering parametric estimates (even the commonly used generalized gamma distributions with 3 parameters) -the approach will be more nonparametric, in that less rigid assumptions will be made about the distribution of the observed data. Although it will be assumed that the distribution has a probability density , the data will be allowed to speak for themselves in determining the estimate more than would be the case if were constrained to fall within a given parametric family. The oldest and most widely used nonparametric density estimator is the histogram. This naive estimator can be written as for a given random sample X 1, X 2 ... X n and a fixed parameter h. It is easy to see that the estimate is constructed by placing a "box" of width 2h (called the binwidth) and height 1/(2nh) on each observation and summing to obtain the estimate. The generalization of this histogram estimator consists in the replacement of the function w by a kernel function K which satisfies the condition Now, by analogy with the definition of the naive estimator, for a given random sample X 1 , X 2 , …, X n the kernel estimator with kernel K is defined by where h is the window width, also called the smoothing parameter or bandwidth. In the statistical literature, the estimator of the density function of a random variable X given by (1) under more general assumptions is called the Parzen-Rosenblatt estimator or the Akaike-Parzen-Rosenblatt estimator (see [1,12,13]). The results presented here are an important extension of the results of the paper [8], which considered only the Gaussian kernel and specific window smoothing dependent upon the sample size and some parameter from the kernel K. We shall consider eight kernel estimators of the density function with two kernels -the Gaussian kernel and the kernel given by Epanechnikov -using several different methods: Silverman's rule of thumb, the Sheather-Jones method, cross-validation methods and other selected plug-in methods. The main aim is to compare the examined kernel estimators, their effectiveness and the prognostic results.
To assess the effectiveness of the considered estimates and their similarity, we applied a distance measure for measurable and integrable functions proposed by Marczewski and Steinhaus [6]. All numerical calculations were performed for a set of hydrological data recording groundwater level at a land reclamation facility (see [7,8]).

Material and methods
Land reclamation studies at the foothill site of Długopole (area approximately 1.5 ha) included long-term measurements of groundwater level using properly installed piezometers (hydrogeological observation holes). Daily registered groundwater levels were averaged based on measurements from about a dozen piezometers suitably located at the research station (the experimental data are derived from the Institute of Agricultural and Forest Improvement at Wroclaw University of Environmental and Life Sciences). The data set includes groundwater level measurements based on 10-centimeter ranges of levels from 10 up to 150 cm. The experimental data aggregated in a frequency table were reproduced by repeated use of a random number generator with a given frequency structure (see [8], Table 1). Then we obtained the vector of values of groundwater level as (x 1 , x 2 , …, x 366 ). Based on these data, the frequency histogram of the groundwater level is drawn below. Throughout this paper we consider the two most often used kernel functions: the kernel K g , being a function of the density of the normal distribution N(0,1) (Gaussian kernel), i.e.
(2) or more generally (see [15]) and the Epanechnikov kernel given by or in a simpler version (3) for which the optimality properties in the density estimation setting were first described by Epanechnikov (see e.g. [8]). Graphs of the kernels are shown below in Fig. 2. The structure of the kernel estimator indicates that in density estimation it is important to select not only the appropriate kernel, but also the optimal bandwidth smoothing. The value chosen for the bandwidth h exerts a strong influence on the estimator (see Figure 3). The bandwidth parameter controls the smoothness of the density estimate. An optimal solution in the problem of selection of the bandwidth h is to minimize the integrated squared error given by (4) where R(g) denotes a measure of the roughness of a given function g, defined by , or to minimize the mean integrated squared error (5) where . To compute the bias term in (5), note that by applying a change of variable. We further analyze the expression (5), allowing h 0 and nh  as n  . Note that the large bandwidth causes important features of f to be smoothed away, thereby causing bias. To understand bandwidth selection it is necessary to analyze carefully the expression MISE(h) (see e.g. [4]). In the numerical analysis of the selection of the optimal smoothing bandwidth, the AMISE is also taken into account, defined as (6) where f must have two bounded continuous derivatives and and is the variance of the kernel K. It can be shown that the following relationship holds: where o(h 4 ) is a quantity that converges to 0 faster than h 4 does as h  0. If h  0 and nh  as n   then MISE(h)  0. The numerical considerations show that the optimal bandwidth is (8) but this is a theoretical result; the quantity h essentially depends on the unknown density f.
In the statistical literature devoted to non-parametric kernel estimation of an unknown density function, many methods can be found based on various premises. Here we shall present two methods and several variants of them. One of the most interesting is the cross-validation method, which uses the data in two ways: once to calculate the kernel estimator from the data, and a second time to evaluate the quality of as an estimator of f. Let be the density estimate constructed from all of the data points except X i , as follows: Considering the expression (4), we notice that the last term is constant, and the middle term can be estimated by . Now, to get a good bandwidth, it is enough to minimize (10) with respect to h. Use of this criterion is called unbiased cross-validation or least squares cross-validation, because choosing h to minimize UCV(h) minimizes the integrated squared error between and f. Instead of the exact MISE formula given by (5), we can use a criterion based on the formula for the asymptotic MISE given by (6); this is called biased cross-validation (BCV). To get a good bandwidth we should minimize the expression given by (11) where BCV(h) is obtained by replacing the unknown in (6) by its the estimator (see [14]). An alternative method of cross-validation for the smoothing of density estimates was proposed by Bowman (see [15,16]). The smoothing parameter h is chosen to minimize the cross-validation function written as (12) where I(x-X i ) = 1 if x-X i  0 and I(x-X i ) = 0 if x-X i < 0, and is given by (9).
In addition to cross-validation methods, and with similar success, many authors have proposed so-called plug-in methods. Such a method for obtaining an optimal smoothing factor in the space L 2 of the square integrable functions was introduced by Woodroofe [17], who obtained an asymptotically optimal expression for the optimal h as a function of f and n. In a second step, he estimated the unknown functional of f (in this case, ) from the data in a nonparametric manner using a pilot bandwidth. To minimize when f is sufficiently smooth and K is a nonnegative kernel, the asymptotically optimal h has the following form: This formula is at the heart of the plug-in method (cf. (8)).
For example, Silverman's rule of thumb gives h = (4/3n) 1/5  1.06 n -1/5 , where is the sample variance. A better solution in a fairly complex process for finding the bandwidth h is given by an approach known in the literature as the Sheather-Jones method. This is a two-stage process. At the first stage, a simple rule of thumb is used to calculate the bandwidth h 0 expressed by (13) where C 1 and C 2 are functionals that depend respectively on the derivatives of f and on the kernel L.
The bandwidth h 0 is used to estimate R(f '' ), being the only unknown in expression (14) for the optimal bandwidth given by (14) where L is a sufficiently differentiable kernel used to estimate f '' . The estimation of R(f '' ) in the formula (8) follows from (14). At the second stage we solve the equation The solution to (15) can be found using, for example, grid search or a root-finding technique such as Newton's method. This calculation method is quite complex, even using a Gaussian kernel (cf. [4]).
An alternative approach to selecting a bandwidth is to use an estimator of the asymptotically optimal bandwidth. This was proposed as a plug-in estimate by Altman and Leger (see [18]) as follows: (16) where and are estimators of the expressions V 2 and B 3 respectively, and with for a positive kernel k and a nonnegative weight function W(x) (see [18]).
The fourth plug-in method of kernel estimation, proposed by Polansky and Baker [19], involves estimation of the asymptotically optimal bandwidth h 0 as (17) where Note. The method for selecting the optimal smoothing bandwidth proposed by Altman and Leger [18] and the modified four-stage method given by Polansky and Baker [19] concern kernel estimation of the distribution function, in contrast to the methods for the density function. It turns out that the values of h that optimize global measures of the accuracy of are different from those for . Therefore, the vast array of estimation techniques used in density estimation are not directly applicable for kernel distribution function estimates (see e.g. [19]).
To assess the efficiency of the obtained estimators, the question arises of how to compare them -that is, how to determine a distance measure between measurable and integrable functions in the same space. To compare the obtained kernel density estimates against the polygon frequency of the feature, we used the Marczewski-Steinhaus metric presented in the paper On a certain distance of sets and the corresponding distance of functions [6].
For non-negative and -integrable functions f and g we define the metric   as follows: Next, we use the above formula for the selected kernel estimators representing different approaches in relation to the empirical frequency polygon (see Figure 1) and we determine their effectiveness in distinct variability intervals [x i , x i+1 ) for i=0,…,14 (x 0 =0 and x 15 =150) in accordance with the experiment conducted. In this way, we obtain a distance matrix of (objects x features), where the objects are kernel estimators and the role of features is performed by relative efficiency in separate intervals. For the thus obtained distance matrix, taxonomic methods (the complete linkage method) are used to define groups (taxa) with similar behavior.
It is worth noting that the use of density estimates is part of the informal investigation of the properties of a given set of data. Density estimates can give valuable indication of such features as skewness and multimodality in the data. There is a vast literature on density estimation, much of it concerned with asymptotic results (see [5]). Key features of kernel estimators include strong consistency, asymptotic unbiasedness and uniform convergence (see [1,13]). The essence of all methods of kernel estimation is the optimal choice of the smoothing bandwidth. Our considerations in this article concern the space L 2 of square integrable functions. Among many methods of selecting the bandwidth, we can distinguish the following: the cross-validation method, various plug-in methods, the maximum smoothing principle, the bootstrap method, the projection method, the spacings method, a method based on the Greenwood statistic, etc. (see e.g. [1]). In our numerical considerations, we used two types of commonly used kernels (Gaussian and Epanechnikov) and several methods of selecting the optimal smoothing bandwidth, based on various statistical and analytical conditions.
From among a dozen obtained density estimates, we