The study and comparison of one-dimensional kernel estimators – a new approach. Part 2. A hydrology case study

The main purpose of this article is to present the numerical consequences of selected methods of kernel estimation, using the example of empirical data from a hydrological experiment [1, 2]. In the construction of kernel estimators we used two types of kernels – Gaussian and Epanechnikov – and several methods of selecting the optimal smoothing bandwidth (see Part 1), based on various statistical and analytical conditions [3–6]. Further analysis of the properties of kernel estimators is limited to eight characteristic estimators. To assess the effectiveness of the considered estimates and their similarity, we applied the distance measure of Marczewski and Steinhaus [7]. Theoretical and numerical considerations enable the development of an algorithm for the selection of locally most effective kernel estimators.


Introduction
The results presented in this paper are an essential extension of the results of the paper [2], which considered only the Gaussian kernel K and specific window smoothing dependent upon the sample size n and some parameter  from the kernel K. There the estimator of the unknown function f was expressed as Here we consider over a dozen kernel estimators of the density function f for two kernels -the Gaussian kernel and the kernel given by Epanechnikov -using several different methods: Silverman's rule of thumb, the Sheather-Jones method, cross-validation methods and other selected plug-in methods. The goal of this paper is to present a method that allows the study of local properties of the examined kernel estimators.

Fig. 1.
Selected kernel estimators of the density function against the frequency histogram of groundwater levels. Figure 1 shows the very similar behavior of individual estimators against the background of the source frequency histogram, and it appears that the best bimodality for these hydrological data is produced by Bowman's cross-validation estimator with the Gaussian kernel. To get a more accurate assessment of the differences between estimators compared locally (for each interval [x i , x i+1 )), some adequate measure of distance (similarity measure) for these estimated density functions of the continuous random variable is required. To compare the obtained kernel density estimates ( ) against the frequency polygon ( ) for the considered feature, we used the Marczewski-Steinhaus metric, as follows: We use the above formula for the selected kernel estimators representing different approaches in relation to the empirical frequency polygon (see Figure 1 in Part 1 and in Part 2) and we determine their effectiveness in distinct variability intervals [x i , x i+1 ) for i = 0,…,14 (x 0 = 5 and x 15 = 155) in accordance with the experiment conducted. As a result, we obtain a distance matrix of size 8 x 15 (objects x characteristics), where the objects are the kernel estimators and the role of features (characteristics) is performed by relative efficiency in separate intervals. In Table 1 we present the obtained numerical results: the last row contains the minimum values for individual ranges, and the last column shows values of the metric in the entire range of variability (i.e. [5,155] for individual estimates). The minimum values in the whole matrix are marked in bold font.
Analysis of the obtained distance matrix allows one to identify the best of the studied estimators (digits in bold): mainly Bowman's cross-validation estimator with the Gaussian kernel, in the variability range from 0 to 115 cm, next the Sheather-Jones estimator in the interval (115; 125), Silverman's estimator in the interval (115; 135), the unbiased cross-validation estimator in the interval (135; 145), the Polansky-Baker estimator in the interval (135; 145) and Bowman's cross-validation estimator with the Epanechnikov kernel in the interval (135; 155). Hence, it is difficult to indicate a single estimator having the best properties in all intervals.
Next, for the thus obtained distance matrix, to define groups (taxa) with similar behavior, taxonomic methods were used. In our case the complete linkage method was applied. Figure 2 below presents numerical results for the distance matrix of the examined estimators, and Figure 3 shows the corresponding dendrogram -the smaller the distance measure, the greater the degree of similarity between the test functions.    Note. All numerical calculations were performed using the authors' own original procedures on the R platform and using packages from the literature: [13][14][15].

Conclusions
In the statistical literature there can be found a wealth of different approaches to obtaining the best estimate of the unknown density function of a continuous type random variable by nonparametric kernel estimation methods. Knowledge of the unknown density function that best reflects the probabilistic data structure is invaluable in the process of predicting values of the studied phenomenon. Using the example of hydrological data, the authors have suggested a way of classifying kernel estimators, chosen from those most often used in practice. Based on our calculations, we have shown that none of the considered estimators have optimal properties on the entire region. This is a known phenomenon in mathematical statistics related to the study of the admissibility of statistical decision rules. Each of the assessed estimators has good local properties, but their behavior is strictly dependent on empirical data. For example, Bowman et al. (1998) performed a simulation study comparing this method with the plug-in method of Altman and Leger. Better results are obtained, in general, with cross-validation (cf. [11]). Plug-in methods apply a pilot bandwidth to estimate one or more important features of the density function f. The bandwidth for estimating f itself is then estimated at a second stage using a criterion that depends on the estimated features. The best plug-in methods have proven to be very effective in diverse applications and are more popular than cross-validation approaches (see [4,16]). However, other authors offer arguments against the uncritical rejection of cross-validation approaches. Our results and the considerations of many authors give us the incentive to look for a solution that will allow us to use the best behavior of the tested estimators in a given area of variability, by suggesting, for example, an estimator that would be a convex linear combination of selected estimators. In 1989 Devroye introduced and developed the very interesting concept of the double kernel method for density estimation [17], and its usefulness has been demonstrated in extensive simulation studies [16]. In the double kernel method, we take two different kernels K and L whose characteristic functions do not coincide on any open neighborhood of the origin.
Only comprehensive knowledge of the efficiency and properties of the kernel density estimators for the onedimensional case will allow us to consider cases of twodimensional or three-dimensional random variables with greater awareness. The problem of stochastic modeling of hydrological or meteorological data using methods of multivariate density function estimation is more difficult and complex [18].