Kernel density estimation and its application

. Kernel density estimation is a technique for estimation of probability density function that is a must-have enabling the user to better analyse the studied probability distribution than when using a traditional histogram. Unlike the histogram, the kernel technique produces smooth estimate of the pdf, uses all sample points' locations and more convincingly suggest multimodality. In its two-dimensional applications, kernel estimation is even better as the 2D histogram requires additionally to define the orientation of 2D bins. Two concepts play fundamental role in kernel estimation: kernel function shape and coefficient of smoothness, of which the latter is crucial to the method. Several real-life examples, both for univariate and bivariate applications, are shown .


Introduction
Out of all probability distribution functions, probability density function (pdf) best shows how the whole 100% probability mass is distributed over the x-axis, i.e., over the values of an X random variable. However, the oldest pdf empirical representation  a histogram  is a highly subjective structure as its shape depends on the subjective choice of the number (or widths) of class intervals (bins) to which the range of a sample is divided, and on the choice of the initial point (e.g., [1]). To this aim several formulas have been proposed of which most relate the number of intervals to the sample size only [2][3]; the other include additionally certain sample characteristics as standard deviation [4], interquartile range [5] or skewness [6].
Independently of the class selection method used, the histogram suffers from its original sin: data binning, which depraves the data of their individual location replacing their locations with a bin (interval) location. This causes the histogram shape to become discontinuous, and flat in each bin.
Kernel estimation of probability density function has not these drawbacks. It produces (in in most practical applications) a smooth empirical pdf based on individual locations of all sample data. Such pdf estimate seems to better represent the "true" pdf of a continuous variable.
The paper shows the advantages and disadvantages of the method illustrating them with real-life examples for one-and two-dimensional applications.
Two concepts play fundamental role in kernel estimation: the kernel function and the coefficient of smoothness.

Kernel density
Let the series {x 1 , x 2 ,..., x n } be an independent and identically distributed (iid) sample of n observations taken from a population X with an unknown probability distribution function f(x). Kernel estimateˆ() fxof original f(x) assigns each i-th sample data point x i a function K(x i ,t) called a kernel function in the following way [11]: K(x,t) is nonnegative and bounded for all x and t: and, for all real x, Property (3) ensures the required normalization of kernel density estimate (1): In other words, kernel transforms the "sharp" (point) location of x i into an interval centred (symmetrically or not) around x i .
In most common practical applications, the kernel estimation uses symmetric kernel function, although asymmetric functions have recently been increasingly used [18][19][20]. Figs. 1 and 2 illustrate the idea of kernel estimation for both cases.   Symmetry property allows to write the kernel function in a form used most frequently: where parameter h, called smoothing parameter, window width or bandwidth, governs the amount of smoothing applied to the sample (Fig. 3). For symmetrical kernel functions, the choice of the shape of the kernel function K(.) has rather little effect on the shape of the estimator [11,21], whereas  as Fig.  3 shows  the influence of the smoothing parameter h is critical because it determines the amount of smoothing. Too small value of h may cause the estimator to show insignificant details while too large value of h causes oversmoothing of the information contained in the sample, which, in consequence, may mask some of important characteristics, e.g. multimodality, of f(x) (cf. Fig. 3). A certain compromise is needed. Many types of kernel function can be found in the relevant literature. Examples of symmetric kernels are presented in Table 1 and in Fig. 4, while Table 2 shows the asymmetric ones.  Table 2. Examples of asymmetrical kernel functions.
Symbol b denotes the smoothing parameter. Fig. 4. Shapes of symmetric kernels defined in Table 1.  Table 1) used to estimate pdf influences the kernel pdf estimate. Triangular and rectangular kernels (especially the latter) produce many local maxima and thus they are rather not recommended for application. The biweight kernel has shorter support than the Epanechnikov one, so reveals more details and more clearly suggests two basic modes. The Gaussian kernel, distributed over the whole x-axis, produces the most smooth estimate, and this property probably causes the kernel to be most frequently used.   of Odra river recorded at the Racibórz-Miedonia gauge station (data source: [22]).

Kernel Definition
The univariate case can be easily formally extended to the multivariate case [23]. However, its illustrative (graphical) power works well for bivariate case only. The most frequently used bivariate kernel function is symmetric where {x i , y i }, i = 1,2,...,n, is a sample, and h x and h y are smoothing coefficients. Available are multivariate counterparts of univariate kernel functions listed in Table 1, e.g., multivariate Epanechnikov kernel or multivariate Gaussian kernel [11].
Two versions of (6) are used in practice: the product kernel estimator and the radial kernel estimator [24].
In its most popular form, the product kernel estimator may be written as follows The radial kernel estimator is based on the Euclidean distance between an arbitrary point {x,y} and sample point {x i ,y i }, i = 1,2, ..., n: In practice, the product kernel estimator is mostly used.
The advantage of multivariate kernel pdf over multivariate histogram is even greater than in an univariate case. This is because of an additional subjective requirement occurs: the user has to decide about the orientation of a two-dimensional bin, which may considerably influence the final shape of the histogram.

Measures of discrepancy between the kernel density estimator f and the true density f
Each estimator ˆ( ) fxdiffers from its original f(x) with 100% probability. In order to build a method producing an estimatorˆ() fxwhich will be as close to f(x) as possible, certain measures should be defined to evaluate this discrepancy.
For each single x, a difference between the "true" density function f(x) and its estimatorˆ() fxcan be estimated with the mean squared error, MSE x , [11]: which, after simple transformations, can be presented as follows: ias var that is, MSE x is the sum of the square bias and the variance of ˆ( ) fxat x. Reducing the bias causes variance to increase and vice versa, so a trade-off between these terms is needed. MSE x is a local measure. Integration of MSE x over all x gives a global measure of conformity of ˆ( ) fxwith f(x), called the mean integrated square error, MISE, [11]: ias var MISE is one of measures used to estimate the smoothing parameter.
In practice, an approximate version of MISE, called AMISE (asymptotic MISE) is also used, developed by expanding MISE into a Taylor series and taking only the most important parts [25,26].
Integrated square error, ISE, is an intermediate measure, between MISE and MSE: (12) which is also a discrepancy measure used to estimate the magnitude of the smoothing parameter.

Methods for calculating optimum value of smoothing parameter
The choice of the optimal smoothing parameter is based, i.a., on formulas that minimize the criterion functions discussed above, mainly ISE [27], MISE [28] and AMISE [11,15,[29][30][31][32]. Many other methods for calculating the smoothing parameter are available in the relevant literature; many of them are available also through statistical software.
Two methods are described below  one for the symmetrical kernel function (Gaussian), the other for any kernel function.

Rule-of-thumb method
The rule-of-thumb method is based on the asymptotic mean integrated square error, AMISE, when the kernel function and true distribution are assumed normal. Silverman [11] got then the values of the smoothing parameter h as follows: where is the sample standard deviation and n is the sample size. In order to have an estimator more robust against outliers the sample interquartile range IRQ may be applied [11]: Silverman [11] believes that the value (13) smoothes non-unimodal distributions too much, and  as one of the remedies  proposes a slightly reduced value of the smoothing parameter (13) The value (15) is widely used in practice and referred to as the Silverman's bandwidth or (Silverman's) rule of thumb, and will be used in most of the remainder of the paper.
LSCV uses the integrated square error, ISE (12), which can be expressed in the following form 11: The last part of the expression (16) does not depend on the estimatorˆ() fx (it is a constant), therefore the choice of the smoothing parameter (in the sense of minimizing ISE) will correspond to the choice of h which minimizes the function To estimate the second part of (17) a leave-one-out density estimator, , is used: which is an estimate of the density function calculated using all sample values except x i . The resulting form of the LSCV criterion function is The optimal smoothing parameter LSCV h is the value for which the LSCV(h) function achieves the minimum. The final form of LSCV function (19), applicable to both symmetrical and asymmetrical kernels, is: Least squares cross-validation is also referred to as unbiased cross-validation 26.
Unfortunately, the LSCV method also has drawbacks: the variance of the obtained smoothing parameters calculated for samples drawn from the same distribution is large [30]. It happens that the LSCV(h) function has several minimums, often false and far on the side of too small smoothing [39]; sometimes LSCV(h) does not have any minima at all [14,30].
There are other versions of the cross-validation method, e.g. biased cross-validation (BCV) or smoothed cross-validation (SCV), and other methods to obtain optimum smoothing coefficient (e.g., [40]). Some resulting examples are shown in Fig. 6.   Fig. 6. Different methods for kernel smoothing coefficient estimation available in Wolfram Mathematica 11.1 applied to the 1961-1995 series of standardized annual maximum flows of Odra river recorded at the Racibórz-Miedonia gauge station (data source: [22]).

The univariate case
Figs. 7 through 9 contain several kernel pdf estimates obtained for maximum and minimum annual flows of certain rivers and maximum annual precipitations in Poland. Apart from the nice smoothness contrasting with a histogram shape, a very attractive characteristic of kernel estimation is shown: its ability to suggest multimodality in a more convincing way than the histogram does.   Of course, the multimodal shape of a pdf estimate does not prove the existence of the real multimodality. It is, however, a sign of possible non-homogeneity that should be considered through the analysis of the mechanism generating the data. Some attempts to statistical testing multimodality are described in [11]; however, as Silverman ([11], p. 141) conclude: "It may be futile to expect very high power from procedures aimed at such broad hypotheses as unimodality and multimodality". Nevertheless, the kernel estimation is a good method for an initial stage of the planned study on probability distribution.
When the variable under study is nonnegative, it may happen that kernel estimate exhibits an undesirable case: probability leakage below zero. It occurs when a part of the sample lies near zero and the magnitude of the smoothing coefficient enables such crossing in a considerable amount. Four such cases are presented in Fig. 10.   Fig. 10. Probability leakage below zero (marked dark blue) in kernel density estimates for time series of standardized annual maximum flow , top two graphs, and annual minimum flows , bottom two graphs, of given River/Gauging station (data source: [22]). The numbers within the graphs show the magnitude of probability leakage.  Table 2, kernel K GAM1, b = 0.06); (d) three cumulative distribution functions (data source: [41]).
If the amount of the probability leakage cannot be disregarded, one of the remedies is to logarithmize the data and apply the kernel estimation to such data. If pdf of logarithmized data is ˆ() gx the following recalculation should be used: Fig. 11(b) shows the result. The leakage has been removed; unfortunately, the second mode disappeared although certain suggestion of non-unimodality has remained visible in the heaviness of the right tail.
Another remedy is to use an asymmetric kernel shown in Fig. 11(c). This approach shows the bimodality revealed in Fig. 11(a). In terms of cumulative distribution function (Fig. 11(d)), log transformation and asymmetric kernel approach are almost equivalent.

The bivariate case and some general remarks on the multivariate case
Formally, the univariate case can be easily extended to the multivariate one, which has been exemplified by equations (7) and (8) for the bivariate kernel. Fig. 12 illustrates with the use of equation (7) how the relation between the two variables studied evolves over the year.
2D kernel pdf graphics may help the user in differentiating the sample into subsamples, for which a non-statistical (cause-and-effect) confirmation may be found. Such graphics is informative when a sample contains many identical data, which are not visible in an x-y plot.
Unfortunately, graphical illustration or interpretation for more than two-variate case is at least difficult if not impossible. Moreover, sample size necessary for preserving similar accuracy as that for one-dimensional case grows rapidly with growing dimension  the problem known as the 'curse of dimensionality'.
Minimization of the effect of the curse of dimensionality requires not only sufficient data, but also careful data preparation [23]. This may involve proper transformation of marginal variable in order to reduce the large skewness or heavy tails, determination if the data are of full rank, and even  if the data do not have many significant digits  carefully blurring the data [23].

Summary and conclusions
When compared with the commonly used histogram, the kernel density estimator shows several advantages.
1. It is a smooth curve and thus it better exhibits the details of the pdf, suggesting in some cases nonunimodality.
2. It uses all sample points' locations, so, therefore, it better reveal the information contained in the sample.
3. It more convincingly suggests multimodality. 4. The bias of the kernel estimator is of one order better than that of a histogram estimator [26]. 5. Compared with 1D application, 2D kernel applications are even more better as the 2D histogram requires additionally the specification of the orientation of the bins which enhances the subjectivity of histogram.
It should be remembered, however, that the value of smoothing coefficient is to some extent a subjective estimate.