Imputation of missing data in time series by different computation methods in various data set applications

In a modern technology generation, big volumes of data are evolved under numerous operations compared to an earlier era. However, collection of data without missing single value, is a great challenge ahead. In practice, there are many solutions suggested to avoid the missing values in time series applications. The existing methods used in imputation and their prediction with time series, varies with applications. The existing methods mostly available for imputation are least squares support vector machine (LSSVM), autoregressive integrated moving average models (ARIMA), Artificial Neural Network (ANN), Artificial Intelligence (AI) techniques, state space models, Kalman filtering and fuzzy model. The extensive experimental application data is used to analyze these methods. In addition, a synthetic set of data can also be used to forecast missing value, which improves performance of imputation methods in time series. In this paper, predominantly used imputation methods have been listed with their fundamental computational information along with their verification on set of data mentioned.


INTRODUCTION
A time series observations are the observations that have been taken successfully at an equal time interval. The main objective of time series forecast solely depends on the past recorded data. In case the historical data recorded includes few hours or days or months, of missing data, the parameter prediction increases the complexity. This missing data of various real time applications, influences the deviation in the actual output. Thus, this missing data plays an imperative task in many decision-making applications. The applications are monitoring the industrial activities, financial analysis, business resources, and power plant with grid control. It would create a loss in prediction in different aspects. There are two different types of methods available for evaluation of missing data in multiple steps. Those are direct method and iterative method. The direct method evaluates the forecasting result in multiple step and reaches towards the final value while, iterative method evaluates the forecasted value iteratively until it matches the required step value. In multi-variable time series, various methods are proposed in the literature. The various methods are such as linear interpolations, Stineman interpolations, Kalman filtering with structural model and smoothing, weighted moving average are used for estimation of missing values for solar irradiance data [1]. Further, time-series forecasting explained in [2]. In this, author had adopted wavelet model technique. The suggested methods might be infeasible to particular set of application or inefficient to predict missing data in real time. Thus, still there is a scope of improvement to exiting proposed methods, and their verification on various data sets. As the forecasting of time series data plays a vital role, it is very important to find the accuracy of these different forecasting techniques. Different applications use different parameters for the measurement of the accuracy of these forecasting techniques. Shin-Fu Wu et al. [3] used normalized root mean squared error (NRMSE), performance measurement parameter for defining the accuracy of (LSSVM) forecasting technique of time series data. Root mean square error (RMSE) is used to evaluate the performance of the proposed algorithm for missing data estimation of synthetic multivariate time series data [4,5,6].

Methodology
In this section, most commonly used and advanced imputation methods are briefly described.

ARIMA (Autoregressive Integrated Moving Average)
ARIMA is Autoregressive Integrated Moving Average models. With the help of time-series data, statistical modeling technique predicts the future values, so this model is used in the field where short term forecasting is needed. It needs minimum 40 past data points. If the data is reasonably extended and the correlation between the past data points is steady, ARIMA model is more efficient compared to the exponential smoothing. Flowchart of Autoregressive Integrated Moving Average model shown in Fig. 1. With the help of ARIMA model, stationary time series data can be described as a function of auto regression and moving average parameters. In flowchart p, d, q are order of AR model, order of MA model, and order of difference respectively (to make non stationary data to stationary data). AR model is represented by: where, G(t) = error parameter, Z(t) is time series data C and D are constants. Vichaya L. et al. [7] proposed forecasting technique namely, seasonal autoregressive integrated moving average (SARIMA), for missing data imputation, which uses mean value of the global horizontal irradiance (GHI) averages over data. A comparative study of different forecasting models such as neural network (NN), autoregressive and moving average (ARMA), coupled autoregressive and dynamical system (CARDS) is done for forecasting of 1 to 6 hr ahead [8].

Support Vector Machine (SVM)
Support Vector Machine (SVM) is effective Deep Learning method. The theory of statistics is the basis of machine learning. It is basically used for classification problems and forecasting of time series data. SVM was first developed at AT&T Bell Laboratory in 1970 by Vapnik and colleagues. Forecasting of time series data is difficult with the missing data. Replacing missing data with values obtained by imputation method and ignoring missing data, will affect the performance of forecasting of time series data. The principle of SVM is to discover function which establishes a relationship between input P and output Q.Q=f(P) from set of data points. Consider input data points F:time series data R: Real set W: Noise Matrix b: Constant is nonlinearly mapped higher dimensional feature space from input space Shin-Fu Wu et al. [3] suggested a machine-learning tool LSSVM (Least square support vector machine). It is applied to time series forecasting with missing data. For forecasting, time series data and local time indexes (LTI) are fed to LSSVM. The results so obtained are compared with the forecasting performance of other imputation methods.

ANN (artificial neural network)
Recently, ANN (artificial neural network) is most widely used machine learning algorithm that is based on biological nervous systems. Neural network consist of elements called neutron operating in parallel. Flowchart of artificial neural network shown in Fig. 3. Neuron is a fundamental information processing part of neural network. It consists of three basic elements; a set of weights, an adder for adding the inputs, and activation function for limiting the amplitude. The neural network is getting rapid importance because of its capability to offer solution to variety of problems. Flowchart of artificial neural network shown in Fig. 3. Badia Amrouche et al. [9] proposed novel approach, which is combination of artificial neural network and spatial modeling for forecasting global solar irradiance.

KALMAN FILTER
The optimal estimation of states of the system can be implemented by kalman filtering algorithm, with the help of least squares approximation or maximum likelihood estimation. Due to this, the accuracy of time series forecasting improves. Flowchart of generalised kalman filter shown in Fig. 4

Results and Discussion
In this section, mostly used important imputation methods have been briefly explained. Imputation methods along with data used in that particular method have been list out in Table 1. The performance parameter also is incorporated in Table 1. Wu et. al. [3] discussed that LTI offers comparable or even better NRMSE values. Furthermore, LTI runs much faster, compared with other imputation methods, including mean, hot deck and auto regression (AR). Shi. et. al. [4] infers that even for high (90%) missing ratio the RMSE calculated would be small using proposed methods. Hence, this proposed method can be an alternative model to find the missing values in time series under large scaled multivariable system. The suggested method can be used in various applications such as, electric equipment monitoring, climate or financial forecasting. The environment state monitoring, security inspection etc. are some of the applications. Mellit et. al. [2] proposed prediction of 24 hour in advance of solar irradiance is possible with Multi layer Perceptron model with consideration of mean value of solar irradiance. The air temperature, as well as day of month also been included in forecasting. The MLPforecaster finds application in GCPV plant, solar irradiance forecasting of (24 hour in advance), renewable systems etc. Gao et. al. [11] discussed methods for estimating missing data in case of sensor failures. Guo et. al. [12] discussed that the published algorithm proved with considerable performance. The verification had been applied on the missing data, which is at random fashion. This algorithm is more appropriate for applications with high computational error requirements. Demirhan et. al. [1] showed in case when the rate of missingness is enlarged to 50 %, the special forecasting algorithm needed to follow. For solar radiation forecasting, non-seasonal Stineman interpolation, Kalman filtering in addion of ARIMA, and smoothing work with good agreements. ARIMA can also been used in marine [11,13]. Anava Oren et. al. [14] discussed model which is used in biomedical application. Application of adaptive wavelet network architecture for forecasting solar radiation on daily basis explained in [2,15]. Akarslan et. al. [16] showed the predictions possibility with linear prediction filters over the measured values. The prediction of in-plane irradiation with a neural network approach has been most common [17,18]. Table 1 Imputation methods along with data set and performance parameter.

Methods
Data set used Performance parameter [1] linear

Conclusions
LSSVM is used in different fields of applications, such as time series prediction and financial forecasting [3]. Mellit et. al. [2] proposed MLP forecaster which finds application in GCPV plant, solar irradiance forecasting of (24h ahead) and, renewable systems etc. Junger et. al. [8] discussed an imputation based method. This method, when compared with GAM method (GAM is based on use of ARIMA) gave good results. ARIMA can also been used in marine [11,13]. Anava Oren et. al. [14] discussed model that applied to various applications. Such as, in DNA microarray, market analysis as well as noise uncertainty reduction. Amrouche et. al. [9] proposed a novel technique. This technique applied to predict daywise global horizontal radiation.
Thus, there is still a scope of improvement in the existing proposed methods. A new proposed method when verified on existing data sets may give better results for a particular application.