Forecast Model of Urban Stagnant Water Based on Logistic Regression

With the development of information technology, the construction of water resource system has been gradually carried out. In the background of big data, the work of water information needs to carry out the process of quantitative to qualitative change. Analyzing the correlation of data and exploring the deep value of data which are the key of water information’s research. On the basis of the research on the water big data and the traditional data warehouse architecture, we try to find out the connection of different data source. According to the temporal and spatial correlation of stagnant water and rainfall, we use spatial interpolation to integrate data of stagnant water and rainfall which are from different data source and different sensors, then use logistic regression to find out the relationship between them.


Introduction
Big Data Analytics (BDA) is the core of big data concepts and methods.It analyzes the massive, diverse, fast-growing, and content-rich data (big data) to find out the hidden patterns, unknown correlation and other useful information [1].But the very key of this whole process is to prepare the data which is suitable for data mining, which means we need to do the data fusion job to combine the data from different source.
Data fusion is different from the traditional data integration or knowledge database technology, it requires large span, deep and comprehensive research methods.Data fusion is the process of integrating two or more data.The purpose of the process is to generate an improved data set, which can be superior to the source dataset or the input dataset both in geospatial and attribute traits.Big Data is a complex set of data with large, diverse, high-speed variations [2][3], in these data, spatial data accounted for the vast majority, about 80% of the data has spatial location [4][5].So in this paper we combine stagnant water data and rainfall data which are two kinds of data source into one dataset by their geographic Information Attributes and their time attribute, but then we may face a problem which is we cannot find these two properties on the same point at the same time.For example, we can get the data of stagnant water at one point and we assume its coordinate is (x,y), but we cannot find the rainfall data in database at the same point, but we can get other rainfall data around this point.So based on the geographical coordinates of the stagnant water, we calculate the value of the rainfall data of this point by the method of spatial interpolation.
In this paper, we will use the method of Inverse Distance to a Power (IDW) to get the data of rainfall, then we use rainfall data as inputs, stagnant water data as output, and logistic regression as a method for machine learning to find the potential relationship between these two.

Data preparation
At present, the water data fusion between stagnant water and rainfall is only limited to transferring the data from each sensor into the data warehouse, because the different functions of the sensors are different, the correlation between the tables and the tables is transmitted, weak.However, because each sensor has its own unique spatial information, then I introduced IDW-based spatial interpolation method to integrate the data from different sensors, which means I calculate the rainfall data of this area which has stagnant water data.At last, we get a one point's data of stagnant water, while also be able to get this point's data of rainfall.

Preprocessing the data of rainfall
First of all, the current structure of the data itself which we can see it in table1.The data stored in database comes with a unique code, and this code represents the spatial attribute of the data.In order to calculate the value by IDW, we need to change the structure of the table which is picking every code as a column and make Z as its value.For example, in a table we get 3 kinds of code which are "xxx", "yyy" and "zzz", then we change the structure of this table as Table2.Above all, the steps are: (1) Make the raw data arranged by every hour per day; (2) Change the table's structure to fit the IDW method just as we talked about above; (3) Calculate the value by IDW.

Changing the structure of the table by SPSS Modeler
SPSS Modeler is a set of data mining tools that enable you to quickly build predictive models using business technology and apply them to business activities to improve your decision-making process.Designed with reference to the industry standard CRISP-DM model, SPSS Modeler supports the entire data mining process from data to better business results.Flow chart shown in Figure 1.SPSS provides "data disaggregation" components through which a given time field can be divided into two columns which are day and hour.At the meantime, SPSS also provides "data reconstruction" components through which each code can be extracted as one column.As shown in Figure 2 is the rainfall data processing flow chart.

Rainfall data interpolation based on IDW
Inverse Distance to a Power interpolation is first proposed by meteorologists and geologists.Inverse Distance to a Power is the earliest computer interpolation method and is still widely used.Its basic principle is to distribute a series of discrete points on the plane, if we know some point's position coordinates and the attribute value , according to the attribute values of the surrounding discrete points, the P-point attribute value is interpolated by the Inverse Distance to a Power [6,7].If there are N data points around, the attribute value of point P is: And = means the distance from the i-th data point to the p-point.In this paper, we only need to know the gis coordinates of all the rainfall points in the urban area and the targeted coordinates , so we can calculate the distance between each rainfall point and the targeted point.According to IDW interpolation method, we can get the influence factors of each rainfall point which we name it . (2) Then put each point's rainfall into the formula (3),we can get the value of targeted point. (3) According to figure 3, we can easily calculate the rainfall of targeted point at every hour every day.

Integrate the data of the stagnant water and the rainfall
We need to determine the possibility of stagnant water happened in this area according to its rainfall, so we plan to use the logistic regression to find out the relation between stagnant water data and rainfall data, we use rainfall data as input and stagnant water data as output.The accumulation area, ground structure, and drainage of the individual water points are different.Therefore, different models should be established for different water accumulation points, in this paper, we choose the point of the bridge of huaxiang as an example.
At first, we need to calculate the rainfall of one stagnant water point by IDW and change the value of stagnant water data into 0 or 1 based on whether stagnant water happened in this area, then we can combine the stagnant water data of this point with the rainfall data through the time dimension.There are three combinations of ways which are external connections, internal connections and partly

Filling in the vacancy value of the rainfall data
In this paper, we only fill the data of its stagnant water data is 1 and delete the rest unfilled data, because most of the data to be analyzed has a stagnant water property of zero, and in order to balance the data set, we need to reduce it.We choose the method of "linear regression" which is we consider the linear correlation between rainfall and time.The specific means is "if there are missing value in the same day, we calculate the data of the rainfall by the other data in the same day to fill, if not, we delete this row".

Data grouping based on the rainfall
Because the distribution of the entire dataset is not even, we cannot use single formula to describe the whole situation.By querying the Bureau of Meteorology on the definition of rainfall levels, we divide our data into five parts which we can see it in Table3.
Table 3. Data grouping based on rainfall levels.

Logistic regression analysis of rainfall and stagnant water
In the stagnant water analysis, rainfall can be used as independent variables, and the occurrence of stagnant water can be used as binary variables (0 on behalf of the stagnant water does not occur, and 1 represents stagnant does occur,).When a dependent variable is a bivariate variable, a multivariate logistic regression model is used to generate the regression coefficients for the respective variables based on the sample data and to discuss the relationship between the dependent and independent variables in the model.Let p be the probability of occurrence of the event, in the range of 0 to 1, then  1-p is the probability that the event does not occur, this probability can be calculated using logistic function, the expression is [8]: Logistic function is a nonlinear function of covariance, in order to obtain the regression coefficient, the logit transformation of (4), to obtain a linear formula : (5) In this formula, are the independent variables, and are the coefficients for the variables, formula( 5) is also called the odds ratio (OR).Since the OR has some good properties in measuring the association [9], it can be used to describe the effect of the independent variables on the event probability in the logistic regression model, Therefore, it is often used to interpret the regression coefficients of the logistic regression model [10].

Determine the data range based on the data distribution
Before data analysis, we need to look at the quality of the data of each group.We describe the quality of each group by stagnant water distribution ratio.Details can be seen in Table 4.As you can see, light rain only has the possibility of 3% to cause stagnant water and super rainstorm has the possibility of 92% to cause stagnant water, so we deem light rain with a very low possibility to cause stagnant water and rainstorm with a very high possibility to cause stagnant water.
As we know, not necessarily large rainfall will cause stagnant water, stagnant water is related with the duration of rain and urban drainage capacity.In all case, the urban drainage capacity is changeless and that makes the duration of rain to be our only concern.According to the data, almost 90% of the rainstorm lasts less than one hour so that we can take an hour as a unit analyse the relationship between the stagnant water and rainfall, and also we believe that the rainstorm can produce stagnant water more suddenly, and the other groups, we cannot simple assume the duration of the rainfall is less than one, so in this paper, we choose the group of rainstorm to analyse.

logistic regression between rainstorm and stagnant water
Before we put the group of rainstorm into logistic regression training, we only have the data of the average of the rainfall as one input and it's not enough, so we think about the length of the rainfall and we can get the number of the records of the rainfall through the section 2.1.1,and we can determine whether the duration of the rainfall is less than half hour through the record, if the record is less than 3  In Figure 5, there are 3 columns, representing rainfall, the record of the rainfall, the existence of stagnant water, we use rainfall and continuous as input and stagnant water as output to perform logistic regression train.
After that, we find out the coefficient of the rainfall corresponding to the significant level of Sig less than 0.05, the regression results can be considered by 5% significance level test.So the formula of logistic regression is (5) In those formula, x stands for rainfall, y1 stands for the rainstorm whether lasts half an hour.

The test of regression equation
In order to test this model, we randomly selected some of the rainfall points which can be thought as rainstorm in the year of 2016 and the year of 2015, and the result is in Figure 6.IST2017 0 represents no stagnant water occur.So according to this figure, we can see when there is actually stagnant water, the probability of the forecast mostly more than 80%, while there is no stagnant water, the probability of forecasts mostly less than 20%.

Conclusion
In this paper, we use the idea of big data analysis as the core, make the different sources of data together according to their spatial and temporal attributes by spatial interpolation, make the previous independent one-dimensional data into two-dimensional data.In the past studies, most of the scholars are through the analysis of drainage measures and other factors to determine whether the stagnant water exist, or according to the stagnant water to analyze the trend of stagnant water, but in this paper, we use logistic regression to get the relation of stagnant water and rainstorm.Through rainfall and rain duration to predict whether the water, which is based on a data mining ideas.With the formula we get we can get the possibility of the existence of stagnant water.In this paper, we only consider the amount of the rainfall and we hadn't considered the randomness of the rainfall, so if we want to predict stagnant water risk precisely, we may need to consider the randomness of rainfall on the basis of the logistic regression we established.

DOI: 10 Figure 1 .
Figure 1.Flow chart of the rainfall data process by spss

Figure 2 . 201 IST2017Figure 3 .
Figure 2. Flow chart of using SPSS to do data processingAs you can see, the flow chart upper there have components named date and hour of which used to change the time dimension to be hour, and the first merge data component is to get the average value of rainfall from same hour same day and record the total number of merge items, and the reason why we want to know how much rows we have merged is to determine the duration of the rainfall, part of the result is stored in excel which we can see in Figure3.In Figure3, each column in the first row represents the meaning of the data from this column.The first two columns are well understood, and we start with the third column.The third column represents the amount of rain collected for one hour, excluding the rainfall is zero, which is used to determine the length of time since data is collected every 15 minutes.After the third column represents the rainfall of each point at a certain time

DOI
In this paper, we choose partly external connection and take the time of stagnant water as reference.The result set are stored in excel which we can see in figure4.And in figure4, the first two columns are time which is based on the original stagnant water, the column called rainfall is calculated by the method of IDW.

Figure 4 .
Figure 4.The result set for logistic regression

Figure 5 .
Figure 5.The data set of rainstorm and stagnant water for logistic regression

Figure 6 .
Figure 6.Comparison of Predictive Probability and Actual ValueAs you can see, the green part represents the actual stagnant water status, and the blue part represents the predict stagnant water status.The value of 1 is the occurrence of water and the value of

Table 1 .
The current structure of data.

Table 2 .
The changed structure of data.

Table 4 .
Stagnant water distribution ratio of each data group.
IST2017means the length of the rainfall is less than half hour, The data for logistic regression is below in Figure5.