Residuals in the modelling of pollution concentration depending on meteorological conditions and traffic flow, employing decision trees

Two data mining methods – a random forest and boosted regression trees – were used to model values of roadside air pollution depending on meteorological conditions and traffic flow, using the example of data obtained in the city of Wrocław in the years 2015–2016. Eight explanatory variables – five continuous and three categorical – were considered in the models. A comparison was made of the quality of the fit of the models to empirical data. Commonly used goodness-of-fit measures did not imply a significant preference for either of the methods. Residual analysis was also performed; this showed boosted regression trees to be a more effective method for predicting typical values in the modelling of NO2, NOx and PM2.5, while the random forest method leads to smaller errors when predicting peaks.


Introduction
The modelling of concentrations of air pollutants, on different scales and for different purposes, is a highly topical issue. Anthropogenic factors are the chief source of air pollution; hence there is a natural need to monitor, model and counteract such pollution, which has adverse effects on human health [1]. According to the Provincial Environment Protection Inspectorate in Wrocław, 56% of NO 2 emissions and 16% of PM 2.5 emissions are produced by road vehicles, while 81% of PM 2.5 emissions and 9% of NO 2 emissions originate from household and municipal sources [2]. Action continues to be taken to reduce surface emissions both from transport and from the municipal and household sector [3,4]. Unnaturally high concentrations of the aforementioned substances in the air chiefly affect respiratory and cardiovascular health [5][6][7]. Research has also shown that air pollution may be a cause of autism in children [8] and of Parkinson's disease [9], and consequently may even lead to death [10]. Pollution models can help traffic managers to take decisions efficiently, by selecting the most adequate traffic management strategy [11] or decision support system [12]. They also enable assessment of the capacity of the atmosphere for self-cleaning [13].
The main input for the models described in the literature is traffic and meteorological data [14][15][16]. There are also some researchers who use only traffic data [17] or only meteorological data [18]. Laña et al. [19] investigated the effect of the choice of explanatory variables (traffic, meteorological and temporal) on the correctness and fit of the model; they obtained comparable results for sets of temporal and meteorological variables and for sets expanded to include traffic data.
The input data used in the present study consist of information on meteorological and traffic conditions, as well as temporal variables, from the years 2015-2016. With the development of computational techniques and machine learning, an ever greater number of models is becoming available. The popular and still developing multidimensional regression modelsoriginally linear, but now more complexdescribe the relationships between variables in an effective manner. González-Aparicio et al. [14] used three different linear regression modelssimple linear regression, linear regression with interaction terms, and linear regression with interaction terms following Sawa's Bayesian Information Criteriato describe the dependence of PM 10 concentration on traffic, meteorological and temporal data. Betraccini et al. [20] and Aldrin and Haff [21] proposed the use of a generalised additive model for modelling the shortterm effects of traffic and weather on air pollution. Machine learning, which continues to be developed, has also been applied in the modelling of air pollution concentrations. The method of boosted regression trees (BRT) is one of the classification and regression methods based on decision trees. Sayegh et al. [16] used boosted regression trees to investigate how roadside concentrations of NO x are influenced by the background levels, traffic density, and meteorological conditions. Even more computationally complex is the random forest (RF) method, as used in [19], where the procedure involves the compilation of information from multiple decision trees simultaneously [22].
A fundamental problem arising in modelling is the quality of the fit of the model, as measured by various goodness-of-fit coefficients. Even if the overall fit is good or very good, the model may fail to estimate the concentration peaks correctly. The problem may be approached in two ways: in terms of the short-term forecasting of pollution concentration [23][24][25] or in terms of the multidimensional modelling of dependences in the search for a model that will identify a set of conditions generating the actual value of pollutant concentrationin other words, that will effectively predict the concentrations based on easily available values of input variables. The first approach, important from an environmental or public service standpoint, is based on short-term forecasts, which can be highly accurate when made for one hour ahead, for example [23]. The second approach is the one relevant to the present study, where two machine learning modelsa random forest and boosted regression treesare constructed to determine the effect of meteorological, temporal and traffic flow variables on the concentrations of the atmospheric pollutants NO 2 , NO x and PM 2.5 . The models are compared in terms of quality of fit to the empirical data, and reference is made to other results reported in the literature. A key part of this work is the comparison of the constructed models in terms of fit errors. The models also underwent verification using data from the year 2017.

Data
The analysis is based on hourly data obtained in the city of Wrocław (southwestern Poland) in the years 2015-2016.
The traffic data are provided by the Traffic and Public Transport Management Department of the Roads and City Maintenance Board in Wrocław, which operates 921 video cameras distributed widely over the area of the city. Cameras manufactured by Autoscope, together with software, are used to monitor city traffic in an Intelligent Transport System (ITS). One of the pieces of information obtained is the number of vehicles passing through the measurement plane on a given traffic lane or lanes. This count includes all vehicles passing through that plane (cars, goods vehicles, public transport vehicles). Marked on Fig. 1 is the camera site used in the present analysis: the intersection of Hallera and Powstańców Śląskich. Pollution data are collected by the Provincial Environment Protection Inspectorate, which operates five measurement stations measuring the concentrations of different pollutants (marked on Fig. 1). In this study we focused on NO 2 , NO x and PM 2.5 , which are measured at hourly intervals. There were 17,332 data points for nitrogen oxide concentrations, and 17,003 for particulate matter.
Meteorological data are provided by the Institute of Meteorology and Water Management (IMGW) at only one station, located on the outskirts of the city (see Fig.  1). The meteorological dataset contains hourly air temperature, wind speed, wind direction, relative humidity and atmospheric pressure.

Boosted regression trees
The principal idea of boosted regression trees (BRT) is the creation of a series of simple binary trees consisting of a root and two descendants (one division), where each successive tree is constructed to predict the residuals generated by the preceding trees [26,27]. At successive algorithm boosting steps, a single (best) division of the data is determined, and the deviations of the observed values from the means (the residuals in the division) are calculated. At the next division, the algorithm works on the deviations obtained as a result of the previous division. The method of stochastic gradient boosting used in the algorithm means that each subsequent tree is constructed on the basis of a random sample containing 50% of the entire dataset. Thus subsequent trees are constructed to predict the residuals in independently chosen samples. The introduction of a certain degree of randomness into the analysis is intended to prevent overtraining, and leads to models with the property of generalisation and good predictive accuracy. The described algorithm leads to a good fit between predicted and observed values even if the relationship between the predictors and the dependent variable is highly complex in nature (non-linear, for example). The use of decision trees with the C&RT method of division in an exhaustive search for single-dimensional divisions enables the quantitative evaluation of the importance of variables as a sum, over all nodes of the tree, of increases in the resubstitution estimate, and the expression of this value as a fraction of the maximum sum (over all predictive variables).
The importance of the variables was determined by the procedure described in [28]. A key advantage of regression trees is the possibility of including qualitative variables in the set of explanatory variables.

Random forest
A random forest (RF) consists of a set number of simple decision trees. Each of the component trees in an RF uses a sample subset of the available data. These subsets are independent, and the same instance may occur in multiple subsets (sampling with replacement). For each tree, the predictors are selected with equal probability. Each weak tree is trained on a different sample subset. The predicted output is obtained by aggregating and averaging the individual predictions of all such compounding trees. This particular construction method, which blends the concepts of bagging and random feature selection, has been demonstrated to improve performance over other machine learning algorithms and linear regression models [29]. In each of the models described, the importance of the predictive variables was determined as the sumover all tree nodesof increases in the resubstitution estimate (∆R) and the expression of this value as a fraction of the maximum sum (among all variables; expressed as a percentage). This means that the most important variable (that with the highest resubstitution sum) is assigned an importance of 100. It should be noted that a different understanding of the importance of predictors is presented by Breiman et al. [22]. The main difference is that in the method used here, ∆R values are summed for all predictors over all nodes (and trees), not only at the nodes where the variable in question participates in the division (or is a substitution variable). An advantage of this approach is that it helps to identify variables which have significant predictive power with respect to the dependent variable, but did not participate in any division.
The model included eight input variables, categorised as relating to:  traffic volume;  temporal features (day of the week, month);  meteorological conditions (air temperature, wind speed, wind direction, relative humidity, air pressure). The output of the model was the concentration of one of three air pollutants: NO 2 , NO x and PM 2.5 .
Of the variables listed, three are categorical: day of the week, month, and wind direction. Wind direction data were originally obtained in continuous numerical form, but it was not appropriate to use the wind direction (in degrees) as an explanatory variable, because values with a large difference may represent winds with a very similar direction (for example, 1º and 360º). For this reason, wind direction was instead expressed using eight categories with 45º separations (N, NE, E, etc.).
The training set consisted of 50% of all samples, and the test set of 30%. For the construction of each tree, five explanatory variables were sampled from the set of eight variables described above. It was decided that the learning process (addition of further trees) would stop when the error fell by less than 5% for 10 cycles. This condition determined the number of trees (stopped the process of creating further trees) only in the case of PM 2.5 , when the process stopped at 90 trees. Given that the number of variables was 8, the number of predictors randomly selected for the construction of a tree was 5, and consequently the number of possible different subsets of the variables was 56, the number of trees was limited to a maximum of 100.
To increase the significance of high values of concentration in the model, in the construction of each of the decision trees making up the forest, a priori probabilities proportional to the value of the instance were used. This meant that the generated forest was more sensitive to high pollution values. This operation represents an approach to extreme values similar to that in the BRT scheme, where the method of creating subsequent binary trees from the residuals of the preceding tree takes account of those values on each occasion.

Goodness-of-fit measures
The following goodness-of-fit measures were used to evaluate the quality of the fit of each model: R 2 , MFB, MADE and MAPE. Popular information criteria such as BIC and AIC were not considered, owing to the fact that the number of variables in the model was predefined and constant. Comparison of the computed values of coefficients makes it possible to evaluate which model is better fitted to the data. The coefficient of determination R 2 is one of the fundamental measures of a model's goodness of fit. It takes values in the range <0,1>: the closer it is to 1, the smaller are the differences between the estimated values of the dependent variable and the empirical values. Other measures of fit, independent of the mean value, include MADE (mean absolute deviation error), MAPE (mean absolute percentage error) and MFB (mean fractional bias) ( Table 2). MFB is a measure recommended in the literature for use in the analysis of pollutant concentrations [30] because it builds upon the concept of bias, which measures the tendency of a model to over-or underpredict. MADE denotes the mean absolute error, that is, the mean difference between the empirical and modelled values. MAPE is a similar measure to MADE, but it represents the mean relative error (expressed here in percent). The mathematical formulae for these coefficients of goodness of fit are given in Table 2. Their values, for each of the models considered, are given in Table 3. where is the ith theoretical value (from the model), is the ith empirical (real) value, is the mean empirical value, and N is the sample size.

Goodness-of-fit measures
As mentioned at the outset, a fundamental difficulty in describing relationships between pollution concentration and explanatory variables is the low values obtained for measures of the goodness of fit of the models. Based on global data from 5220 air monitors located on all continents, a study using the method of land use regression with Lasso variable selection [31] produced a model for NO 2 with an adjusted R 2 goodness-of-fit measure equal to 0.52. Sayed et al. [16] constructed 112 models for the concentration of five airborne pollutants using four different sets of predictors. For explanatory variables covering the largest set of input datameteorological conditions, temporal variables and traffic flowthey obtained R 2 values in the range 0.49-0.54 for nitrogen dioxide, 0.37-0.48 for PM 10 and 0.33-0.44 for nitrogen monoxide. In [32] models were constructed for pollutant concentration in different time subsets, using the RF method. The obtained R 2 values included 0.57 for NO 2 in the winter period, 0.52 for NO x in the summer period (June-August), and 0.58 for PM 2.5 on nonworking days. Based on the data and computational techniques described in section 2, BRT and RF models were constructed for each of the considered pollutants. The values of the goodness-of-fit measures are given in Table  3. Bold indicates the best goodness-of-fit measures for each pollution type.
The values of the goodness-of-fit measures do not deviate significantly from those reported in the literature. The R 2 value, as a measure of explanatory power, indicates that the model explains up to 57% (43%) of the variation in the dependent variable (for the RF and BRT model respectively). The MFB values indicate that the BRT model achieves a better fit for all pollutants. The mean absolute deviation error is 26% (24%) of the mean of concentrations of NO 2 (for BRT and RF respectively), 35% of the mean of concentrations of NO x (for both models), and 42% (38%) of the mean value of PM 2.5 (for BRT and RF respectively). The MAPE values are higher than those given above because of the occurrence of empirical values close to zero (these appear in the denominator of the formula given in Table 2). The goodness-of-fit measures used indicate that values of pollutant concentration are predicted more effectively by the RF method in the case of PM 2.5 and by the BRT method in the case of NO x . For NO 2 the measures do not indicate an unambiguously better model. Deeper analysis was therefore carried out on the values of the errors occurring in modelling using the techniques described in sections 2.2 and 2.3, and the conditions in which they were observed.

Importance of predictive variables
The importances of variables (Table 4) were determined in order to identify the variables that exert the largest influence on pollutant concentrations. The different strategies and methods used for tree construction in the RF and BRT cases led to differences in the importances of particular variables in the respective models (Table 4). Generally speaking, the concentrations of nitrogen oxides (NO 2 and NO x ) in the air were most strongly influenced by traffic flow, wind speed and day of the week. Although day of the week is statistically significantly uncorrelated with traffic flow (r = -0.16), there is a link between them, in view of the weekly variability of traffic volumes. There are clear differences between the values of importance obtained using the BRT and RF methods. According to BRT the greatest impact on nitrogen oxide concentrations comes from wind speed, which is responsible for the evacuation of pollutants; this is in agreement with the results reported in [16]. According to the RF models, however, nitrogen oxide concentrations are most affected by traffic flow, the principal source of emissions of those pollutants; this agrees with the findings of Laña et al. [19].
There is a marked difference in the importances of the variables in the case of particulate matter concentration. According to both models, the most important factor is air temperature, which has a direct influence on heating emissions. For the same reason, the next most important factor is month. In spite of the obvious causal relationship between month and air temperature in Wrocław, the non-parametric correlation coefficient gamma shows, with statistical significance, a lack of correlation (0.14). The next most important variables are wind factors, which are responsible for the evacuation of pollutants.

Residual analysis
RF and BRT modelling do not require the distributions of the input variables to satisfy any normality assumptions. The mean values of the residuals (differences between real and modelled values) in each case did not differ statistically significantly from zero (the t-statistic values ranged from -1.65 to -0.22, and the p-value for the t-test from 0.16 to 0.84). In none of the cases considered did the distribution of residuals correspond to a normal distribution. The errors for each of the six models exhibit right-sided asymmetry and are leptokurticthat is, they have a longer tail on the righthand side, and there is a greater concentration of values around the mean than in the case of a normal distribution (Table 5). The non-normality of the distribution of the residuals results from the specific nature of the phenomenon under analysis. All of the dependent variables have right-sided asymmetry (outliers and extreme values above the median). This feature is least pronounced in the case of NO 2 (where the skewness is 1.3) and is significantly stronger in the case of NO x (skewness 3.5). For the residuals in the modelling of particulate matter, the greatest difference between the distributions for the two models is observed ( Table 5). The models provide good predictions for values that are close to average, but underestimate the extreme values (Fig. 2). This phenomenon occurs to a lesser degree in RF than in BRT models. Figures 2 and 3 show underestimated values in the BRT models (points beneath the line y=x) for large values of concentration. In the BRT models, however, there is always visible some kind of upper bound on the modelled values: at 80 µg/m 3 for NO 2 and PM 2.5 , and at 300 µg/m 3 for NO x . This results from the way in which successive tree boosts are constructed in BRT, where in spite of the large weights of high values of residuals in the previous division, they are divided at the next step into only two subgroups (binary trees). In effect, the high values, which are not numerous, fail to be represented in the model.

Verification of models
Each of the models underwent verification using data from the year 2017, including hourly values of eight variables: traffic volume, day of the week, month, air temperature, wind speed, wind direction, relative humidity and air pressure. The values of the main statistics for the numerical variables are given in Table 6.    Table 7. The smallest error was obtained in the verification of the models for nitrogen dioxide concentrations. The error value is comparable to that obtained for the test data from 2015-2016. The greater variation in NO x values led to a reduction in the accuracy of prediction of concentrations. Nonetheless, the mean absolute deviation error took lower values than in the case of the test set. The highest relative error (greater than in the model testing process) was recorded in the verification of the models for PM 2.5. While the models predicted low values in summer, and higher values with greater variation in winter, the relative error for this pollutant was the highest (46.6% for BRT and 56.6% for RF). This is due to the overestimation of summer values and underestimation of winter values. The RF model also predicted a peak in PM 2.5 valuesat the end of November and start of Decemberwhich did not occur in reality (the real values were almost two times smaller than the modelled values).

Conclusions
Two data mining methodsa random forest and boosted regression treeswere used to model the values of roadside air pollutant concentrations based on meteorological and traffic conditions, using the example of data obtained in the city of Wrocław in 2015-2016. In the models, account was taken of eight continuous or categorical explanatory variables. A comparison was made of the quality of the fit of the models to the empirical data. As in other cases reported in the literature, relatively low values of goodness-of-fit measures were obtained (R 2 was approximately 0.5). The effectiveness of the modelling methods was compared for each of the considered pollutants: NO 2, NO x and PM 2.5 . However, the commonly used goodness-of-fit coefficients did not provide an unambiguous answer. For the modelling of NO x they indicated that the BRT model achieved greater predictive accuracy, while for PM 2.5 the random forest model proved superior. Residual analysis was performed for each of the models, and led to the conclusion that the effectiveness of a modelling method depends on the assumed priorities. If the prediction of typical values is most important, then better results were obtained by the BRT method. However, for the prediction of higher values (although not the peaks themselves) the RF method gives better results. The RF models, with imposition of the additional condition of the weighting of instances according to their value, predict higher concentration values, although unfortunately they also overestimate the low values. For the BRT model there was observed to be an upper bound on the predicted values of all modelled pollutants. This is a result of the way in which the method operates, in combination with the very large dataset and the small number of atypical values. The decision trees-based approach described in this paper, fits to the concept of decision support systems. Such systems are commonly used in a variety of urban management domains, including energy planning [33], climate control [34] and water management [35]. The development of decision support systems also leads to the advancement of technologies, which are helpful in processing increasing large amounts of data [36].