Optimization Method of Fusing Model Tree into Partial Least Squares

: Partial Least Square (PLS) can’t adapt to the characteristics of the data of many fields due to its own features multiple independent variables, multi-dependent variables and non-linear. However, Model Tree (MT) has a good adaptability to nonlinear function, which is made up of many multiple linear segments. Based on this, a new method combining PLS and MT to analysis and predict the data is proposed, which build MT through the main ingredient and the explanatory variables(the dependent variable) extracted from PLS, and extract residual information constantly to build Model Tree until well-pleased accuracy condition is satisfied. Using the data of the maxingshigan decoction of the monarch drug to treat the asthma or cough and two sample sets in the UCI Machine Learning Repository, the experimental results show that, the ability of explanation and predicting get improved in the new method.


Introduction
In real life, some actual processes are complex nonlinear processes, the nonlinear relationship reflects not only between independent variable and the independent variables, but also between independent variable and dependent variable [1] . Since the experiment and some objective or non-objective factors, it usually results in small data sample set, and some sample data size is even less than the sample dimension. PLS [2] was first proposed by Herman Wold, including principal component analysis (PCA),Canonical Correlation Analysis(CCA) and Multiple Linear Regression(MLR). It has a good explanatory power for the data with multiple independent variables [3] , multiple dependent variables and a small sample size, however the nature of linear regression in PLS can not completely reflect the characteristic of the traditional Chinese medicine data.
In 1996, Qin S.J. [4] proposed a RBF neural network combined with PLS, which can establish a good nonlinear prediction model, while it's hard to explain the characteristic for its linear approximation to continuous function. Paper [5] proposed an algorithm with fuzzy neural network model embedded into the iterative PLS, and achieved good nonlinear mapping effect, but the results of the model is vulnerable to membership function. In 2013, paper [6] proposed a Kernel Partial Least Squares Method, which mapped the nonlinear data to high dimension linear space with the help of kernel function, to extract the relationship between dependent variables and independent variables as much as possible, the method can well reflect the nonlinear structure contained in sample data, however, choosing a good kernel function is extremely difficult.
Model Tree [7] is an algorithm proposed by Quinlan, which the leaf node adopt multiply linear function instead of the average processing method in traditional regression tree it is constructed by several multiply linear pieces, and has a piecewise linear approximation to any unknown variable distribution trend, not only is the model structure simple, but also easy to explain the nonlinear data, with high efficiency and good robustness. Based on this, to make up the defect of the linear nature in PLS method within the model, it uses MT as the internal model in PLS to interpret the nonlinear characteristics in TCM data.

Partial Least Square(PLS)
Partial Least Square algorithm can not only build regression model for data with multi-independent variables, dependent variables, but also adapts to the situation when sample sizes are less than variables numbers [8] .
The introduction of PLS is as follows: To make the explanation easy, assume there are independent variables set 1 2 ( , , , , , ) are the linear weight combination of independent variables and dependent variables, and they must satisfy the two conditions below: 1try to carry variance information of independent variables and dependent variables respectively as soon as possible; 2the correlation coefficient between the two is largest. Extracted first principle component information from X, Y as 1 t , 1 u , then make the 1 t , 1 u do multiply linear regression, judge the residual information ,if satisfy the requirements, then terminate the process, else continue to extract principle component information from the residual information, the above procedure continues until satisfactory accuracy is achieved.

Model Tree(MT)
Model Tree(MT) [9] adopt multiply linear regression rather than the average processing method in leaf node like traditional classification and regression tree(CART). It divides the sample data into several discrete areas according to certain rules and choose a suitable model to construct a regression model for the areas. the model includes: tree building, searching for splitting attributes, handling internal nodes, pruning, smoothing, and prediction. The detailed introduction about how to build Model Tree browse paper [10][11] .

The Algorithm Flow for MTree-PLS
MTree-PLS is formed by two modular: one is PLS, to extract principle component and to eliminate multiple variable correlation, the other is MTree, to establish the relationship between independent variables and dependent variables and to make the model nonlinear.
MTree-PLS is built over the traditional partial Least Square(PLS), the external model still adopt the original method to extract principle component t, the internal model build Model Tree with the extracted principle components and dependent variables, then do multiply linear regression in leaf nodes of Model Tree, and get the predicted value, if the residual information satisfy the pre-defined condition, stop to build tree, else ,continue to build Model Tree by the residual information till satisfied accuracy obtained. The process of the MTree-PLS algorithm is as follows.

An Algorithm for Constructing Model Tree of Principal Components
The method for using Principal Components to construct Model Tree is to build Model Tree through Principal Components t 1 extracted from independent variable by PLS and the original dependent variable Y. Due to the 1 t 's linear characteristics, it can find the best split point by calculating the error of multiply linear regression between 1 t and Y, then split the 1 t into two subset according to the best split node, continue to split the subset in the manner above till the number of leaf nodes is less than the threshold predefined or the error fluctuation is not that obvious. The algorithm for using Principal Components to construct Model Tree shows in algorithm 1.
End of the algorithm Introduction RMSE is the Root Mean Square Error, expressed as follows In the formula (1), n is the number of sample data of current father node, k [12] is a smoothing constant(defaults k=15), child f and parent f are a fitting function of current leaf nodes and current parent nodes respectively, new f is a fitting function under smoothing.

Fusing Model Tree into Partial Least Squares
MTree-PLS get the principal components 1 t by using PLS, then adopt 1 t to do multiply linear regression with X , and meanwhile build MT with Y and 1 t , it still employs the original PLS to get residual information of X , and the residual information of Y is the disparity between Y and the predicted Y ∧ which is the regression values by the leaf nodes of Model Tree. If not meet the accuracy requirements, then continue to extract the main component by using residual information and use the principal component and the residual information of Y to continue to build trees. Repeat the above procedure, terminating the algorithm until a nonlinear model with satisfactory precision is constructed.
The detailed steps are as follows: 1 Data preprocessing normalizing X,Y, to obtain 0 E and 0 F According to the principle of Lagrange multiplier, respectively calculated   Anti-standardized to the coefficient of the equation, and get the equation of the Y and X.

Experimental Analysis
The experimental data is from the key laboratory of Modern Preparation of TCM, Ministry of Education in Jiangxi University of Traditional Chinese medicine, which supports us with the precious data of maxingshigan decoction of the monarch drug to treat the asthma or cough, the paper still choose another two sample sets in the UCI Machine Learning Repository, namely yacht_hydrodynamics [13] and CCPP_Folds5x2_pp [14] to testify the improved algorithm.

The Explaining about the Experimental Data
The part of data about maxingshigan decoction of the monarch drug to treat the asthma(mxsgpc) showed in Table 1, has a total of 46 samples. it is about the impact of pharmacological indicators about the blood medicine composition in rats under 10 distinct dosage of herbal ephedra respectively. There are five compositions about the blood medicine composition in rats, There are two pharmacological indicators namely, incubation period (Unit: s) and cough duration (Unit: min). The first five compositions is the independent variable, the rest two is dependent variable. The part of data about maxingshigan decoction of the monarch drug to treat the cough(mxsgzk) showed in Table 2, has a total of 62 samples. it is about the impact of pharmacological indicators about the blood medicine composition in rats under 10 distinct dosage of almond respectively. There are five compositions about the blood medicine composition in rats There are one pharmacological indicators namely, cough times. The first five compositions is the independent variable, the rest one is dependent variable.  The description of yacht_of hydrodynamics(yacht) and CCPP_Folds5x2_pp(CCPP) shows in http://archive.ics.uci.edu/ml/.

Analysis of the Procedure and Result of the Experimental
To validate the effect of the new model, it adopt the traditional PLS and Random Forests Regression (RFR) as a contrast. Based on this, it divides the original data randomly in the proportion of 7:3, 70% of them are training set and the rest are test set and adopts the experimental data of maxingshigan decoction of the monarch drug to treat the asthma or cough and two sample sets in the UCI Machine Learning Repository to compare.    Table 4 and Fig 2, we can see the four points below: Firstly PLS has a poor ability of explanation and prediction to nonlinear data and shows obvious inadaptability than the improved PLS.
Secondly, No matter SSETrain or SSETest in the data of maxingshigan decoction cure asthma or cough, yacht_hydrodynamics and CCPP_Folds5x2_pp, compared with PLS and RFR algorithm, the improved PLS has different level of good effect.
Thirdly, although the SSETrain of the improved algorithm is not so well than the RFR', but the prediction ability of RFR is poor in evidence.
Last, the improved PLS method not only has good adapt ability to TCM data, but also has good adapt ability to UCI'S nonlinear data with middle or huge level sample.
In summary, the Model Tree shows a strong analytical and predictive effect for multidimensional nonlinear data. whether it is for small or large sample data, in the degree of interpretation of the model, or the analysis and prediction of data, The improved algorithm is superior to the PLS and the RFR.

The analysis of the Algorithm's Time Complexity
For the PLS, the time complexity is mainly expressed in the principal component extraction. Since the eigenvalues and eigenvectors can be solved by the singular value matrix, only the covariance matrix exists in the time complexity, and the time complexity is

Summary
To deal with Partial Least Squares can't well explain nonlinear data, the thesis put forward Fusing Model Tree into Partial Least Squares. The Partial Least Square make full use of the nonlinear characteristic that the regression model constructed by model tree is formed by many multiple linear segments and after a series of experiments, the conclusion shows the improved algorithm can well explain the level of the model and have more accurate prediction ability. However, the number of leaf node directly decide the calculate result of model. Thus, what we should study further is to choose appropriate leaf node.