Estimating Ads ’ Click through Rate with Recurrent Neural Network

With the development of the Internet, online advertising spreads across every corner of the world, the ads' click through rate (CTR) estimation is an important method to improve the online advertising revenue. Compared with the linear model, the nonlinear models can study much more complex relationships between a large number of nonlinear characteristics, so as to improve the accuracy of the estimation of the ads' CTR. The recurrent neural network (RNN) based on Long-Short Term Memory (LSTM) is an improved model of the feedback neural network with ring structure. The model overcomes the problem of the gradient of the general RNN. Experiments show that the RNN based on LSTM exceeds the linear models, and it can effectively improve the estimation effect of the ads' click through rate.


Introduction
In 2014, online advertising has overshadowed television advertising for the first time in terms of market size, reaching the sum of 154 billion yuan in China with a 40% increase year-on-year.Compared with 77.3 billion yuan in 2012, the market size of online advertising almost doubles in 2014, and in 2015 it is more than 200 billion yuan.
As an important research area in the field of computational advertising, ads' click through-rate estimation is one important way to increase the online advertising revenue.Based on rich historical data, the ads' click through-rate estimation model makes full use of the complex relation among a large number of nonlinear characteristics of historical data as much as possible for the sake of the estimation accuracy.
In combination with the advertising position, ad auction mechanism and other factors, the increasing ads' click-through rate estimation accuracy could make online advertising be put more accurately, so as to improve the real advertisement click-through rate.According to the online advertising payment mechanisms, most companies are using "click pay cost per click" (CPC), namely, the more the ads are clicked, the more profit it makes [2].
Ads' click-through rate estimation process can be divided into four steps: feature extraction, model building, model training, and model estimation.There are many tries to estimate ads' click through rate.Joachims [3] came up with online Bayesian probability regression(OBPR), but it was based on the specific characteristics of advertisement, which made it difficult to achieve personalized recommendation accurately.Chapelle et al. [5] proposed the dynamic Bayesian network model.Dave et al. [6] adopted gradient boosting decision tree (GBDT) as a regression model to extract the similar characteristics.Richardson [7]used logistic regression (LR) model,which is used in the nonlinear characteristics of learning, but it doesn't fully reflect the relationship among many features and with the increasing of the number of iterations and learning time, it could easily cause the over fitting problems.Agrawal et al. [8] presented spatio-temporal predicting models in 2009.Agarwal et al. [9] proposed to use sparse data pre-existing hierarchy to solve the sparse event and sparse data rate estimation problem.Zhang et al. [10] put forward the COEC (clicks over expected clicks) model, which set an expected figure in advance.Cheng et al. [11] matched the user's search terms and the content of the advertisement.Zhang et al. [12] proposed the use of RNN to predict search advertising click through rate, adopting back-propagation through time (BPTT) for model training.
According to experimental results, it is more accurate compared with the LR models and NN models.However, there would be problems of gradient disappearance or gradient outbreak when using gradient falling algorithm for the RNN algorithm.In order to solve this problem, this paper adopts RNN based on LSTM, and uses LSTM special structure to avoid problems of gradient disappearance or gradient outbreak and to improve the model's accuracy.
Advertisement data in this paper is from Avazu Company.The implicit features are extracted according to the dominant features and hidden features such as users' attribute characteristics.The hidden layer of our model adopts three layers of connection structure, which makes the model to be trained adequately.The experimental results show that our model is more accurate than the LR model, BP neural network (NN) model and RNN model.

Recurrent Neural Network Model Based on Lstm 2.1 Model Definition
Recurrent neural network based on LSTM , which uses LSTM structure to replace the hidden layer nodes of the general recurrent neural network, and the LSTM structure increases the input gates, output gates, forget gates and an internal unit (Cell).
The input gate indicates whether the input layer is allowed to enter the hidden layer node.The door is opened to allow the input layer of the output signal to enter, the door is closed to refuse to enter the signal ,input gate is denoted as .The output gate indicates whether the output value of the current node is output to the next layer.The door is opened to allow the hidden layer node of the signal output, the door is closed to refuse to signal output, output gate is denoted as  .The forget gate decides whether to retain the current hidden layer node storage of historical information.The door is opened to keep the history information of the node of the hidden layer, and the door shut does not keep the history information of the node of the hidden layer,forget gate is denoted as s represents the value of the information stored at the t time.And the input and output layers of the model are consistent with the RNN model , as shown in figure1: The hidden layer nodes are replaced as shown in figure 2:

Model Training
With the general recurrent neural network of the two values of the source as the input is different, The input of the input gate consists of three values.Including the input layer node of the output vector t i x , the first hidden layer of the cell's output vector Through the activation function f of the input gate, the output vector at t time The input of the forget gate is also made up of three input vectors, which is the same as the source of the input gate.theinput vector of the forget gate at time t t a  is as follows: Through the activation function f of the forget gate, the output vector formula of the gate is obtained at time t t b  is as follows: Cells unit from the figure2, it can be known that its input is composed of two parts, one is the input vector of the input layer t i x , one is the output of the first hidden layer of the output gate t h b , the input vector of cell unit at time t t c a is as follows: According to the Forget door to determine whether to retain the information value ) ( The input of output gate is composed of three parts, the output vector of the input layer, a hidden the output vector of the layer of the cell and the cell retention of information, the input vector formula of the output gate at time t t a  is as follows: The output vector t b  of the output gate is obtained by using the activation function of the output gate unit at time t: The output vector of the Cell unit t c b is as follows: The output vector t k a of cell unit, that is, the hidden layer of the output vector, as the input vector of the output layer, the formula is as follows: The resulting vector t k b of output from the output layer is as follows: The weights ij w between the i node and the j node are updated as: Where is expressed as Learning step, ) ( ij w L  is expressed as follows: Where t j  is expressed as residual error of the j node, t j b is expressed as output vector of the j node.So ,the formula 12 can be simplified using formula 13:

Loss Function and Evaluation Function
We need a criterion to judge whether the model is good or bad.Area Under roc Curve (AUC) is a common criterion.However, AUC focuses on the sorting of the CTR estimation, and logloss focuses on the accuracy of th e CTR estimation.When the click through rate all increase s in a certain proportion, it will not cause change in AUC.However, it will cause a change in logloss.Logloss is a r eflection of differences between the estimated click rate t hrough the model and the true click rate.The logloss is s maller, the estimation results of the ads' CTR is more acc urate.This paper uses logloss in the scikit-learn, and loglo ss is defined as follows: (15) Where i y is expressed as the i-th click true value, i p is expressed as the i-th click value estimated by the model.

Data Analysis
In this paper we use avazu company's advertising data (on ly the training set)to validate the model we proposed can enhance the CTR accuracy.The training set samples are made of 40428967 records.Each record consists of 24 fea tures , including 15 explicit features and 9 hidden encrypt ion features.We divide the training set samples into four p arts, the three parts as the training data, one part for the test data.As shown in the table 1, the click rate of test data set and the real hit rate of the training data set is similar, so the data will not affect the model's prediction.

Feature Processing
The analysis of the characteristics of advertising data indi cates that in 24 features there are most two sets of the feat ures beginning with "device_ip" or "device_id", and the value of the features beginning with "device_ip" or "devi ce_id" are too much small numbers, so that advertising da ta has a lot of the long tail of the characteristics.The anal ysis results are shown in table 2: In order to make the model more stable to learn, so we filter out some of the long tail characteristic values.In th e characteristics of samples, we filter the device_ip freque ncy less than 10 times and device_id frequency less than 10 times.Thus the size of training samples are reduced to 23548762 records.
We merge the characteristics of device_ip, device_id, device_ model and C14 into mosaic characters.Then mos aic characters are hashed to new characters, from which t he first 8 characters are denoted as the user_id.We merge feature C15 and feature C16 into a new feature, which is denoted as banner_size.Banner_size also is hashed.Rem ove the C15 and C16 features, add the features of banner_ size and user_id, the number of features is still 24.The obtained features are normalized, and the characterist ic values are mapped to [0,1].

Result
Most machine learning models are adapted to find a line, plane or higher dimensional space to reach the trend of infinite approximation of data character.
Table 3 shows the logloss values of each model under different iteration times.As shown in the table3, the logloss value of LR reaches the minimum value 0.388364 in the fortieth iteration, and then the value begins to increase, which indicates that the model has been fully studied.The logloss value of BP reaches the minimum value 0.461315 in the fiftieth iteration.BP neural network model is essentially a gradient descent algorithm, so there are local minimum problems, which will lead to early termination.Compared with the LR model and the BP neural network model, the logloss value of RNN reaches the minimum value 0.386299 in the sixth iteration.Therefore, RNN has a better ads' click-through rate estimation.
In comparison with BP, RNN model has the input of the hidden layer nodes not only from the output of the input layer nodes, but also from last moment output of Hidden layer nodes, so our model can learn the relationships among the more complex characteristics.
Although the RNN can remember the historical infor mation, yet with the increasing of the depth of learning, RNN results in problems of vanishing gradient.We us e LSTM instead of the normal neuron, so that improved RNN model Based on LSTM can memorize history information as much as possible to prevent the disappeara nce of the gradient.The experimental results show that th e minimum logloss value of improved RNN mode based on LSTM is 0.383213, thus the model is better than other models in the estimation of ads 'click-through rate.The rectangular chart in Figure 4 is more intuitive to draw the logloss values of each model.

Conclusions and Prospects For Future Work
The problem of online advertising CTR estimation is one of the most popular fields of computational advertising, which has been drawing more and more attention.This paper is based on improved LSTM recurrent neural network model for the estimation of the advertising history data.The experimental results show that our model is better than the contrast models in accuracy concerning estimated advertising CTR.It turns out that the work done in this paper is effective.The experimental data are from Kaggle data, provided by Avazu.The data has been encrypted and sampling processed.In the light of the data analysis, sampling data click rate is 17%, which is significantly higher than that in real life.Actually advertisement data is distributed extremely unevenly, so we are still committed to finding true advertisement data for training and evaluation of each model.There are a lot of neural network optimization algorithm and the objective function.The optimization algorithm used in RNN in this paper is SGD algorithm.The new optimization algorithms have been constantly appearing, such as adaptive algorithm Adadelta [14], Adagrad [15] algorithm.Objective function in this paper is cross entropy function, and common objective function also includes MSE (mean squared error), MAE (mean absolute error) and categorical cross entropy as well.
Optimization algorithm and objective function are also problems worthy of research in the future.
The main contributions of this paper are in three folds: First we analyze large number of advertising data features and extract hidden users' attributes with mosaic characters.Mosaic characters are hashed to new characters, so the original different types of characteristic values are turned into the same type of characteristic values.In accordance with the NN model, the characteristic values are mapped to [0,1] by using the normalization process.The improved RNN model based on LSTM has three layers of hidden layer, each layer possessing 256 nodes.This model is used to simulate the users' click behavior and to estimate the ads' click through rate.We use a large number of training data and test data to verify the validity of the model.The experimental results show that the model is better than the LR model and the general RNN model.The first section of this paper gives the introduction of the researches for Ads' CTR prediction.In the second section, we propose the definition of the RNN model based on LSTM, including the training process and evaluation function of the model.After that, we analyze the advertising data and introduce experiments and results of each model in section four.Section five presents the conclusions and prospects for the future work.

Figure 1 .
Figure 1.Recurrent Neural Network Structure Where  h w is expressed as the weights between the hidden layer and the input gate unit.In the following part of this paper,the different subscript of w indicates the weights of different nodes.The hidden layer nodes are replaced as shown in figure2:

Figure 2 .
Figure 2. Long-Short Term Memory Structure So, the structure of the recurrent neural network model based on LSTM is shown in figure 3.

Figure 3 .
Figure 3. Recurrent neural network base on LSTM Gradually we begin to use deep learning neural network model for ad CTR, such as convolutional neural network and deep network learning model.This paper adopts improved RNN model based on LSTM for estimation of ads' click-through rate.In view of the experimental results, it is better than LR model and general NN model.The NN model is more sensitive to the input feature, and high dimensional features affect the model training, even result in its failure.So it would become a hot research prospect in the future to work more effectively among the sea quantity data for feature extraction, feature selection and feature reduction,

Table 2 .
The Feature Sets Of Device_Ip and Device_Id.