Backpropagation algorithm with fractional derivatives

. The paper presents a model of a neural network with a novel backpropagation rule, which uses a fractional order derivative mechanism. Using the Grunwald Letnikow definition of the discrete approximation of the fractional derivative, the author proposed the smooth modeling of the transition functions of a single neuron. On this basis, a new concept of a modified backpropagation algorithm was proposed that uses the fractional derivative mechanism both for modeling the dynamics of individual neurons and for minimizing the error function. The description of the signal flow through the neural network and the mechanism of smooth shape control of the activation functions of individual neurons are given. The model of minimization of the error function is presented, which takes into account the possibility of changes in the characteristics of individual neurons. For the proposed network model, example courses of the learning processes are presented, which prove the convergence of the learning process for different shapes of the transition function. The proposed algorithm allows the learning process to be conducted with a smooth modification of the shape of the transition function without the need for modifying the IT model of the designed neural network. The proposed network model is a new tool that can be used in signal classification tasks.


Introduction
Here we briefly recall some approaches to neural networks connected with gradient methods of error minimalization.The algorithm of backpropagation was firstly pro-posed by the Finnish student Seppo Linnainmaa in 1970, but without any indication of its possible relationship with neural networks [1,2].In the history of the development of neural networks, this was the period directly after the 1969 publication of Minsky and Papert titled Perceptrons, in which they presented proof of the limitations of the network of linear perceptrons for solving linearly non-separable tasks, eg XOR [1,3].Paradoxically, this fact initiated a period of stagnation, in which the idea of neural networks development was frozen for several years, becoming an un-fulfilled dream of artificial intelligence applications.Fortunately at the beginning of the 1980s a novel approach was presented (in some aspects earlier by Grossberg in 1973 [3]) by Werbos in 1982, Parker 1985, and LeCun in 1985, which overcame the non-linear XOR problem by using the mechanism of errors minimization via gradient descent, specifically using complex derivative of error function [1,3 -9].The classical backpropagation algorithm that uses this mechanism of minimizing the error function assumes that the expression within the error function is differentiable.Because in such notation, the error function is a complex function, there is a need among other things to differentiate the transfer function of a single neuron, hence the key limitation of the possible applications of transient functions that must satisfy the condition of differentiability.In practice, more or less accurate approximations of transition functions are presumed, which in consequence leads to a simplification of the general model of the network.In addition, the assumed network structure forces the use of the same IT model, which includes the numerical implementation of the derivative of the transfer function of a single neuron, eg sigmoid, tanh, Gauss, log.In the author's opinion, this is a limitation which is a significant simplification of both the dynamics of individual neurons and the strategy of conducting their learning processes.Taking into account the above and assuming that biological neural cells can adopt any transitional character, the idea arose to design a mechanism that would allow for a smooth modeling of their dynamics.The concept of the smoothly changed transfer functions was already presented in works [10] and [11].The main assumption is that the mathematical model of a neuron that uses base functions derived from a number of sigmoidal functions that assumes a dynamic change depends on the order of the used derivative, which in the general case may have a non-integer value.

Model of the Fractional Back-Prop network
Let us assume the following model of the L layered neural network, for which the model with fractional order derivative mechanism [13][14][15] will be given (see Fig. 1.).Excluding transfer functions this model resembles the classic model of the feedforward network in which the input signal is presented by the matrix of input vectors: R denotes the number of receptors, N stands for the number of network outputs.The form of the expected value vector matrix was similarly defined as   1 , , T q q q t t t N    .For the sake of notation simplicity, it was assumed that in a given learning step the q index can be omitted.

Fig. 1. Presumed model of the network
The flow of the signal within the network can be described as following.The input signal to the j th  neuron in the first layer equals: where:   where: f e j -is the neurons transfer function in the first layer.Similarly, the input signal for k th  neuron in the second layer is equal to: where: w j k -element of the matrix of weights connecting the first layer with the second layer, vector element of threshold values in the second layer.
Similarly, neuron activation in the output layer is: Typically, the bias vector might be placed in the weight matrix, then the beginning of the i and j index notation changes respectively.
In the presented model (differently than in the classic approach), the transition function of a single neuron is taken as a Grünwald-Letnikov fractional derivative (GL) [16][17][18][19][20] of log base function.Based on the definition of the integer derivative and the fractional derivative, the GL derivative is described by the formula: where consecutive coefficients   v n a are defined as follow: N stands for the number of discrete measurements and   B fe , which might be defined as: where 1 β  denotes the inclination coefficient.The set of the possible basis functions as well as their retrieval has been shown in [11].When The BP learning method is based on the minimization of the error function, which takes the form for the presented architecture as follow: where q -means the number of the input vector, for which the appropriate vector is desired at the output of the network q t .This is the normal equation of the least squares method (LSM).The expected change in the weight values, which minimize the error, is expressed: where: η -the speed of learning.Thus, starting from the output layer, respectively, we can rewrite: Δ, ,, qq q q q q q ae EE w j k ηη w j k a e w j k By introducing additional notations   We can write the final equation of weight correction in the output layer as: For 1 v  and 1 h  , regarding (10) and additivity property of derivatives [21 -26] we have: For the first layer, the considerations are analogous with the exception of the following component described by the measure of the next layer error: , , qq q q q q q qq k q q q q q k aa w i j ηp i w j k t k a k ee Introducing additional similar designation 1 () q δ j we have: Inserting into the expression defining we get finally: , , Based on the presented considerations, the general formula for weight modification for L layered neural network with modified transition function can be given: where the value i j δ , while presenting the q th  pattern, will be written as: where: l -means the number of the considered layer, j -the number of the neuron in layer l , k -is the number of neurons in the layer 1 l  , N -is the number of neurons in layer l .In the above considerations, it was assumed that in the presented network architecture the parameter v is the same in each layer and for individual neurons because of the first stage of the presented investigations.

FBP net under the XOR problem
For the accepted network model with fractional backpropagation (FBP) mechanism, experiments were carried out to examine the convergence of the proposed learning algorithm.The XOR problem was assumed as the input task for the neural network.The diagram of the network designed for this purpose is shown in Fig. 2. It has been assumed that for a randomly selected set of weights and bias   The initial values of   , , , S w b w b have been shown in Table 1 and these values are common to each subsequent learning process.Those values have been obtained with standard randomizing procedure within the Matlab environment.

Results
In the part of the work, it was assumed that the proposed FBP model will be tested in the task of solving the XOR problem.This is a task in which both the ability of the network to solve non-linearly separable tasks can be demonstrated as well as the convergence of its learning process can be tested.Using the initial values of the weights and biases, a number of learning processes were carried out and the obtained weights with biases are presented in Table 2.In the initial phase of the experiments, we checked in general whether or not it is possible to obtain the convergence of the FBP algorithm for non-integer values of base function derivatives.The left part of Figure 3 shows the course of the learning process for the classical BP algorithm performed with the set of initial weights and biases set from Table 1.At the right part the course of the learning process in the FBP network for different values of the factor ν is shown.The BP network used for com-parisons uses the momentum mechanism and adaptive change of the learning rate coefficient.The modification of weights was performed with the use of the Quickprop mechanism, i.e. adaptive change of the learning rate coefficient combined with the momentum mechanism.Table 2 presents a sample result sets of weights and biases respectively, which were obtained as a consequence of successive learning processes.

Conclusion
The presented paper presents the FBP network model using a fractional order derivative mechanism.The network uses a GL derivative to obtain a base function and to calculate a discrete approximation of its derivative in individual layers.The proposed network learning model is a new approach that overcomes the limitations associated with properties of the single neuron transfer function known in the current subject literature.For the proposed FBP network model, simulations of network convergence under the XOR task were performed.The courses of the learning process illustrated in Fig. 3 and Fig. 4 allow the conclusion to be drawn that it is possible to carry out the learning process using a derivative of fractional order.The resulting weight sets presented in Table 2 of the matrix of weights connecting the receptor layer with the first layer of neurons, i -the number of the receptor on the input layer, j -the number of the neuron in the input layer,   pi -denotes i th  element of the input vector,   1 bj - j th  vector element of bias values in the first layer.Activation of neurons   1 aj in the first layer is expressed as follows:

3 ITM
Web of Conferences 21, 00004 (2018) CST 2018 https://doi.org/10.1051/itmconf/20182100004where v n    denotes the Newton binomial, v -order of fractional derivative of basis function   B fx , 0 e -the interval range, h -step of discretization.For the given discrete function   fe of real variable e defined on the interval 0 ,
process will be performed for successive smoothly changing values of parameter ν in the presumed range 0 1.1 ν  for comparisons in series of tests.The network uses a discrete approximation of the fractional GL derivative of the log base (9) function B f in case of shape and derivative acquiring respectively:

Fig. 3 .
Fig. 3.The process of learning in classic BP network and FBP network.

Fig. 4 . 7 ITM
Fig. 4. The process of learning in FBP network with changing values of v The green circle marks denotes the beginning of the learning process, while the red circles indicates the end of the learning process.Figure Fig. 4 presents a juxtaposition of the

8 ITM
indicate that it is possible to achieve the same minimum error function for different values of the derivative order and the assumed shapes of the base function.The accuracy of the fractional derivative approximation has a key influence on the accuracy of the determination of the weights and biases sets.The new algorithm of the error function minimization can be used for various base functions without the need to modify the IT model of the NNet.As a continuation of the ongoing research, experiments should be undertaken to determine the optimal selection Web of Conferences 21, 00004 (2018) CST 2018 https://doi.org/10.1051/itmconf/20182100004 of derivative approximation parameters for individual base functions.It is also necessary to examine the possibility of selecting specific base functions according to the class of the input signal being analyzed.

matrix of the network response vectors for the excitation q p in the input layer. The input signal of the network is defined as:
1, , , ,

Table 1 .
Initial values of weights and biases.

Table 2 .
Exemplary set of output weights and biases EG network SSE error, η the learining rate coefficient.