Implementation of Airy function using Graphics Processing Unit (GPU)

Special mathematical functions are an integral part of Fractional Calculus, one of them is the Airy function. But it’s a gruelling task for the processor as well as system that is constructed around the function when it comes to evaluating the special mathematical functions on an ordinary Central Processing Unit (CPU). The Parallel processing capabilities of a Graphics processing Unit (GPU) hence is used. In this paper GPU is used to get a speedup in time required, with respect to CPU time for evaluating the Airy function on its real domain. The objective of this paper is to provide a platform for computing the special functions which will accelerate the time required for obtaining the result and thus comparing the performance of numerical solution of Airy function using CPU and GPU. KeywordsGPU, CPU, Airy function, Speedup.


I. INTRODUCTION
Fractional calculus (FC) is a field of mathematics that deals with derivatives and integrals of arbitrary non-integer order. It is being used in engineering domain extensively due to its ability to model real-world systems more accurately than its integer order variant. It is a topic which is around 320 years old!. The inception of this field can be attributed to a letter written to Guillaume de l'Hopital by Gottfried Leibniz in the year 1695 in which he mentions about fractional order derivatives, see [26], one can also find the foundational work of Fractional calculus in the early papers of Niels Henrik Abel [27]. But due to its mathematically intricate and cumbersome nature it remained mostly unexplored for many years until 1974 when the "First Conference on Fractional Calculus and its Applications" was held at the University of New Haven,Connecticut [28]. In recent years applications of this are growing nowadays in a rapid manner as there is a community of researchers that have came together to revive this field with the help of when advanced computing technology [1], [2]. As mentioned before using this mathematical technique, the scope of describing a real object is more accurate as compared to the classical integer-order methods. The replacement of integer orders with fractional orders has resulted into various realistic and compact model in the field of various physical, engineering, automation, biomedical, chemical engineering according to the mention made in this book [13], The function can be called a "Special Function" only when it has the ability to be useful in some applications and satisfies certain special properties. Some commonly used special functions are Gamma, Gauss Hypergeometric, Airy, Bessel,etc [2].
Airy function is the solution to Airy differential equation,which has the following form [22]: It has its applications in classical physics and mainly quantum physics. The Airy function also appears as the solution to many other differential equations related to elasticity, heat equation, the Orr-Sommerfield equation, etc in case of classical physics. While in the case of Quantum physics,the most well-known application of this equation is while solving WKB approximation problem [3], [24]. Taking a few steps we can arrive at the Airy Differential equation as given below. The following is the One-Dimensional Time Independent Schrödinger's equation Where ψ(x) is the wave-function,h is the reduced Planck's constant (Dirac constant), m being mass of the particle, V (x) is the potential function and E is the total energy. This equation basically says that Kinetic Energy + Potential Energy = Total Energy, and that is what each term in this equation signifies. When we consider a triangular potential well i.e V (x) = qεx (q is charge of particle and ε is the electric field strength), we arrive at the following equation The above equation is the same as equation (1) with some constant terms mugging up the equation of which the Airy function is the solution. Hence we can say that for the Schrödinger equation with triangular potential well's case the solution is Airy function. However, to our knowledge, the computational complexity of such Special Mathematical Functions included in the fractional calculus increases when long time vectors are considered while implementing. It is interesting to make a platform which will allow us to speedup the performance rate and thus use it in various hardware implementations of the functions. This process cannot be automated as the definition of the function is supposed to be written from scratch by a human, hence human intervention is required. For this purpose the concept of parallel computing plays a crucial role. In parallel computing, many arithmetic and logical operations are carried out simultaneously using multi-core processors. Traditionally, parallel computing has been carried out using the CPU with a handful of cores. But, we are using the capabilities of multi-core graphics processors for parallel computing.
Even though the original purpose of GPU is for graphics rendering, in recent years the GPU is increasingly being used for scientific computations. This practice of using GPU for general purpose computations is called General Purpose Computing on Graphics Processing Unit (GPGPU) Technology. In paper [4] a parallel GPU solution of the Caputo fractional reaction-diffusion equation in one spatial dimension with explicit finite difference approximation has been provided. The optimized GPU solution obtained is faster than the optimized parallel CPU solution. So the power of parallel computing on GPU for solving fractional applications can be recognized. The paper, see [5] gives a description of the numerical approach that has been done for the evolution of algorithms which can do multiple operations in a given time, in order to solve the fractional differential equations. The procedure accepted is to fit the existing numerical schemes and to develop model parallel algorithms by utilising Parallel Computing toolbox of MATLAB. Unlike CPU, Graphics Processing Unit has parallel arrays of hundreds of smaller processors which function in parallel. We have utilised this parallel architecture of GPU to accelerate the execution of the above mentioned special function. The solution methods for fractional differential equations are not present. This absence is the main reason behind using the models of integer-order. In recent years, the field Fractional Calculus and its applications have drawn the interest of many authors.
The rest of the paper is organized as follows: In section 2, the basic concept of Airy function has been presented briefly. In section 3, a detailed explanation of GPU architecture has been written which also includes the tools for MATLAB computing with GPU. Section 4 represents the implementation steps which were performed while designing the algorithm. Finally all the results are given in section 5 and the conclusions drawn from this work are presented in section 6.

II. BASIC CONCEPT OF AIRY FUNCTION
The Airy function is named after Sir George Bidell Airy [3].He stumbled upon this following integral while calculating the intensity of light in the neighbourhood of a caustic. For this purpose, he introduced the function defined by the integral given below which is now called the Airy function. W (m) is the solution of the differential equation The numerical calculation of Airy function was very tricky back in those days and even nowadays it is so. But in 1838, the values of W for m varying from -4.0 to 4.0 were given by Airy and in the year it was given for values ranging from -5.6 to 5.6. The problem which arose was that the series converges slowly as m increases [3]. In 1982, a scientist named Jeffrey's introduced the notation Ai(x) which is used nowadays as Two linearly independent solutions of equation (1) which are real when x is real are Ai(x) and Bi(x). For the function Bi(x), the integral representation is, The solutions Ai(x) and Bi(x) satisfy the initial conditions Also the behaviour of the first kind of Airy function which is Ai(x) can be described as follows: For negative values, it has an oscillatory behaviour while for positive values, it exponentially decays. It is also observed that there is an interesting and unique smooth transition between the oscillatory behaviour and exponential decay without any abrupt change in the transition.
For the 2nd kind of Airy function Bi(x) the behaviour for negative values is the same as the 1st kind but it is with a phase shift of π 2 and for positive values it grows exponentially. Airy function is related to Bessel Functions [25] of non integer order of first kind J α (x) and to non integer order of modified Bessel function I α (x) and K α (x) of order 1 3 which is implemented with change of the variable ζ = 2 3 x 3 2 . The relation between Airy functions and Bessel functions for x > 0 is given as For the values of x < 0, the functions Ai(x) and Bi(x) are given by While computing the time required for implementation of these functions in CPU as well as GPU, the definitions of Airy function which we used were the one with Bessel and modified Bessel function (In this experiment every function was written in a different subroutine). The reason behind doing so is simple. For calculating Airy function using the integral definition we have to depend on integration techniques like Runge-Kutta-Fehlberg method to get desired accuracy [29]. Contrary to it by using Bessel functions we eliminate the use of numerical integration as the definition of Bessel functions used for computing Airy function are in terms of a power series [14]. The Bessel function of the first kind J α (x) and Modified Bessel function of non-integer order I α (x) and K α (x) used in Airy function are defined as Here Γ(x) is the Gamma Function. The Bessel functions have there applications in Fluid dynamics, Electrodynamics and mainly in Acoustics [25].

III. GRAPHICS PROCESSING UNIT (GPU)
Graphic rendering is all about compute intensiveness and highly parallel computation. These are the features or specialization of Graphic Processor unit or GPU [6]. The process of generation of an image from a 2D or 3D model by means of computer programs is called rendering. This process needs a huge number of calculations to be performed in fraction of seconds. GPU has developed gradually into a highly parallel and multi threaded processor which consists of many cores. The computational horsepower of GPU is enormous and also it has high memory bandwidth. The floating point capability of GPU and CPU has inconsistency between them. Not only this but there are also differences between the architecture of GPU and traditional CPU. The CPU provides more resources to make single instruction execute faster, whereas the GPU makes use of hundreds of individual processing elements (cores) to execute single instruction on multiple data elements (elements in array) in parallel [2]. It is designed in such a way that more transistors are devoted to data processing rather than data caching and flow control.
To get a speedup in computation time, the applications consisting of large data sets make use of the platform of data parallel programming, the applications of processing of image and media which includes the post-process of image synthesis, video encoding and decoding, resizing of a digital image, stereo vision and automated recognition of patterns and regularities in data which maps the image blocks and pixels to parallel processing threads.It is quite obvious that a GPU will take less time for computing as compared to CPU, but the catch is to write a program by utilising special function/subroutines in such a way that it uses GPU architecture efficiently. In fact, acceleration is provided by data parallel processing for the algorithms which are present outside the field of rendering and processing of image. The GPU used in order to carry out the work represented in this paper is NVIDIA Tesla K80.

Specifications of NVIDIA Tesla K80 GPU
The following are the GPU device specifications used during computing and evaluating the Airy function.

Computation of GPU using MATLAB
The Matrix Laboratory, MATLAB is a programming platform constructed especially for engineers and scientists. It makes use of the MATLAB language, a matrix based language, granting the most logical expression of computational mathematics. MATLAB is used by millions of engineers worldwide. The reason behind its usage is to analyse and design the systems and products for revolutionizing our world. As MATLAB's operation is on whole matrices and arrays, it is one of the efficient ways to express computational mathematics. For envision and to obtain the insights from data, built in graphics are utilised which furthermore makes it an easy job. The desktop environment (DE) gives an opportunity to experiment, analyse and discover. These tools of MATLAB and its effectiveness are all precisely tested and constructed to work together [7]. Parallel Computing Toolbox present in MATLAB lets one to solve computationally and data-intensive problems using multi core processors like GPUs. Some of the High level constructs like parallel for loops and parallelized numerical algorithms enables us to parallelize MATLAB applications. The toolbox allows one to use the full processing power of multi core desktops by executing applications on workers that run locally [8]. The Parallel Computing Toolbox (PCT) requires the NVIDIA CUDA enabled device with compute capability of 1.3 or greater [2].

IV. DESIGN OF IMPLEMENTATION
Some of the functions used while computing in GPU are as follows: • gpuArray: This function is used to transfer the input data from MATLAB workspace to GPU. This function converts the array in the MATLAB workspace into a gpuArray object [8].
• gather: For transferring the data from GPU to MAT-LAB workspace this function is used. The gather function is used after all the operations on the data are performed by GPU.
• gpuDevice: For displaying the properties of the currently selected GPU device, the command gpuDevice is used.
Following are the steps to be followed while running a MATLAB code on GPU: 1) Start with the correct code which gives appropriate results. 2) Perform Vectorization of the code.
3) Profile the code using Run and Time function. 4) Variables should be transferred to GPU memory using gpuArray() function. 5) After the execution of code is done the output variables must be transferred to CPU memory by using gather() function.
The definitions of Airy function used while implementing it are as follows: The following algorithm represents steps of implementation of Airy function: 1) Start with the definition of Airy function used.
2) Write a pseudo-code for the function on paper.
3) Write the actual code for the function in MATLAB and obtain the output. 4) Do the calculations of Airy function using Bessel function manually.

5)
Compare the output thus obtained with that of MAT-LAB's result. 6) If the answer is not of the desired precision, make the necessary changes in the code. 7) Prepare a vectorized code for the same (for GPU). 8) Dump the code in GPU and thus calculate the time required for implementation. 9) On calculating the time required for the same code in CPU, obtain the Speedup parameter by using the following formula: 10) Compile and analyse the results.
The Airy function is implemented on CPU and similarly on GPU by following the steps given in the above algorithm. Vectorization of code is important while implementing the function in GPU because it eliminates loops which are sequentially executed. All the operations performed on the data transferred to GPU memory are executed on GPU in parallel by using MATLAB's PCT. The above algorithm can be generalized for any function.

V. RESULTS AND ANALYSIS
In this section a rigorous comparative study of computation time required for evaluating the Airy function on NVIDIA Tesla K80 CPU and GPU is presented. It should be taken into account that both functions (Airy function of first kind and second kind) were evaluated on total 9 test input in ascending order as shown in table 2 and 3 for 10000 times on CPU as well as GPU, and the computational time was noted down for both platforms. After that a mean was calculated for each input and listed in tables below. The reason behind doing so is to eliminate any statistical anomaly or error that might have occurred unknowingly while calculating the computation time. At last the Speedup parameter is calculated by the formula given in the algorithm described in section 4.
The following two tables present the computational time and the Speedup parameter for the purpose as discussed above. The table 2 is for Ai(x) and table 3 is for Bi(x). We can infer from table 2 that for Ai(x) the Speedup parameter calculated using the device used in this experiment gives an improvement in computation time (for largest input) of around 2.667 times the computation time required for the CPU to evaluate the same function with the same input on GPU. This is the crux of using GPUs multi-core architecture instead of single or dual core CPUs. Similarly for Bi(x) the table 3 shows that the calculated Speedup parameter is around 2.601 for the same system as mentioned above. The Speedup parameter depends on various factors that can decrease its value. As mentioned in section 2, for every function a different subroutine was used to evaluate the code, because of the complexity of the Airy function it calculates a total of 5 subroutines to evaluate the final Ai(x) or Bi(x) value. The function was not hardcoded (implementing Bessel and Airy functions into one single subroutine) as in real world application where this function might be used there would be an existing definition of the Bessel function.

VI. CONCLUSION
The work which was carried out was the acceleration of computation of Airy function on GPU. The acceleration is obtained because of the presence of multiple processing cores in GPU. A validation for the function was set using MATLAB. The same code was transformed into vectorized version using Parallel Computing toolbox of MATLAB. From the results, it is observed that the computational time required for the execution is more in CPU as compared to GPU. Thus the Speedup parameter obtained is around 2 to 3 fold for Airy function when evaluated on systems with NVIDIA Tesla K80 GPU. From the results it may be concluded that Speedup parameter depends on the amount of data and number of iterations of an operation.