Research on heterogeneous acceleration platform based on FPGA

. In the context of today's artificial intelligence, the volume of data is exploding. Although scaling distributed clusters horizontally to cope with the increasing demands on computing power for massive data processing is feasible. But the unlimited addition of nodes will lead to bloated cluster size. Most of the transistors in CPUs are used to build cache memory and control units, which are not efficient for computing operations of massive data processing. Currently, academia uses hardware devices such as GPU (Graphics Processing Unit), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array) to accelerate deep learning, image processing, which require massive computational operations. The paper first discussed the advantages and technical requirements of FPGA acceleration based on the characteristics of the Spark cluster. Then the paper proposed the design of the FPGA-CPU heterogeneous acceleration platform, and introduced the base-two-FFT algorithm. Finally, the paper present and compared the computation time of the base-two-FFT algorithm before and after the acceleration. The results show that the heterogeneous cluster has a speedup ratio of about 1.79 times compared to the CPU cluster.


Introduction
With the rapid growth of data volume in recent years, the design of computing platforms needs to consider both high performance and acceptable power consumption.Spark is an open-source big data processing engine that supports the performance computation of RDD and provides reusability, fault tolerance, and real-time stream processing [1].Spark's application tasks are executed only on the CPU.Low parallelism and low power efficiency may limit the performance and scalability of Spark clusters.Heterogeneous accelerators such as FPGA, GPU and MIC show better performance than general processors in the field of big data processing.If we integrate these heterogeneous accelerators into the original Spark framework, we can significantly improve the performance of each node.GPU expands its performance through the number of cores and SIMD/ SIMT parallelism.It has the best performance when it repeatedly performs several tasks and simultaneously performs the same operation, but it is inferior to FPGA when it performs different tasks [2].Based on the above considerations, this paper uses FPGA as a heterogeneous accelerator.Spark applications can be developed in Scala, Java, and Python programming languages.Python has an extensive library of scientific computation and data processing and is fast to develop [3].Therefore, this article mainly uses PySpark to develop CPU programs.
The rest of this paper is organized as follows: Section 2 introduces the technical background and related research; The third section introduces the design of integrating FPGA acceleration core into Spark cluster.The fourth section introduces the implementation and simulation of the FPGA acceleration core based on the base-two-FFT algorithm.The fifth section presents the test and analysis of computing time and power consumption of the heterogeneous system.
The phase factor    , also known as the rotation factor, is defined in equation 2. Figure 1 shows the flow of the base-two-FFT algorithm, taking an FFT with 16 sampling points as an example.Firstly, the FFT of 16 sampling points is decomposed into the FFT of the odd-even combination of two 8-sampling points, and then the FFT of two 8-sampling points is decomposed into the FFT of four 4-sampling points in parallel.The cycle is carried out until the FFT is decomposed into two sampling points, then put into equation 1 for calculation.

Related research
Baidu's Parallel Distributed Deep Learning Platform (Paddle) [6] integrates GPUs and FPGAs into clusters to accelerate applications.IBM's Coherent Accelerator Processor Interface (CAPI) provides the POWER8 core, the system's available memory architecture, and a high-bandwidth, low-latency path between peripheral devices [7].Microsoft developed a custom FPGA board, Catapult [8], which will be placed on the server of each cluster of 1632 nodes.Catapult has increased throughput per server by 95% and reduced tail latency by 29% under high load.Yu Ting Chen [8] and Ehsan Ghasemi [9] connected the host program of JVM and OpenCL through JNI and used the computing framework provided by OpenCL to control and manage the link between FPGA and CPU.Inspur cooperated with IFLYTEK [10] to conduct accelerated research on the DNN speech recognition algorithm based on deep learning on a heterogeneous platform.The results show that FPGA has a noticeable performance and energy consumption ratio advantage.Huang [11] realized the acceleration of MuTect2 based on FPGA.The experimental results showed that the FPGA implementation achieved a maximum acceleration effect of nearly thirty times and an average acceleration effect of about three times for each node in the load balancing test.In the above work, only the platform test results are involved, but the integration scheme of heterogeneous framework and acceleration design of accelerator are not introduced in detail.
Aiming at the slow development process of the FPGA heterogeneous platform, this work designed a distributed heterogeneous computing platform based on the latest SDSoC whole system optimization compiler, which shortened the overall development cycle.Using Vivado HLS high-level synthesis tool and accelerating strategies such as pipeline and cyclic unrolling, an efficient accelerator IP core is designed and realized for the base-2-FFT algorithm.In the simulation and test, the acceleration effect is better than that of single CPU platform.

Design of heterogeneous computing cluster
In this study, the cluster includes one master node and one slave node.The slave node is equipped with a Xilinx Artex 7@200MHz board card.The slave node is equipped with Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz.It includes 3 PCIe bus interfaces, including an X4 PCIe bus interface connected to the Xilinx Artix-7 FPGA development board.Artix-7 has 126800 registers for locking data; 63,400 LUTs for control logic, gate circuits, and selectors; 135 BRAMs for caching small amounts of data; 240 DSPS, which are internal computing resources for multiplication.This paper uses Xilinx SDSoC 2017.4 as the primary design tool, using Vivado HLS and Vivado to compile C/C++ code into the format of Verilog (code of FPGA).The system structure is shown in Figure 2, the base-2-FFT algorithm is realized inside FPGA, the input data set is preloaded and sliced on CPU processor, and the initialization stage of FPGA is controlled by CPU.The hardware logic part mainly consists of a computing unit, on-chip memory, DMA and finite state machine controller.DMA is mainly used to preload control parameters into the on-chip memory during the operation phase and write the final result data set back to the off-chip memory.In-chip memory is mainly used to store the result data after computing unit operation and save the data read by DMA.It corresponds to BRAM, FIFO, RAM and other buffer storage resources, respectively.The computing unit is implemented based on the LUT inside the FPGA, and different configurable logic blocks and I/O units are called inside each compute unit.

FPGA implementation and simulation of base-two-FFT algorithm based on heterogeneous platform 4.1 FPGA implementation of base-two -FFT algorithm
High-speed FFT operation requires multipliers, a large amount of memory and registers.This is the reason why the FFT algorithm is suitable for FPGA implementation [12].The I/O architecture of base-2-FFT is based on the design of the butterfly processing engine, as shown in Figure 4.It is mainly comprises of RAM (FPGA internal resources), butterfly operation module, rotating factor ROM.When executed sequentially, O(N log N) operations in FFT require O(N log N) time steps.A common way to parallelize FFT is to organize the computation into  stages.The actions of each phase depend on the actions of the previous phase, naturally leading to pipelining across tasks.This architecture allows simultaneous calculations of  FFTS with task intervals determined by the architecture of each phase.
Each stage in FFT also contains significant parallelism because each butterfly calculation is independent of other butterfly calculations in the same stage.Each clock cycle performs /2 butterfly calculations with task intervals of one.The FFT code we implemented is a nested three-tier "for-loop" structure.
The external "for-loop", marked _, implements a phase of FFT during each iteration.There are  stages, where N is the size of input samples.In this experiment, the 16-point FFT has eight butterfly operations.
The second "for-loop", marked _, performs all of the butterfly operations for the current phase._ has another nested "for-loop", marked _.Each iteration of _ performs a butterfly operation.The first line in _ determines the offset of the butterfly operation.The "width" of butterfly operations varies depending on the stage.Phase one performs a butterfly operation on adjacent elements, phase two performs a butterfly operation on elements whose indexes differ by two, and phase three performs a butterfly operation on elements whose indexes differ by four.The difference is calculated and stored in the _ variable.We can see the variable  is different at ITM Web of Conferences 45, 01029 (2022) CSCNS2021 https://doi.org/10.1051/itmconf/20224501029 each stage.The remaining operations in _ perform multiplication by rotation factor and addition or subtraction operation.The variables _ and _ retain the real and imaginary parts of the data after multiplying by the rotation factor W. The variables c and S are the real and imaginary parts of W. We choose to store the real and imaginary parts of the complex number in two separate arrays.X_R holds real input values, and X_I holds virtual values.X_R[i] and X_I[i] preserve the complex number indexed i in separate real and imaginary parts.Finally, the elements of X_R [] and X_I [] arrays are updated using the results of the butterfly calculation.
_ and _ are executed at different times depending on the phase.However, the total number of times _ is executed in a phase is constant._ has only one iteration in phase one, two iterations in phase two, and four iterations in phase three.Similarly, the number of iterations of _ changes.It iterates over the whole algorithm eight times in stage one, four times in stage two, and only two times in stage three In each stage, the _ body performs the same total number of times, and a total of eight butterfly operations are performed in each stage for a 16-point FFT.Firstly, the FFT analysis is done for the 2MHz signal.According to the estimation in Latency above, the calculation time takes 145μs, so the simulation needs to run to about 150μs.The results are shown in Figure 5.The duration of the whole spectrum is precisely the time that the out_valid signal stays high.The test data set used in this experiment is the vibration signals from six sets of signal collectors with a sampling frequency of 512 and a sampling point count of 512.The data set is stored in a text file, and each data is divided by a comma.The data set will be computed using the algorithm library of this paper and MATLAB 's built-in FFT function, respectively, and the calculation results are stored in the text file.Comparing the calculation results shows that the calculation results are the same as the built-in function in MATLAB, and the correctness of the algorithm is verified.

Conclusion
In this section, the performance of the base-two-FFT algorithm on the heterogeneous acceleration platform is tested and analysed.First, we analyse the speedup effect by testing the overall computation time of the algorithm.Then the overall system computation time of the heterogeneous platform's and CPU's executing the same scale of the base-two-FFT algorithm is compared.Finally, the FPGA development board resource consumption is statistically analysed.

Accelerate platform testing and analysis
In this paper, we compare the performance of FPGA (Artix 7) and CPU (Xeon E3-1505M) implementing different scales of the base-two-FFT algorithm, and the results are shown in Figure 7.The system computation time for both schemes increases as the operation scale increases.At a scale smaller than 128 x 128 floating point numbers for the base-two-FFT (a single kernel in FPGA implementation), the performance of FPGA decreases compared to CPU.This is caused by a large number of RDD generation, computation and collection in Spark Cluster.However, at scales larger than 128 x 128 floating point numbers, FPGAs outperform CPUs, and as seen in Figure 8.The heterogeneous acceleration platform is 1.79 times faster than the CPU implementation.We will scale up the cluster size and configure each slave node with an FPGA accelerator in future work.In addition, we will optimize other time-consuming algorithms and implement a more optimized FGPA acceleration strategy.

Power consumption testing and analysis
In this subsection, we test and analyse the power consumption of the designed accelerated system to calculate the base-two-FFT algorithm.Vivado provides us with the occupancy of various resources inside the development board to analyse of the power consumption of the FPGA.
It can be seen from Table 1 that BRAM consumes fewer resources, because the system uses register groups to cache temporary calculation results of RDD between computing modules.Register resources are frequently used because the cache Register group occupies a large number of Register resources.A large number of dual-end read and write operations must be controlled by the control unit to avoid read and write conflicts.Therefore, LUT resources responsible for the control logic and data selection are often used.

Fig 2 .
Fig 2. The top-level design framework of a heterogeneous Spark cluster.

Fig. 5 .
Fig. 5. Radix-2 FFT simulation of 2MHz signal.We use MATLAB to generate 2MHz (shown in Fig 5) and 15MHz (shown in Fig 6) sine wave signals and output them to a text file.The text files of the two signals are read separately to do FFT analysis of the two single frequency signals by running the excitation program.Firstly, the FFT analysis is done for the 2MHz signal.According to the estimation in Latency above, the calculation time takes 145μs, so the simulation needs to run to about 150μs.The results are shown in Figure5.The duration of the whole spectrum is precisely the time that the out_valid signal stays high.

Fig. 6 .
Fig. 6.Radix-2 FFT simulation of 15MHz signal.Then the FFT simulation of the 15MHz signal is simulated, and the results are shown in Fig 6.It is observed that the peak position of the spectrum of the 15Mhz signal is significantly more centred than the 2MHz frequency, which follows the fact.