Applying of the NVIDIA CUDA to the video processing in the task of the roundwood volume estimation

The paper is devoted to the parallel computing. The algorithm for roundwood volume estimation had insufficient performance so it was decided to port its bottleneck part on the GPU. The analysis of various GPGPU techniques was observed and the NVIDIA CUDA technology was chosen for implementation. The results of the research have shown the high potential of the GPU implementation in the improvement performance of the computation. The speedup of the algorithm for the roundwood volume estimation is more than 300% after porting on GPU with implementation of the CUDA technology. This helps to apply the machine vision algorithm in real-time system.


Introduction
Nowadays the GPGPU (General Purpose Computing on Graphic Processing Units) technology has come into widespread acceptance [1].This programming technique is embedded in the data-flow computing concept.Data-flow computing is the paradigm of parallel computing that interpret data to be processed as a thread which elements could be processed independently hence parallel [2].Hardware-software architecture CUDA (Compute Unified Device Architecture) developed by the NVIDIA is the implementation of the GPGPU technology.CUDA provides highlevel language (C and C++) programming to solve the complex computational problem in a less time due to the multi-core processing power of the GPU [3].

The analysis of the parallelization possibility
The R&D project "Hardware-software system for assessment of cubic capacity and type of wood" (grant number 14418/8880 of the FASIE fund) in a part of the software development has resulted in the set of the algorithms for the solving the problem of the roundwood volume measurement via video analysis.All computations referred to the image processing were optimized to running on CPU.The flow diagram of the algorithm is given in the Figure 1.
The sequential implementation gives an insufficient performance of the algorithm thatprevents its implementation in the real-time data processing [4].The speedupof the algorithm is possible via its adaption to the parallel architecture.The first step of the algorithm porting on the GPU is its profiling and determination of the time-consuming code areas which need the parallelizing [5].The complete porting of the algorithm is impossible due to the fact that the GPU has access neither to the memory nor to the input-output devices of the PC inasmuch as it is an auxiliary tool for computing [6].The computational experiment was carried out to determine the bottleneck of the algorithm.For that purpose a number of the images showing the various cases in the process of the roundwood volume estimation were sampled (Table 1  The algorithm itself was divided into three basic parts: -image enhancement, -background model refreshing, -detection.The CPU processing time for each part of the algorithm was measured.Results of the computational experiment are given in the Table 2.The experiment showed that the bottleneck of the algorithm is image enhancement.This operation consumes about 70% of the time of the frame sequence processing.The image enhancement consists of the combination of the basic morphology operators which are the erosion and dilatation.The consequent execution of these methods of the non-linear signal processing is an extremely effective approach to noise reduction within the developed algorithm [7,8,9].The mentioned morphology operators are referred to the group of the local transformation and implemented in sliding mode by successive displacement of the scanning area which includes an odd number of image samplings.All pixels occurring in the scanning area are processed by the specific scenario [10].The result of the processing is a output image pixel referred to the center of the scanning area.The final value of the pixel is computed as a minimum value among all pixels in the area for the erosion and a maximum value for the dilatation [11].Since each pixel is processed by the same consequence of the operation (minimum and maximum search), the algorithm could be effectively implemented on the GPU [3,12].

The implementation of the algorithm on GPU
The structure of the image enhancement algorithm using CUDA consists of the following steps: 1. Selecting of the active GPU; 2. Allocating the data store for an image in the global memory; 3. Loading input image into the global memory; 4. Loading coefficients of the filter window into the constant memory; 5. Forming the structure of the computational grid; 6. Launching the kernel functions executing the image filtration; 7. Copying the output image from the GPU memory to the internal memory; 8. Deallocating.
ICBDA 2016 As can be seen from the Table 3, the average frame processing speedup is more than 300%.Although the average data transfer time to/from GPU is 0,2 ms, the achieved speedup value clearly demonstrates the efficiency of computing on the GPU.

Conclusion
The following problems were solved while porting the existent sequential algorithm on the GPU: -The existing algorithm for roundwood volume estimation was analyzed for the most effective paralleling on GPU; -The parallel implementation of the algorithm based on the CUDA technology was offered; -The performance of the algorithm is improved in three times due to the parallelization.This allows to apply the algorithm for the roundwood volume estimation in the real-time systems.Thus the results of the research have shown the high potential of the GPU implementation in the improvement performance of the computation.The speedup of the ported algorithm provides the resources for additional functionality in the developed hardware-software system of the roundwood estimation.It was decided to use two color digital video cameras which will enable to analyze the major part of the log surface hence support an external estimation of the timber quality.Table 3. Results of the computational experiment.

Table 1 .
Set of the images.

Table 2 .
Results of the computational experiment.