Performance comparison of parallel fastICA algorithm in the PLGrid structures

During processing the EEG signal, the methods of cleaning it from artifacts play an important role. One of the most commonly used methods is ICA (independent component analysis) [1-3]. However, algorithms of this type are computationally expensive. Known implementations of ICA type algorithms rarely include the possibility of parallel computing and do not use the capabilities provided by the architecture itself. This paper presents a parallel implementation of the fastICA algorithm using the available libraries and extensions of the Intel processors (such as BLAS, MKL, Cilk Plus) and compares the execution time for two selected architectures in the PLGrid structure (Zeus and Prometheus).


Introduction
Electroencephalography as a non-invasive diagnostic method has been used for many years both in medicine and in experimental psychology.For this purpose, various methods of signal analysis and processing are applicable [4][5].The recorded signal of brain activity is easily disturbed by external events, therefore an important element of working with the EEG signal is to clean it from artifacts [6].
Another important element of signal analysis is the extraction of features for specific stimuli in ERP experiments (event-related potentials).A very often used method that accomplishes both tasks is the so-called blind source separation, while independent component analysis (ICA) is one of these methods [7].
While working with various types of EEG machines, it could be noted that the EEG system manufacturer does not always provide the right tools for effcient and desired processing.It is common that one should use external software to take advantage of the aforementioned method.This prompted the authors of the research to write an application with its own implementation of ICA, integrated with the software used in the Department of Neuroinformatics at Maria Curie-Sklodowska University in Lublin (NetStation) [8][9][10][11].
However, the algorithm itself is computationally expensive (as an iterative algorithm that uses the Newton approximation method) [12].It can take many hours to process one data recording, which is bothersome for both the researcher and the patient who is waiting for results.Therefore, the next step was to develop a method of time optimization of the selected ICA algorithm by using the capabilities of modern parallel architectures.The fastICA algorithm [13] was adopted as a model in studies.In this paper, a comparison of the time efficiency of the algorithm for two selected machines made available by PLGrid (Zeus and Prometheus) will be presented.The main purpose of this research is to develop a faster method of preprocessing EEG data.

Independent Component Analysis and fastICA implementation 2.1 The ICA method
The recorded EEG signal is, in fact, a mixture of many signals coming from different sources.If on the basis of these mixtures one can separate the source signals from each other, this is called blind source separation (BSS) [12,14].
We can illustrate this problem using the equation: where S ∈ R CxM is the matrix of C components for M samples, W ∈ R CxN is the transition matrix with the weight vectors between each signal and electrode and X ∈ R NxM is the data from N electrodes.The problem is to find a separating matrix W that satisfies this equation.However, it should be noted that no more sources can be found than the number of signal mixtures (N) and that the found sources will have a similar shape, but it is not possible to find the original amplitude.
The ICA discussed, in this case, is one of the approaches to blind source separation.Unlike other BSS methods, ICA is based on higher-order statistical methods.It should be noted that the use of statistical methods results in one more disadvantage: if the primary signals have a distribution close to normal, the result is ambiguous.ICA is based on the assumption created on the central limit theorem, that the signal sources are independent components, and the mix of these signals has a normal distribution.Therefore the task of the algorithm is to separate the independent components [12].
For this purpose, the data is preprocessed, i.e. centering (the mean of each signal is zero) and whitening (the variance of each signal is equal to 1, this is done by decomposition into eigenvectors and eigenvalues of the matrix).As a result, uncorrelated signals are received, but they are not yet independent of each other.
Finally, one needs to find such weights (W matrix) so that the received signals have the least normal distribution.Different measures of normality (negentropy and kurtosis) are used, and weights are modified using Newton approximations using non-quadratic functions [14].In our case, the hyperbolic tangent function was used.
As previously mentioned in the work, the fastICA algorithm was used, as a very common implementation of ICA algorithm.It is relatively stable, it is characterized by quick convergence, and the source code is available.In addition, its design allows using parallel computations.It differs from other algorithms by way of weight modification [12].

Implementation and data representation
The implementation of the fastICA algorithm presented in this paper is based on the version that can be found in the open library it++ and in MATLAB.In contrast, the implementation discussed in this work does not use the reduction of the matrix dimensions, which increases the accuracy of calculations.During subsequent modifications of weights, the tanh function (hyperbolic tangent) from the standard mathematical library is used.To prevent the so-called cache misses, the data is properly arranged and the Intel mm malloc compiler function is used.
In the parts of the code where the entire signal is used, parallel blocks are applied.With the most time-consuming parts of the algorithm, i.e. matrix multiplication or eigenvectors calculation, functions from the BLAS and MKL libraries are used.
Thanks to the Intel Cilk Plus extensions for C and C++, it is possible to use array notation and built-in reduction functions (such us searching maximum in the array) that not only make the code more transparent but most of all force the effective vectorization [15].

Results
All tests were performed on two architectures: 1. Zeus: Intel Xeon X5650 2.67GHz -12 cores (+12 virtual) 2. Prometheus: Intel Xeon E5-2680 2.5GHz -24 cores (+ 24 virtual) Prometheus and Zeus are superfast computers providing computing servers as part of the ACK Cyfronet AGH platform.In the latest release of TOP500 list, Prometheus was classified on 77th position.It has over 53 000 computing cores.Zeus was listed on the TOP500 list twelfth in a row and for many years was the fastest supercomputer in Poland.It provides over 25 000 computing cores.
A test comparing the speed of fastICA implementation for 1 second, 10 seconds, 100 seconds and 1000 seconds of the signal recording was performed.For both architectures, the time performance was compared depending on the number of threads for which the program was run.
In Table 1 and Table 2 there is the program execution time for all data in seconds.The Figure 1 and Figure 2 show the speed-up performance of multithreaded launching of application for both architectures compared to single-threading launching.The application of the aforementioned methods (parallel blocks, vectorization, functions from BLAS and MKL libraries) has increased the speed of calculation compared to the single-threaded version.It can be seen in Figure 3 that parallel implementation is scalable and the increase in data size generates more profits.However, the physical number of threads matters.Using more threads than their physical amount (24 threads for Zeus and 48 for Prometheus) does not generate more and more profit.It can also be seen that the efficiency in Zeus decreases after exceeding the physical number of threads.At Prometheus, it is kept at the same level.This may mean that newer architectures cope better with hyperthreading.

Conclusions
The obtained results showed that the execution time scales with the increase of the problem.It can also be seen that on Prometheus (a machine with a newer architecture) the optimal number of threads increases with the size of the problem, while in Zeus it is constant (12 threads).
The studies have shown that architecture and the physical number of cores have a fundamental impact on the speed of calculation in the parallel version of fastICA.The newer generation machine performs better with hyperthreading and it is cost-effective to use the capabilities of the latest generation processors and parallelism.

Summary
This paper presents a parallel version of the fastICA algorithm.It includes the capabilities of multicore Intel architecture and processor.The algorithm has been applied to real EEG data.It was launched on two machines (Zeus and Prometheus in the PLGrid structure).
In addition, the number of physical cores is important, because efficiency decreases (in the case of Zeus) or does not grow (in the case of Prometheus) if we run the program for the number of threads exceeding the number of physical cores.
The future plan is to integrate our solution with tools for EEG signal processing such us NetStation [8][9][10][11].It is also worth considering the adjustment of the existing solution to the capabilities of a particular architecture to achieve even better results.
We have an experience in parallel computing when we modelled some cognitive processes occurring in the simulated cortex of mammalian brains [16][17][18][19].From our point of view, it is interesting if it is possible to find such methodology for algorithm optimization so that the given architecture could achieve maximum performance.