Kohonen network as a classifier of Polish emotional speech

The power of speech is a main tool in human communication. There are a lot of factors as age, emotions, gender, pitch of the voice which can influence features of speech. Obviously, information conveyed by voice intonation has more than only textual meaning. The same sentence pronounced in two different ways can have two completely different meanings. This paper describes Kohonen networks as a classifier of Polish emotional speech. The usage of Discrete Wavelet Transform (DWT) as well as an innovative approach to scaleogram processing is also presented in this article. Mexican Hat Wavelet and the Haar Wavelet were used in researches. All simulations were carried out in MatLab 2016 with Neural Network Toolbar. During whole research more than 9000 simulation have been done. Three different speech databases were used in conducted researches. One of them was prepared by professional actors – four women and four men, and contains 240 wav files. Two others are results of researchers works. The structures of used Kohonen networks depend on speech signal decomposition’s level and scaleogram division. During conducted researches the following emotional states were considered: anger, joy, sadness, boredom, fear and neutral state. Achieved results were between 68% and 80% depends of used wavelet, speech signal and signal decomposition’s level.


Introduction
Recognition of speaker's emotional state based on speech signal processing is relatively new issue but its significance has been rapidly increasing. One of the reasons of such a direction of changes is burgeoning development of systems based on Brain-Computer Interface, as well as Virtual Reality (VR) environments [1].
There are a lot of factors such as: age, emotions, language or gender of a speaker which may have great influence on features of speech signal [2,3]. Obviously, that information conveyed by intonation of voice has more than only literal meaning.
The biggest problem in emotional speech recognition systems is the number of emotional states; therefore, developing an application which correctly identifies most emotions is not trivial. So far, in researches the following emotional states are often considered: anger, joy, sadness, fear, boredom and neutral state [4,5].
In the Figure 1, oscillograms for three emotional states are presented, including: neutral state (top), joy (middle), anger (bottom) for the same expression in the semantic sense. As it can be easily seen, the appearance of a particular emotion changed not only ranging from frequency but also to the shape of oscillogram.
In this article, an innovative system of Polish emotional speech signal processing has been described. The system based on discrete wavelet transform and scaleograms. As a classifier the Kohonen networks have been used.
The whole article was divided in four parts. The first one charactering the discussed matter. The second one described used databases of Polish emotional speech. The third part included the description of signal processing algorithm, research methods, and parameters for the Polish emotional speech and presented obtained results and suggestions for improving the adapted research methods.

Analysis of issues
Nowadays, the researches struggle with developing a proper and useful model of emotion. Contemporarily, two most popular models exist in the literature: James-Lange's and Plutchik's models [3,4,6]. The first one assumes that behavioural and somatic changes are interpreted as emotion which triggers a reaction. Based on James-Lange's model the following sentence is true: 'I am afraid of murder because I am running' instead of 'I am running because I am afraid of murder' [7]. The second model was proposed in 1960 by Robert Plutchik. He introduced eight basic emotional states which were related to behaviours essential for surviving, that is: joy, fear, trust, sadness, surprise, anger, disgust and anticipation. All other states arose from the basic ones [8].
As it was mentioned, the most complicated issue in Polish emotional speech recognition is the number of emotional states which should be detected. It should be emphasised that for an average person it is possible to recognise another person's emotional state only in 60% of all cases [9]. Some methods of emotions' detection in speech signal, based on the Polish language, have been described in the literature. These methods relate to Support Vector Machine (SVM) [10] or k-Nearest Neighbour algorithms [4]. However, all authors conclude that obtained results are not fully satisfactory and used signal processing methods need to be improved. All mentioned researches focused on six most popular emotional states: anger, joy, boredom, fear, neutral state and sadness [4,10]. Time-frequency methods are among the most popular speech signal processing tools [11]. They allow to estimate speech signal spectrum in short and finite time based on window functions and overlapping method [11]. A particular role in these cases, has Short Time Fourier Transform (STFT) and Discrete Wavelet Transform (DWT) [2,4,10,12]. In this article the usage of second one has been described in detail.

Description of used databases
Berlin Database of Emotional Speech (BES) [13] is commonly used in many researches connected with emotional speech signal processing. However, if Polish speech is considered, a researcher would rather use their own databases or the one prepared by Medical Electronics Division of the Lodz University of Technology [14]. The base has been prepared by professional actors: four women and four men, and collected files was recorded in six emotional states. The whole database contains 240 records sampled with 44.1 kHz frequency and the bit rate of 16 bps. This database includes the following statements: 'Johnny went to the hairdresser today', 'They bought a new car today', 'I've stopped shaving from today on', 'His girlfriend is coming here by plane' and 'This lamp is on the desk today'. It was the first database used in conducted researches.
The second one was prepared in Lublin University of Technology and contained the same records as the first one. The abovementioned collection was recorded in acoustic chamber. The research involved people between 20 and 30, who were not involved in acting. The entire database contains 306 records and its structure is presented in the Table 1. Unfortunately, it was not possible to collect the same number of recordings for each emotional state for both sexes. The third database was also prepared in Lublin University of Technology but the recordings was collected not in acoustic chamber but in the city environment (lecture class, street, shop). Three women and six men aged 22-31 were involved in this study. They uttered following five sentences: 'I need to talk to you', 'Is this what you really think?', 'You don't understand anything', 'Peter bought a new bike', 'This package is in Krakow already'. Whole database contains 266 records and its structure is shown in the Table 2.  In case of second and third database, the affiliation of recordings to specific group of emotion was verify by 94 respondents.

Conducted researches
During conducted researches, following emotional states were considered: anger, joy, sadness, boredom, fear and neutral state. All simulations have been done in MatLab 2016 with Neural Network Toolbox. The number of all experiments exceeded 9000.

Wavelet Transformation
Continuous Wavelet Transform (CWT) was developed by Jean Morlet and Alex Grossman. One dimensional signal is expressed by following form [15]: CWT has the following form [16]: where: * -indicates Conjugate complex function, a (a>0) -scale parameter, b -offset parameter, ψmother wavelet. Continuous Wavelet Transform has many advantages but in speech signal processing its discrete form is more often used due to its simplicity [17]. Discrete Wavelet Transform (DWT) is defined by the following form [17]: where: S(k) -indicates input signal, a (a>0) -scale parameter, b -offset parameter, ψ -mother wavelet.
One of the advantages of Discrete Wavelet Transformation, compared for example to Fourier transform, is that DWT provides accurate and uninterrupted time information, which is a significant enhancement for signal processing [18].
Two types of mother wavelet were considered in conducted researches: the Haar Wavelet and the Mexican Hat Wavelet. The first one is defined as follows [19]: The Mexican Hat Wavelet has the following form [20]: (5) Sample graph of above-mentioned function is shown the in Figure 2.

Speech signal processing scheme
The biggest challenge in conducted researches were preparation and processing of data process. It was divided into several steps. The first one related to noise reduction process from input signals. To fulfil this task The VOICEBOX: Speech Processing Toolbox for MATLAB was used. The second one related to speech signal values normalisation process to the range [-1,1]. These two steps constituted the pre-processing phase. In the next step the scaleogram was created. To create speech signal spectrum, the discrete wavelet transform was used. The example of signal decomposition is shown in the Figure 3.
In the next step, features were extracted from scaleograms. The main issue of this phase was the preparation of an input vector for processing by Kohonen Networks. At the beginning of features extraction process all scaleograms were transformed into grey scale images. The next step was a transformation to binary images which was as follows: all values, above specific threshold, were replaced by 1, other ones by 0. To define the threshold all values between 50 and 200 were tested. The best results were obtained at 100. The next step included scaleogram division into several subareas. The number of subfields depended on speech signal decomposition's level. The example of scaleogram division is shown in the Figure 4. Commonly used k-Nearest Neighbours algorithm was used also in this research, to determine the best level of signal decomposition. The fastest form of k-NN algorithm is its 1-NN version. In this algorithm, the unknown tested sample should have been assigned to the same group as its closest neighbour. Results of researches conducted during Statlog project [21], in which a several classifiers were compared, showed that for 75% of k-NN test the best results were achieved for 1-NN version. In that research the aforementioned algorithm was also used. Achieved results were shown in the Figure 5. Based on abovementioned algorithm, the 7 th and 9 th speech signal decomposition's level were used. It can be easily noticed that achieved results, under 7 th level of decomposition, were unsatisfactory. What is more, the processing time above 9 th level has been growing rapidly. The number of scaleogram subareas were multiple of level of decomposition.   The best effectiveness was achieved from division into 210 subareas for 7 th decomposition's level and into 360 for 9 th level. The last step in Polish emotional speech signal processing, before classification, was summarising values in each subarea. The whole processing scheme is illustrated in the Figure 6.

Kohonen Networks Architecture and achieved results
Kohonen Networks belong to a group of Self-Organizing Maps. The main difference between this kind of neural networks and one-way artificial neural networks is that the correct output cannot be defined before training process (a priori). The main aim for Kohonen networks is to organise multidimensional information in such a way that it can be presented and analysed in space with smaller number of dimensions. was constructed in a way that for one input value there were four corresponding neurons in a map. So, the size of the map was four time bigger then input vector. The Kohonen networks work as follows: 1. Inputs relate to all nodes in map. 2. Each node stores a weight vector with identical size as an input vector. 3. Each node calculates its activation level as scalar product of vector weights and input vectors. 4. A node with the highest level of activation is the winner and can update its weight vector. 5. Nodes in the winner's neighbourhood may update their weight vectors. In conducted researches, a Euclidean distance weight function was used as a neighbourhood function. Number of epochs was set to 1000. As a training set 50% of recordings from the first database was used. Achieved results are presented in Tables 3 to 5.  It can be noticed that the strong emotions, such as fear or anger, were more often recognised correctly than the weak ones. What is more, regardless of used database, fear was identified with the highest efficiency and sadness with the worst. The justification for such results should be sought in the appearance of the speech signal spectrum which for sadness is quite similar as for boredom and neutral states; therefore, these three emotions were often confused. The opposite situation occurs in the case of fear, which spectrum was clearly different from spectrums of other emotional states. For all data the average effectiveness of classification was almost 76%.

Conclusions
Conducted researches showed that identification of emotion in speech signal is not trivial issue. There are not many publications directly related to the possibilities offered by spectrographic methods and Kohonen networks in issues connected to the identification of Polish emotional speech. Scaleogram division and determination of energy in its subfields allowed to create the appropriate input vector for Kohonen networks. Conducted researches and obtained results, as well as preliminary results of subsequent experiments, allow to assume that proposed method of speech signal processing can be universal and it is possible to uniquely identify the speaker's emotional state.