Regional language Speech Emotion Detection using Deep Neural Network

. Speaking is the most basic and efficient mode of human contact. Emotions assist people in communicating and understanding others' viewpoints by transmitting sentiments and providing feedback.The basic objective of speech emotion recognition is to enable computers to comprehend human emotional states such as happiness, fury, and disdain through voice cues. Extensive Effective Method Coefficients of Mel cepstral frequency have been proposed for this problem. The characteristics of Mel frequency ceptral coefficients(MFCC) and the audio based textual characteristics are extracted from the audio characteristics and the hybrid textural framework characteristics of the video are extracted. Voice emotion recognition is used in a variety of applications such as voice monitoring, online learning, clinical investigations, deception detection, entertainment, computer games, and call centres.


Introduction
The purpose of emotional speech recognition is to use a person's voice to automatically assess their emotional or physical state. During speech, air moves from the lungs to the larynx through the trachea, vibrating the vocal cords and producing speech signals [9] [7]. People transmit their underlying intention through paralinguistic features such as emotions, intonation, and style through the interaction of human speech. The purpose of emotional speech recognition is to use a person's voice to automatically assess their emotional or physical state. This technology has a bright future and is critical for natural language comprehension [4]. Empathic and natural humancomputer interactions necessitate the ability to perceive emotions [11] [12]. Speech emotion recognition (SER) [12] has garnered a lot of academic interest in recent years, thanks to the rapid growth of conversational agents like Siri, Alexa, and Cortana. Emotions aid communication and understanding by transmitting sentiments and providing feedback to others [13]. Human voice provides a natural and instinctive interface for robot communication, and it is thus commonly used in robots that interact with humans [2]. The ability of computers to understand human emotional states such as joyful, angry, and disgust from speech signals is a fundamental goal of speech emotion recognition [12]. In recent years, a variety of viable solutions to this problem have been offered [14]. [15] [6]. Speech emotion recognition is utilised in many applications, including voice surveillance, e-learning, clinical studies, lie detection, entertainment, computer games, and call centres [7].

Autonomous speech emotion recognition
In essence, autonomous speech emotion recognition systems employ a computer to mimic human emotions, including traits like accentuation, intonation, and pause, and match them to the target emotions using spectrumbased properties. To match their desired emotions, pause uses spectrum-based characteristics. Then, for accentuation, intonation, and pause, spectrum-based traits are used to match them to the intended emotions. A voice emotion recognition system is made up of three phases at its core: speech data preprocessing, emotion feature extraction, and emotion categorization [16][7] [28]. As a result, two critical components of emotion detection are a sophisticated categorization architecture and speech emotion characteristics incorporating crucial information [7]. There are now numerous models for audio emotion identification that involve machine learning and deep learning [17][18] [19] [7]. The categorization process begins with the extraction of features. The quality and amount of characteristics employed determine how well a categorization system performs. Feature engineering is a key stage in categorization in this regard [8]. Speech emotions have been classified using hidden Markov models, support vector machines, deep belief networks, convolutional neural networks (CNN), and long shortterm memory networks (LSTM) [12] [1]. Acoustic characteristics of speech are extracted to identify emotions in speech. For understanding the relationship between retrieved speech data and preset emotion tags, many types of machine learning approaches are used [7]. The act of recognising emotions between people is called recognition. The effort of trying to design classifiers that generalise across application situations and acoustic Emotion detection has been used in a variety of industries, including smart homes, travel suggestion systems, and health monitoring, thanks to the rapid growth of artificial intelligence. Externally, by sight, verbal, and gestures; internally, through heart rate, respiration, blood pressure, body temperature, EEG signals, and so on. Because building speech and visual datasets is straightforward and intuitive, speech and visual characteristics are commonly employed in emotion identification. [3]. Intelligent services such as chatbots, psychological diagnosis aid, intelligent healthcare, sales advertising, and intelligent entertainment are examples of intelligent services that address not only the fulfilment of services but also the humanization of the human-computer link. [5]. It's difficult to model human emotions in words. The following are the key reasons: 1) Human emotions may be seen as noise and rejected by many existing speech recognition algorithms due to their abstraction. 2) Throughout general, human emotion can only be recognised at certain points in a protracted speech [21] [4]. Nonverbal noises can effectively assist the brain in determining the difference in emotion expression when the human brain analyses emotional speech [22] [5]. The automated identification and appraisal of human emotions is one of the most current research areas in fields spanning from biomedical engineering and psychophysiology to computer engineering and artificial intelligence [23][7].

Literature Review
Weighted Fusion and Consistent & Random fusion algorithms were suggested by Sheng Zhang et al. [1]. Adaptable and appropriate for activities requiring several modalities. Wisha Zehra et al. [2] investigated the Ensemble learning approach, which proved to be highly useful in the development of an emotion identification system for robots that deal with consumers from all over the world. Chen Guanghui et al. [3] adopt a Multi-modal emotion identification approach that successfully distinguishes similar classes and fuses speech and visual information, resulting in improved multi-modal emotion recognition performance. Chenghao Zhang et al. [4] employed an emotion embedding autoencoder capable of learning strong emotional information from labels. Jia-Hao Hsu et al. [5] Speech emotion detection in affective discussions with nonverbal vocalisation. Not only is it useful for recognising pleasant emotions, but it's also useful for recognising bad emotions. Transfer subspace learning was utilised by Na Liu et al. [6] to solve the unsupervised cross-corpus speech emotion recognition (SER) issue. Mehmet Bilal Er and his colleagues [7] The use of a novel hybrid architecture based on acoustic and deep features improves classification accuracy. Sofia Kanwal et al. [8] utilised a clustering-based genetic algorithm that can distinguish between different emotions.

Emotional Speech Databases
The naturalness of the information determines the success of speech emotion recognition. The Danish Emotional Speech corpus (DES) and Berlin Emotional info (EMO-DB) are two public databases, and four databases are available in Spanish, Slovenian, French, and English emotional speech. Only a few databases are authentic, and the majority of them include performed emotional speech. There appear to be three types of databases used in the SER research in terms of relevancy credibility: type one is performed emotional speech by human labeling. This information is gathered by having an actor talk with a predetermined emotion. Recently, strong challenges to the use of performed emotions have surfaced. The read of alternatives and accuracies differs between performed and spontaneous samples [15], and type two is genuine emotional speech with human labeling. Type 3 involves induced emotional speech using self-report instead of labeling, and this database comes from real-world systems (for example, contact centers). Anger and self-report are employed for labeling management whenever emotions are a unit.

Speech Emotion Recognition
The methods for emotion identification from Marathi speech, as well as the databases used to evaluate it, are presented in this chapter. The objective is to uncover acoustic emotion units that are suited for real-time applications first, and then to identify potential acoustic characteristics for emotion detection that can be extracted fast and automatically second. as well as evaluating a reasonable technique for picking the most relevant characteristics for a certain goal, and lastly selecting a quick yet accurate classification algorithm. As a result, for both the training and test phases, the approach utilized for all three processes as outlined in the overview in this chapter is detailed. Evaluation tests on the Marathi Language database with performed emotions are undertaken in order to be able to make as broad claims as feasible. The outcomes of the experiments will be discussed in the next chapter.
A fundamental challenge in speech emotion detection is defining a set of core emotions that can be classified by an automatic emotion recognizer. Languages have compiled a list of the most common emotional states we face in our everyday lives. There are 300 emotional states in a typical set. Classifying such a large variety of emotions, on the other hand, is incredibly difficult. Emotion is commonly broken down into core emotions, much as any colour may be broken down into a few basic hues. The basic emotions are anger, contempt, fear, pleasure, sadness, and surprise [2]. These are the most visible and recognisable feelings in our life. Table 1 depicts a strong relationship between mood and a few speech features.

Automatic Speech Emotion Recognition
Speech emotion recognition systems use a person's speech to automatically detect his or her emotional state. Analysis of the speech signal's generating process, as well as the extraction of some aspects that include emotional identification methods for identifying emotional states. The components of the speech emotion system are shown in Figure 1. The pattern recognition technology is similar to spoken emotion recognition. This demonstrates that the steps seen in the pattern recognition system are also found in the Speech emotion recognition system. The speech emotion recognition system has five main modules for training and testing: emotional speech input, preprocessing, feature extraction, feature normalization, classification, and recognised emotional output [2].

Database Creation
The naturalness of the database used as an input is used to evaluate the speech emotion recognition system. If the system is given a bad database as an input, the system may come to incorrect conclusions. The database used as an input to the voice emotion detection system might comprise real-life or staged emotions. It is more feasible to employ a database that has been compiled from realworld scenarios [15]. The first step is to Emotion identification from speech is derived from acoustic measurements of those units, which are then used to determine the real characteristics from the audio input signal. The units are frequently phrases or utterances, which are linguistically motivated medium-length time periods. Despite the fact that choosing whatever type of unit to join is clearly vital, it has gotten little attention. Neither the division into utterances nor the expectation of a continuous feeling over each speech are straightforward. A good emotion unit, in general, must meet a set of conditions.
The voice emotion recognition system is estimated based on the naturalness of the database that is utilised as an input. If a bad database is used as an input to the method, it is possible that incorrect conclusions will be made. The database, which is used as an input to the spoken emotion detection system, may contain actual or acted emotions. It is more feasible to employ a database that has been compiled from real-world scenarios [15]. The first step in speech emotion recognition is to make the audio input signal meaningful, after which acoustic measurements of those units are used to extract the real features. The units are generally medium-length linguistically inspired units. phrases or utterances are examples of temporal intervals. Despite the fact that choosing whatever type of unit to join is clearly vital, it has gotten little attention. Neither the division into utterances nor the expectation of a consistent feeling over a speech are correct. In general, a good emotion unit must adhere to a set of guidelines. It should be, in particular: 1. To be consistently extracted, it must be well-defined.
2. Long enough that statistical functions may be used to calculate characteristics convincingly.
3. Short enough to maintain consistent emotional acoustic characteristics throughout the Segment.

Consistent with the training database's labeling.
Speech samples used in training, testing, and applications should all follow the same set of criteria, i.e., they must have the same properties. The Marathi Database is created using these guidelines.

Feature Evaluation
A multi-algorithm technique for detecting emotion from audio signals is given. The suggested MFCC and Discrete Wavelet Transform-based algorithms will be utilised to extract emotional information from speech data, using characteristics derived from pitch and formant frequency to support them. Pitch contour features such as local maxima, local minima, frequency distance, temporal distance, and slope between nearby local extrema are determined for each frame of speech sample. In addition, the first four formant frequencies are determined. The MFCC technique is a time-honored way of analysing speech signals. It is based on a linear cosine transform of a log power spectrum on a nonlinear Mel frequency scale of frequency and depicts the short-term power spectrum of a sound. DWT is used to breakdown the input speech signal and provide approximation and detail coefficients as an alternate approach. 4th level decomposition using db4 wavelets will be used to derive wavelet characteristics for each input spoken sound. The SVM classifier will be used to determine the similarity between the recovered features and a set of reference features. The database will have Marathi speech samples for each of the six emotions.

Database:
The Emotional Prosody Speech corpus provided us with our data. This corpus comprises Marathi continuous utterances created by 6 speakers in 6 emotions: happy, rage, neutral, fear, sorrow, and boredom (3 female, 3 male). For the examination of distinct emotions, a corpus of 180 utterances of phrases was recorded. For each of the six emotions, each speaker is given a set of five sentences to utter. Recording: The recording was done using an electric microphone in a partly sound 16 kHz/16 bit format, with the distance between the lips and the microphone set to about 30 cm. Listening Test: We initially randomized all of one speaker's continuous sentence files, which were then shown to ten naive listeners who were asked to rate the process was repeated for all 10 speakers, and the feelings were divided into six categories: neutral, pleased, angry, grief, fear, and surprise. All of the listeners were educated and aged 18 to 28 years old. For this study, only those statements were picked that had at least 80% of all listeners recognising their sentiments.

Acoustic Analysis of Emotions:
The acoustic characteristics of the voice, such as intensity, pitch, and length, are also influenced by emotion. Sentence acoustic analysis is performed. The spectrograms of one of the sentences are shown in Figures  (4.1-4.6). Both prosody-related and spectral variables were taken into account in our research of emotions.  Tone height is the acoustic equivalent of pitch, which is a fundamental frequency. Pitch estimate by machine is a difficult problem. Effects of vocal tract resonance and short-term disruptions in the spoken stream can obscure pitch detection [17]. Figures shown below indicate pitch contour curves for specific emotion for one of Marathi sentence "KOKILACHA AWAZ KHUP MADHUR ASTO".
Both prosody-related and spectral variables were taken into account while studying emotions. Pitch contour curves in utterances of angry feeling increase and decrease towards the beginning of the phrase and descend towards the conclusion of the sentence, according to the findings (fig 2) [28]. For dread, the pitch increases at the start of the phrase and then stays the same before falling at the end (fig 3). Pitch contour curves of utterances in a happy mood exhibit a hold pattern at the start of the phrase and rise and fall at the conclusion of the sentence ( fig  4) [28]. Pitch lowers at the end of the sentence and rises and falls at the beginning of the sentence in a neutral mood (fig 5.). The intensity of the sad emotion's F0 curve falls and rises in the start of sentences ( fig. 6), whereas the intensity of the boring emotion's F0 curve falls and rises in the commencement of sentences and declines at the end position ( fig. 7). With the exception of surprise, it has been determined that the intensity curve varies in proportion to the pitch in most emotions. The length of phrases pronounced in different emotions has varied values, according to the pitch contours of the sentences. As can be seen from the pitch contour ( fig. 6), boredom emotions have the longest length of 2.8 seconds, while rage emotions have the shortest duration of 1.6 seconds.
To evaluate single features, the information gain for all characteristics with relation to emotion classes was estimated in a database [28].   When all of the characteristics are combined, the system's performance is deemed to be good. This suggests that the problem of speech emotion recognition is better handled when using a group of multiple descriptors. Moreover, our methodology significantly outperforms the ability of a human listener to classify the respective signals which is equal to 100 percent. The Classifier shows better results when fusion of all the features are used as shown in table 5 and 6. In future work, we aim to test system performance on spontaneously expressed (Non acted) emotions of Marathi language. To enhance system performance Table 4. Confusion Matrix for recognition using Fusion features rate. further introduction of one more classifier can be proposed which can reduce confusion rate. classifier can be proposed which can reduce confusion

Results
The multi-algorithm approach is proposed for emotion recognition from speech signals. The system is designed with specifications as: The Database created of one of Indian regional languages, the Marathi speech samples database created for the six emotions. The prosodic analysis on the Marathi speech samples shows that linguistic changes do not affect the prosody and emotion correlation. Feature extraction performed using the Acoustic features like Pitch and Formant frequency along with MFCC and Discrete Wavelet Transform based algorithms. Evaluation carried out for recognition of emotions by using individual features first. Results show that recognition is better when pitch and MFCC algorithms are employed individually.