Automatic Human Facial Expression Recognition Based on Integrated Classifier From Monocular Video with Uncalibrated Camera

: An automatic recognition framework for human facial expressions from a monocular video with an uncalibrated camera is proposed. The expression characteristics are first acquired from a kind of deformable template, similar to a facial muscle distribution. After associated regularization, the time sequences from the trait changes in space-time under complete expressional production are then arranged line by line in a matrix. Next, the matrix dimensionality is reduced by a method of manifold learning of neighborhood-preserving embedding. Finally, the refined matrix containing the expression trait information is recognized by a classifier that integrates the hidden conditional random field (HCRF) and support vector machine (SVM). In an experiment using the Cohn–Kanade database, the proposed method showed a comparatively higher recognition rate than the individual HCRF or SVM methods in direct recognition from two-dimensional human face traits. Moreover, the proposed method was shown to be more robust than the typical Kotsia method because the former contains more structural characteristics of the data to be classified in space-time


Introduction
Human facial expression recognition is valuable in many areas, such as medical diagnosis, [1][2][3] teaching, [4,5] and human-computer interaction (HCI), [6][7][8] and it has been actively studied for many years. The recognized facial expressions are typically categorized in two types: twodimensional (2D) expressions [7,9,10] and threedimensional (3D) expressions. [11][12][13] The former usually recognizes the processed traits that are directly acquired from 2D facial images. The latter often utilizes the 3D traits acquired from multi-view-angle facial images obtained by several cameras, from special 3D expression databases, or via special sensors. Other methods combine 2D with 3D traits to recognize facial expressions. [14][15][16] These methods of acquiring and processing traits typically integrate the two above types.
Currently, facial expression recognition methods mainly include neural networks, [17] support vector machines (SVMs), [16,18] Markov chains, [13,19] and others. It is apparent that different learning machines are suited for different kinds of datasets. When these learning machines are independently utilized, they often cannot independently overcome limitations. Alternatively, a real dataset often must be significantly changed at the cost of considerable useful trait information. Thus, recognition accuracy will be affected.
To retain trait information as much as possible, we therefore consider integrating learning machines with a special architecture to improve the overall performance. Moreover, in typical situations, 2D facial expression images are most often obtained. Thus, we attempt to acquire and utilize reflected 3D information from 2D images to directly realize approximated 3D recognition.
In this paper, we propose an automatic recognition framework for human facial expressions from a monocular video with an uncalibrated camera. From a deformable template, similar to a facial muscle distribution, the associated expression traits are extracted, regularized, and normalized. They are then arranged line by line in a matrix in accordance with the changing of the traits in space-time with complete expressional production. Next, the dimensionality of the trait matrix is reduced by a method of manifold-learning neighborhood preserving embedding (NPE). Finally, with the refined matrix as input, a classifier that integrates the hidden conditional random field (HCRF) with SVM is used to recognize the expression.
The remainder of this paper is arranged as follows. Section 2 describes the principle of facial expression recognition and outlines the proposed method. Section 3 describes our experiment to verify the proposed method, and the results are analyzed. Our conclusions are presented in Sec. 4

Facial
Expression Recognition Principle Figure 1 illustrates the presented facial expression recognition framework. The 2D dynamic expression sequence from monocular vision is used as the input expression to be trained or tested. Here, the centroids of the different facial-area gray values in the expression image are employed as traits. A gray level has various attributes-brightness, contrast, and saturation-that change with differing depths. Therefore, the gray values include some depth information, especially relative changing values. These traits are characterized by movement in space-time as each facial expression is produced.
The same types of expression traits typically move in space-time with the same styles and have the same periodicities. Here, for each type of expression, only one facial expression cycle is needed in the recognition framework. With the production of this one cycle, that is to say, under a complete expressional production, the associated traits change in space-time, thereby forming time sequences. All the time-sequences are arrayed in a matrix. Thus, the problem of recognition is transformed into the problem of handling all the time-sequence data. Key details are described in the subsections below.

SVM Train/Infer
The expressional category No.X Expression X Є{1-surprise,2-happy,3-sad, 4-fear,5-angry,6-disgust,7-neutral} Multi-trait based on facial typical regional centroids Matching deformable template to compute typical regional centroids     This strategy effectively reduces processing time. Meanwhile, all detectors are designed as in Ref. [20]. All detection of the areas noted above is automatically performed; manual assistance is not required. By combining Figs. 2(a) and (b), it is evident that the template traits proposed herein are highly consistent with the usual distribution of human facial muscles. This approach is thus convenient for analyzing expressions in terms of the human anatomical structure.  As shown in Fig. 2, based on the anatomy of human facial muscle configurations and dynamics, [21,22] the facial expression is divided into eight areas. In each area, we employ an associated centroid to describe all the regional pixel global changes in space-time as different expressions are produced. Each centroid is computed by Eq. (1), where, at time t, n i is the number of pixels in region i, and x m , y m and z m are the corresponding horizontal coordinates, vertical coordinates, and gray values of each pixel. As a person forms a facial expression, all pixels in the facial image are in motion in space-time. At the given moment, all the above centroids are in motion as well. These centroids moving in spacetime form time sequences. Figure 4 depicts the time sequences of the motioning centroids in the MATLAB tool with the above expressions gradually being produced, starting from the neutral expression of one person. It is apparent that the distributions and shapes of the time sequences differ from the different kinds of expressions. Thus, they can be used as the basis for distinguishing between each other. Equation (2) describes the principle of preprocessing for the time sequences above. This includes extracting relative changing values, normalizing, and scattering.

Facial Expression Recognition
In general, here, the mechanism we adopted is similar with that in Ref. [23] where we studied gait recognition, because they are all the problems of multi-sequence classification although in different application fields. Of course, some associated definitions and settings are also different. Next, combing with Ref. [23], we elaborate our method as below. 1) NPE for reducing time-sequence dimensionality NPE is detailed in Ref. [24], and its main principle is expressed in Eqs. (3) to (5). Before applying NPE, [24] we note that each time- produced by the associated curve is regarded as data point x i . To preserve the correlation property among curves, the means of constructing an adjacency graph is the k-nearest neighbor (KNN), and k is set to 24. That is, each timesequence is reconstructed by 23 other adjacent timesequences in the facial expression. It is hence reasonable to assume that each local neighborhood is linear, although these data points may reside on a nonlinear sub-manifold. Then, we use Eq. (3) to compute weight matrix W of the structural relation existing in the data points. We employ Eq. (4) to compute the projections on reducing dimensionality, and Eq. (5) is applied to realize, in turn, the final transformation of reducing dimensionality for each data point, x i .
Initially, each of the motion curves in a facial expression for one expression cycle has 50 space samples. By using NPE, the corresponding time-sequence dimensionality is reduced from 50 dimensions of one facial expression cycle to 24 dimensions. The local manifold structure is preserved in low-dimensional space with optimal embedding.
2) Integrating SVM with HCRF for classifying multiple time-sequences a)HCRF for marking all time-sequences with signs According to Fig. 1, we assuming the expression type for subject No. X (X = 1, 2, . . .7). This is because different centroids in the human facial expression correspond to different motion time-sequences. All timesequences of expression type No. X are marked with X1A, X1B, X1C, X2A, X2B, X2C, . . ., X8A, X8B, and X8C, respectively.
The mapping between the sequences and corresponding labels is conducted by HCRF's training or inferring. Here, the HCRF method, as in Ref. [25], is principally similar to that in Ref. [26]. The associated formulas are Eqs. (6) and (7).
Equation (6) describes the principle of inferring label y from the HCRF model given observation x, the HCRF models parameters θ and window parameter w which is used to incorporate long-range dependencies. Equation (7) describes the estimation for the HCRF model parameters θ .
In combination with instructions in Ref. [25], the typical method of the conjugate gradient in Ref. [27] is used to estimate parameters θ . In both equations above, all the terms' interpretations and all the parameters' meanings are same as that in Ref. [23].
In this paper, the number of hidden states is set to ten, and window size w is set to 0,1,2 fator testing. In training, each type of time-sequence in each human face sub-area for all facial expressions of all persons, and its corresponding label, are learned with a separate HCRF model. In inferring, the tested sequence is run with the HCRF model to produce the same time-sequence type. The class label with the highest probability corresponds to the label of the test sequence.

b) SVM for classifying all multi-sequence signs
Here, SVM is used as the final classifying means to recognize the facial expression by employing the marked features associated with the facial expression as input. Now, we address the problem of multiclass classification. On multiclass SVM, we apply the oneagainst-one approach [28,29] that is suitable for practical use, as in Ref. [30]. In this method: if the amount of training categories in the database is K, then K(K − 1) / 2 binary SVM classifiers are needed; each classifier adopts the Csupport vector classification (CSVC) model with the radial basis function kernel, in which two parameters are considered: C and γ; the two parameters are selected by using typical cross validation via a parallel grid search, and all K(K − 1) / 2 decision functions finally share the same (C, γ).
For training data from the ith and jth classes, Eq. (8) displays the binary classification problem to be solved. In Eq. (8), each term's interpretation and each parameter's meaning are same as that in Ref. [23]. In addition, during final classifying, the voting strategy suggested in Ref. [29] is used to predict that x is in the class with the largest vote. In this strategy, if sign(((w ij ) T Φ(x t )) + b ij ) infers that x is in the ith class, then the vote for the ith class is increased by one; otherwise, the jth class is increased by one. For the case in which two classes have an identical vote, the one with the smaller index is selected.

Testing Environment
To validate the performance of the proposed method application in a basic environment, all experiments were conducted in an environment of Microsoft Visual C++ 6.0 as the platform on a Pentium 1.73-GHz personal computer.
The proposed facial expression recognition framework was tested with the Cohn-Kanade AU-Coded with constraints: Facial Expression (CKACFE) database, [31] which includes 486 sequences digitized into 640 by 480 or 490 pixel arrays with 8-bit precision for gray-scale values from 97 posers. Each sequence began with a neutral expression and proceeded to a peak expression. The peak expression for each sequence was coded using the Facial Action Coding System (FACS) [32] and then assigned an emotion label. Each poser had at most seven kinds of emotions: neutrality, surprise, sadness, disgust, anger, fear, and happiness.
In our experiment, we adopted the leave-one-out cross validation to train/infer the facial expression. We first employed subject No. 1 to obtain all expressions for recognition; the expressions of the remaining subjects were used for training. Then, we used all expressions of subject No. 2 for testing; the expressions of all others were used for training. Next, we again changed the order and repeated the training/inferring experiment until all subject expressions could be inferred. At last, we computed the rate of successful recognitions for each type of expression and regarded it as the final result.
The associated computation is shown as Eq. (9), where function_infer can be regarded as the whole function corresponding to the relative recognizing system of HCRF+SVM when the ith expression is testing emotion x i (x i > 0). Only when the function's result equals input x i is the result correct at this point: Combining the methods in Section 2, in training for HCRF, the conjugate gradient method presented in Ref. [27] is used to estimate the associated parameters. In training for SVM, the method of cross validation via a parallel grid search, as in Ref. [30], is used to estimate the associated parameters. According to Section 2, since there are 97 subjects, each subject has seven kinds of expressions. Each expression corresponds to eight 3D time-sequences. Thus, there are 8 × 3 = 24 HCRF models and 7 × (7 − 1) / 2 = 21 SVM classifiers in total. When training for all the expressions in the CKACFE database is finished, all the trained parameters for the associated HCRF models and SVM classifiers are saved to another database.
In addition, just after reducing the dimensionality of the associated time-sequences, we use these temporal associated data as input, with HCRF and SVM respectively used as independent classifiers to recognize the facial expression. In this case, one HCRF model corresponds to one type of expression; that is, there are seven HCRF models, and the number of SVM classifiers is the same as in the previous case. Of course, another database on associated trained parameters is produced from the CKACFE database after training. At last, at the same window parameter w, we conduct the comparisons of the recognition rates above with the results of the proposed recognition framework and the typical Kotsia method. [16] Table 1 and Fig. 6 present the results of comparisons of the proposed integrated classifier, the HCRF+SVM method, with the typical Kotsia method [16] and independent HCRF and SVM classifiers for facial expression recognition. It is evident that the recognition rate of the proposed method is more stable than that of the Kotsia method; the Kotsia method is higher than ours in very few cases. The HCRF+SVM recognition rate is higher than those of independent HCRF and SVM.

Testing Results and Analysis
According to the principle in Sec. 2, when the NPE method reduces the dimension of the time-sequences, the local neighborhood structure on the data manifold is preserved. When HCRF trains or infers the signs of the time-sequences, the sequences where the underlying graphical model captures temporal dependencies across frames is modeled and long-range dependencies are incorporated. Meanwhile, when SVM trains or infers the final expression of the relative signs, the separating margins of decision boundaries on classification are where : δ is the unit pulse function: δ(m)=1 if m=0, and zero otherwise. maximized as the data are mapped into high-dimensional space. Thus, compared with HCRF or SVM, the combined HCRF+SVM method obtains more structural traits of the data to be classified when handling the data, thereby making the recognition more efficient.  In addition, it is evident that the recognition rate differs from different window sizes, w. Furthermore, when w equals one, the recognition rates are universally higher than at other window sizes. From the whole effect, the results of the proposed recognition framework are satisfactory.
In the phase immediately after reducing the dimensionality of the associated time-sequences, the independent HCRF recognition rate is slightly higher than that of the independent SVM because HCRF is more fitting for classifying time-sequences than SVM. Meanwhile, SVM takes the correlated sequences as independent parameters, which is unreasonable to some extent. In addition, in terms of the respective HCRF or SVM, the changing trends of the recognition rates at different expressions are consistent with those of the integrated HCRF+SVM classifier.

Conclusion
In this paper, we proposed a framework based on the HCRF+SVM integrated classifier for human facial expression recognition. According to universal humanfacial muscle distributions, the method employs the centroids of different areas of the facial image as traits. This approach is convenient for analyzing expressions in terms of the human anatomical structure. With any production of an expression, these centroids move in space-time, forming multiple time-sequences. After applying NPE for reducing the dimensionality of trait time-sequences, the local neighborhood structure on the data manifold is preserved. The proposed integrated classifier HCRF+SVM obtains more structural characteristics of the data in space-time during inferring. Consequently, the experimental testing produced higher recognition rates than the respective HCRF or SVM directly from trait sequences. Moreover, the proposed method was shown to be more robust than the typical Kotsia method.
In future work, we intend to improve the proposed method's robustness to different illuminations and occlusions. In addition to the datasets used in the present study, we will evaluate in other situations whether the adopted integrated classification can handle other kinds of datasets. Moreover, in the strategy of the two-stage classification herein proposed, the first stage identification error will affect the next stage's recognition accuracy. That is, errors can easily occur. We will therefore consider using only one mathematical model to directly perform the classification of multi-sequences, thereby further improving the recognition rate.