Emotion Recognition on FER-2013 Face Images Using Fine-Tuned VGG-16

Article history: Received: 25 August, 2020 Accepted: 21 October, 2020 Online: 10 November, 2020 Facial emotion recognition is one among many popular and challenging tasks in the field of computer vision. Numerous researches have been conducted on this task and each proposed either standaloneor ensemble-based processing technique. While many researches strive for better accuracy, this research also attempts to increase the processing efficiency of computer correctly classifying human emotions based on human face by utilizing a single standalone-based neural network. This research proposes the use of standalone-based modified Convolutional Neural Network (CNN) based on Visual Geometry Group – 16 (VGG-16) classification model which was pretrained on ImageNet dataset and fine-tuned for emotion classification. The classification is performed on the publicly available FER-2013 dataset of over 35,000 face images with in-the-wild setting for 7 distinct emotions with the provided 80% training, 10% validation, and 10% testing data distributions. The proposed approach outperforms most standalone-based model results with 69.40% accuracy.

The reviews from [15][16][17][18][19] described many available facial emotion recognition datasets, such as Facial Emotion Recognition 2013 (FER-2013) [1,9], Static Facial Emotion in the Wild (SFEW) [10], Japanese Association of Female Facial Expression (JAFFE) [11], Cohn-Kanade (CK) [12], Extended Cohn-Kanade (CK+) [13], and Expression in the Wild (ExpW) [14] among many others. These datasets vary in many aspects, commonly described by one or combination of the following: amount of data, number of emotion classes, image-or sequential-based, and in lab-like or inthe-wild condition. The amount of data available in each dataset varies from hundreds to tens of thousands, with some datasets having pre-set training, validation, and/or testing data distribution. These datasets also vary in number of emotion classes, commonly ranging from six to eight emotion classes which include anger, disgust, fear, happiness, sadness, surprise, contempt, and is often appended with neutral emotion as well. Although some datasets do not include neutral and/or contempt expression. Image-or sequential-based (video-based or image-sequence-based) datasets also give variance in the conducted researches, as each dataset type's processing require different approach. Another difference is the condition of the dataset, where lab-like dataset differ from inthe-wild dataset. The former is recorded in an ideal condition (proper lighting, proper face alignment, and/or minimal to no usages of face accessories such as eyeglasses), whereas the latter depicts face expressions captured in a non-ideal condition from real-world scenarios.
These datasets are utilized in many works related to solving facial emotion recognition task, some of which uses conventional methods, deep learning, pretrained model, ensemble neural networks, combination of deep learning with hand-crafted feature selection technique(s), and other related works that will be described in following sections. These approaches achieve various results, which could be summarized in most researches' test accuracy for lab-like datasets often achieving above 90%, while test accuracy for in-the-wild datasets barely surpasses 75% [19].
With such low accuracy result from previous researches for facial emotion recognition with in-the-wild dataset (as compared   ASTESJ   ISSN: 2415-6698 with lab-like datasets), this paper attempts to improve facial emotion recognition accuracy for image-based in-the-wild dataset FER-2013. The proposed model achieved test accuracy of 69.40%, which is better than most existing researches with standalonebased neural network architecture, with the advantage of being simpler in terms of network depth and topology and having the capability of end-to-end training.
Continuing this introduction is a review of core concepts, brief description of FER-2013 as the used dataset in this paper's experiments, existing approaches for facial emotion recognition task using FER-2013 dataset, and the works that inspire and are the base of this paper's model. Afterwards, the proposed model is elaborated in detail. Then, the results from the proposed model are discussed and compared with other related works to understand how well the model performs compared to other taken approaches. Finally, this paper concludes with final thoughts about the proposed model and future works to improve facial emotion recognition task.

Studies and Developments
The study of emotion has gained interest since early 1970s, with one remarkable and frequently used research result being the work of Ekman and Friesen. They introduced six basic emotions: anger, disgust, fear, happiness, sadness, and surprise [20]. Many publications would later improve the known basic emotions, such as the work of Matsumoto which described contempt as another universal basic emotion [21]. Other improvements include the introduction of tools for supporting study of emotion recognition, such as the development of Facial Action Coding System (FACS) [22] with the updated version introduced in 2002 [23].
The interest in emotion study gained attention from computer science researchers as well, resulting in published researches on how to create automated emotion recognition system since the late 1980s. Even a survey on the developments made during those times [24]. These researches were conducted with broad and separate focus, such as establishing dataset [11,12,25], mapping features or feature selection process(es) [26,27], designing architecture [27,28], and adopting approaches from other fields of study [25,27]. More recent studies adopted CNN for face emotion recognition, such as the approaches proposed in [29][30][31] One, if not the most, attractive feature of emotion recognition is the human face itself. The main reason behind this is because mostly human face expresses emotions one feels [32]. The expression(s) is/are shown due to the work of face muscles' coordination which produces many expression configurations [33]. Note that the human face can express more than one emotion at the same time or at an approximately same time [34].

Facial Emotion Recognition Dataset (FER-2013)
For such an attractive feature, many datasets have been developed for facial emotion recognition task. As this paper only utilizes one dataset (FER-2013 [1,9]), it is the only dataset discussed in the paper. where each image's pixel values are stored in cells per row. The images were obtained utilizing Google search and are later grouped per emotion classes, which are anger, disgust, fear, happiness, neutral, sadness, and surprise. As the dataset was built utilizing Google search, the images are in in-the-wild condition, with even very few images being animated characters.
The dataset originally had data distribution of 28,709 images for training and 3,589 images for public test, but after the competition ended, another 3,589 images which were used for private test were added to the dataset. The usages of FER-2013's data distribution vary among published researches, with each using public test images for different purposes, either as part of training set, validation set, or test set. Several images contained in FER-2013 dataset are given in Figure 1.

Standalone-Based Neural Network (SBNN)
An architecture was proposed using CNN and batch normalization, in which the batch size was set constant to 256. The training process ran for 8 epochs as higher epochs was reported to result in overfitting [35]. Such architecture resulted in 60.12% test accuracy on FER-2013's private test data.
Other architectures using both CNN and batch normalization, varying in number of filters, were proposed in [36]. The uniqueness of both architectures is that both do not have any fully connected layer and maintain kernel size of 8, and with this design, both architectures achieved test accuracy above 65%.
Global average pooling (GAP) [37] was also used as an addition to both CNN and batch normalization [38]. The architectures vary in terms of depth-wise separable convolution layer, and although both were designed for multi-task purposes (emotion and gender recognition), both architectures achieved 66% accuracy.
Another new architecture inspired by network-in-network architecture and Inception was proposed with addition of utilizing a polynomial-based learning rate decreasing method for faster convergence and better performance [39]. The data used as input were pre-processed first for face feature extraction while still maintaining the image form, which resulted in a maximum of 10% increase in accuracy. With the proposed architecture and data preprocessing, the test result for FER-2013 dataset reached 66.4% after training for 200 epochs with batch size of 250, noting that the test data used is unknown to be either public or private test data.
Two SBNN architectures, one using custom CNN and another using One-vs-All (OVA) Support Vector Machine (SVM) with Histogram of Oriented Gradients (HOG) which resulted in 66.67% and 45.95% accuracy on test data of FER-2013 dataset respectively were also introduced [40].
A new SBNN architecture based on attentional convolutional neural networks was proposed in [41]. The proposed model was able to focus on only face features that are important to emotion recognition, achieving 70.02% test accuracy on FER-2013's private test data.
The winner of FER-2013 challenge, Yichuan Tang, proposed SBNN architecture with L2-SVMs (DLSVM) as replacement for softmax layer at the end of the neural network architecture. Such modification resulted in the proposed model reaching 71.2% test accuracy on private test dataset [42].
The current SOTA SBNN architecture was proposed in [43], in which they theorized that features at mid-level and high-level of a CNN would have impact on the final prediction. They based their model, termed as Multi-Level Convolutional Neural Network (MLCNN), on VGG architecture with 18 layers and achieved 73.03% test accuracy on FER-2013's private test data.

Ensemble-Based Neural Network (EBNN)
An EBNN architecture with 3 subnets, containing 3 and increasing convolution layers respectively for each subnet was proposed [44]. With batch size of 100, momentum value of 0.9, learning rate value of 0.01 decreasing to 0.001 by validation error condition, and running from varying epochs of 20 to 100, and 80% and 20% data distribution for training and test set respectively, they reached 65.03% test accuracy.
Referred in their paper as aggregator or hybrid CNN-SIFT aggregator, [45] combined CNN with Scale Invariant Feature Transform (SIFT), specifically Dense SIFT and regular SIFT. The EBNN, constructed and trained from scratch with combining CNN, Dense SIFT, and regular SIFT, resulted in 73.4% test accuracy on FER-2013 private test data.
Another researcher proposed an EBNN coined as hierarchical committee for facial emotion recognition task with 216 retrained deep convolutional neural network (DCN) models using transfer learning and 108 DCN models trained on aligned faces [46]. This method achieved 70.58% test accuracy on FER-2013's private test dataset, with randomly splitting training data for training and validation purposes after removing 11 images from the training data. The work is further developed in [47] which consisted of 9 DCNs, with each having 3 convolution-max pooling stages and ended with 2 fully connected layers. By using both aligned and non-aligned face structure dataset in training each DCN and using Alignment-Mapping Network (AMN) with the ensemble, they achieved 73.73% accuracy on FER-2013 private test data with the same training and validation data distribution as used in [46].
Another proposed EBNN architecture was offered in Pramerdorfer and Kampel's work [48]. The ensemble consisted of 8 DCN models based on VGG, Inception, and ResNet, where each base model underwent convolution and/or pooling layer(s) removal. They managed to achieve 75.2% test accuracy on FER-2013 private test data.
Currently known SOTA EBNN architecture was proposed in [49], which combined 3 VGG models, 2 pre-trained models which were retrained, and 1 model trained from scratch, combined with SIFT and k-means clustering, and flattened extracted feature vector as input to local SVM. This approach achieved 75.42% test accuracy on FER-2013's private test data.

Other Researches
There are other notable researches that inspire the proposed model which impacts the proposed model's architecture and learning and/or generalization capability. These researches are reviewed in this section in the order of previously stated impact descriptions, with the purpose of giving brief and proper information regarding the researches' results prior to elaborating the proposed model.
The architecture of the model is based on VGG-16 architecture [50], which contains 13 convolution layers with 3 additional fully connected layers at the end of the network. After each convolution layer, ReLU [51] layer is used for non-linearity rectification. Table  1 describes VGG-16 architecture. However, the existence of fully connected layers in a convolutional neural network architecture may lead to higher overfitting probability [37]. To overcome this, the authors in [37] proposed GAP which works by averaging each of final feature map results and directly using the averaged results for classification. This pooling method regularizes feature maps as confidence map for categories with no parameter requiring optimization, hence avoids overfitting.
Another attempt to reduce overfitting possibility is the use of early stopping [52]. Such implementation affects the training mechanism as the training may stop before fully reaching the specified number of epochs due to no further increase in accuracy or decrease in loss in a specific period. Other than using early stopping to avoid overfitting, a proper optimizer is required for better model learning and/or generalization capability.
One most used and popular optimizer is Stochastic Gradient Descent (SGD) [53,54], which takes constant parameter learning rate and traverses the gradient by the specified learning rate to find global minima. With SGD, it was reported that the generalization capability of a model improves [55,56], hence less overfitting probability. Another development for SGD was using momentum as additional parameter which accelerates model learning process, which has been proven in practice in [57]. However, SGD is often slow for learning process since the traversing process is done uniformly by the learning rate value.
An alternative to SGD was proposed in [58] and coined as Adam to overcome SGD's limitation. With Adam, the learning process adapts the learning rate parameter as the learning process continues, resulting in better learning performance compared to SGD. The major drawback of Adam is the generalization capability drops significantly as reported in [56].
In an effort to overcome Adam optimizer's drawback, Keskar and Socher proposed SWATS (Switching from Adam to SGD) [59], which at first uses Adam optimizer to have better initial learning process then switches to SGD on specified trigger condition. They empirically argued that SWATS optimizer gives better performance on most tasks, especially visual-and text-based tasks, by combining both Adam and SGD.

Proposed Model
Our proposed model was defined to be based on VGG-16 as the base model and using GAP as final pooling layer prior to VGG-16's classifier. The best model was obtained after experimenting on multiple combinations of SGD/Adam/SWATS as optimizer and inclusion/non-inclusion of early stopping. The researchers made some modifications for VGG-16's classifier due to the use of GAP.
VGG-16 was chosen as the base model for several reasons. The first reason is because VGG-16 uses convolution layer with 3x3 kernel instead of 5x5 or 7x7, hence lesser parameters to train [50]. The second reason is by using two or three stacked convolution layers with 3x3 kernel as an equal for one convolution layer with 5x5 or 7x7 kernel respectively, VGG-16's architecture adds more rectification layers, which results in a decision function that is more discriminative [50]. The third reason is based on empirical findings as the researchers reviewed previous publications, in which the researchers found that several SBNN and EBNN researches used VGG based models and achieved good, with some achieving SOTA, results on FER-2013 dataset among other tested datasets. While VGG-13, VGG-16, and VGG-19, either in its original architecture or modified, are present in the found literature reviews, the researchers decided to go with VGG-16 since it is not too deep and complex as VGG-19, noting that both architecture demonstrated similar results as reported in [50]. The researchers also agreed to not use VGG-13 as Simonyan and Zisserman stated in their paper that increase in depth in networks results in better accuracy [50]. In addition to the architecture selection, the researchers theorized that the use of pretrained model may improve accuracy and reduce training time. The researchers used VGG-16's pretrained model which was trained on ImageNet [60] dataset rather than initializing the network's weights randomly.
The researchers also modified VGG-16's architecture by using GAP as final pooling layer before entering VGG-16's classifier, and therefore the researchers replaced all of VGG-16's classifier layers with a single fully connected layer, taking 512 averaged input neurons and outputting 7 neurons as the equivalent number of FER-2013's emotion classes. Our proposed model's architecture can be seen in Table 2.
With the proposed model's architecture, the researchers experimented on multiple combinations of training mechanism, optimizer choices, data distribution, frozen layers, and batch normalization usage. Our best model was trained using SGD optimizer and used early stopping in its training process. Such combination proved to be superior as compared with the researchers other experiments. The results of the experiments are discussed in the next section.

Experimental Design
The researchers experimented on combining VGG-16 with several optimizers and training method. The used optimizers were SGD, Adam, and SWATS, and the training method was customized to use early stopping. The experiment was conducted using PyTorch and ran on Google Colab with GPU acceleration support. Due to the training data's size and Google Colab's lifetime, the researchers had to perform run-and-pause per 10 epochs as part of the training mechanism to prevent sudden disconnect from the machine, since in such an event the training process may stop prematurely. Note that the training and validation were done in one epoch. After each run-and-pause, the test was performed per 10 epochs on the model and the results are recorded.
Initially, the researchers had several options to experiment on. The options include the use/disuse of batch normalization, freezing/not freezing selected layers, imbalanced (original)/balanced training data distribution, use/disuse of GAP, and SGD/Adam/SWATS optimizer selection. All in all, the researchers initially estimated the need to experiment on approximately 12 combinations of VGG-16 model. Due to the quite large number of possible combinations and since each epoch takes about 7-10 minutes, the researchers had to estimate which combination merits further investigation per specified run epochs. After each combination has passed the specified run epochs, the combination that provided the highest test accuracy was used to undergo further training and/or further combination. At the end of the experiments, the researchers experimented on 23 combinations, with all combinations having equivalent hyperparameter values. The researchers used learning rate value of 0.001, 0.9 for momentum on SGD optimizer, and batch size of 32.
The results of the experiments are given in Table 3. For better readability, the following acronyms are used in Table 3: The researchers group the experiments to select the best model in each experiment group, and what the next experiment group's models' possible combinations are. GAP layer helps aggregate feature map information and helps prevent overfitting caused by fully connected layers in the model. For this reason, the researchers tried the combination of using and not using GAP in the experiments. Note that although GAP can be used as the classifying layer, the researchers appended a fully connected layer instead as the classifying layer for GAP's aggregated feature map. To easily refer to a specific model, the researchers number each model. Note that italicized series number means the model is used in multiple experiment groups.

Experimental Results
EG1 denotes the first experiments. All combinations in EG1 underwent training for 10 epochs due to the large number of combinations. The best result from this experiment turned out to be model with model series number 7 (M7), using balanced training data distribution with no batch normalization, GAP layer, and SGD optimizer with 64.11% test accuracy result.
From the best result of EG1, the researchers decided to perform the next experiment (EG2) on balanced training data with unfreezing 1-and 2-last convolution layer (denoted 1C and 2C in Table 3, respectively) since the researchers hypothesized that with the use of GAP, training the output (fully connected) layer does not do much on improving the model's generalization capability. In addition to the experiment on balanced training data, the researchers tried running an experiment for imbalanced data with only the output layer being unfrozen. The researchers also extended the number of epochs run from 10 to 30 in EG2. The result was out of what the researchers had thought, with the combination utilizing imbalanced training data having the highest test accuracy of 67.26% (M15), and the combination for balanced data and unfrozen 2-last convolution layer (M14) coming in second with 67.20% test accuracy.
The researchers noticed how M14 had a close accuracy gap with M15 (EG2's best model). The researchers then slightly deviated the experiment in EG3 with the thought of using both balanced and imbalanced dataset on no frozen layers. The researchers unfroze all layers in the proposed model and trained the networks for 30 epochs. The results show that using SGD with imbalanced training data and no frozen layer gave the best result of all the experiments, with 69.46% test accuracy, which is M17. The researchers also noted how the combination that uses balanced training data could not compete with the combination utilizing imbalanced training data.
Considering our last observation on EG3, EG4 consisted of experiments on imbalanced dataset with SGD optimizer, and the researchers tried to reduce the number of unfrozen layers by freezing only 3-last convolution layers (3C in Table 3) in hopes of speeding training time by reducing trainable parameter count. Then the researchers compared the combination's result with M15 and M17. Once again, M17 which was built by unfreezing all layers in the network and using imbalanced dataset produced the best accuracy result.
With the best combination now obtained after four experiment groups, EG5 established the comparison for such combination (imbalanced training data, no batch normalization, using GAP, and no frozen layer) but with different optimizers. The researchers also tried to use early stopping in this experiment group, with the early stopping being effective per epoch 30. The researchers set the minimum passing point of the training epoch to be 30 to have the models learn at first and not getting cut out by the early stopping mechanism. The early stopping mechanism was also executed if the validation accuracy does not improve for last 10 epochs since the last best recorded validation accuracy. To this extent, note that benchmarking epoch (EP in Table 3) refers to last epoch for each model after early stopping was effective. What was unique from this experiment group is that M17 had gotten better validation accuracy but lower test accuracy, proven with the decreasing test accuracy in EG5 as compared to EG3 and EG4's results.
After EG5, the researchers attempted to do training without early stopping, to see whether the model may be caught in local minima when early stopping was enabled. The experiments which did not use early stopping are grouped as EG6. EG6 proved our assumption wrong about our model being caught in local minima with the use of early stopping, as all three models in EG6 did not show much improvement after passing 30 epochs, nor after the benchmarking epoch in EG5. The results of EG6 also show the same findings as reported in EG5, in which the model using SGD optimizer with the specified hyperparameters value in Table 3 reigned triumphant compared to other models in EG6.
With the obtained best model from the experiments, the researchers provide their best model's test accuracy comparison with other works' test accuracy on Table 4. Note that the compared models are those from SBNN group as the proposed model is categorized as SBNN model.
Our proposed model outperforms most proposed approaches by a significant margin of 3-24%. The researchers argue that the capability of VGG-16 is adequate as feature extractor for facial emotion recognition and is further enhanced by the addition of GAP layer prior to classifier (fully connected) layer. GAP's addition is observed to greatly increase test accuracy, as the experiments have shown in Table 3 and shown by [38]. Unfreezing all layers in pretrained VGG-16 model also results in the best performance. The researchers believe that the model learns to better extract features as convolutional layers are learning as well during training instead of just letting the fully connected layers learn the preferred mapping. In addition to the proposed model's capability, the use of SGD optimizer benefits the proposed model as reported in [56] where SGD is reported to yield better performance on unseen data. What is unique is the use of Adam and SWATS optimizer without batch normalization leads to worst performances as shown in Table 3, and yet using batch normalization with SGD does not yield the best results.
Our proposed model also achieves similar, albeit lower, test accuracy as compared to the approaches from [40,41] with less model complexity. This would be beneficial since utilizing a simpler model may result in fewer required resources and faster performance in real-life scenarios. Using simpler model would also open further research on the possibilities of using simpler models when pursuing other objectives.

Conclusions
In this paper, the researchers report their experiments using FER-2013 dataset as the data source. Several researches using FER-2013 are also reported and they used different kinds of approaches, which the researchers group as standalone-based neural network (SBNN) and ensemble-based neural network (EBNN) approaches. The proposed network is classified as SBNN with VGG-16 as the base model, which was modified into using 13 convolutional layers and GAP as the last pooling layer. The network is then experimented on by varying several variables, such as data distribution, use/disuse of batch normalization, GAP, optimizers choice, and freezing some layers. The researchers experimented on 23 models and found out that using an imbalanced dataset, GAP, non-frozen layer, and SGD optimizer results in the highest accuracy throughout the experiments, which is 69.40%. With this result, the network has surpassed most reported networks defined in Table 4 by quite a large margin. In addition, the model supports end-to-end training and is much simpler as compared to other three best SBNN models [41][42][43], which in return results in lower time and memory consumption.
Further investigations could be done regarding hyperparameter tuning as the researchers experimented on the variation of data distribution, pooling layer, and optimizer selection only. Other improvement includes using other existing models or developing new architecture with similar topology as the proposed model for better classifying capability. With the proposed model's result, the researchers believe that further modifications may lead to better performance in classifying emotions with in-the-wild settings.