Emotion recognition and its application in learning scenarios supported by a smart classroom (2023)

Emotion recognition and its application in learning scenarios supported by a smart classroom

Emotion recognition and its application in learning scenarios supported by a smart classroom (1)

Zhen ZhuEmotion recognition and its application in learning scenarios supported by a smart classroom (2) |Xiaoqing ZhengEmotion recognition and its application in learning scenarios supported by a smart classroom (3) |Tongping KeEmotion recognition and its application in learning scenarios supported by a smart classroom (4) |Guofei Chai*Emotion recognition and its application in learning scenarios supported by a smart classroom (5)

Digital Campus Construction Center, Quzhou Vocational and Technical School, Quzhou 324000


School of Information Engineering, Quzhou Vocational and Technical College, Quzhou 324000, China


Modern Business Research Center, Zhejiang Gongshang University, Hangzhou 310018, China


School of Electrical and Computer Engineering, Quzhou University, Quzhou 324000, China


Author's email for correspondence:

chaig@qzc.edu.cn

Side:

751-758 (view, other).

|

DOI:

https://doi.org/10.18280/ts.400235

received:

January 12, 2023

|

adopted:

March 26, 2023 r

|

Published:

April 30, 2023

|quote

ts_40.02_35.pdf

(Video) Applications of Artificial Intelligence - Emotion Recognition

open access

Abstract:

Emotion recognition technology is one of the important applications of artificial intelligence and machine learning in the field of education. By recognizing student emotions in learning scenarios, teachers can better understand students' learning status and provide them with personalized learning and support resources. Current emotion recognition methods rely mainly on static facial emotions, ignoring the temporal characteristics of facial emotions, which can lead to inaccurate recognition results. To address these challenges, this study looks at emotion recognition and its application to learning scenarios powered by smart classrooms. The Transformer encoder is used to extract the temporal characteristics of the student's facial emotion based on the learning scene, that is, the encoder's self-attention module is used to extract the temporal characteristics of the student's facial emotion in the learning scene. Delayed attention networks, transformers and non-local neural networks are used to extract emotional facial features from different angles and levels. The combination of Vision-Transformer (ViT) and NetVLAD enables the model to learn data features from multiple perspectives, thereby improving the generalizability of the model. Experimental results confirm the effectiveness of the constructed model.

1. Introduction

In today's society, the student-centred method of teaching has gradually become mainstream, with more and more attention being paid to individualized and contextualized learning [1-5]. Therefore, it is necessary for teachers to understand and correct students' learning emotions over time in order to achieve the best learning effect [6-11]. The smart classroom is an important product of the new educational technology that uses the latest computer, communication and Internet technologies to achieve deep perception, intelligent analysis and accurate conduct of the educational process [12-17]. Among them, emotion recognition technology is one of the important applications of artificial intelligence and machine learning in the field of education [18-21]. By recognizing student emotions in learning scenarios, teachers can better understand students' learning status and provide them with personalized learning and support resources. Emotions are closely related to learning outcomes. By recognizing and adapting to students' emotions in real time, you can improve their learning motivation and performance.

Recognizing student emotions in the classroom plays an important role in improving classroom effectiveness. Currently, recognition methods based on traditional image processing generally suffer from problems such as low recognition accuracy and difficult feature extraction. In order to effectively solve these problems, Su and Wang [22] proposed a method of recognizing students' emotions based on deep learning. This method first introduces MobileNet's convolutional structure to replace DarkNet-53 with YOLO v3, making the model lighter and reducing the number of parameters. Then use GIOU loss to improve the model loss function. Experimental results show that the mAP of the improved model increases by 4%, the F1 score increases by 3.2%, and the detection time is reduced by 1/3. Liang et al. [23] focused on the teacher's speech signal and designed a sound processing system to detect emotions. The teacher's speech determines their mood. The classification model for speech emotion recognition was built using a recursive neural network (RNN) algorithm. Refining the traditional cepstral Mel-frequency coefficient (MFCC) feature extraction process, adding a second-order differential process to remove MFCC convolutional noise, and adding one-dimensional energy features to 39-dimensional MFCC coefficients for experiments. The experimental results show that the improved MFCC eigenvalue and neural network are more effective in improving the speech emotion recognition rate than the traditional speech emotion recognition method and can be used in speech emotion recognition in teaching. Putra and Arifin [24] built a real-time facial emotion recognition system that allows teachers to track students' emotions during classroom activities. The system should be reliable enough when running on a mid-range computer. Students will receive a questionnaire to measure stress. The results of the survey will be used to analyze whether using the system can reduce students' stress. The results of the survey showed that the system was able to detect student emotions early, which allowed teachers to minimize student stress.

Facial emotions are very complex and subtle cues, and people can show multiple facial emotions at the same time. Moreover, different people can express the same emotions in different ways, which makes it very difficult to recognize facial emotions. Moreover, facial emotions are dynamic, and the beginning, duration and ending of facial expressions contain rich emotional information. However, current emotion recognition methods rely mainly on static facial emotions, ignoring the temporal characteristics of facial emotions, which can lead to inaccurate recognition results. To meet these challenges, it is necessary to develop more advanced and precise methods of emotion recognition. To that end, this study investigates emotion recognition and its application in learning scenarios that support smarter classrooms.

2. Extracting the emotional features of the temporal face based on the student's learning scene

The traditional recursive neural network (RNN) faces the problem of gradient decay and explosion when processing long sequences, while the Transformer model uses its self-attention mechanism to capture long-term dependencies in sequences, allowing the model to better understand and analyze dynamic changes in facial emotions. In addition, compared to RNNs that must sequentially process each time step, the Transformer model can process all time steps simultaneously, significantly improving computational efficiency, which is critical for processing large amounts of learning scene data. Based on the self-attention mechanism, the Transformer model can assign different weights based on the importance of facial expressions at different points in time, which helps the model capture key emotional changes. Therefore, this study used the Transformer encoder to partially extract the temporal features of students' facial expressions based on the learning scene, that is, the encoder-based self-attention module extracts the temporal features of students' facial expressions in the learning scene. .

In the self-attention module, it is assumed that the input sequence of the learner's facial expressions is mapped usingentrance entrance, bring to doOther1IOther2, then go$Q^w, Q^i$ i $Q^c$received matrix$w^u, j^u$, ICvas.

Where,won behalf of the query,Jrepresentkey, ICRepresents information extracted fromOther.When we assume that the inputOther1IOther2, array parameters enabled$Q^w$, Then:

$w^1=s_1 Q^w, w^2=s_2 Q^w$ (1)

Since models can be parallel, there are also:

$\left(\begin{array}{l}w^1 \\ w^2\end{array}\right)=\left(\begin{array}{l}s_1 Q^w \\ s_2 Q^w \end{数组}\right)$ (2)

Similar to, (J1J2) I (C1C2) you can get. In the mechanism of self-attention W (w1w2), (J1J2) her (C1C2) SoC.next connect w1 to eachJto perform the dot product operation. As the dot product is processed, the gradient becomes very smallMecca liverfunction that needs to be scaled to get the right oneBThe calculation process is as follows:

$\beta_{1,1}=\frac{w^1 j^1}{\sqrt{f}}, \beta_{1,2}=\frac{w^2 j^2}{\sqrt{f }}$ (3)

(Video) Subway Surfers in REAL LIFE. 🤣😂 #shorts

So is stackingw2AllJ,$\beta_{2, u}$ You can get:

$\left(\begin{array}{ll}\beta_{1,1} & \beta_{1,2} \\ \beta_{2,1} & \beta_{2,2}\end{array}\右)=\frac{\left(\begin{array}{c}w^1 \\ w^2\end{array}\right)\left(\begin{array}{l}j^1 \\ j ^2\end{tablica}\right)^Y}{\sqrt{f}}$ (4)

after applicationMecca liverWorking on each row of the matrix in the formula above ($\beta_{1,1}^{\klin} \beta_{2,1}^{\klin}$) I ($\beta_{2,1}^{\klin} \beta_{2,2}^{\klin}$) you can get. by$\beta^{\klin}$the weight of eachC, you can get the output of the self-monitoring mechanism, namely:

$A T T(W, J, C)=\Omega\left(\frac{W J^Y}{\sqrt{f_j}}\right) C$ (5)

The multi-head self-service module was transformed in a similar way$\beta_u$ gets $w^u, j^u$ and $c^u$ over $Q^w, Q^j$ and $Q^c$Figure 1 shows the architecture of the multi-headed attention mechanism. The specific calculation is shown in the formula below.

$\operatorname{ATT}\left(W_u, J_u, C_u\right)=\Omega\left(\frac{W_u J_u^Y}{\sqrt{f_j}}\right) C_u$ (6)

tu_pian_1.png

Emotion recognition and its application in learning scenarios supported by a smart classroom (6)

picture 1.Architecture of the multi-headed mechanism of attention

Using this module to extract a time series of a learner's facial expressions, it is assumed that a set of features has been learnedvisual converterrepresent$\lambda_u^k$.Nurse$\lambda_u^k$NursetransformerThe module extracts the temporal characteristics of the emotions on the faces of students in a series of educational scenes. Then the vector scores of several categories are obtained by a fully connected layer, expressed as$z_u$.

hypothetical resultvasThe output of the Transformer module class is represented as$z_u$, the initial probability vectorMecca liverfunction expressed asovas, the number of marked samples is expressed asOther, and the corresponding ground truth label is marked withHisvasand the cross-entropy loss of the probability vectorovascomeHisvasrepresentRiceICome in finally$z_u$NurseMecca liverfunction and formula of the entropy loss function to obtain the model loss value.

$\left\{\begin{array}{l}o_u=\frac{r^{z_u}}{\sum_{j=1}^B r^{z_j}} \\ M_Y=-\sum_{u= 1}^B t_u L O\lijevo(o_u\desno)\end{niz}\desno.$ (7)

3. Recognition of student's facial emotions based on the teaching scenario

Existing facial emotion recognition algorithms have many limitations in data collection and suppression of non-emotional features. Delayed attention networks can bring out deep features and record subtle changes in images that are crucial for recognizing facial emotions.transformersIt can process time series data and record the dynamic changes of facial emotions. Non-local neural networks can capture long-term relationships in images, which is especially useful for understanding complex facial emotions. By integrating these three networks, facial emotion features can be extracted from different perspectives and levels. Figure 2 shows the general architecture of the network model.

2. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (7)

photo 2.Network model architecture

The residual attention network consists of two main branches: the main branch and the masked branch. These two branches work together to extract useful sentiment characteristics and learn the importance of attention from input images. Figure 3 illustrates the architecture of the residual attention network.

The master branch is mainly responsible for extracting features from the input image. It usually includes a series of convolutional layers, activation functions and pool layers, forming the deep structure of the neural network. The design of the main branch mainly refers to the structure of the residual network (ResNet), where each residual block contains multiple convolutional layers, and aabbreviationa combination that effectively brings out deep features in images and prevents the problem of gradient fade. In addition, the main branch can extract the image features of different receptive fields by changing the convolution nucleus size and step size to better understand the local and global information in the image.

packet normalization layer irecoveryA line activation unit is placed after each winding layer in the main branch. Hypothetical outputRice-this convolutional layer is expressed asXRice, the output after series normalization is expressed asulRice, the activation function is expressed asD( ), weights and deviations are expressed asaskINgive the formula for calculating the linear activation unit as:

$p^m=d\left(x^m\right)=d\left(Qo^{m-1}+n\right)$ (8)

The first task of the mask branch is to learn the importance of the input image's attention, directing the master branch to pay more attention to the key areas of the image. The mask branch generally consists of a light convolutional layer and an activation function, and finally transmits the attention weight of each pixel throughThe sigmoid colonFunction. These weights are applied to master branch feature maps to enable dynamic feature-level attention adjustments. The mask branch design enables the model to adaptively focus on areas of the image that are more critical to the task, thereby improving model performance.

Do a point multiplication operation on the features and output weights from the two branches to get a careful feature map corresponding to the learning scene. Suppose the input of the residual attention network is expressed asz, the exit of the main branch is marked as$Y_{u, v}(z)$, the output of the mask branch is marked as$L_{u, v}(z)$, Then:

$G_{u, v}(z)=L_{u, v}(z) * Y_{u, v}(z)$ (9)

In order to avoid affecting the transmission of the extracted facial features of the students, this study adopts a method similar to residual connection to combine the output of the two branches. The output expression of the residual attention network model is:

$\lijevo(1+L_{u, v}(z)\desno) * Y_{u, v}(z)$ (10)

In facial emotion recognition tasks, it is often necessary to deal with multidimensional objects, such as three-channel RGB images and object maps from different convolutional layers.coupled neural network (FFNN) can effectively process these multidimensional features and extract useful information. By incorporating multi-headed attention networks, feature relationships can be modeled in multidimensional space to better understand and recognize facial emotions.

The workflow of the adopted multi-head attention network is as follows:

$A T T((J, C), W)=A T T\lijevo((J, C), w_1\desno) \oplus \cdots \oplus A T T\lijevo((J, C), w_g\desno)$ (11)

Hypothetical input featuresRice- this layer consists ofXRice, output functionsRice- this layer consists ofulRice, matrix of weightsRice-this layer is represented asaskRiceand biasX-this layer is represented asNRiceThe iterative feedback neural network process is presented as follows:

$x^m=Q^m p^{m-1}+n^m$ (12)

$p^m=d^m\lewo(x^m\prawo)$ (13)

In facial emotion recognition, global feature relationships (e.g. eye-mouth interaction) are crucial to understanding overall facial emotion. The non-local neural network calculates the relationship between any two points, models display global relationships and help extract more complete emotional information. As non-local neural networks can capture richer spatial information, they can significantly improve the performance of facial emotion recognition tasks. Combining the previous network of residual attention andtransformernetwork that enables the model to better understand complex facial emotions, thereby improving the accuracy of emotion recognition. Figure 4 shows the architecture of a non-local neural network. Suppose the input features are given byz, the output at position u is marked asHisvas, the normalization factor is expressed asV(z) and all element positions required for related calculationsvasDepends onkThe definition of a foreign operation is as follows:

$t_u=\frac{1}{V(z)} \sum_{\forall_k} d\left(z_u, z_k\right) h\left(z_k\right)$ (14)

The correlation scores between u and all items can be computed using$d\left(z_u, z_k\right)$and input representationkcan be calculated usingH(zk).

In this study, the results of the correlation betweenvasIkIn the object map, it is obtained by downloadingaskIzAs the input of a non-local neural network, expressed as$d\left(z_u, z_k\right)=\left(Q_\phi z_u\right)^Y\left(Q_\rho z_k\right)$The detailed process of calculating a non-local neural network is described as follows:

Step 1: Make a 1×1 weave on the map of objects, mapzResults in the ranges $\phi, \rho$ and $h 3$.

Step 2: Calculate the correlation score betweenvasIkusing the pattern$d\left(z_u, z_k\right)=\left(Q_\phi z_u\right)^Y\left(Q_\rho z_k\right)$.

3. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (8)

photo 3.Network architecture for residual attention

4. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (9)

(Video) After watching this, your brain will not be the same | Lara Boyd | TEDxVancouver

Figure 4.Architecture of a non-local neural network

Step 3: Weighted sum of correlation results and feature representation functionsH(zk) zkDownload dependency learning resultsvasIk.

4. General model and iterative update process

In research,visual converter(Wertera) is used as the main frame.Werterais an image processing model designed to extract spatial features from images. thisGOVERNMENT networkModules have also been introduced to extract more valuable features from images using grouping methods. At last,transformerThe module is used to learn the temporal features of the image sequence and further extract the features of the temporal dimension. This enables a comprehensive extraction of the space-time features of the learning screen for more accurate identification of student emotions. ConnectionWerteraIGOVERNMENT networkIt allows the model to learn data features from multiple perspectives, thus improving the generalizability of the model and making it work well on unseen data. This can be represented by the following formula:

$M_c=M_B+\eta M_Y+\frac{\eta_2}{2}\|q\|$ (15)

From the formula above, it can be seen that the unified loss functionRiceCswitch onWerterastrataRiceOtherIGOVERNMENT networkstrataRiceI.ten1Iten2is a balancing hyperparameterRiceIIRiceOther, this is.

To improve model robustness and maintain high recognition accuracy across different environments and people, this study usesStarGANGenerate student images in the top frame of the learning screen sequence. generated imageStarGANIt can increase the diversity of data, which can be considered as a data improvement method. Data augmentation can improve the model's ability to generalize, making it perform better on unseen data. Finally, the generated image and the top image of the frame are mixed together and fed into themWerteranetwork, allowing models to understand and learn facial emotions from different perspectives, which helps improve model performance in facial emotion recognition tasks.

At this point, the mixed input image is expressed as$z_u^l \u Z$mark by$t_u\kod T$foundedsoftmaxSMfunction expressed asovas, and the corresponding ground truth label is marked withHisvas, the number of markers is expressed asOtherand cross entropy lossRiceHprobability vectorovaseyeHisvasbe represented. Then it uses the mainframe gridMecca liverfunction and cross entropy function as follows:

$M_H=-\sum_{u=1}^B t_u \log \left(o_u\right)$ (16)

Finally, the general loss function of the model has three parts:RiceOther,RiceI, IRiceH, as follows:

$M=M_H+M_B+\eta_1 M_Y+\frac{\eta_2}{2}\|q\|$ (17)

5. Results and analysis of experiments

Figure 5 shows a comparison of emotion recognition accuracy before and after the introduction of the NetVLAD module. It can be seen that after the introduction of the NetVLAD module, the accuracy of recognition of all categories of emotions has improved. For 'fear' recognition, the accuracy rate increased by 0.02 from 0.71 to 0.73 after the introduction of the NetVLAD module. In the case of “anger” recognition, after the introduction of the NetVLAD module, the accuracy index increased by 0.02, from 0.68 to 0.7. For recognition of "disgust", after the introduction of the NetVLAD module, the accuracy index increased by 0.03, from 0.61 to 0.64. For "lucky" recognition, the accuracy rate increased by 0.08 after the introduction of the NetVLAD module, from 0.89 to 0.97. This is the biggest improvement in all categories of feeling. For the recognition of "sadness", the accuracy index increased by 0.06 after the introduction of the NetVLAD module, from 0.65 to 0.71. For forward recognition, the accuracy increased by 0.015 after the introduction of the NetVLAD module, from 0.695 to 0.71. For "neutral" recognition, the accuracy factor increases by 0.06 after the introduction of the NetVLAD module, from 0.84 to 0.9. From these results it can be seen that after the introduction of the NetVLAD module, the accuracy of recognition of all categories of emotions improved, especially the recognition of "happy", which has the largest increase. This shows that the NetVLAD module plays a positive role in feature extraction and effectively improves the model's ability to recognize emotions. This result also confirms the advantage of the NetVLAD module in obtaining more valuable features with the clustering method, thus improving the discriminating ability of the model.

In addition, this study analyzed the RMSE of changes in emotion recognition scores in terms of emotional arousal and emotional valence with increasing repetitions. Figure 6 shows that the RMSE of both emotional arousal and emotional valence decreased during the model iterations and stabilized after about 30 iterations. For emotional arousal, the RMSE decreased from the initial 0.75 to the final 0.42, indicating that the model's prediction error for emotional arousal is gradually decreasing and the model's performance is improving. For emotional valence, the RMSE decreased from the initial 1.68 to the final 0.49, which also shows that the model's prediction error for emotional valence is gradually decreasing and the model performance is improving. In addition, the error reduction for emotional valence was greater than for emotional arousal, suggesting that the model outperformed emotional arousal in predicting emotional valence. In general, as the number of iterations increases, the model's prediction error gradually decreases, indicating a good learning performance. Especially after 30 iterations, the error is basically stable, which means that the model has reached convergence and further refinement cannot significantly improve the performance of the model. This result also confirms the effectiveness of our earlier strategy of introducing the NetVLAD module and using the StarGAN and ViT networks to recognize facial emotions.

Similar conclusions can be drawn by observing the curve of change in the coefficient of correlation coefficient. As the number of iterations increased, the consistency of the model's predictions of emotional arousal and emotional valence improved, indicating that the model has good learning outcomes (Figure 7). However, it should be noted that although the consistency correlation coefficient increased, it did not reach a value of 1, indicating that there is some discrepancy between the model's prediction results and the actual labels, and the model needs further improvement. These results further prove that the strategy of introducing the NetVLAD module and using the StarGAN and ViT networks for facial emotion recognition can effectively improve the consistency of model predictions, thus improving model performance.

5. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (10)

Figure 5.Comparison of emotion recognition accuracy before and after the introduction of the NetVLAD module

6. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (11)

Figure 6.Root mean square error (RMSE) variation curve.

7. png

Emotion recognition and its application in learning scenarios supported by a smart classroom (12)

Figure 7.Coherence correlation coefficient variation curve

Table 1.Comparing the performance of different network models

network model

the root of the mean squared error

Coherence correlation coefficient

emotional excitement

emotional valence

emotional excitement

emotional valence

CNN-GRU

0,501

0,378

0,189

0,587

LBVCNN

0,411

0,389

0,332

0,645

Improved ConvLSTM

0,439

0,376

0,335

0,679

fan

0,424

0,359

0,367

0,733

our

0,387

0,356

0,511

0,454

Table 2.Comparison of recognition accuracy of different models

network model

accuracy

Entry

CNN-GRU

88,62%

sequence of learning frames

LBVCNN

88,72%

sequence of learning frames

Improved ConvLSTM

84,23%

sequence of learning frames

fan

85,28%

sequence of learning frames

our pattern

91,93%

sequence of learning frames

Table 1 shows the performance comparison results of various network models. You can see how well each model performs in terms of RMSE and correlation coefficients for emotional arousal and emotional valence. Overall, our model (Our) outperforms or matches the other four models on these four metrics. Regarding the RMSE of emotional arousal, our model (0.387) is lower than the other models (0.501, 0.411, 0.439, 0.424), indicating that our model predicts emotional arousal with greater accuracy. Our model is also the lowest in terms of RMSE for affective valence (0.356), further underlining its excellent accuracy. Regarding the correlation coefficient of emotional arousal agreement, our model (0.511) was significantly higher than the other models (0.189, 0.332, 0.335, 0.367), indicating greater consistency in predicting emotional arousal. Regarding the coefficient of emotional valence coherence, although our model (0.454) is slightly lower than the FAN model (0.733), it is still better than the other three models (0.587, 0.645, 0.679), indicating that our model still has a good performance in predicting emotional valence consistency.

Table 2 compares the recognition accuracy of different models. As shown in the table, the recognition accuracy of each model when processing a series of learned scenes is given. Overall, the "Proposed Model" had the highest accuracy rate of 91.93%, significantly higher than the other four models. Specifically, the accuracy of the CNN-GRU model is 88.62%, ranking it third among the five models. The LBVCNN model has a slightly higher accuracy rate of 88.72%, placing it second among the five models. The Enhanced ConvLSTM model has an accuracy rate of 84.23%, ranking it fourth out of five models. The accuracy rate of the FAN model is 85.28%, which ranks it fifth among the five models. Based on these data, the following conclusions can be drawn. All models achieved a high level of accuracy in processing the training scene sequence, indicating that these models can recognize and understand the training scene sequence to some extent. However, the efficiency of the proposed model is the highest, with an accuracy rate of 91.93%, which is much higher than other models. This is because the proposed model can better understand and deal with the complexity and diversity of a range of learning scenes, thus providing more accurate recognition results. These results further show that the proposed model has significant advantages in the emotion recognition task of learning scene sequences.

This experiment involves three scenarios: 1) Scenario 1: A new semester is starting and the teacher introduces an interesting new topic. The lecturer was energetic and explained the topics in a layman's way. Students are very engaged, curious and excited about new knowledge. Interaction between teachers and students is frequent and the atmosphere in the classroom is active. Students actively participate in discussions and activities and demonstrate good academic performance. 2) Scenario 2: This is when the teacher imparts difficult or boring knowledge. Although the teacher tried to explain in a lively and interesting way, the students' reactions were not as positive as in Scenario 1 because they had difficulty understanding or were disinterested. But overall, students are still trying to keep up, and the class atmosphere and learning outcomes are still good. 3) Scenario 3: This is the middle of the course, we are dealing with repetitive, little required but learned knowledge. Teachers tend to get a bit tired of repetitive content, and students get bored with dry or repetitive content, leading to a drop in engagement. At present, the classroom atmosphere and learning outcomes are relatively average and it is necessary to change the teaching method or add interactive elements to improve participation.

Table 3 shows the emotion recognition results for the 3 scenarios. Changes in facial expressions of teachers and students, effects of emotional activity and teaching effects are observed. In scene 1, the teacher's mimic activity is the highest and amounts to 0.7583, the positive facial activity of students 1, 2 and 3 is also high, i.e. they are in the process of learning, in a relatively positive state. The total emotional activity in the classroom is 0.6753, which is a relatively high value. The atmosphere in the classes is active, interaction is frequent. Didactic work is assessed as good. In scenario 2, the teacher's facial activity decreased to 0.5637 and the students' positive facial activity also decreased, but their negative facial activity increased, especially for students 1 and 2, and the negative facial activity significantly increased to 0.5338 and 0.5642 , which indicates that students have difficulties or challenges in learning. The overall emotional activity of the class has dropped to 0.3826, the atmosphere in the classroom is a bit depressed or the teaching content is more demanding, but the assessment of the learning outcomes is still "good". In Scenario 3, teachers' emotional expression activity further decreased to 0.3875, and students' positive emotion activity also decreased correspondingly. The activity of expressing negative emotions increased, especially in student 3, whose activity of expressing negative emotions reached a very high level of 0.7836. This means that the student has significant learning difficulties or challenges. Emotional activity in the classroom dropped to 0.1846, and learning outcomes were rated as 'average'. This shows that there are some problems in classroom teaching at this stage and that teachers need to adapt teaching methods or strategies over time. Overall, emotion recognition based on learning scenes can effectively reflect classroom teaching situations, help teachers understand students' learning status in real time, adjust teaching strategies, and improve learning outcomes.

Table 3.Emotion recognition results in 3 scenarios

(Video) Inside Out: Guessing the feelings.

scene 1

scene 2

scene 3

teacher, teacher, professor

facial activity

0,7583

0,5637

0,3875

student 1

active facial activity

0,6475

0,4826

0,3902

passive facial activity

0,3246

0,5338

0,6383

student 2

active facial activity

0,5684

0,4985

0,3784

passive facial activity

0,4531

0,5642

0,6284

student 3

active facial activity

0,7853

0,5829

0,2946

passive facial activity

0,2345

0,4726

0,7836

emotional activity in the classroom

0,6753

0,3826

0,1846

Assessment of learning outcomes

All right

All right

Average

Table 4 presents the results of the sentiment assessment for three scenarios. In these three typical scenarios, we can observe changes in the emotional assessment of teachers, students and the whole class, as well as learning outcomes. In Scenario 1, the emotional evaluation of teachers and students is "active", the emotional evaluation of the class is "excellent", the evaluation of the teacher's work is also "excellent". This means that in such an environment, teachers are enthusiastic about teaching, students actively participate in learning, the atmosphere in the classroom is active, and the teaching effect is excellent. In Scenario 2, the teacher's emotional rating remained "active", but the student's emotional rating dropped to "normal", and the class's emotional rating and assessment of learning outcomes were "good". This means that although the teacher is still actively teaching, due to the difficulty of the teaching content or other reasons, the level of teaching and student engagement has deteriorated and the classroom atmosphere and learning outcomes have deteriorated compared to Scenario 1, but overall it is still good Scenario In 3, teachers' and students' emotional ratings are 'normal', and ratings of classroom emotions and learning outcomes are also 'average'. It shows that in this scenario, both teachers' enthusiasm for teaching and students' enthusiasm for learning have decreased, and the classroom atmosphere and learning outcomes are mediocre. Taken together, these three scenarios reveal a strong relationship between emotional state and teaching effectiveness. The emotional state of teachers and students directly affects the atmosphere in the classroom, and thus the course of the lesson. Therefore, teachers should pay attention and manage their own and students' emotional state to improve the teaching effect.

Table 4.Emotional assessment results of 3 scenarios

emotional evaluation of teachers

Emotional assessment of students

Emotional evaluation in the classroom

Assessment of learning outcomes

scene 1

positive

positive

Perfect

Perfect

scene 2

positive

normal

All right

All right

scene 3

Average

Average

Average

Average

6. Conclusion

In summary, this study is about emotion recognition in a learning scenario supported by a smart classroom. The Transformer encoder is used to extract the temporal features of the learner's facial emotions based on the learning scene, that is, the coder's self-attention module extracts the temporal features of the learner's facial emotions in the learning scene. The combination of delayed attention network, transformer and non-local neural network enables extraction of facial emotion features from different perspectives and levels. The combination of Vision-Transformer (ViT) and NetVLAD enables the model to learn data features from multiple perspectives, thereby improving the generalizability of the model. This study first compared the recognition accuracy of each emotion before and after the introduction of the NetVLAD module, and confirmed the advantages of the NetVLAD module in terms of obtaining more valuable features and improving model recognition ability through the clustering method. The root mean square error and the coefficient of correlation between the indexes of emotional arousal and emotional valence in the results of emotion recognition with the number of repetitions were analysed. This further confirms that the introduction of the NetVLAD module and the use of the StarGAN and ViT facial emotion recognition network can effectively improve the predictive ability and performance of the model. Performance comparison results of different network models are provided, confirming that the proposed model still performs well in predicting emotional valence coherence. In the experiment, three scenes were set up and the emotion recognition results for three scenes were presented, which confirmed that emotion recognition based on learning scenes can effectively reflect the classroom teaching situation, help teachers understand the real-time learning status of students, adjust teaching strategies and improve learning outcomes.

Thank you

This paper is funded by the Quzhou Science and Technology Plan Guideline Draft (Grant No.: 2023ZD146); scientific research project of Quzhou Vocational and Technical College (approval number: QZYY2113).

reference

[1] Wang, H. B., Zhou, J. T., Hu, C. L., Chen, W. W. (2022). Vehicle lateral stability control based on stability category recognition and enhanced brain emotion learning network IEEE Transactions on Vehicle Technology, 71(6):5930-5943. https://doi.org/10.1109/TVT.2022.3159271

[2] Zhang, HZ, Yin, JB, Zhang, XL (2020). Research on the five-dimensional model of emotions for recognizing facial emotions [J]. Mobile Information System, 2020: article number: 8860608. https://doi.org/10.1155/2020/8860608

[3] ElBedwehy, MN, Behery, G.M., Elbarougy, R. (2020). Emotional speech recognition based on a weighted distance optimization system. International Journal of Pattern Recognition and Artificial Intelligence, 34(11):2050027. https://doi.org/10.1142/S0218001420500275

[4] Xu Zhiqiang, He Jianlin, Liu Yuzhu (2020). A study of emotion pattern recognition and emotion model construction based on the International Big Data Conference on Frontier Computing: Theory, Technology and Applications, 982-992.

[5] Yang Hongsheng, Fan Yuanyang, Lu Guiyang, Liu Siyuan, Guo Zheng (2023). Korištenje koncepata emocija za prepoznavanje emocija slike. Visual Computers, 39 (5): 2177-2190. https://doi.org/10.1007 /s00371-022-02472-8

[6] Sheng Wenjie, Lu Xueyou, Li Xiaodong (2023). Extending data for emotional gait recognition by separating representations of identity and emotion. Robotics, 41(5): 1452-1465. https://doi.org/10.1017/S0263574722001813

[7] Qasim M, Habib T, Urooj S, Mumtaz B (2023). DESCU: Binary Urdu Emotional Speech Corpus and Recognition System. Speech Communication, 148: 40-52. https://doi.org/10.1016/j.specom.2023.02.002

[8] Tropmann-Frick, M. (2023). Recognize sign language and express signer-specific emotions using similarity indicators. Frontiers in Artificial Intelligence and Applications, Information Modeling, and the Knowledge Base XXXIV, 364:21-37. https://doi.org/10.3233/FAIA220490

[9] Banskota, N., Alsadoon, A., Prasad, PWC, Dawoud, A., Rashid, T. A., Alsadoon, OH (2023). A new improved convolutional neural network with extreme machine learning: facial emotion recognition in psychological practice. Multimedia Tools and Applications, 82(5): 6479-6503. https://doi.org/10.1007/s11042-022-13567-8

[10] Zhou Z.J., MA Asghar, D. Nazir, K. Siddique, M. Shofuzzaman, RM Mehmood. (2023). AI-based affect recognition models that use physiological cues for health care and emotional well-being. Cluster Computing, 26(2): 1253-1266. https://doi.org/10.1007/s10586-022-03705-0

[11] Chen, K.Y., Yang, X., Fan, C.J., Zhang, W., Ding, Y. (2022.) Semantički richly recognized and zraza face. IEEE Transactions on Affective Computing, 13(4): 1906-1916. https://doi.org/10.1109/TAFFC.2022.3201290

[12] Sutedja, I., Septia, J. (2022). Recognizing emotional expression: A systematic review of the literature. 2022 International Conference on Information Management and Technology (ICIMTech), pp. 184-188. https://doi.org/10.1109/ICIMTech55957.2022.9915210

[13] Ryumina, E., Ivanko, D. (2022). Recognition of emotional speech based on lip reading. In Speech and Computers: 24th International Conference, SPECOM 2022, Gurgaon, India, pp. 616-625. https://doi.org/10.1007/978-3-031-20980-2_52

[14] Jiang Huiping, Jia Jianjun. (2020). LSTM-Based EEG Emotion Recognition Research in Bionic Computing: Theory and Application: 14th International Conference, BIC-TA 2019, Zhengzhou, Chiny, s. 409-417. https://doi.org/10.1007/978-981-15-3415-7_34

[15] Deng Hongxia, Qian Guiyang, Zhang Yufeng, Hu Zhaoxia, Liu Yang, Li Hongfeng (2020). Analiza i rozpoznawanie emocji w oparciu o sieci mózg-komputer 2020 8th International Conference on Advanced Cloud and Big Data (CBD), Taiyuan , Chiny, s. 56-61. https://doi.org/10.1109/CBD51900.2020.00019

[16] Akçay, MB, Oğuz, K. (2020). Speech emotion recognition: emotion models, databases, functions, preprocessing methods, supportive modalities and classifiers. Speech Communication, 116: 56-76. https://doi.org/10.1016/j.specom.2019.12.001

[17] Cui, ZL, Zhao, YJ, Guo, J, Du, HB, Zhang, JJ (2020). Typical and reversal effects of other race on emotional facial recognition of Tibetan students. 2020 15th IEEE International Conference on Automatic Facial and Gesture Recognition (FY2020), Buenos Aires, Argentina, pp. 754-760. https://doi.org/10.1109/FG47880.2020.00031

[18] Saeki, K., Kato, M., Kosaka, T. (2020). Adaptation of a language model for emotional speech recognition using data from tweets. Asia Pacific Signaling and Information Processing Association (APSIPA ASC) Annual Summit and Conference 2020, Auckland, New Zealand, pp. 371-375.

[19] Zhang, F., Li, X. C., Lim, C. P., Hua, Q., Dong, C. R., Zhai, J. H. (2022). Deep affective arousal networks for multimodal mood analysis and emotion recognition. Fusion of Information, 88: 296-304. https://doi.org/10.1016/j.inffus.2022.07.006

[20] Dindar, M., Järvelä, S., Ahola, S., Huang, X., Zhao, G. (2020). Leaders and followers identified by emotional imitation in collaborative learning: A study of facial expression recognition on emotional valence. IEEE Transactions for Affective Processing, 13(3): 1390-1400. https://doi.org/10.1109/TAFFC.2020.3003243

[21] Fahad, M.S., Singh, S., Ranjan, A., Deepak, A. (2022). Recognize emotions from spontaneous speech using regions resembling the vowels of emotions. Multimedia Tools and Applications, 81(10): 14025-14043. https://doi.org/10.1007/s11042-022-12453-7

[22] Su, C., Wang, G. (2020) Designing and applying emotion recognition to students in the classroom Journal of Physics: Conference Series, 1651(1): 012158. https://doi.org/10.1088/1742- 6596/1651/1/012158

(Video) Emotions and the Brain

[23] Liang Jun, Zhao Xueyi, Zhang Zhihui (2020). Teachers' verbal recognition of emotions in classroom teaching [J]. China Control and Decision Making Conference (CCDC) 2020, Hefei, China, pp. 5045-5050. https://doi.org/10.1109/CCDC49329.2020.9164823

[24] Putra, WB, Arifin, F. (2019). The real-time emotion recognition system monitors students' emotions in the classroom. In Journal of Physics: Conference Series, 1413(1):012021. https://doi.org/10.1088/1742-6596/1413/1/012021

FAQs

What is the application of emotion recognition system? ›

Emotion detection can assess and measure a candidate's emotions through facial expressions. It helps interviewers to understand a candidate's mood and personality traits.

Why is emotional recognition by a teacher to his her learners important? ›

It is important that teachers are able to steer students toward recognizing and managing emotions, and help them develop an interest in the feelings of others, thus enshrining social and emotional skills within standard educational goals.

What is an example of a smart classroom? ›

Standard smart classrooms are teacher-led learning spaces that usually include a computer, interactive whiteboard and projector. Like a traditional class, teachers deliver front-of-class learning – but the smart technologies provide students more opportunities to interact with the content, the teacher and each other.

How can teachers foster emotional intelligence and create emotionally safe classroom environments? ›

Emotional Intelligence: How to improve EI in the classroom
  • First, teach students to understand the vast variety of emotions. ...
  • Next, work on strategies to control their emotions. ...
  • Teach students to feel empathy. ...
  • Teach students to handle delayed gratification. ...
  • Teach students to volunteer and give back to society.

Videos

1. He Tried To Mess With A Royal Guard & Big Mistake
(Daizen)
2. Lee Mack's Joke Leaves John Cleese In Near Tears | The Graham Norton Show
(The Graham Norton Show)
3. The boy is super smart but his teacher doesn't believe it
(Eyasin Arafat)
4. How China Is Using Artificial Intelligence in Classrooms | WSJ
(Wall Street Journal)
5. 2023 FCIAC Boys Lacrosse Semifinals | Darien vs. Wilton
(DAF MEDIA)
6. don't move!!!!!! #squidgame
(Still Watching Netflix)
Top Articles
Latest Posts
Article information

Author: Patricia Veum II

Last Updated: 04/14/2023

Views: 5656

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.