Multimodal Efficient Computing: A learning evaluation application based on joint graphics and text processing
School of Electronics and Computer Engineering, Langfang Normal University, Langfang 065000, China
College of Education, Langfang Normal University, Langfang 065000, Chiny
Author's email for correspondence:
wangshunye@lfnu.edu.cn
Side: 533-541 (view, other). DOI: https://doi.org/10.18280/ts.400212
received:
January 10, 2023
|
adopted:
April 7, 2023
|
Published:
April 30, 2023
|quote
ts_40.02_12.pdf
open access
Abstract:
Traditional learning assessment emphasizes student mastery of knowledge rather than emotional mastery. Multi-modal Affective Computing (MAC) can comprehensively analyze various information about students in the classroom, including students' facial expressions, gestures, text feedback, etc., to help teachers detect students' emotional problems in time so that they can adjust teaching appropriate methods and strategies . However, existing MAC techniques may result in unstable or erroneous judgments for complex emotional expressions, resulting in inaccurate assessments of students' emotional states, which may adversely affect learning evaluation outcomes. In response to these issues, this study innovatively uses MAC to assess learning based on a combination of graphics and text. The input text is divided into two parts, the body part and the hash tag, which are used to extract features. Extract image features from the two angles of the subject and the scene, as these two angles can provide different levels of information about the image. The MAC model is divided into modally shared tasks and modally private tasks to better adapt to new learning evaluation scenarios. Experimental results confirm the effectiveness of the proposed method.
Keywords:
Text and image processing, effective multimodal processing (MAC), learning evaluation
1. Introduction
With the rapid development of artificial intelligence and deep learning, significant progress has been made in the field of computer vision and natural language processing [1-5], which has created the background for the emergence of MAC for joint text and image processing [6-12]. Traditional teaching evaluation emphasizes that students' mastery of knowledge is greater than emotions [13-15], and the state of emotions has a very significant impact on the learning process and student performance. MAC can comprehensively analyze all kinds of information about students in the classroom, including facial expressions, gestures, text feedback, etc., to help teachers detect students' emotional problems in time, adjust teaching methods and teaching methods. Appropriate strategies [16-18]. The research results obtained in this study can be directly used in the actual work of teaching evaluation, can help teachers to understand the educational needs and emotional states of students more fully and accurately, and provide useful evidence for educational decision-makers, thereby improving learning outcomes and promoting a sensible allocation of educational resources to achieve educational equity and improve the overall level of education.
It was proposed by scientists Pham and Wang [19].An attentive student 2, a multi-modal smart teacher that runs on an unmodified smartphone, complementing today's MOOC (Massive Open Online Course) click-through learning analytics. thisAn attentive student 2Using the front and rear cameras of smartphones as two complementary channels for accurate real-time feedback: the rear camera monitors students' photoplethysmographic (PPG) signals, and the front camera monitors their facial expressions during MOOC learning and indirectly to infer students' emotional abilities and cognitive states, learning from PPG cues and facial expressions. Barron-Estrada et al. [20] presented the initial implementation of a multimodal emotion recognition system using mobile devices and the creation of an emotion database using a mobile application. Recognizers work in mobile educational apps to recognize the user's emotions when interacting with the device. The emotions identified by the system are commitment and boredom. The emotion database was created based on the spontaneous emotions of students interacting with the educational mobile application Duolingo and the information-gathering mobile application EmoData. Hu and Flaxman [21] proposed a new method of multimodal sentiment analysis using deep neural networks combined with visual analysis and natural language processing. Their purpose differs from the target mood of standard sentiment analysis, which predicts whether a sentence expresses positive or negative moods. , aimed to infer users' underlying emotional states by focusing on predicting emotional word tags that users attach to their Tumblr posts, seeing them as "self-descriptive emotions." Yin et al. [22] pointed out that the use of deep learning methods to analyze multimodal physiological signals and recognize human emotions is becoming more and more attractive, but traditional deep emotion classifiers may not be able to determine the structure of the model and combine multimodal feature abstractions. oversimplification. Therefore, they propose a multi-layered, stacked autoencoder assembly classifier for emotion recognition, where deep structures are identified based on a physiological data-driven approach.
A thorough review of existing research shows that while MAC has some potential in evaluating learning, its shortcomings and challenges are also obvious. Existing MAC technology can cause unstable or erroneous judgments for complex emotional expressions, which in turn leads to inaccurate results of evaluating students' emotional states, which affects the effectiveness of learning evaluation. In addition, most existing MAC techniques rely on deep learning models that have poor interpretation, which can confuse teachers' understanding and trust in the evaluation results provided by the models. Given these issues, this article uses MAC to assess learning based on a combination of graphics and text. In the second chapter, the input text is split into two parts, the text and the hash tag, and feature extraction is done separately. The third chapter extracts image features from two angles of the subject and the scene, which can provide different levels of information about the image. Chapter 4 divides the MAC model into modality-based tasks and modality-based tasks to better accommodate new instructional evaluation scenarios. Experimental results confirm the effectiveness of the proposed method.
2. Extraction of text features
MAC can play an important role in teaching evaluation as it integrates text and image processing, detects and assesses students' emotional state based on provided feedback and facial expression images, and provides personalized learning resources and suggestions according to their situation. Emotional state and learning requirements by analyzing text responses and images posted by students in online tests or surveys can be used to help students understand and accept course content, and can also assess teacher performance and help mental health professionals Students analyze "conditions affective. These applications are conducive to improving the quality of teaching, equity in education and all-round development of students.
The use of MAC for learning evaluation requires the collection of the following textual and graphical data: 1) Students' textual data: including homework and exam answers, textual data from classroom discussions, forums, chat tools, and other scenarios, as well as their feedback and assessment by teachers and courses and teaching methods to understand their satisfaction and needs; 2) data on the student's image, including facial expressions, gestures and movements of students in classes and images used in classes, such as drawing, design, handicrafts, etc.; 3) text data about teachers, including lesson plans, brochures and text data from teacher-student interactions; 4) pictorial data about teachers, including facial expressions, positions and actions of teachers in the classroom.
In this study, the input text is divided into two parts: body and hash tags. Teaching assessment textual data tend to have high dimensionality, and the vector word segmentation expression can transform high-dimensional textual information into a low-dimensional vector form, thereby reducing computational complexity and improving the performance of training and prediction models. When the text is converted to a vector form by the word segmentation algorithm, the semantic information of the original text can be well preserved, which is convenient for improving the accuracy of the MAC model in the analysis of text features. The idea behind this method is to convert text into mathematical vectors, support comparison and calculation of similarities between texts, facilitate the identification of texts with similar emotional characteristics in a learning evaluation scene, and help improve the accuracy of emotional calculations. Figure 1 shows the working principle of the word segmentation algorithm.
1. png
picture 1.Principle of word segmentation algorithm
assumptions:vasrepresentvas- NO. a sample,IQvasrepresents the evaluation text content input string,IIvasrepresents a string of hash tags, then the two parts of the text (content tags and hash tags) can be saved asIvas={IQvas,IIvas}; Let's assume thatQkrepresentk- the first word in the text string of the learning assessment text,lostIndicates the length of the subject string, then the input subject string of the teaching assessment text can be saved asIQvas={Q1,Q2,...,Qk,...,Qlost}; Similarly in principleIkrepresentk- prvi hashtag u nizu hashtagova,JIndicates the length of the hashtag string, then the hashtag string can be saved asIIvas={I1,I2,...,Ik,...,IJ}.
The process of converting text for teaching assessment into vectors is described by the following formula:
$G_{0}^{y}={{d}_{TE}}\lijevo( Y_{u}^{q};{{\phi}_{TE}} \desno),G_{0}^ {y}\w {{E}^{l\razy f}}$ (1)
After preprocessing the input learning assessment text, it is converted into a word vector in two steps. assumptions:Qkrepresents the input word,CIndicates the size of the corresponding generated dictionary, valueCequal to the size of the single vector, {0,0,0,...,0,1,0,...,0}∈OtherCRepresents each vector of words in one-time coding; if the dimension of the word vector is expressed asF, then the weight initialization matrix between the input layer and the output layer can be represented by the matrixaskRthe size isC×F.
$Q_r=\lijevo(\begin{niz}{cccc}q_{11} & q_{12} & \cdots & q_{1 n} \\ q_{21} & q_{22} & \cdots & q_{2 n} \\ \cdots & \cdots & \cdots & \cdots \\ q_{C 1} & q_{C 2} & \cdots & q_{C n}\end{tablica}\right)$ (2)
Hash tags (such as: #keyword) can contain important emotional information in the text, i.e. keywords or phrases closely related to the content of the text, which help to analyze the emotional characteristics of the text in more detail. A routing based encoder can efficiently extract this key information and provide more valuable input for later efficient computations. In addition, keywords in hashtags may have similar or identical semantics to words in the text. By encoding and distinguishing hash tags, you can increase the weight of these keywords in the text vector to better capture the semantic information of the text.
assumptions:Ikrepresentk- the first hashtag in the sample,JIt indicates the number of hash marks, and the operation to remove the "#" symbol for each hash character can be written asIGvas={I1...,Ik,...,IJ}; Meanwhile, let's assumeGG0represents the hidden state of the generated hashtag,Dtenrepresents a text processing network,PhiBMIndicates the parameters that the network needs to learn,GGvasrepresent eigenvectorsMean,DBMRepresents a network that encodes hash tags, then the hash tag encoding process can be expressed by the following formula:
$G_{0}^{g}={{d}_{BM}}\left( Y_{u}^{g};{{\varphi}_{BM}}\right),g_{0}^ {g}\w {{E}^{l\razy f}}$ (3)
As with content processing, hash tags should also be represented in vector form, assumingCGrepresents the number of hashtags,askG∈OtherCG×FRepresents a learnable matrix, then:
$r_k^g=Q_g \cdot y_k$ (4)
The initialization parameters of the neural network can be obtained by uniformly sampling over the interval such thatBSRIBSCShow the number of neurons in the input layer and the output layer:
$q \sim I\left(-\sqrt{\frac{6}{b_{S R}+b_{S C}}}, \sqrt{\frac{6}{b_{S R}+b_{S C}}} \右)$ ∀ (5)
2. png
photo 2.Hashtag extraction algorithm principle
In order to use MAC for learning evaluation, it is necessary to ensure which hash tags automatically learned by the model during training are associated with mood. In this way, when the model processes the learning assessment text, it can focus on hash tags that contain emotional information, thereby improving the accuracy of the emotional calculations. In order to implement the above algorithm, a soft routing method is used to optimize the encoding method based on the routing network. Instead of simply classifying the tags into two categories (pass or fail), the soft gate control method allows multiple hash tags to be passed with a certain probability, which can maintain the relative weight between the tags so that the model can be better The difference between hash tags can be accurately captured, and the effect of affective calculations can be enhanced. Figure 2 shows how the hash tag extraction algorithm works.
Using vectorsIw ∈EFProvide a representation of the global textual context for learning evaluation and calculate the similarity between current textsIkand properIwcan be done using a probability-based routing mechanismIkThis can be judged by the closing mechanism. Specifically, let's assumeRGrepresent a vector characterIk, transform it non-linearly and find out its hidden expression; SupposeIGkrepresents a hidden expressionk-th hashtag, transformation matrix and bias term can be expressed asaskKrab∈OtherF×FIIIG∈OtherFThis means that the process can be described by the following formula:
$i_{k}^{g}=\nazwa operatora{Tanh}\left({{Q}_{gg}}\cdot r_{k}^{g}+n_{g}^{y}\right)$ (6)
tenThe sigmoid colonSelect a function to calculate the probabilityBkEach hash tag selected by the gateway check engine:
${{\beta}_{k}}=\frac{1}{2+{{e}^{-i_{k}^{Y}\cdot {{i}_{v}}}}}$ (7)
Then take the weighted sum of the hidden expressions of all the hash tags:
${{r}^{g}}=\sum\limits_{k=1}^{j}{{{\beta}_{k}}\cdot r_{k}^{g}}$ (8)
Finally, a linear transformation of the hashtag representation is performed to incorporate information from all samples into the same semantic space. assumptions:GG0∈OtherFHashtag representing the final result,askKrab∈OtherF×FRepresents the transformation matrix, then there are:
$G_{0}^{g}={{Q}_{gg}}\cdot {{r}^{g}}$ (9)
3. Image feature extraction
In order to obtain more complete emotional information in order to apply MAC to learning evaluation, this study extracts image features from the two perspectives of the scene and the object, as these two perspectives can provide different levels of information about the image. Object features focus on specific objects in an image and the relationships between them, helping to identify objects and events in an image, while scene elements reveal the overall background and surroundings of an image, which can provide background and scene information. Object and scene features can complement each other, giving images richer semantic representations. For learning evaluation, image data can contain multiple objects and scenes. By considering the characteristics of these two aspects, the model's generalization ability to deal with new image data can be strengthened. Figure 3 shows the principle of operation of the image feature extraction algorithm.
3. png
photo 3.The principle of operation of the image feature extraction algorithm
The local target characteristics of the learning assessment scene images can be obtained using a target detection algorithm. assumptions:Gul0Represents the characteristics of an object-oriented image,Xrepresents the number of objects,Frepresents the dimensions of features,DLeadrepresents the object detection network,PhiLeadIt indicates the parameters that the network needs to learn, and then this process can be expressed as:
$G_{0}^{p}={{d}_{PB}}\lijevo( L;{{\varphi}_{PB}} \desno),G_{0}^{p}\w { E}^{x\razy f}}$ (10)
The local object image area of the final teaching evaluation scene can be saved asRul={Rul1,Rul2,...,RulX}, on the assumptionGul0∈OtherX×FRepresents the characteristics of the local end object of the learning evaluation scene result,askgeneral practitioner∈OtherF×FIt represents the transformation matrix, the non-linear transformation of the image features to the semantic space of the text is given by:
$G_{0}^{p}={{Q}_{gp}}\cdot {{r}^{p}}$ (11)
assumptions:GA0Represents scene-oriented image features,DSCa grid representing a separate scene,PhiSCIt denotes the parameters that the network needs to learn, and then the global scene feature extraction process is given by the following formula:
$G_{0}^{a}={{d}_{SC}}\left( L;{{\varphi}_{SC}} \right),G_{0}^{a}\u {{ E}^{f}}$ (12)
assumptions:GA0∈OtherFRepresents the characteristics of an object-oriented image of the final output,askga∈OtherF×FIt represents the transformation matrix, the non-linear transformation of the image features to the semantic space of the text is given by:
$G_{0}^{a}={{Q}_{ga}}\cdot {{r}^{a}}$ (13)
4. MAC method combining text and image processing
After the feature extraction of the collected image and text data was completed, in order to further implement the use of MAC in the assessment of common classes of image and text processing, the MAC model was divided into modal shared tasks and modal private tasks in order to achieve better adaptability to new classroom scenarios. The modality sharing task can integrate information between different modalities and extract cross-modal sharing features, which is useful for the model to better capture the relationship between text and images and improve the accuracy of affective calculations. Private modal tasks focus on the intrinsic characteristics of each modality and can retain unique information within the modality, so that when the model combines multimodal information, it does not lose the characteristics of each modality, which may better reflect the assessment of the emotional situation of the learning scene information in. Thanks to such a division of tasks, the generalization ability of the model can be improved. This is because modality-shared tasks can learn the common features of different modalities, while modality-private tasks focus on the functions of different modalities. This design makes the model more flexible for new learning evaluation scenarios.
4. png
Figure 4.The modal private task principle
The key to achieving the modality-private task (Figure 4) is to predict and analyze each modality so that unique information within each modality can be extracted, namely Dirichlet distribution parameters and subjective opinions. This helps prevent the loss of modality characteristics when combining multimodal information and thus better reveal emotional information in learning evaluation scenarios. Then the prediction results for each mode are combined according to D-S proof theory. Since Dirichlet distribution parameters and subjective opinions can provide an uncertainty matrix for each model's prediction results, we need to apply D-S evidence theory to deal effectively with this incomplete, uncertain and conflicting information so that the model can flexibly combine ways of communicating with each other, improving thus the adaptability and accuracy of the model in teaching evaluation scenarios.
assumptions:bigIIbigCrepresents the subjective output of each mode,BIIBCrepresents the Dirichlet distribution parameter,Otherrepresents the number of samples,JayIndicates pattern type number, $\hat{t}_v=\left[\hat{o}_1, \hat{o}_2, \ldots, \hat{o}_J\right]$ indicates that the final result prediction is result based on the D-S theory of proof, which includes any wayIRepresents an uncertainty estimate, then the loss function can be expressed as:
$Los{{s}_{UN}}=-\sum\limits_{u=1}^{B}{\sum\limits_{k=1}^{J}{{{t}_{uk}} \log \left( {{{\kapelusz{o}}}_{uk}} \right)}}$ (14)
The key to the task of sharing modalities (Fig. 5) is to build a feature fusion network based on tensor decomposition and use it to effectively combine features with different modalities, integrating multimodal text and picture information into one feature in the model's ability to capture emotional information in evaluation scenarios teaching. Using tensor decomposition, you can reduce the computational complexity of the feature-combination process and improve model performance. By performing a low-order approximation on multimodal features, the dimensionality of the combined features can be significantly reduced, thus reducing the computational load on the model. Finally, based on the results of feature fusion, multilayer perception (MLP) is used for predictive classification. MLP can effectively integrate multimodal functions, reduce computational complexity, capture high-level interactive information, and provide comprehensive training. It has good flexibility and scalability. Using MLP can provide a more valuable model design for your application. MAC in the classroom Rate the app in the scenario.
5. png
Figure 5.The principle of modal division of tasks
assumptions:GIIGCAfter the modal representation, it represents the feature vector of text and image; when combined, the feature tensor can be expressed asX=XI⊗XC, the tensor product can be expressed as ⊗, and the dimension of the feature vector can be expressed asFIIFC, and isX∈Otherfly×fc, then the eigenvectors of each model can be extended to one dimension by the formula above:
$\begin{align} & {{x}_{y}}=\Gamma \lijevo( {{G}_{y}},1 \desno),{{x}_{y}}\w {{ E}^{{{f}_{y}}}} \\& {{x}_{c}}=\Gama \lijevo( {{G}_{C}},1 \desno),{{ x}_{c}}\w {{E}^{{{f}_{c}}}}\\\end{align}$ (15)
assumptions:askRepresents the linear weight of the layer,Nrepresent prejudiceXrepresents a third-order tensor,ask∈Otherfly×fc×fGIt represents a fourth-order tensor, and the extra dimension is the dimension of the output vectorFG,Goutput through the line layerH(""), these are:
$G=h\lijevo( X;Q,n \desno)=Q\cdot X+n,g,n\in {{E}^{{{f}_{G}}}}$ (16)
askcan be viewed as a superpositionFGsecond order tensorask~J∈Otherfly×fc.for everyoneask~J∈Otherfly×fs×fc, which can be broken down by rankOtheras follows:
${{\tylda{Q}}_{j}}=\sum\limits_{u=1}^{E}{q_{y,j}^{\left( u \right)}\czasami q_{c ,j}^{\lijevo( u \desno)}}$ (17)
assumptions:Q(vas)I,J∈OtherflyIQ(vas)C,J∈OtherfcRepresents the tensor factorization factorask~JCapital is classOtherRelevant text and images hereOtheris then set to constantOther- rank factorization factor {Q(vas)I,J,Q(vas)C,J}Othervas=1,J=1,...FGcan be used to reconstruct tensorsask~J.allowQ(vas)I=[Q(vas)I1,Q(vas)I,2,...Q(vas)I,fG],Q(vas)Other=[Q(vas)Other1,Q(vas)Other,2,...Q(vas)Other,fG],Q(vas)C=[Q(vas)C1,Q(vas)C,2,...Q(vas)C,fG] followed by the weightaskin equation 16 can be represented by the following formula:
$Q=\sum\limits_{u=1}^{E}{q_{y}^{\left( u \right)}\otimes q_{s}^{\left( u \right)}\otimes q_ {c}^{\lijevo( u \desno)}}$ (18)
In addition, equation 16 can be written as:
$\begin{align} & g=\left( \sum\limits_{u=1}^{E}{q_{y,j}^{\left( u \right)}\czasami q_{s,j} ^{\left( u \right)}\otimes q_{c,j}^{\left( u \right)}} \right)\cdot X=\sum\limits_{u=1}^{E}{ \lijevo( q_{y,j}^{\lijevo( u \right)}\otimes q_{s,j}^{\lijevo( u \right)}\otimes q_{c,j}^{\lijevo( u \right)}\cdot X \right)} \\& =\sum\limits_{u=1}^{E}{\left( q_{y,j}^{\left( u \right)}\ ponekad q_{s,j}^{\lijevo( u \right)}\otimes q_{c,j}^{\lijevo( u \desno)}\cdot {{x}_{y}}\otimes {{ x}_{s}}\otimes {{x}_{c}} \right)} \\& =\lijevo( \sum\limits_{u=1}^{E}{q_{y}^{\左( u \right)}\cdot {{x}_{y}}} \right)\otimes \left( \sum\limits_{u=1}^{E}{q_{s}^{\left( u \right)}\cdot {{x}_{s}}} \right)\otimes \left( \sum\limits_{u=1}^{E}{q_{c}^{\left( u \右)}\cdot {{x}_{c}}} \右) \\\end{align}$ (19)
allowaskw∈OtherfG×1, the deviation is expressed asNw, the emotion prediction result is represented by $\hat{t}_v$ , and finally the fully connected layer is used for multimodal emotion prediction, and the result can be expressed as:
${{\kapelusz{t}}_{v}}=Q_{v}^{Y}G+{{n}_{v}}$ (20)
The loss function of the modality sharing task can be expressed as:
$Los{{s}_{CO}}=\frac{1}{B}\sum\limits_{u=1}^{B}{{{\left( {{{\kapelusz{t}}}_ {u}}-{{t}_{u}}\右)}^{2}}}$ (21)
assumptions:BRepresents a hyperparameter, then the total loss function of the MAC model based on joint text and image processing can be expressed as:
${{s}_{MA}}={{s}_{CO}}+\beta {{s}_{UN}}$ (22)
5. Results and analysis of experiments
Figure 6 shows the changes in emotion classification accuracy corresponding to different key region numbers, showing the relationship between key region numbers and emotion classification accuracy. For the given six region key numbers (1, 2, 3, 4, 5, 6), the corresponding pitch classification accuracies are 77%, 77.5%, 81.9%, 82.7%, 83.1% and 83 respectively, 4%. Analysis of the data in the table shows that as the number of key regions increases, the accuracy of emotion classification increases, indicating that more key areas help the model better capture emotional information and improve emotion classification accuracy. When the number of key regions is increased from 1 to 2, the increase in sentiment classification accuracy is very limited (from 77% to 77.5%), indicating that a small increase in the number of key regions may not significantly improve sentiment classification accuracy under certain conditions. But once the number of key regions reaches 3, the improvement in emotion classification accuracy is obvious. Especially when the number of key regions increases from 3 to 4, the accuracy of sentiment classification increases from 81% to 82.7%, showing a significant improvement, indicating that in the MAC task, a corresponding increase in the number of key regions has a positive effect on improving the performance of the model. As the number of key regions continues to increase from 5 to 6, the accuracy rate of the sentiment classification continues to increase, but the increase is small, indicating that, once a certain threshold is reached, further increasing the number of key regions has a limited effect on model performance.
6. png
Figure 6.Accuracy of sentiment classification corresponding to a different number of key areas
Table 1.Experimental results of various models
Model | Correct (%) | remember (%) | F1(%) | Accuracy (%) |
basic model | 72.16 | 70,63 | 71,96 | 74.02 |
Basic model + text feature extraction | 74,93 | 71,95 | 73,48 | 71,35 |
Basic model + image feature extraction | 78.15 | 85.16 | 82,51 | 85.19 |
supplied model | 82,49 | 84.31 | 83,26 | 83,42 |
Table 1 shows the experimental results of various models. The performance of four different models in four aspectsaccurate,Remember,F1- value, i.eaccuracybe a day. The models included in the experiment include the base model, the base model + text feature extraction, the base model + image feature extraction, and the proposed model. During our experimentsResNetThe 50 was selected as the base model, which only performs simple text and image processing and does not include advanced feature extraction or MAC. From the data presented in the table, it can be seen that the performance of the basic model is the lowest among all models, which justifies the need for advanced feature and MAC extraction methods. Regarding base model + text feature extraction, when adding text feature extraction to the base model, the precision, recall and F1 value increase, but the accuracy decreases slightly. This may be because extracting text objects helps the model better understand the text information, but due to the lack of support for image information, in some cases the model may have wrong estimates. In terms of base model + image feature extraction, when image feature extraction is added to the base model, the performance of the model on various metrics is improved, which proves that image feature extraction has a large impact on emotional calculations. In a learning evaluation scene, the model can very well capture emotional information in an image. The proposed model combines the methods of text feature extraction, image feature extraction and MAC methods. Judging by the data contained in the table, the proposed model exceeds other models in all indicators, which confirms the effectiveness of the proposed method in teaching evaluation. The model can accurately classify moods by comprehensively analyzing textual and graphical information.
Figure 7 shows the experimental results of different models using histograms. From the figure, it can be seen that the proposed model achieves greater accuracy than other models, which means that the proposed model can more accurately identify the type of emotion and reduce the possibility of misjudgment. .
7. png
Figure 7.Experimental results of various models
Compared to other models, the recall rate of the proposed model is much better, indicating that the proposed model can more comprehensively identify patterns containing specific emotions, and the coverage of the model is extended. The F1 value is the harmonic mean of precision and memory that is used to evaluate the overall performance of the model. The proposed model surpasses other models in terms of F1, showing that it can provide better performance in terms of precision and memory. In terms of accuracy, the proposed model was also the best of the four models, proving its high accuracy in the general classification of emotions in learning evaluation scenarios, and the proposed model can predict real emotion types well as learning evaluation is a valuable reference. In conclusion, the multi-indicator analysis shows that the method of combining text feature extraction, image feature extraction and MAC proposed in the article proved to be effective in teaching evaluation. Significant improvement. From the perspective of various indicators, this also shows great potential in the application of learning evaluation. It offers a new and more effective learning evaluation solution to better understand students' emotional needs and improve the quality of education.
Figure 8 shows experimental results at 50%, 75% and 100% data volume. From the figure, it can be seen that as the amount of training data increases, the accuracy of all models increases, but the accuracy of the proposed model is better than other models in all data amounts, which shows that the proposed MAC method has exceptional performance in the learning evaluation scene, mainly because the technology MAC combines text and image processing so that the model can use text and image information at the same time, thus improving the accuracy of emotion classification. As the amount of data increases, the accuracy of all models increases, indicating that more training data favors the model learning richer and more accurate information, and the predictive power of the model increases. Comparing the results of the base model, base model + text feature extraction and base model + image feature extraction, it can be concluded that adding text feature extraction or only image feature extraction can improve the accuracy of the model to some extent, but its effect is not obvious. Equally important is the overall effect of the proposed model, which once again confirms the advantages of combining MAC image processing and text in the teaching evaluation scene.
81.png
(1)
82.png
(2)
83.png
(3)
Number 8.Experimental results at 50%, 75% and 100% data volume conditions
Table 2.Experimental results on the MAC dataset (%)
So. | Model | modal | Standard sample set 1 | Standard sample set 2 | Standard sample set 3 | |||
accelerator | F1 | accelerator | F1 | accelerator | F1 | |||
1 | text | Burt | 63,58 | 52.36 | 69.35 | 52,48 | 52.37 | 52.43 |
2 | GRU | 61.24 | 68.42 | 61,52 | 69.16 | 65.42 | 59.16 | |
3 | submitted method | 69.35 | 60,51 | 69.47 | 64,82 | 69.31 | 57.35 | |
4 | photo | object-oriented | 64,52 | 55,82 | 60,51 | 53,62 | 42.18 | 42.05 |
5 | Scenario | 53.16 | 59.27 | 69.34 | 59.46 | 46,53 | 46.15 | |
6 | submitted method | 64.15 | 68.31 | 61.42 | 64,81 | 59.18 | 49.38 | |
7 | text+photo | ICCN | 68,53 | 61,81 | 63,81 | 69,76 | 63.18 | 52,58 |
8 | misa | 74.15 | 60.18 | 69,52 | 60,85 | 64.32 | 50.16 | |
9 | submitted method | 79,52 | 72,36 | 63,84 | 69.15 | 62.08 | 57,49 |
Table 2 shows the performance of different modes and models on three standard sample sets. The following conclusions can be drawn from the analysis of these results. In text mode, the accuracy and F1 score of the proposed method (No. 3) are better than BERT (No. 1) and GRU (No. 2), which indicates that the text feature extraction method proposed in this article can better capture information about emotions, thus improving the accuracy of mood classification. In the image mode, the proposed method (No. 6), object-oriented (No. 4) and scene-oriented method (No. 5) showed good results in terms of accuracy and F1 value, indicating that the image feature extraction method used in this article, the method can more comprehensively capture emotional information and improve the model's predictive performance. In the combined text-image mode, the accuracy and F1 score of the proposed method (No. 6) are higher thanICCN(Interactive Canonical Correlation Networks, No. 7)misa(Modality Invariant and Specific Representations, No. 8), showing that by taking into account text and image information, the proposed MAC method can make full use of this information and improve the accuracy of sentiment classification.
In conclusion, the proposed model has been proven to be effective in experimental results on the MAC dataset, and since the proposed text extraction method can capture sentiment information well, the accuracy of sentiment classification in text mode has been improved. The image feature extraction method adopted in this paper can extract emotional information from both objects and scenes more comprehensively, thereby improving the prediction performance of the image mode. In addition, the MAC method proposed in this article can make full use of textual and graphical information and further improve the accuracy of mood classification by combining them.
6. Conclusion
This study explores the use of MAC in evaluating learning combinations of text and image processing. First, the input text is split into two parts, the text and the hash tag, and feature extraction is done separately. Extract image features from the two angles of the subject and the scene, as these two angles can provide different levels of information about the image. The MAC model is divided into modally shared tasks and modally private tasks to better adapt to new learning evaluation scenarios. In the experimental results, changes in the accuracy of emotion classification corresponding to a different number of key regions were given, and the performance of four models was analyzed on the basis of four indicators.accurate,Remember,F1IaccuracyThe results confirm the effectiveness of the proposed method in teaching evaluation. Thanks to the comprehensive analysis of textual and graphical information, the model can more accurately classify moods. Experimental results under 50%, 75% and 100% data volume conditions are presented, and the advantages of the MAC method of combining text and image processing proposed in this paper for learning assessment are reaffirmed. Finally, the experimental results of different models on the MAC dataset in different modes are given and the reasons why the proposed model shows better performance on the dataset are summarized.
funds
This article was supported by the Humanities and Social Sciences Research Project of the Department of Education of Hebei Province (Grant No.: SD2022041).
reference
[1] Singh, GV, Firdaus, M., Ekbal, A., Bhattacharyya, P. (2022). EmoInt-Trans: a multimodal translator for recognizing emotions and intentions in a social conversation. IEEE/ACM Transactions for Sound, Speech and Language Processing, 31: 290-300. https://doi.org/10.1109/TASLP.2022.3224287
[2] Samsonowicz AV, Liu Z., Liu T.T. (2023). On the possibility of regulating human emotions through multimodal social interactions of embodied agents controlled by ebics-based models of emotional interaction. In Artificial Intelligence: 15th International Conference, AGI 2022, Seattle, WA, USA, pp. 374-383. https://doi.org/10.1007/978-3-031-19907-3_36
[3] Kalatzis, A., Girishan Prabhu, V., Rahman, S., Wittie, M., Stanley, L. (2022). Emotions matter: personalizing human-computer interactions using a two-layer multimodal approach. In Proceedings of the International Conference on 2022 Intermodal Transport, Bangalore, India, pp. 63-72. https://doi.org/10.1145/3536221.3556582
[4] Samyoun, S., Mondol, A.S., Stanković, J. (2022). A multimodal framework for robustly discriminating similar emotions with wearable sensors. 2022 44th International Annual Meeting of the IEEE Engineering Society in Medicine and Biology (EMBC), Glasgow, Scotland, UK, pp. 4668-4671. https://doi.org/10.1109/EMBC48229.2022.9871229
[5] Caldeira, F., Lourenço, J., Tavares, Silva, N., Chambel, T. (2022). Emotional multimodal video search and visualization. In ACM International Conference on Interactive Media Experience, Aveiro JB, Portugal, pp. 349-356. https://doi.org/10.1145/3505284.3532987
[6] Teo, CL, Ong, AKK, Lee, AVY (2022). Studying the cognitive emotions of students in constructing knowledge using multimodal data. In Proceedings of the 15th Conference on Computer-Supported Collaborative Learning, Hiroshima, Japan, pp. 266-273.
[7] Chen, W., Wu, G. (2022). A multimodal model of a convolutional neural network to analyze the influence of a musical genre on children's emotional intelligence. Computational Intelligence and Neuroscience, 2022:5611456. https://doi.org/10.1155/2022/5611456
[8] M. Muszyński, L. Tian, C. Lai et al. (2019). Identification of emotionally triggered cinema viewers based on multimodal information. IEEE Transactions for Affective Processing, 12(1):36-52. https://doi.org/10.1109/TAFFC.2019.2902091
[9] Catharin, LG, Ribeiro, RP, Silla, CN, Costa, YM and Feltrim, V.D. (2020). Multimodal classification of emotions in Latin music. 2020 IEEE International Symposium on Multimedia (ISM), Naples, Italy, pp. 173-180. https://doi.org/10.1109/ISM.2020.00038
[10] Bollini, L., Fazia, ID (2020). Emotions related to the situation. The role of soundscapes in geography-based multimodal applications in the field of cultural heritage. In Computational Science and Its Applications - ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1-4, 2020, Proceedings Part III 20, Cagliari, Italy, pp. 805-819 (view, other). https://doi.org/10.1007/978-3-030-58808-3_58
[11] Ordonez-Bolanos, OA, Gomez-Lara, JF, Becerra, MA, Peluffo-Ordónez, DH, Duque-Mejia, CM, Medrano-David, D., Mejia-Grove, C. (2019). Emotion recognition using an ICEEMD-based representation of multimodal physiological signals. 2019 IEEE 10th Latin American Symposium on Circuits and Systems (LASCAS), Armenia, Colombia, pp. 10-11. 113-1 https://doi.org/10.1109/LASCAS
[12] Zheng, W.L., Liu, W., Lu, Y., Lu, BL, Cichocki, A. (2018). The Mood Meter: A Multimodal Framework for Recognition of Human Emotions. IEEE Transactions on Cybernetics, 49(3): 1110-1122. https://doi.org/10.1109/TCYB.2018.2797176
[13] Villegas-Ch, W., García-Ortiz, J., Sánchez-Viteri, S. (2023). Identify emotions based on facial expressions in the classroom using machine learning techniques. Accessed by IEEE, 11:38010-38022. https://doi.org/10.1109/ACCESS.2023.3267007
[14] Hartikainen, S., Pylväs, L., Nokelainen, P. (2022). Engineering student perceptions of teaching: Teacher atmospheres and teaching procedures as drivers of student emotion. European Journal of Engineering Education, 47(5): 814-832. https://doi.org/10.1080/03043797.2022.2034750
[15] Sun, Y. (2022). The use of computer information technologies to change the positive emotions of academic art students based on the teaching perspective [J]. Applied Bionics and Biomechanics, 2022: 7184274. https://doi.org/10.1155/2022/7184274
[16] Rehmat, AP, Diefes-Dux, HA, Panther, G. (2021). Self-Reported Emotions of Mechanical Engineering Teachers in Distant Emergency Education 2021 IEEE Frontiers in Education Conference (FIE), Lincoln, NE, USA, pp. 1-6. https://doi.org/10.1109/FIE49875.2021.9637440
[17] Ramos-Aguiar, L.R., Álvarez-Rodriguez, F.J. (2021). Teaching emotions to children with autism spectrum disorders using a computer program with a tangible interface. IEEE Ibero-American Journal of Learning Technologies, 16(4):365-371. https://doi.org/10.1109/RITA.2021.3125901
[18] Liang, Y. C., Lin, K. H. C., Li, C. T. (2021). Analyzing Academic Emotions in Hands-on Digital Video Courses Using the STEAM 6E Teaching Method [J] Innovative Technology and Learning: 4th International Conference, ICITL 2021, Virtual Event, pp. 584-592. https://doi.org/10.1007/978-3-030-91540-7_60
[19] Pham, Wang, J. (2018). Predicting student emotions during MOOC mobile learning through a multi-modal smart tutor. In Intelligent Tutoring Systems: 14th International Conference, ITS 2018, Montreal, Quebec, Canada, pp. 150-159. https://doi.org/10.1007/978-3-319-91464-0_15
[20] Barron-Estrada, M.L., Zatarain-Cabada, R., Aispuro-Gallegos, C.G. (2018). Multimodal emotion recognition and its application in mobile learning. 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), Mumbai, India, pp. 416-418. https://doi.org/10.1109/ICALT.2018.00104
[21] Hu, A., Flaxman, S. (2018). Multimodal sentiment analysis examines the structure of sentiment. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, pp. 350-358. https://doi.org/10.1145/3219819.3219853
[22] Yin Zheng, Zhao Min, Wang Yu, Yang Jing, Zhang Jing (2017). Emotion identification using multimodal physiological cues and an integrated deep learning model. Computer Methods and Procedures in Biomedicine, 140:93-110. https://doi.org/10.1016/j.cmpb.2016.12.005
FAQs
What are examples of multimodal applications? ›
- Multimodal Voice Search integrating GUI and speech.
- Voice control on mobile devices.
- Address input on GPS systems.
- Multimodal in-car systems for accessing navigation and audio/visual control.
What is multimodal learning? Multimodal learning suggests that when a number of our senses – visual, auditory, kinaesthetic – are being engaged during learning, we understand and remember more. By combining these modes, learners experience learning in a variety of ways to create a diverse learning style.
What are the 5 types of multimodal text? ›In the composition field, multimodal elements are commonly defined in terms of the five modes of communication: linguistic, visual, gestural, spatial, audio.
What is multimodal evaluation? ›What is Multimodal Assessment? Multimodal assessments are alternative, often digital, assessments that allow students to demonstrate skills and knowledge in a new way by presenting them in multiple forms.
What are 5 example of multimodal? ›Simple multimodal texts include comics/graphic novels, picture books, newspapers, brochures, print advertisements, posters, storyboards, digital slide presentations (e.g. PowerPoint), e-posters, e-books, and social media.
What is an example of multimodal learning style? ›Multimodal Learning combines all of the different modes of learning to teach the curriculum in a course. For example, students listen (auditory) to an instructor giving a lecture on a subject before asking them to demonstrate (kinesthetic).
How can a teacher implement multimodality in the classroom? ›Turn journal entries into a multimodal activity by making them personalized. Let students complete entries in a way that helps them express their thoughts best. This could include written entries, charts, illustrations, videos, podcasts or example stories.
What is an example of a multimodal assessment? ›In written texts, the use of different fonts, styles, colours and other formatting tools are examples of multimodality. In speech, tone of voice, intonation, prosody and non-verbal cues constitute multimodality.
What are the 4 types of multimodal text? ›Modes include written language, spoken language, and patterns of meaning that are visual, audio, gestural, tactile and spatial.
What are the 3 types of multimodal text? ›Multimodality does not necessarily mean use of technology, and multimodal texts can be paper-based, live, or digital.
What are the three types of multimodal text examples? ›
Paper-based multimodal texts include picture books, text books, graphic novels, comics, and posters. Types of multimodal texts Live multimodal texts, for example, dance, performance, and oral storytelling, convey meaning through combinations of various modes such as gestural, spatial, audio, and oral language.
How do we evaluate multimodal texts and messages? ›- IMAGES REFLECTING DIFFERENT. ...
- Access to simple, easy to use media. ...
- “Message” or text refers to any recorded. ...
- “Medium” includes broad categories as speech. ...
- One should consider text type, purpose. ...
- Semiotics is concerned with everything that. ...
- Signifiers (sound and images)
Multimodal Deep Learning is a machine learning subfield that aims to train AI models to process and find relationships between different types of data (modalities)—typically, images, video, audio, and text.
How do you explain multimodal? ›Multimodal projects are simply projects that have multiple “modes” of communicating a message. For example, while traditional papers typically only have one mode (text), a multimodal project would include a combination of text, images, motion, or audio.
What are the 5 elements of multimodal strategy? ›The five elements of the multimodal strategy are Build it (system change), Teach It (training and education), Check It (monitoring and feedback), Sell It (reminders and communications), and Live It (culture change).
What are examples of multimodal images? ›Multimodal imaging or multiplexed imaging refers to simultaneous production of signals for more than one imaging technique. For example, one could combine using optical, magnetic, and radioactive reporters to be detected by SPECT, MRI, and PET.
What are the 5 semiotic system of multimodal text? ›Mutlimodal texts
five of modes: linguistic, visual, aural, gestural and spatial.
There are 4 predominant learning styles: Visual, Auditory, Read/Write, and Kinaesthetic. While most of us may have some general idea about how we learn best, often it comes as a surprise when we discover what our predominant learning style is.
What is a multimodal classroom? ›Multimodality in the writing classroom refers to the use of different modes, such as written, oral, non-verbal, and visual, to communicate and persuade. Lutkewitte (2014) refers to multimodal composition as composition using multiple modes that work purposely to create meaning.
How does the multi literacies approach assist teachers in the teaching and learning process? ›This approach enables teachers to be creative in the literacy classroom by integrating movies, the Internet, music, art, photos, and a range of other digital resources as part of literacy learning.
What are the advantages of multimodal communication? ›
- Increased self-confidence.
- True inclusion.
- Less frustration.
- Improved relationships.
- Enhanced community participation.
- Increased independence.
- Communication partner awareness.
- An independent voice.
Proposed benefits of multimodal output system include synergy and redundancy. The information that is presented via several modalities is merged and refers to various aspects of the same process.
What is the role of multimodal communication in language learning? ›Multimodal refers to a person's way of communicating by using more than one different way at the same time (Kress & Van Leeuwen, 1996). Multimodal communication applications in learning can be a teaching technique that combines several modes such as combining the use of images, audio-visual, and text in learning.
What are the potential challenges for using or learning about multimodality? ›Challenges associated with multimodal representation include dealing with different noise levels, missing data, and combining data from different modalities. Multimodal representation can be divided into two categories: joint and coordinated.
What are the criteria for a multimodal presentation? ›A multimodal presentation includes at least one mode other than reading and writing such as listening, speaking, viewing and representing. No specific weightings have been allocated to the modes to allow flexibility in task design and to meet the needs and interests of students in a range of contexts.”
Why is multimodal literacy important in teaching? ›Supporting multimodal literacy is an important aspect of education today as it encourages students to understand the ways media shapes their world. Most, if not all texts today, can be considered “multimodal texts,” as they combine modes such as visuals, audio, and alphabetic or linguistic text.
Which best describes multimodal? ›Multimodal projects are simply projects that have multiple “modes” of communicating a message. For example, while traditional papers typically only have one mode (text), a multimodal project would include a combination of text, images, motion, or audio.
What roles do images play in a multimodal text? ›The use of images in multimodal texts has a more direct effect on viewers because images are spatial/simultaneous meaning that all of the information in an image is seen by the viewer at the exact same moment with the spacing of the different elements all in relationship to one another.
Which is an example of a graphic text? ›Reading Graphic Texts Page 2 Graphic texts include printed text and a variety of other visual features which may include: captions, diagrams, graphs, maps, photographs, tables, etc. These types of texts are read across the curriculum.
What are the 5 modes of communication? ›A mode, quite simply, is a means of communicating. According to the New London Group, there are five modes of communication: visual, linguistic, spatial, aural, and gestural.
What are the types of multimodal analysis? ›
There are three major strands of multimodal research: a social-semiotic approach including studies drawing on discourse analysis, an interactional approach including studies informed by ethnomethodology, and a sensory approach.
What are 3 print based multimodal texts? ›Print-based multimodal texts include comics, picture storybooks, graphic novels; and posters, newspapers and brochures.
What is the difference between multimodal text and traditional text? ›A key difference between traditional texts and digital multi-modal texts is that while traditional composition can live in a hard format or on a computer, digital multimodal texts are almost entirely limited to their use in a digital format due to the multiple media that can be contained in the text.
What is a multimodal text give two benefits and explain very well? ›Multimodal texts are communications that use more than one semiotic system, or mode, to convey a message. These modes can include written language, visual imagery, audio, spatial arrangements, and gestures. The integration of these different modes allows for a more complex and enriched communication experience.
How do you evaluate text messages and images? ›- SIMPLICITY. ask ourselves two questions: – is my purpose evident? – Is my core message clear?
- SPECIFICITY. language is specific we may ask ourselves: – Is my language specific? ...
- STRUCTURE. *Ideas should be organized and easy to follow. – Does my messages have a STRUCTURE? ...
- STICKINESS.
This aligns closely with the five steps this paper employs for multimodal analysis: sampling data, transcribing data, analyzing individual modes, analyzing across modes and connecting them with theories.
What are the applications of Multimodal learning? ›Multimodal Learning expands the capabilities of an AI model by making it more human-like. A multimodal AI system gets a wider understanding of the task by analyzing various data types. For example, an AI assistant can use images, pricing data, and purchasing history to offer better-personalized product suggestions.
How do you use Multimodal learning? ›Multimodal learning means engaging with concepts and ideas in multiple ways. Reading, listening, watching, writing, talking, exploring, doing, building, playing, drawing, singing, sculpting – Learning through different sensory modalities opens a world of possibilities for engaging lessons and activities.
Is Multimodal learning effective? ›Multimodal learning is a great tool especially if you want to improve the quality of your teaching. Summarizing there are 4 different modes: visual, auditory, reading/writing, physical/kinaesthetic. Try and use a combination of all of these in your lessons for the best effect.
What is the purpose of multimodality? ›Multimodality can be used to build inventories of the semiotic resources, organizing principles, and cultural references that modes make available to people in particular places and times: the actions, materials and artifacts people communicate with.
What is another word for multimodal? ›
In transportation, multimodal is used synonymously with intermodal to refer to transport systems that use multiple modes of transportation, such as trucks, trains, and ships.
What is multimodality and give an example? ›Multimodality refers to the interplay between different representational modes, for instance, between images and written/spoken word. Multimodal representations mediate the sociocultural ways in which these modes are combined in the communication process (Kress & Van Leeuwen 2001, p. 20).
What are multimodal programs? ›Multimodal projects are simply projects that have multiple “modes” of communicating a message. For example, while traditional papers typically only have one mode (text), a multimodal project would include a combination of text, images, motion, or audio.
What is multimodality with example? ›Example of Multimodality: Podcast+Website
A podcast is entirely an aural text, but the website for the episode expands the podcast with images, text, and video related to the original narration. Thus, this episode's web page is a multimodal refashioning of the original text.
Types of multimodal texts
Multimodality does not necessarily mean use of technology, and multimodal texts can be paper-based, live, or digital.
Chapter 4. A technology that provides several distinct tools for input and output of data, thus allowing multiple modes of interacting with a system.
What are the different types of multimodal analysis? ›There are three major strands of multimodal research: a social-semiotic approach including studies drawing on discourse analysis, an interactional approach including studies informed by ethnomethodology, and a sensory approach.
What is multimodality in language learning? ›Summary: Multimodal learning is teaching a concept through visual, auditory, reading, writing, and kinaesthetic methods. It is meant to improve the quality of teaching by matching content delivery with the best mode of learning from the student.
What is another term for multimodal? ›In transportation, multimodal is used synonymously with intermodal to refer to transport systems that use multiple modes of transportation, such as trucks, trains, and ships.
What are the multiple modalities of learning? ›The Four Learning Modalities. There are four primary learning modalities: visual, auditory, kinesthetic, and tactile. The learning modalities are similar in that they all involve using the senses for learning. They are different in that each style relies more heavily on the use of a different sense.