Facial and facial expression recognition integrated into the Bottleneck module using deep learning and SNet architecture
Vel Tech Rangarajan Faculty of Computer Science and Engineering Sagunthala Institute of Technology, Avadi, Chennai 600062, Tamilnadu, India
Author's email for correspondence:
vtd702@veltech.edu.in
Side: 647-655 (view, other). DOI: https://doi.org/10.18280/ts.400223
received:
24 listopada 2022 r
|
adopted:
March 18, 2023 r
|
Published:
April 30, 2023
|quote
ts_40.02_23.pdf
open access
Abstract:
Infrared face image recognition using deep learning technology has now become the most discussed concept in the research field. Many articles have been written and the area is being explored to make new discoveries. Thermal infrared images can be recognized regardless of lighting conditions, aging and facial camouflage. The paper proposes an approach called SNet-Integrated Bottleneck (SN-BNAM) for face thermal image recognition using the SENet architecture integrated with Bottleneck. After the compression and arousal process, channel and spatial attention are understood as two independent branches in the bottleneck attention module (BAM). This module is in every bottleneck area. The SN-BNAM module can be integrated with any feed-forward convolutional neural network. The effectiveness of the proposed system is assessed by conducting experiments on various architectures, and validation of objects is carried out on VOC 2007, MS COCO, CIFAR-100 and ImageNet-1K datasets. These experiments show that our method shows consistent improvements in image classification and object detection.
Keywords:
Facial Expression Recognition, Face Recognition, SNet Architecture, Bottleneck Attention Module, CNN, Thermal Imaging
1. Introduction
Facial recognition is a biometric application used to access various applications such as unlocking mobile phones, timekeeping, health and security systems. Initially, there were many challenges with many traditional methods of authentication and security. For example, passwords and PINs are hard to remember and can easily be assumed to be stolen or forgotten. Smart cards, chip cards, and plastic cards can be lost or duplicated, and magnetic cards can darken or become illegible. An efficient technique is needed to overcome the ambiguity found in traditional authentication methods. To overcome these difficulties, biometrics and human characteristics such as fingerprint, iris and facial recognition are used for authentication. Among these options, facial recognition is the more important biometric element that promises authentication and security. The widespread use of smartphones and digital cameras has made the face recognition process simpler and more effective. Facial recognition can provide accurate, secure and fast authentication for secure access and surveillance.
Many researchers have conducted research in the visible (RGB) and infrared spectrum, but the above face recognition methods are affected by such factors as lighting, light, darkness or occlusion [1]. Therefore, in order to overcome these problems, thermal images are used for facial recognition, which work well in all lighting conditions. Thermal imaging images were recorded in the wavelength range from 8 μm to 12 μm [2]. Thermal images of the human face are created on the basis of thermal patterns emitted by the body, which are not affected by ambient lighting conditions [3].
Artificial intelligence (AI) refers to human behavior as described by a system or machine. Programming computers to mimic human behavior is the main task of artificial intelligence, and this is achieved using vast amounts of data from past patterns of the same behavior. Artificial intelligence is distinguished by various algorithms that use machine learning (ML) and deep learning (DL). Machine learning is a branch of artificial intelligence that automates the building of analytical models and prepares machines to independently adapt to new models. On the other hand, deep learning is a subset of ML that performs tasks better than traditional machine learning methods. Deep learning mainly combines two key techniques: supervised and unsupervised learning [4]. DL implements an artificial neural network with multiple hidden layers that takes the output of the previous layer as the input of the next layer. Deep learning is considered as a tool to improve performance and optimize processing time in various computational processes. Today, deep learning (DL) is becoming an effective facial recognition tool with impressive results. Convolutional Neural Network (CNN) is a deep neural network that automatically extracts visual features [5] DL automates the filter selection process to extract the best features from the image and provides better accuracy. Deep learning applications include image recognition, natural language processing, recommendation systems, and speech recognition [6]. The article focuses on the recognition of thermal images of human faces using deep learning (DL) embedded in the SNet architecture. The SNet architecture consists of a special block called the Squeeze and Excite (SE) block, which can improve the representational power of any network by explicitly modeling interdependencies between channels. The SE network is built by creating a series of several SE blocks. The function of the SE block is to calculate channel attention and provide increased efficiency at low cost. SENet provides an effective method to calculate channel attention with excellent performance. The main purpose of SENet is to impose inter-channel relationships by learning modulation weights in each channel [7]. While the SNet architecture is suitable for face recognition, the proposed SENet-Bottleneck Attention Module (SN-BNAM) method improves accuracy and reduces the error rate. The Bottleneck Attention Module (BNAM) is located between each SE block where information flow is critical. Before we get to BNAM, the concept of the attention module needs to be explained. Attention modules or attention mechanisms are DL techniques that provide additional emphasis on specific components of particular importance. Attention modules are used to improve the performance of convolutional neural networks (CNNs) by focusing only on more important functions and removing unnecessary functions [8]. Typically, attentional mechanisms apply to both channel and spatial dimensions. The proposed SN-BNAM method uses the attention module of Bottleneck Park et al. [9] The bottleneck attention module can be applied to any CNN. In the proposed system, BAM is incorporated into the SENet architecture to achieve operational accuracy. BAM divides the default 3D object map into two attention modules, called Channel Attention Modules and Spatial Attention Modules, which can be viewed as feature detectors to gain clear knowledge of where and what to focus on. Experimentation on various underlying architectures using our own Thermal Face Dataset (THFDS) shows that performance can be improved by adding SN-BNAM to the baseline. The effectiveness of the proposed method was also assessed by its implementation on CIFAR-100 and ImageNet-1K data sets, and the results are presented. Finally, performance improvements were achieved in object detection tasks in the VOC 2007 and MS COCO datasets, demonstrating the superiority of SN-BNAM.
1.1 Statement of the problem
There is a lot of research in the field of facial recognition systems using different algorithms and architectures. Facial recognition based on RGB images has been extensively studied with practical success. This research addresses the challenges of illusions, occlusions, light changes, and camouflage. Previous studies used multiple images of faces with different poses and lighting conditions to achieve maximum accuracy. This process requires a large database containing many images of the same person, leading to higher storage requirements and complex calculations. Calculations on such a large database are error-prone, time-consuming, and reduce performance accuracy. Although thermal images have been used for facial recognition in many studies, the proposed method can recognize not only facial images, but also basic facial expressions in any situation. Many traditional algorithms have been used for face recognition, but they are computationally intensive and unnecessary functions may be included in the recognition process, reducing the accuracy rate and increasing the error rate.
To overcome these problems, the proposed method uses thermal infrared images to recognize faces.
1.2 Contribution
The main contributions of this work are as follows:
(1) This work uses thermal infrared images for facial recognition, which overcomes the problems of facial illusion, repositioning, occlusion, expression and aging.
(2) The proposed SN-BNAM method implements the SNet architecture that integrates a lightweight bottleneck attention module.
(3) In the proposed method, the SE block increases the interdependence of the channels and improves the efficiency of the recognition process only at the computational cost.
(4) BottleNeck lightweight attention module focuses on combining channel attention with space attention to improve efficiency.
(5) The proposed system also detects basic facial expressions: happy, sad, natural, scared and camouflaged.
(6) Our proposed method works well under various environmental conditions, such as indoor, outdoor, day and night.
The rest of this article is arranged as follows. Part 2 presents a literature review, Part 3 presents the methodology of the proposed system, Part 4 explains the results of the proposed system and Part 5 presents the conclusions.
2. literature review
The performance of the infrared facial recognition system is affected by the temporal fluctuations present in the thermal images of the face, which are mainly caused by various environmental factors, physiological changes of the subject, and changes in the response of the infrared detectors at the time of registration. These five techniques Local Binary Pattern (LBP), Weber Linear Descriptor (WLD), Gabor Jet Descriptor, Scale Invariant Feature Transformation and Accelerated Robust Features were used to develop a thermal facial recognition system [1]. Recent advances in deep learning techniques in the field of thermal infrared facial recognition have enabled numerous research groups working on this topic to achieve many groundbreaking results [2]. In addition to identifying faces that cannot be seen in visible light, thermal infrared facial recognition can also identify facial blood vessel structures. It reviews previous research on temperature changes, mathematical equations, wave types, and infrared facial recognition techniques [3]. A new field of research on machine learning (ML) is deep learning. For large databases, deep learning methods use non-linear transformations and advanced model abstractions. Deep learning architectures have recently made significant advances in several areas, and these developments have had a significant impact on AI [4]. However, recognizing faces in everyday presence using convolutional neural networks (CNNs) remains a challenge due to sampling challenges. Samples can be increased by data augmentation, which is used for multi-stage learning [5]. Advances in machine learning in MFR have greatly aided the intelligent process of identifying and authenticating people with hidden faces. Using the characteristics of deep network architecture and deep feature extraction strategies to introduce state-of-the-art technology [6]. Triple attention encodes inter-channel and spatial information with minimal computational load on the input tensor while creating cross-dimensional relationships through rotation operations and residual transformations [7]. A convolutional neural network with a convolutional block attention module (CBAM) for finger vein recognition can achieve a more accurate capture of the visual structure by the attention mechanism and take into account the different meaning of pixels [8]. The Bottleneck Attention Module (BAM) is a simple and powerful attention module that can be integrated into any convolutional feedback neural network. Wide application of BAM is achieved by continuous improvement of the classification and detection efficiency of various models [9]. The Squeeze-and-Excitation Network (SENet) architecture excels at generalizing difficult datasets. Most importantly, we found that SE blocks significantly improve the performance of current state-of-the-art deep architectures at low additional computational costs [10]. Convolutional neural networks solve both of these problems together by creating AW convolutions where the shape of the attention map corresponds to the shape of the weights, not the activations. The attention module we propose [11] is a complementary approach to earlier attention-focused strategies, such as methods using attention mechanisms to study the relationship between channel features and spatial features. The Convolutional Block Attention Module (CBAM) is an Attention Module that we propose for use with all convolutional feedback neural networks. CBAM can be seamlessly integrated into any CNN architecture with minimal overhead as it is a lightweight general purpose module [12]. Residual meshes are easier to optimize and can improve accuracy with greatly increased depth. On the ImageNet dataset, we tested residual networks up to 152 layers, 8 times deeper than VGG networks, and at the same time less complex [13]. ResNeXt additionally uses the COCO detection kit and the ImageNet-5K kit, both of which show better performance than ResNet [14]. As a very deep family of architectures with impressive accuracy and good convergence, deep residual networks emerged. Using identity maps as skip links and post-add activations, we analyze the propagation formulation of residual building blocks and find that forward and reverse signals can be directly transmitted from one block to another [15]. An overall summary of the literature review is presented in Table 1.
Table 1.Literature review summary
S. br. | Year | Question | Solution/Overview |
1. | 2017 | Thermal face recognition in time-varying conditions | Five FR methods were compared. Weber's linear descriptors work better. |
2 | 2021 | Infrared face recognition | Comments: Thermal infrared facial recognition is more accurate than visible light and requires intensive development. |
3. | 2022 | Study of thermal face recognition using machine learning | Research on face recognition based on images in visible light and thermal imaging. CNN's methods for photos of hot faces give better results. |
4. | 2019 | Facial recognition with deep learning | Solution: Data augmentation based on orthogonal experiments. Limitation: Percent accuracy is less than 90. |
5. | 2018 | Research on facial recognition using deep learning methods | Comment: Deep learning models work better on larger data sets. |
6. | 2022 | Identification of veins on the fingers | CNN convolutional block attention module. Limitations: complete with minimal data. |
7. | 2021 | Delayed attention networks for image classification | Trunk-and-mask encoder module. Limitations: Computationally complex due to the direct generalization of 3D attention maps. |
8. | 2019 | Gcnet: Non-Local Networks Explore networks inspired by Squeeze and more | The solution: the new NL block integrated into the SE block. Constraints: Use complex permutation-based operations to reduce feature maps. |
9. | 2021 | Convolutional triple module of attention | Solution: triple attention with 3 branches. Limitations: complex network structure. |
10. | 2017 | Densely connected convolutional networks | Solution: DenseNet - collect all layers directly from each other. Limitations: The emphasis is on depth, breadth and cardinality, attention facts are not taken into account. |
11. | 2017 | Image identification | Solution: Squeeze and shake the blocks. Limitations: Lack of spatial attention when deciding "where" to focus. |
3. Methodology
This article proposes an SNet architecture that integrates a facial recognition method using the Bottleneck Attention Module (SN-BNAM). The proposed method uses a single face for facial recognition, and for the dataset, our own Thermal Face Dataset (THFDS), which consists of thermal face images. Thermal imaging overcomes other spectral challenges such as optical illusions, facial camouflage, aging, etc. This work also focuses on face recognition in various environmental conditions such as indoors, outdoors, day and night. In addition to face recognition, basic expressions such as happiness, sadness, nature, fear, and camouflage are also taken into account when recognizing facial expressions. Experiments are conducted taking into account all of the above factors and the results are presented in terms of accuracy, error rate, power, precision, recall and F1 score. Figure 1 shows a block diagram of the proposed method.
11. png
picture 1.Proposed SN-BNAM method
3.1 Architektura SENet
Figure 2 shows a model of the extrusion and excitation block. As can be seen in the figure, each Ftr transform mapping input X to U (feature map), where U$\in$R H×W×C, must perform recalibration of the feature, which is performed in the SE block [10 ].
2. png
photo 2.Squeeze i Excite blokovi SNet-a [10]
Input X is mapped to the feature map U, where U$\in$ is ℝHeight x width x depthInitially, a compression operation is applied to U features to generate channel descriptors by collapsing feature maps along the spatial dimension (H x W). This descriptor performs the function of generating globally distributed channel response integration and enables all layers in the network to use information from the global receiver field in the network. This aggregation is followed by a gain operation, which is aided by a basic self-monitoring tool that accepts embedding as input and clusters modulation weights per channel. These weights are now used in the U feature map to obtain the output of the compression and excitation block, which can be fed directly to subsequent layers of the network.
3.1.1 Operation of the Extrusion and Excitation Block (SEB)
SEB creates a computational unit that is set to transform FspecialThe mapping from input X to the object map U $\in$ ℝHeight x width x depthIn the equation below FspecialTreated as the convolution operator, the set of filter kernels is denoted as V=[v1, w2,...,wCThe result can be written as U=[u1, To2,...,wasC] where [10]:
$u_c=\mathbf{v}_c * \mathbf{X}=\sum_{s=1}^{C^{\prime}} \mathbf{v}_c^s \mathbf{X}^s$ (1 )
In the above formula, the operation * represents convolution and vs c is a two-dimensional spatial kernel, representing one channel, acting on the corresponding X and v channelsC=[wC1, wC2CVCCʹ], X=[x1, X2,...,XCʹ] And youC$\w$ ℝheight x width.The output is generated by implementing the sum of all channels, so channel dependencies are embedded in vCimplicitly. Convolutional models describe channel relationships as implicit and local, but it would be better if these convolutional features were made explicit and the network had the ability to amplify their sensitive features. It is necessary to provide an unambiguous model for solving the issue of information features that are used by subsequent transformations. The SE block provides this functionality by allowing access to global information, and the filter response is recalibrated in two passes, blocking the excitation from entering subsequent passes.
3.1.2 Ironing procedure
To overcome the channel dependencies, all learned filters operate using the local receiving field so that each unit of the transformed output U cannot use the channel dependencies outside this region. The whole process is performed by a compression operation in which global spatial information is compressed into channel descriptors. Global average aggregation is used to generate statistics for each channel. Stats from $\in$ ℝCThis is achieved by reducing U along its spatial dimension HxW, where cIZ elements were calculated according to [10]:
$z_c=\mathbf{F}_{s q}\left(\mathbf{u}_c\right)=\frac{1}{H \times W} \sum_{i=1}^H \sum_{j= 1}^Wu_c(i, j)$ (2)
The output U is treated as a set of local descriptors whose statistics represent the whole picture.
3.1.3 Excitation work
The information collected during the compression operation is used in the next operation where the channel correlations are fully recorded. This goal is achieved according to the following criteria: (1) the function should be able to learn non-linear interactions between channels and (2) it must learn non-linear mutual exclusion relationships. These criteria are met by using the following basic routing mechanism [10]:
$s=F_{e x}(\mathbf{z}, W)=\mho(y(z, W))=\mho\lijevo(W_2 \delta\lijevo(W_1 z\desno)\desno)$ (3 )
The upper activation of the $\delta$ sigmoid is a function of ReLU, W1$\in$ℝC/r X C and W2$\in$ℝC×C/r. The routing mechanism is parameterized by the bottleneck of two fully connected layers, a dimensionality reduction layer with a factor of r and a dimensionality enhancement layer that returns the dimension of the output channel U. The final output of the block is activated by [10]:
$\widetilde{X_c}=F_{\text {skala}}\left(u_c, s_c\right)=s_c u_c$ (4)
Fmeasure(wasC, OtherC) refers to the multiplication of channels between scalarsCand map of objects inCℝheight x width.
3.2 Attention module about bottlenecks
Reminders module:Attention mechanism techniques capture long-term function interactions and improve CNN representations [11]. For any input image, the two modules of attention, channel and space, compute complementary attention to focus on what and where. Two modules can be mounted parallel to each other or one behind the other. Experimental results show that keeping channel attention modules in a sequential manner leads to better results first [12].
Note that the modulus is constant when a bottleneck occurs when sampling the feature map. BAM constructs a hierarchical bottleneck of attention that can be trained using a feedback model. In the proposed SN-BNAM method, this module is integrated with the SNet architecture, thanks to which the advantages of SNet and BAM are used in the face recognition process to achieve greater accuracy.
3.2.1 Channel Focus
This module performs feature extraction and reduces data loss by compressing feature maps. This process is performed using a global pool averaging layer that generates general feature map informationFand creates a channel vector FC$\w$ ℝC×1×1.Using the F channel vectorC,Attention is estimated using multi-layered perception (MLP) with one hidden layer and an M-channel attention mapC(F) $\w$ ℝC×1×1is produced. Reduce parameter loading by setting hidden activation size to ℝC×1×1where r is the reduction factor. After the mesh is applied, a volume normalization (BN) layer is added to scale the result in space. Channel attention is calculated as follows:
RiceC(F)=BN(MLP(medium set(F)))=BN(W1(w0ŚredniaPula(F)+b0)+b1) (5)
U gornjoj jednadžbi, $\mathrm{W}_0 \in \mathbb{R}^{\mathrm{C}/\mathrm{r}\times\mathrm{C}}, \mathrm{W}_1 \in\mathrm {C}} {R}^{\mathrm{C}\times \mathrm{C}/\mathrm{r}}, \mathrm{b}_0 \in \mathrm{R}^{\mathrm{C}/ \mathrm{ r }}$ i $\mathrm{b}_1 \in \mathbb{R}^{\mathrm{C}}$. Ze sądza na MLP network dijeleja network, dva ulaza dijele waży $W_0$ i $W_1 $ .
3.2.2 Spatial attention
The Spatial Attention Module provides a spatial M map of attentionsmall(F) $\w$ ℝheight x widthDisable items in different spatial locations. This module is used to determine which spatial locations to focus on to retain important contextual information. A large-scale receptive field that can effectively use contextual information is required. Perform an extended turn to increase the receptive field more effectively. Extended convolutions construct an efficient spatial map compared to standard convolutions. The spatial branch adopts a bottleneck structure, which reduces the number of parameters and computational load. The feature F belongs to ℝheight x height x widthReduced to size ℝC/ Receiver height x widthAssemble using 1×1 convolutions and compressed feature maps in the channel dimension. After the reduction process is completed (reduction factor r), two expanded 3×3 weaves are used to use the information efficiently. Again, the features in the spatial attention map have been reduced to r1xhxw using a 1×1 convolution. Mass normalization is applied at the end of the spatial scaling branch. Spatial attention is calculated as follows:
$M_s(F)=B N\left(f_3^{1 X 1}\left(f_2^{3X3}\left(f_1^{3X3}\left(f_0^{1X1}(F) \右)\右)\右)\右)$ (6)
Where,FDenotes the convolution operation, the BM series normalization operation, with superscript asFIndicates the size of the convolution filter. Two 1×1 turns are used to shrink the channel, and the middle 3×3 extended weave is used to gather contextual information in a large receptive field.
3.2.3 BAM wheel
Figure 3 shows the general structure of the bottlenecked attention module. Given the input object map F$\in$ ℝlength x height x width, which derives the three-dimensional attention map M(F) $\in$ ℝLength x Height x Width.To develop an effective module, first focus on MC(F) $\w$ ℝCand spatial attention Msmall(F) $\w$ ℝheight x widthAs two independent branches, and finally compute M(F), the attention map looks like this:
M(F)为:M(F)=σ(Mc(F)+Ms(F)) (7)
Once channel attention and spatial attention are gathered, they are combined to create the final 3D attention mapmen and women).The two attention maps have different shapes, so they are elongated to ℝ before merginglength x height x widthThis fusion is performed using elemental summation followed by a sigmoid function to create the final 3D attention mapmen and women)Between 0 and 1.
Combining channel attention Mc and spatial attention Ms, the attention map M(F) is computed by the following equation:
M(F)为:M(F)=σ(Mc(F)+Ms(F)) (8)
A 3D attention map was obtainedmale female)ℝlength x height x widthand enter the object mapFare multiplied by elements and added to the original input feature map to produce a refined feature map, using the following equation, where ⊗ stands for multiplication by elements.
F'=F+F⊗M(F) (9)
3. png
photo 3.Detailed view of the BottleNeck attention module
4. Results and discussion
To evaluate the proposed SN-BNAM method, other network models (such as ResNet18, ResNet50) were implemented using our THFDS dataset and the experimental results are shown in Table 2. As shown in Table 2, the results of architectures such as ResNet18, ResNet50 and ResNet50 are compared with the results of other modules when the proposed SN-BNAM module is integrated with other modules. Integration of the proposed modules with other architectures can reduce the error rate and provide better performance. The proposed SN-BNAM method outperforms other architectures and improves facial recognition accuracy while reducing the error rate. Then our proposed system is also implemented on CIFAR-100 and ImageNet-1K datasets for effective evaluation. Table 3 and Table 4 show the performance evaluation of our SN-BNAM method on the CIFAR-100 and ImageNet-1K datasets. The results show that our method outperforms other traditional methods and lightweight network architectures in performance and error reduction.
Table 2.Classification results of other network models and our THFDS dataset
architecture | parameter | GFLOPY | Top 1 errors (%) | Top 5 mistakes (%) |
ResNet18 [13] ResNet18+SN-BNAM | 11,22 mln 11,56 mln | 1,81 1,90 | 29.60 28.36 | 10.53 10.22 |
ResNet50 [13] ResNet50+SN-BNAM | 23,11 mln 23,18 mln | 1.22 1.35 | 20.00 sob 19.97 | 7.17 6.01 |
ResNet101 [13] ResNet101+SN-BNAM | 44,23 mln 45.00 meters | 7,65 7,78 | 22.44 21.41 | 6.34 6.20 |
ResNeXt29 8x64d [14] ResNeXt29 8x64d+SN-BNAM | 34,43 mln 34,52 mln | 4,99 5.12 | 21.23 20.12 | 9.87 9.76 |
PreResNet110 [15] PreResNet110+SN-BNAM | 1743 meters 1756 meters | 0,245 0,261 | 21.89 21.54 | 7,78 7.62 |
VGG-16 [16] VGG-16+SN-BNAM | 15.231 mln 15.430 mln | 7,65 7.71 | 21.96 20.78 | 9.45 9.29 |
mobile network [17] MobileNet+SN-BNAM | 4,33 mln 4,41 mln | 0,548 0,566 | 23.47 22.12 | 9.24 9.12 |
Table 3.Results of the SN-BNAM classification on the CIFAR-100 dataset
Architecture (CIFAR-100) | parameter | GFLOPY | mistake |
ResNet 50 [13] ResNet 50+KM [9] AW-ResNet50 [11] ResNet 50+SN-BAM | 23,71 mln 24,07 mln 23,87 mln 24,68 mln | 1.22 1,25 1.23 1.28 | 21.49 20.00 sob 19.87 19.80 |
ResNet101 [13] ResNet101+BAM [9] ResNet101+SN-BAM | 42,07 mln 43,06 mln 43,42 mln | 2.44 2.46 2.29 | 20.00 sob 19.61 19.32 |
PreResNet 110 [15] PreResNet 110+KM [9] PreResNet 110+SN-BAM | 1,726 mln 1733 mln 1,757 mln | 0,245 0,246 0,249 | 22.22 21.96 21.53 |
WideResNet 28 (w=8) [18] WideResNet 28 (w=8)+BAM [9] WideResNet 28 (w=8)+SN-BNAM | 23,4 mln 23,42 mln 23,87 mln | 3.36 3.37 3.41 | 19.06 19.06 19.0 |
Table 4.Results of the SN-BNAM classification in the ImageNet-1K dataset
Architecture (ImageNet-1k) | parameter | GFLOPY | First (%) | Top 5 (%) |
ResNet 50 [13] ResNet 50+CBAM [12] AW-ResNet50 [11] ResNet 50+KM [9] ResNet 50+SN-BAM | 25,56 mln 28,09 mln 25,72 mln 25,92 mln 29,36 mln | 3.858 3.864 3,87 3,94 3.897 | 24.56 22.66 23.38 24.02 21.80 | 7.50 6.31 6,79 7.18 5,89 |
ResNet101 [13] ResNet101+CBAM [12] AW-ResNet101 [11] ResNet101+SN-BAM | 44,55 mln 49,33 mln 44,95 mln 49,67 mln | 7.570 7.581 7.58 7.593 | 23.38 21.51 22.38 20.89 | 6,88 5,69 6.21 5.34 |
mobile network [17] Cellular + BAM [9] AW-SE-MobileNet MobileNet+SN-BNAM | 4,23 metra 4,32 mln 5,52m 5,82 mln | 0,569 0,59 0,623 0,687 | 31.39 30.58 29.41 29.21 | 11.51 10.90 10.59 10.24 |
WideResNet 18 [18] (w=1.5) WideResNet 18 (w=1.5)+BAM [9] WideResNet 18 (w=1.5)+CBAM [12] WideResNet18 (w=1.5)+SN-BNAM | 25,88 mln 25,93 mln 26,08 mln 26,12 mln | 3.866 3,88 3.868 3.861 | 26.85 26.67 26.10 25,98 | 8.88 8,69 8.43 8.12 |
table 5.MS COCO Object Detection [ECCV Reference Document] (Completed)
method | detector | mAP@.5 | mAP@.75 | maybe @ (0.5, 0.95) |
ResNet50 [13] ResNet50+CBAM [12] ResNet50+SN-BNAM | Faster-RCNN [19] Faster-RCNN [19] Faster-RCNN [19] | 46,2 48.2 49.1 | 28.1 29.2 30.1 | 27.0 28.1 29.2 |
ResNet101 [13] ResNet101+CBAM [12] ResNet101+SN-BNAM | Faster-RCNN [19] Faster-RCNN [19] Faster-RCNN [19] | 48,4 50,5 52,3 | 30,7 32,6 33,4 | 29.1 30,8 31,7 |
Table 6.mAP for object detection on VOC 2007 test set Apply StairNet detection framework and apply SE, CBAM and SN-BNAM to the detector
spine | detector | can@.5 | parametry (m) |
VGG-16 [16] VGG-16 [16] VGG-16 [16] VGG-16 [16] VGG-16 [16] | SSD [20] Stair Mesh [21] Stair Mesh+SE [10] Stair mesh + CBAM [10] Stair grid+SN-BNAM | 77,8 78,9 79.1 79,3 79,5 | 26,5 32,0 32.1 32.1 32.3 |
mobile network [20] mobile network [20] mobile network [20] mobile network [20] mobile network [20] | SSD [20] Stair Mesh [21] Stair Mesh [21]+SE [10] Stair mesh [21]+CBAM [12] Stair Net[21]+SN-BNAM | 68.1 70.1 70,0 70,5 70,9 | 5.81 5,98 5,99 6.00 6.04 |
4-a.png
(A)
4-b.png
(two)
4-c.png
(C)
4-d.png
(four)
4-e.png
(pet)
Figure 4.Performance comparison with state-of-the-art algorithms. (a) Comparison of accuracy with other traditional algorithms. (b) Performance of different algorithms under different environmental conditions. (c) and (d) show a comparison of the power and accuracy of the algorithms for detecting the 5 basic facial expressions. (e) recall and F1 score of different algorithms
4.1 Detecting MS COCO Objects
The performance of the proposed system was evaluated in terms of object detection using the Microsoft (MS) COCO dataset [22]. Our model is trained using all training images, a subset of validation images, and 5,000 sample validation images. Adopt Faster-RCNN [22] as our discovery method and select pre-trained ResNet101 [15] ImageNet as the underlying network. Improve performance by including SN-BNAM in the underlying architecture. As can be seen in Table 5, our proposed model, SN-BNAM, shows a significant improvement and outperforms other methods.
4.2 VOC Object Detection 2007
The experiments are carried out on the PASCAL VOC 2007 test set, where the SN-BNAM is superimposed on the detector. The proposed SN-BNAM method adopts the StairNet framework [23], which is the strongest multi-scale method based on SSD [24]. The SN-BNAM is placed in front of each classifier to refine the final features before the prediction process and force the model to select significant features.
Experimental results are listed in Table 6 and it is clear that SN-BNAM achieves higher accuracy compared to strong underlying architectures. The increase in precision is achieved with a small increase in parameters. The light network results also show that SN-BNAM can also be implemented in less expensive devices.
4.3 Performance comparison with state-of-the-art algorithms
The performance of the proposed system is compared with the most modern algorithms, such as Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB), Random Forest (RF), Linear Regression (LR). ) and the SNet architecture. The results are shown in Figure 4. Performance comparisons are made using metrics such as accuracy, power, precision, memory, and F1 score. Experiments are conducted in a variety of environments, such as indoors, outdoors, day and night, and it is clear that the proposed SN-BNAM method is superior to other state-of-the-art algorithms. The strength and accuracy of five basic facial expressions are reported, which are happy, sad, natural, scared and camouflaged. Compared to all other algorithms, the method we propose performs better in all aspects.
5. Conclusion
This article proposes a SENet architecture embedded in a bottleneck facial recognition module. The proposed SN-BNAM method uses a single thermal image of a person's face to overcome the limitations of visible or RGB images such as illusions, occlusions, changes in position and changing lighting conditions. Aging issues, images of people wearing glasses or masks do not affect accuracy when using thermal face images. Moreover, thanks to facial recognition, the proposed system can detect the six basic facial expressions of people at any time of the day, such as indoors, outdoors, day and night. Our method is to capture important features of the input image. Compared to other base architectures, the proposed method chooses the SNet architecture as the base architecture for better accuracy. The input thermal image of the face is processed by the SE block and the BAM block stored between the bottleneck locations. Channel and Spatial Attention within BAM determines what and where to focus on. These attention modules only focus on important facial recognition features while other non-essential features are omitted. The experiments performed in this paper clearly show that SN-BNAM outperforms core architectures such as MobileNet, VGG-16, ResNet, etc. in image classification on our own THFDS dataset as well as CFIAR-100 and ImageNet datasets. The proposed system works well on the MS COCO and VOC 2007 datasets and the results are published. Also compared to traditional predictive algorithms such as SVM, KNN, decision tree, etc., our SN-BNAM method excels in terms of accuracy and error reduction. The integration of the proposed SN-BNAM module with other core architectures improves the performance of other architectures in terms of accuracy and reduces the error rate. Channel and Spatial Attention improves accuracy by removing noise from input images and focusing only on feature attributes. The proposed system uses only one face image for experiments. Future work could improvise datasets, including full-size images and videos of people, and develop methods and algorithms to select or crop only faces from facial recognition videos.
reference
[1] Vigneau, G. H., Verdugo, J. L., Castro, G. F., Pizarro, F., Vera, E. (2017). Thermal face recognition in time-varying conditions. Accessed by IEEE, 5:9663-9672. https://doi.org/10.1109/ACCESS.2017.2704296
[2] Weidlich, Virginia (2021). Infrared face recognition. Kureusz, 13(3). https://doi.org/10.7759/cureus.13736
[3] Dixit, A.N., Kasbe, T. (2020). Research on facial expression recognition using machine learning techniques. 2nd International Conference on Data, Engineering and Applications, 2020, pp. 1-6. https://doi.org 10.1109/IDEA49133.2020.9170706
[4] R. Vargas, A. Mosavi, R. Ruiz (2018). Deep Learning: An Overview. Preprints.org, 2018:2018100218. https://doi.org/10.20944/preprints201810.0218.v1
[5] Pei, Z., Xu, H., Zhang, Y.N., Guo, M., Yang, Y.H. (2019). Facial recognition through deep learning using experiment-based orthogonal data augmentation. Electronics, 8(10):1088. https://doi.org/10.3390/electronics8101088
[6] Hammadi, O.I., Abbas, AD, Ayed, K.H. (2018). Overview of facial recognition using deep learning methods. International Journal of Engineering Technology, 7(4):6181-6188.
[7] Misra, D., Nalamada, T., Arasanipalai, AU, Hou, Q. (2021). Rotational share: weave modulus of triple attention. In Proceedings of the IEEE/CVF Winter Conference on Computer Vision Applications, pp. 3139-3148. https://doi.org/10.1109/WACV48630.2021.00318
[8] Zhang Zhixin, Wang Mingwei (2022) Convolutional neural network with convolution block attention module for finger vein recognition. arXiv reprint of arXiv:2202.06673. https://doi.org/10.48550/arXiv.2202.06673
[9] Park, J., Woo, S., Lee, J.Y., Kweon, I.S. (2018). Bam: BottleNeck attention module. arXiv reprint of arXiv:1807.06514. https://doi.org/10.48550/arXiv.1807.06514
[10] Hu Jing, Shen Li, Sun Guang (2018.) Squeezing and Exciting Networks. U Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, ul. 7132-7141. https://doi.org/10.1109/CVPR.2018.00745
[11] Zhu, BZ, Hofstee, P., Lee, J., Al-Ars, Z. (2021). Attention module for convolutional neural networks. arXiv reprint of arXiv:2108.08205. https://doi.org/10.48550/arXiv.2108.08205
[12] Woo S., Park J., Lee J.Y., Kweon I.S. (2018.). Cbam: convolution block attention module. U Proceedings of the European Conference on Computer Vision (ECCV), ul. 3-19. https://doi.org/10.1007/978-3-030-01234-2_1
[13] He Kai, Zhang Xueyou, Ren Siqi, Sun Jie (2016) Deep Residual Learning for Image Recognition. W Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, s. 770-778. https://doi.org/10.1109/CVPR.2016.90
[14] Xie, S., Girshick, R., Dollár, P., Tu, ZW, He, K. (2017). Aggregate Residual Transform for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, ul. 1492-1500. https://doi.org/10.1109/CVPR.2017.634
[15] He Jinming, Zhang Xueyou, Ren Siqi, Sun Jianjun (2016). Identity mapping in deep residual networks. In Computer Vision - ECCV 2016: 14th European Conference, Springer International Publishing, pp. 630-645. https://doi.org/10.1007/978-3-319-46493-0_38
[16] Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv reprint of arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
[17] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T. Adam, H. (2017). Mobilenets: powerful convolutional neural networks for mobile vision applications. arXiv reprint of arXiv:1704.04861. https://doi.org/10.48550/arXiv.1704.04861
[18] Zagoruyko S., Komodakis N. (2016). Wide residual networks. arXiv reprint of arXiv:1605.07146. https://doi.org/10.48550/arXiv.1605.07146
[19] Ren, S., On, K., Girshick, R., Sun, J. (2015). Faster r-cnn: Real-time object detection using regional networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149. https://doi.org/10.1109/TPAMI.2016.2577031
[20] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C. (2016). SSD: Single multi-box detector. In Computer Vision - ECCV 2016: 14th European Conference, Springer International, pp. 21-37. https://doi.org/10.1007/978-3-319-46448-0_2
[21] Woo S., Hwang S., Kweon I.S. (2018). Stairnet: Top-down semantic aggregation for accurate one-time detection. IEEE 2018 Winter Conference on Computer Vision Applications (WACV), IEEE, pp. 1093-1102. https://doi.org/10.1109/WACV.2018.00125
[22] Ren, S., On, K., Girshick, R., Sun, J. (2015). Faster R-CNN: Real-time object detection using regional networks. during. Neural Information Processing Systems (NIPS), 2015.
[23] Park, J., Woo, S., Lee, J. Y. (2018). BAM: Attention Bottlenecks Module. 2018. Computer vision and pattern recognition. https://doi.org/10.48550/arXiv.1807.06514
[24] Zhou Sen, Qiu Jun (2021). Enhanced SSD with interactive multi-scale object detection attention. Multimedia Tools and Applications, 80: 11539-11556. https://doi.org/10.1007/s11042-020-10191-2