Human-computer interaction (HCI) (Card, 2018) has evolved significantly with the emergence of novel technologies and seeks to emulate the ease and intuitiveness of human-to-human communication. Hand gesture-based interaction is at the forefront of this evolution, providing users with a natural and expressive medium for interacting with machines. The ability to recognize and interpret human gestures in real time opens a new dimension for HCI, improving accessibility, usability, and the overall user experience (Xia, Gao, Lai, Yuan, & Chai, 2017). In particular, dynamic hand gesture recognition (DHGR) has attracted significant attention due to its potential to enable seamless and intuitive interactions in various domains, including games (Kulshreshth, Pfeil, & Laviola, 2017), virtual reality (Sagayam & Hemanth, 2017), robot control (Sun, Zhang, & Chen, 2023), and Smart home automation (Chou et al., 2017). Unlike traditional input devices that require explicit commands or gestures, DHGR systems can adapt to the nuanced and fluid movements inherent in human communication. With the advent of sophisticated smart devices and robotic technologies, the ability of machines to interpret and respond to these nuanced gestures has become increasingly important.
Despite the promise of gesture-based interaction, several challenges must be overcome to achieve robust and reliable recognition. These include the complexity of gesture dynamics, variations in execution speed, differences between individuals, and the effects of environmental factors such as lighting conditions and background noise (Qi, Ma, Cui, & Yu, 2024). The traditional visual recognition approach divides the process of hand gesture recognition into gesture segmentation, gesture tracking, gesture feature extraction and gesture classification. Zhang, Lu, Zhang, Duan & L (2015) used the K-means clustering method for the chromaticity channel in the YCbCr color space to separate skin pixels from useless background information by using them as foreground pixels. Danelljan, Khan, Felsberg, and Weijer (2014) proposed an adaptive method to transform low-dimensional color attributes to improve tracking performance by studying the role of color in a point-to-point detection framework. The shift-invariant feature transformation (SIFT) method (Lowe, 2004) is widely used for hand gesture feature extraction. Based on SIFT, Rekha, Bhattacharya, and Majumder (2011) proposed a feature extraction method “Speeded-up Robust Feature” (SURF) for hand gesture recognition. Pisharady, Vadakkepat & Poh (2014) used edge and texture features combined with SVM for gesture recognition. The experimental results show that the algorithm works independently of humans and reliably handles changes in opponent size and complex backgrounds. However, the traditional approaches have weak generalization ability and low recognition accuracy, which makes it difficult to meet the requirements of DHGR in practical complex scenarios.
The cornerstone of effective dynamic hand gesture recognition lies in the ability to process and analyze visual data with high accuracy. With the advent of deep learning, particularly the convolutional neural network (CNN) (Krizhevsky, Sutskever, & Hinton, 2012), the ability to extract complex features from visual streams has been significantly improved. Its capabilities in feature extraction and efficient pattern recognition in huge datasets make it the mainstream approach for hand gesture recognition tasks. Wang et al. (2019) proposed a time segment network to divide videos into multiple segments. It uses a two-dimensional CNN (2D-CNN) to extract information from the color and optical flow modes of each segment and then fuses it with spatial state information for action recognition. However, dynamic gestures involve video data with temporal information. In general, 2D CNN can only extract spatial information from images and is suitable for static gesture recognition. With the introduction of three-dimensional CNN (3D-CNN) (Tran, Bourdev, Fergus, Torresani, & Paluri, 2015), spatiotemporal learning tasks involving time series can be solved well, and many improved or redesigned network models based on 3D-CNN are used for DHGR. Al-Hammadi, Muhammad, Abdul & et al. (2020) proposed a DHGR system based on an improved 3D CNN structure, effectively using local and global configurations, and focusing on the hand region. In (Liu, Liu, Gan, Tan & Ma, 2018), a T-C3D model is proposed, which introduces a temporal coding model based on the 3D CNN model, which can effectively improve the recognition accuracy. Models based on deep learning can autonomously recognize a range of gestures, from simple movements to complicated sequences, improving recognition accuracy.
- •
We propose MONAS_ABC, a novel Multi-Objective Neural Architecture Search (NAS) framework based on the binary algorithm for artificial bee colonies. Unlike traditional NAS methods, MONAS_ABC includes three custom search strategies for the Employed, Viewer, and Scout Bee phases, enabling effective binary-encoded exploration specifically tailored for architecture optimization tasks.
- •
We design a MobileNetV2-inspired cell-based search space and a task-specific binary encoding scheme, which together ensure stable search dynamics and efficient exploration in the discrete architectural space, especially under the constraints of DHGR.
- •
We conduct comprehensive experiments on two widely used multi-objective benchmark tasks (C10/MOP and IN1K/MOP) and two real-world DHGR datasets (EgoGesture and NvGesture). The results show that MONAS_ABC achieves state-of-the-art performance and offers a better balance between accuracy, FLOPs and model complexity compared to existing 3D CNN and NAS-based methods
The remaining parts are planned as follows. The related work includes DHGR and the search for multi-objective neural architectures in Section 2. The detailed description of MONAS_ABC can be found in Section 3. Section 4 covers the experimental verification and results. In Section 5 the conclusion of this work is given.