Skip to main content

Real-time fire detection algorithms running on small embedded devices based on MobileNetV3 and YOLOv4



Fires are a serious threat to people’s lives and property. Detecting fires quickly and effectively and extinguishing them in the nascent stage is an effective way to reduce fire hazards. Currently, deep learning-based fire detection algorithms are usually deployed on the PC side.


After migrating to small embedded devices, the accuracy and speed of recognition are degraded due to the lack of computing power. In this paper, we propose a real-time fire detection algorithm based on MobileNetV3-large and yolov4, replacing CSP Darknet53 in yolov4 with MobileNetV3-large to achieve the initial extraction of flame and smoke features while greatly reducing the computational effort of the network structure. A path connecting PANet was explored on Gbneck(104, 104, 24), while SPP was embedded in the path from MobileNetV3 to PANet to improve the feature extraction capability for small targets; the PANet in yolo4 was improved by combining the BiFPN path fusion method, and the improved PANet further improved the feature extraction capability; the Vision Transformer model is added to the backbone feature extraction network and PANet of the YOLOv4 model to give full play to the model’s multi-headed attention mechanism for pre-processing image features; adding ECA Net to the head network of yolo4 improves the overall recognition performance of the network.


These algorithms run well on PC and reach 95.14% recognition accuracy on the public dataset BoWFire. Finally, these algorithms were migrated to the Jeston Xavier NX platform, and the entire network was quantized and accelerated with the TensorRT algorithm. With the image propagation function of the fire robot, the overall recognition frame rate can reach about 26.13 with high real-time performance while maintaining a high recognition accuracy.


Several comparative experiments have also validated the effectiveness of this paper’s improvements to the YOLOv4 algorithm and the superiority of these structures. With the effective integration of these components, the algorithm shows high accuracy and real-time performance.



Los incendios representan una seria amenaza para la gente y sus propiedades. El detectar incendios rápida y efectivamente y extinguirlos en su estado inicial es una forma efectiva de reducir sus peligros. En la actualidad, la detección de incendios basada en algoritmos de detección usando el conocimiento profundo (deep learning) están siendo desarrollados mediante el uso de computadores (PCs).


Luego de migrar hacia computadores cada vez más pequeños, la exactitud y velocidad de reconocimiento se están degradando debido a una falta de capacidad de computación. En este trabajo, proponemos un algoritmo de detección de incendios en tiempo real basado en la tecnología digital yolov4, que reemplaza a CSP Darknet53 en el yolov4 por la MobileNetV3-large, para alcanzar las características iniciales que permitan la detección de llamas y humo mientras se reduce grandemente el esfuerzo en la estructura de las redes computacionales; un paso que conecta PANet fue explorado con Gbneck(104,104,24), mientras que SPP fue incorporado en el paso que conecta MobileNetV3-large a PANet para mejorar la capacidad de extracción sobre las características de objetivos pequeños. El PANet en el yolov4 fue mejorado combinando el método de fusión BiFPN, y este PANet mejorado incrementó además las características en la capacidad de extracción. El modelo de Vision Transformer es adicionado a la columna vertebral de las características de la red de extracción y al modelo yolov4 para brindar una mayor articulación al mecanismo de atención del modelo de cabezas múltiples para pre-procesar las características de las imágenes. La adición de la RED ECA a la cabeza de la red yolov4 mejora la performance del reconocimiento general de la red.


Estos algoritmos funcionan bien en una PC y alcanzan y reconocen una exactitud del 95% en el conjunto de datos públicos BoWFire. Finalmente, estos algoritmos fueron migrados a la plataforma Jeston Xavier NK y la red completa fue cuantificada y acelerada con el algoritmo Tensor RT. Con la función de propagación de imagen del robot de fuego, el reconocimiento general de la tasa de encuadre puede alcanzar 26.13, con una performance en tiempo real mientras se mantiene una alta exactitud de reconocimiento.


Diferentes experimentos comparativos han validado también la efectividad de este trabajo en el mejoramiento del algoritmo yoylov4 y la superioridad de estas estructuras. Con la interacción efectiva de estos componentes, el algoritmo muestra una alta exactitud y performance en tiempo real.


Fire is one of the major public safety disasters that can result in casualties and economic and property losses. Detecting fire conditions as early as possible and extinguishing them in the beginning stages is an effective method to reduce fire hazards. Therefore, researching rapid and accurate fire detection is of great significance (Muhammad et al. 2018a, b, c). Traditional smoke detectors can sense fire when smoke particles enter a room, but this method has a long detection time and is not suitable for outdoor fire detection.

With the development of neural networks and deep learning and other fields (Gong et al. 2021; Succetti et al. 2022), a video-based fire detection method is proposed. Compared with traditional methods, it has the advantages of fast response, non-contact, visualization, intelligence, and easy integration. Most fires pass through a long-smoldering process before the occurrence of a flame, generating a large amount of smoke. Due to the diffusion of smoke, smoke can identify the trend of fire earlier than flame detection, and the response time is earlier.

Although the smoke detection algorithm has made great progress, it has not been widely used in the real world, mainly because of the following reasons: fire generally causes the background scene to become complicated, thereby reducing the accuracy of the detection algorithm, false alarms, leaks fire alarms, and other phenomena occur frequently; although the general fire detection algorithm has good accuracy, it is too complicated, which will cause it to not run well on general small embedded devices. If the algorithm does not run stably on some embedded platforms, then such algorithms lose their practical applicability.

Based on the above analysis, we conclude that the limitations of current fire detection algorithms include too many parameters for the algorithm to calculate and poor immunity to environmental disturbances resulting in the algorithm being prone to false alarms. For these reasons, in this paper, we propose a new lightweight fire detection algorithm. The contribution of the algorithm is as follows:

  1. 1.

    It is proposed to replace the backbone network CSPDraknet53 of YOLOv4 (Bochkovskiy et al. 2020) with the MobileNetV3 (Howard et al. 2019) network, which can effectively extract valid information and greatly reduce the computational complexity of the algorithm.

  2. 2.

    In this paper, the YOLOv4 algorithm improves multiscale feature fusion by extending a PANet (Liu et al. 2018) path at the G-bneck (104, 104, 24) layer to improve the detection of multi-pose and multi-scale targets.

  3. 3.

    The Spatial Pyramid Pooling (SPP (He et al. 2015)) module is added to the path from the feature layer of the backbone output to the PANet to improve the feature extraction of small targets.

  4. 4.

    The path fusion method based on BiFPN (Tan et al. 2020) is used to improve the path aggregation method of PANet to further improve the feature extraction capability.

  5. 5.

    The Vision Transformer (Dosovitskiy et al. 2020) model is added to the backbone feature extraction network and PANet of the YOLOv4 model to give full play to the model’s multi-headed attention mechanism for pre-processing image features.

  6. 6.

    Efficient Channel Attention (ECA) (Wang et al. 2020) is added to the header network of YOLOv4, which reduces the input of interference information and improves the overall recognition effect of the network.

  7. 7.

    The algorithm running stably on PC was successfully migrated to Jeston Xavier NX, and TensorRT was used to accelerate the algorithm.

  8. 8.

    For the model training and experimental comparison of this algorithm, we collected a series of flame and smoke images, including single flame and smoke, multi-body flame and smoke, indoor fire, forest fire, and complex background fire scenarios, with a total of 29,980 images, divided into a training set, a validation set and a test set according to a ratio of 7:1:2.

Related work

Traditional fire detection is typically based on a combination of flame and smoke sensors, but this type of method has severe restrictions on the environment used and cannot be used in all situations. With the widespread use of video cameras in public safety systems, fire detection techniques based on machine learning methods of image information have been rapidly developed. Traditional vision-based fire detection methods generally achieve fire detection by extracting fire features, such as color (Töreyin et al. 2006; Chen et al. 2006; Genovese et al. 2011, Celik and Demirel 2009), texture (Gunay et al. 2012; Chunyu et al. 2009; Yuan et al. 2016a, b; Dimitropoulos et al. 2016), shape (Hongyu et al. 2020; Töreyin et al. 2005), and motion state (Han and Lee 2009, Yuan 2008). Related research results are as follows: Kim et al. (2014) established an RGB color model to achieve fire detection, but the robustness and generalization ability of the method was insufficient; Wang et al. (2020) proposed a fusion of flame color and local features, a flame detection method based on KNN background subtraction; Günay and Çetin (2015) proposed a real-time dynamic texture recognition method using projection to random hyperplanes and deep neural network filters and applied the method to infrared video, real-time flame detection. Emmy Prema et al. (2018) preliminarily segmented the flame regions in the image according to the YCbCr color space and extracted static and dynamic texture features for the candidate flame regions through 2D 1446 Fire Technology 2022 temporal wavelet decomposition and 3D volume wavelet decomposition. Finally, the candidate flame regions are classified according to the extracted texture features. Jia et al. (2016) adopted non-linear enhanced smoke color features to identify smoke regions, then used motion features to measure saliency, and finally used motion energy and saliency maps to segment smoke regions. Habiboğlu et al. (2012) divided the video into spatiotemporal blocks and used the covariance-based spatiotemporal features extracted from these blocks to train an SVM classifier. Dimitropoulos et al. (2014) employed background subtraction and color analysis to define candidate regions, and then modeled fire behavior in time and space using color probability, flicker, space, and energy simultaneously for each candidate region, and performed dynamic texture analysis. Finally, the candidate regions are classified using a two-class SVM classifier. Yuan et al. (2016a, b) proposed a method for forest fire detection using drones. Firstly, the candidate area is extracted by the color feature of the flame; then, the motion vector of the candidate area is calculated by the Horn-Schunck optical flow algorithm, and the binary image is obtained by thresholding and morphological operation on the motion vector. Finally, the spot counting method is used to locate the fire source in the binary image. Kim and Lattimer (2015) and Kim et al. (2016) extracted the texture and motion features of flames and smoke from long-wave infrared images for autonomous navigation of robots in fire environments. Although these algorithms are less dependent on the computing power of the hardware, the accuracy of detection is affected by the accuracy of the algorithm’s feature extraction and is also susceptible to interference from the environment, and the shape and color characteristics of flames and smoke are very complex and variable. It is becoming clear that traditional vision algorithms alone cannot effectively solve these problems.

The subsequent rise of fields such as neural networks, artificial intelligence, and deep learning has provided new opportunities to address fire detection. Related research results are as follows: Frizzi et al. (2016) used a 6-layer CNN to solve the three classification problems of fire, smoke, and no fire. Tao et al. (2016) used deep convolutional neural networks to achieve end-to-end training from raw pixel values to classifier output, which successfully improved the accuracy of smoke detection. Yin et al. (2017) proposed a 14-layer deep normalized convolutional neural network (DNCNN) to achieve automatic extraction of smoke features. Xu et al. (2021) applied deep learning techniques to adaptively learn and extract the features of forest fires. The method first integrated two independent learners, Yolov5 and EfficientDet, to complete the fire detection process. Second, another individual learner, EfficientNet, is responsible for learning global information to avoid false positives, and finally, the detection results are based on the decisions of the three learners. Kim and Lee (2019) proposed a deep learning-based fire detection method using video sequences, which uses a convolutional neural network (R-CNN) to detect suspicious fire areas (SRoF) and non-hazardous fires based on their spatial features. fire area. Then, the aggregated features within bounding boxes in consecutive frames are accumulated by LSTM to classify whether there is a fire in the short term. Decisions in successive short periods are then combined into a majority vote for the final decision in the long period. Zhang et al. (2018) solved the problem of insufficient training data by inserting real smoke or simulated smoke into the forest background to generate synthetic smoke images and used the synthetic smoke image dataset to train Faster R-CNN to obtain a smoke detection model. Xu et al. (2019) proposed a novel deep saliency network-based method for video smoke detection. Informative smoke saliency maps are extracted by combining pixel-level saliency convolutional neural networks and object-level saliency convolutional neural networks, and the presence of smoke in images is predicted by combining deep feature maps and saliency maps. Lin et al. (2019) constructed a joint framework of RCNN and 3D CNN, using RCNN to extract static spatial information and using 3D CNN to extract spatiotemporal features, thus solving the problem of fire smoke detection and localization. It can be seen that the complex CNN can extract the spatial features of the smoke target and can accurately locate the smoke in time, which is very suitable for smoke detection. Bhattarai and Martinez-Ramon (2020) used deep convolutional neural networks to extract, process, and analyze key information from thermal imaging, creating an automated system capable of detecting critical objects at fire sites in real time. Wu et al. (2022) proposed a video fire detection algorithm based on YOLOv5, which improved SPP, and used an activation function (GELU) and predictive bounding box suppression (DIoU-NMS), with excellent performance of the final algorithm. Huang et al. (2023) proposed a light forest fire detection algorithm with a defogging function. The algorithm first obtains a fog-free image after a dark channel operation on the image and then detects the image with a lightened and improved YOLO-L-Light algorithm. Xue et al. (2022) proposed an improved forest fire classification and detection algorithm based on YOLOv5, which introduced SIoU and CBAM, and improved PANet to a BiFPN-like structure, and the final algorithm outperformed the original algorithm in all aspects. Zhao et al. (2022) proposed an improved YOLO algorithm that extends the feature extraction network in three dimensions and adds feature propagation properties to improve the network performance and reduce the algorithm parameters. Sathishkumar et al. (2023) proposed a learning without forgetting (LwF) method for fire detection algorithms, which addresses the possibility that the detection model may lose its ability to classify the original dataset when applying migration learning thereby greatly reducing the number of steps required to migrate the detection model for learning. Zheng et al. (2023) novel algorithm for remote sensing forest fire detection is proposed, which first uses FireYOLO for the initial recognition of the target, then applies the Real-ESRGAN algorithm to the target to improve image clarity, followed by FireYOLO for a second recognition. Each of these algorithms has its own characteristics and solves some of the challenges in fire detection, but it is still a challenging problem to improve the accuracy of the algorithm and its immunity to interference while reducing the number of parameters to a great extent.

Materials and methods

In order to further improve the real-time performance of the deep learning-based fire detection algorithm, this paper proposes a fire detection algorithm based on MobileNetV3-large and YOLOv4 (hereafter referred to as MobileNetV3-large-YOLOv4 algorithm), the structure of which is shown in Fig. 1. The algorithm improves the network structure of YOLOV4: the MobileNetV3-large is used as the backbone network to achieve the initial extraction of smoke and flame features; the PANet path is extended at the G-bneck(104, 104, 24) layer to improve the multi-scale feature fusion and enhance the detection of multi-pose and multi-scale targets; the feature layer at the backbone output to the PANet’s path, the SPP module, is added to improve the feature extraction of small targets; the path of PANet is modified according to the path connection principle of BiFPN; the Vision Transformer model is added to the backbone feature extraction network of yolo4 model and PANet to give full play to the multi-head attention mechanism of the model to pre-process the image features; the ECANet is introduced in the head network to reduce the input of interference information and improve the extraction of effective information. The algorithm runs well on PC and achieves a recognition accuracy of 95.04% on the public dataset BoWFire (Chino et al. 2015). Finally, these algorithms are migrated to the Jeston Xavier NX platform to quantify and accelerate the entire network using the TensorRT algorithm. Using the image propagation function of the fire robot, the overall recognition frame rate can reach about 26.13, and the algorithm has a high real-time performance while maintaining a high recognition accuracy.

Fig. 1
figure 1

The network structure of MobileNetV3-large-YOLOv4

Fire feature extraction based on MobileNetV3-large

The MobileNet network is a lightweight CNN proposed by Google. The convolution model of MobileNetV1 mainly uses the depthwise separable convolution (depthwise separable convolution) to replace the ordinary convolution method. The depthwise separable convolution process is shown in Fig. 2. It is achieved by using different convolution kernels for each input channel to perform convolution, and then channel adjustment through 1 × 1 convolution kernel, and add a BN (Batch Normalization) layer and ReLU after the convolution layer, activation function. Suppose the size of the input feature map is D W × D H × M, and the size of the output feature map is D W × D H × N, where D W and D H are the width and height of the feature map, respectively, and M and N are the number of channels of the input and output feature maps, respectively. For a standard convolution with a convolution kernel size of D K × D K, there are N convolution kernels of D K × D K × M, so the calculation formula of the parameter PN can be expressed as:

$$P_{N} = D_{K} \times D_{K} \times M \times N$$
Fig. 2
figure 2

Depth-separable convolution structure

Each convolution kernel has to undergo D W × D H calculations, and its calculation amount S N table is shown as:

$$S_{N} = D_{K} \times D_{K} \times M \times N \times D_{W} \times D_{H}$$

In depthwise separable convolution, a standard convolution can be divided into depthwise convolution and pointwise convolution two-step operation. Depthwise convolution requires only a D K × D K × M Convolution kernel; the size of the convolution kernel of pointwise convolution is 1 × 1 × M, and there are N in total, because this parameter P D is expressed as:

$$P_{D} = D_{K} \times D_{K} \times M + M \times N$$

Each parameter of depthwise convolution and pointwise convolution needs to go through D W × D H operations, and its computational cost S D is expressed as:

$$S_{D} = D_{K} \times D_{K} \times M \times N \times D_{W} \times D_{H} + M \times N \times D_{W} \times D_{H}$$

The ratio of depthwise separable convolution modules to standard convolution parameter quantities R P table. It is shown as formula (5), and the calculation ratio R Q is expressed as formula (6).

$$R_{P} = \frac{{P_{D} }}{{P_{N} }} = \frac{1}{N} + \frac{1}{{D_{K}^{2} }}$$
$$R_{Q} = \frac{{S_{D} }}{{S_{N} }} = \frac{1}{N} + \frac{1}{{D_{K}^{2} }}$$

It can be seen from the above formula that the parameters and calculation amount of the depthwise separable convolution are reduced \(\frac{1}{N} + \frac{1}{{D_{K}^{2} }}\) for standard convolution.

The MobileNetV1 network structure is prone to failure of the convolution kernel of the depth convolution part during the training process, that is, most of the parameters of the convolution kernel are 0, which affects the feature extraction effect. MobileNetV2 uses the inverted residuals block (Sandler et al. 2018) structure on the basis of V1, as shown in Fig.1E. Firstly, point-by-point convolution is used to increase feature dimension, then depthwise convolution is used for feature extraction, and finally point-by-point convolution is used for dimension reduction, and the ReLU activation function is replaced with the ReLU6 activation function, which makes the model more powerful under low-precision computing robustness and remove the last ReLU layer. The formula of the ReLU6 activation function is expressed as:

$${\text{Re}} LU6(x) = \min (\max (0,x),6)$$

When the input dimension is the same as the output dimension, the residual connection in ResNet is introduced to directly connect the output with the input. The characteristics of this inverted residual structure are that the upper and lower layers have low feature dimensions, and the middle layer has high dimensions, which avoids the failure of the convolution kernel in the deep convolution process of MobileNetV1, and the use of single depth convolution in the high-dimensional feature layer is not would increase the amount of parameters too much. In addition, the introduction of residual connections can avoid the phenomenon of gradient disappearance when deepening the network depth.

MobileNetV3 uses a 3 × 3 standard convolution and multiple bneck structures to extract features. After the feature extraction layer, a 1 × 1 convolution block is used to replace the fully connected layer, and a maximum pooling layer is added to obtain the final classification result, which further reduces the amount of network parameters. MobileNetV3 includes two structures, large and small, and this paper uses the large structure. In order to adapt to the model recognition task, the input image size is set to 416 × 416. The structure of MobileNetV3_large is shown in Table 1, where SE means whether to use the attention module, NL means which activation function to use, and s means the step size.

Table 1 MobileNetV3-large construction method diagram

Improvement and optimization of the neck structure

The PANet structure is used in four feature layers

The shallow network contains more localization information, while the deep network contains more semantic information, and the localization information of small-scale pedestrians is lost after a series of down-sampling operations. The aim of this section is to improve the detection accuracy of the YOLOv4 detection model for small-scale flames and smoke by fusing multi-scale features so that more localization information of the shallow small target line flames and smoke is transferred to the deeper network.

The PANet structure first performs top-to-bottom feature extraction in the traditional feature pyramid structure FPN (Lin et al. 2017) (Feature Pyramid Network), which only enhances semantic information and does not convey localization information, then completes bottom-to-top path-enhanced feature extraction in the next feature pyramid, which conveys strong localization information in the shallow layer; next, the adaptive feature pooling layer uses features from each layer of the pyramid to enable more accurate classification and localization at a later stage. The next layer is the adaptive feature pool layer, which uses features from each layer of the pyramid to enable more accurate classification and localization at a later stage. Figure 3 shows various types of relational network structures related to the neck structure of this paper, where a is FPN, b is PANet, c is BiFPN, and d is the neck relational network structure of the algorithm in this paper, which is obtained by fusing the structural features of PANet and BiFPN.

Fig. 3
figure 3

Schematic representation of the various types of network relationship structures

The YOLOv4 algorithm uses the PANet structure on the three effective feature layers, but it is still not effective in recognizing small target pedestrians and multi-attitude pedestrians. Therefore, YOLOv4 is improved as shown in Fig. 3 to perform multi-scale feature fusion on the four effective layers.

Medium- and large-scale feature layers introduce SPP structure

The SPP structure was originally used as a transition layer between convolutional layers and fully connected layers to solve the problem of size mismatch. Subsequently, researchers found that this structure can enhance the receptive field; therefore, some researchers tried to introduce the improved SPP module into the target detection network, splicing multi-scale local area features, and improving the accuracy of target detection (Huang et al. 2020; Mao et al. 2020). YOLO V3-608 with SPP module outperforms AP50 by 2.7% in the COCO object detection task.

In YOLO V4, the SPP module is set after the small-scale feature layer to be responsible for the prediction of small and medium-sized targets, but the SPP module is not set after the medium-scale feature layer and the large-scale feature layer. Therefore, it is easy to lose small objects during the propagation process. The characteristics of the target lead to the omission of small targets. In this paper, after the medium-scale feature layer and the large-scale feature layer, the SPP module is added, and the feature tensors of different scales extracted by the backbone network are input into the SPP module, so that the characteristics of small and medium-sized targets are more obvious, target identification.

The SPP module consists of three max-pooling layers and one connection layer. Figure 1 (c) shows the structure of the SPP module. The maximum pooling is performed by a pooling core, the size of the pooling core is 5 × 5, 9 × 9, and 13 × 13, and the step length of the pooling core is 1. Therefore, after the pooling operation, three new feature maps of the same size as the original feature map are obtained. The three feature maps are superimposed with the original feature maps to get the final output of the module.

Improved PANet

When three feature tensors of different scales pass through the SPP module, the PANet structure is adopted for further feature fusion. Small-scale features are more responsive to the overall target, while large-scale features are better at expressing local features. However, considering the use of MobileNetV3-large to replace CSPDarknet53, YOLOv4 is lightweight and reduces the ability of feature fusion. This paper applies the path fusion idea in BiFPN to improve PANet in YOLOv4. In BiFPN, the input nodes and output nodes of the same layer can be connected across layers to ensure that more features are incorporated without increasing the loss. This algorithm performs cross-layer connections on the same level of PANet (the three orange lines in Figs. 1A and 3d); in this way, the path from low-level information to high-level information can be shortened, and their semantic features can be combined. In BiFPN, adjacent layers can be merged in series. In this paper, the adjacent layers of PANet are merged in series (the three blue lines in Figs. 1A and 3d).

The improved PANet has the characteristics of bidirectional cross-scale connection and weighted feature fusion, which improves the feature fusion ability and further increases the feature extraction ability.

Introduction of ECA attention mechanism

When learning and understanding the unknown, humans quickly focus their attention on key areas and ignore useless information in order to get the information they need quickly and accurately. Researchers have been inspired to incorporate attention mechanisms into convolutional neural networks to improve the performance of traditional network models while sacrificing a small amount of computation. In this paper, the Efficient Channel Attention (ECA) module is added to YOLOv4, and the weights are trained on the channel dimensions of the four feature layers of the head network to make the model more focused on useful information. The specific structure of ECA Net is shown in Fig. 1D.

The ECA module can be seen as an improved version of the Squeeze-and-Excitation (SE (Hu et al. 2018)) module. The authors of the ECA argue that the SE prediction of the channel attention mechanism has the side effect of capturing all channel dependencies inefficiently and unnecessarily, whereas convolution has good cross-channel information acquisition capabilities, so the ECA module replaces the 2-full joins of the SE module with 1D convolution. The size of the convolution kernel of the 1D convolution affects the coverage of cross-channel interactions, so it is important to choose the 1D convolution kernel size k. Although k can be adjusted manually, this wastes a lot of time and effort. k is non-linearly proportional to C. The larger C is, the stronger the long-term interaction; conversely, the smaller C is, the stronger the short-term interaction, i.e.:

$$C = \emptyset \left( k \right) = 2^{(\gamma \times k - b)}$$

Once the channel dimension C has been determined, the convolution kernel size k is then:

$$k = \varphi (C) = \left| {\frac{{\log_{2} (C)}}{\gamma } + \frac{b}{\gamma }} \right|$$

where γ and b are the regulation parameters; |t| odd denotes the nearest odd number t.

In this paper, the ECA module is applied to the enhanced feature extraction network by adding an attention mechanism to the 152 × 152, 76 × 76, and 38 × 38 feature layers extracted from the backbone network, so that the subsequent training of the network can focus on the effective features and improve the detection capability of the algorithm.

Introduction of Vision Transformer

The Vision Transformer model was designed with the principle of not changing the transformer too much, using the Transformer Encoder part to do the classification, i.e., just to solve the problem of its poor performance in classification tasks with large data. Alexey Dosovitskiy et al. were inspired by the success of transformer scaling in NLP and attempted to apply the standard transformer directly to images with as few modifications as possible, eventually proposing the Vision Transformer for computer vision modules, whose network structure is shown in Fig. 1F.

The Vision Transformer first chunks the image and then adds a classification token to the image sequence so that the sequence of images is cut into smaller chunks from a single image, with the dimensionality changing as shown in Eq. 10.

$$B,C,H,W \Rightarrow B,N,P^{2} CN = \frac{HW}{{P^{2} }}$$

Instead of using the traditional transformer encoding method, the Vision Transformer’s position encoding first initializes the position information randomly and then trains to learn the image features. Finally, the extracted image features are used to generate feature predictions for different target classes. The encoding used is shown in Eqs. 11 and 12.

$${\text{PE}}({\text{pos}},2i) = \sin (\frac{{{\text{pos}}}}{{10,000^{{\frac{2i}{{d_{{{\text{model}}}} }}}} }})$$
$${\text{PE}}({\text{pos}},2i + 1) = \cos (\frac{{{\text{pos}}}}{{10,000^{{\frac{2i}{{d_{{{\text{model}}}} }}}} }})$$

Algorithm quantization based on TensorRT

To have a faster operation speed on the embedded platform, this paper further quantifies the related algorithms. The commonly used methods are network pruning, model quantization, and so on. Considering that the MobileNetV3-large-YOLOv4 algorithm has adopted the MobileNetV3-large lightweight network structure, continuing to prune the MobileNetV3-large-YOLOv4 network will destroy the integrity of the entire network, so this paper adopts the model quantization method to achieve the quantization of the algorithm.

Model quantization methods can be divided into quantization-aware training and post-training quantization, where post-training quantization methods are divided into hybrid quantization, 8-bit integer quantization, and half-precision floating-point quantization. This paper uses the TensorRT acceleration engine to process the model weight file using the post-training quantization method, converts the weight from float type to int8 type, and performs overall optimization through a series of operations such as tensor fusion, kernel adjustment, and multi-stream execution. The algorithms can be deployed directly on embedded devices.

Results and discussion

Training dataset

This paper is a collection of flame and smoke images including single flames and smoke, multiple flames and smoke, indoor fires, forest fires, and complex background fires. Smoke is individually labeled. A total of 29,980 datasets were collected and divided into training, validation, and test sets in a 7:1:2 ratio, as shown in Table 2. Datasets were selected and merged from several publicly available datasets, including FLAME (Shamsoshoara et al. 2021), FireNet Dataset (Jadon et al. 2019), and BoWFire. If the experiments below do not specify what dataset is used, then the dataset used in the experiments is the test set. Figure 4 shows some of the datasets used in this paper.

Table 2 The number of each type of dataset
Fig. 4
figure 4

Part of the dataset

Anchor box

The prior box in the MobileNetV3-large-YOLOV4 algorithm requires two categories of flame and smoke, which are obtained by the K-means clustering method in this paper. The size of the input image is 416 × 416. When K-means clustering is used for 76 and 73 iterations, the ratio of the prior frame to the real frame of the flame and smoke reaches 76.54% and 74.6%, respectively. The resulting flame and the smoke prior box are shown in Table 3.

Table 3 The anchor of fire and smoke

Model building and training

The specific hardware and software configuration is shown in Table 4. The network model training is based on the deep learning framework of Tensorflow 2.5, and the algorithm in this paper is implemented.

Table 4 Software and hardware configuration

Evaluation criteria

The test set is divided into two categories, positive samples and negative samples. TP is the number of positive samples predicted as positive; FP is the number of negative samples predicted as positive; FN is the number of positive samples predicted as negative; TN is the number of negative samples predicted as negative. This paper uses the accuracy, detection rate, false detection rate, precision mAP, and running frame rate FPS as the evaluation indicators of the algorithm. The above indicators are defined as follows:

  1. (1)


    $${\text{Accuracy = }}\frac{{\text{TP + TN}}}{{\text{TP + FP + FN + TN}}}$$
  1. (2)

    Detection rate (recall rate)

    $${\text{Recall = }}\frac{{{\text{TP}}}}{{\text{TP + FN}}}$$
  1. (3)

    Missing detection rate

    $${\text{FN}}_{{{\text{rate}}}} { = }\frac{{{\text{FN}}}}{{\text{FN + TP}}}$$
  1. (4)

    False detection rate

    $${\text{FP}}_{{{\text{rate}}}} { = }\frac{{{\text{FP}}}}{{\text{FP + TN}}}$$
  1. (5)

    Precision mAP

    $${\text{mAP = }}\frac{{\sum {{\text{AP}}} }}{{N{\text{(class)}}}}{ = }\frac{{\sum {{\text{AP}}} }}{{2}}$$

The definition of mAP is shown in Eq. 17, which represents the average precision of the target average precision AP (AP is calculated by the P-R curve) of N classes, and N = 2 in this experiment.

  1. (6)

    Running frame rate FPS refers to the number of frames per second.

Experimental results and analysis

Speeds up the convergence of the network during training

In the “Anchor box” section, the algorithm uses the K-means clustering algorithm to regenerate the prior box of the network, x. The specific loss curved of this paper’s algorithm compared with YOLOv4 during the training process is shown in Fig. 5. When the training reaches 600 rounds, the algorithm in this paper has basically reached stability, while the YOLOv4 algorithm has been in a slightly oscillating state. This proves to a certain extent that the convergence speed of this paper’s algorithm is significantly higher than that of YOLOv4, and the final loss value is also much lower than that of YOLOv4.

Fig. 5
figure 5

The loss curves

Ablation experiment

The ablation experiments were performed for the MobileNetV3-large-YOLOv4 algorithm. The images used in this experiment came from a collection of 2000 randomly selected images containing fire and smoke from the test set. As can be seen from Table 5, using MobileNetV3-large instead of CSPDarknet results in a slight decrease in mAP but a significant increase in FPS, e.g., experiment 2. Adding a path from the backbone to the PANet results in a slight decrease in FPS but an increase in mAP, e.g., experiment 7. Modifying the network structure of the PANet along the lines of BiFPN results in a large increase in mAP but a slight decrease in FPS, e.g., experiment 4. The introduction of more SPPs has a greater impact on mAP and a small reduction in operating speed, e.g., experiment 6. MobileNetV3-large, the BiFPN-based PANet, ECANet, and the SPP module introduced at multiple ends each have their own focus on algorithm improvement and complement each other. As a result, the MobileNetV3-large-YOLOv4 algorithm proposed in this paper achieves good overall performance. For example, in experiment 64, the algorithm achieves an mAP of 90.30% and an FPS of 61 and can accurately identify smoke and flames in real time.

Table 5 Ablation experiment results

Detection performance of fire-like and smoke-like targets

Due to the specific nature of detection targets such as flame and smoke, flame-like lighting effects and white cloud-like smoke effects are often encountered in real fire detection scenarios. The presence of these smoke and fire targets can affect the accuracy of model detection. Considering the improvements to the YOLOv4 algorithm in this paper, this problem can be addressed to a large extent by comparing the detection effectiveness of each model file through experiments on a certain number of collected fire-like and smoke-like datasets, as shown in Table 6. It is also clear from the data in this table that the algorithm in this paper has a much lower false alarm rate for flame and smoke than the other four algorithms and a much higher accuracy rate than the other algorithms. Figure 6 shows the results of the algorithm runs, from which we can see that only the algorithm in this paper did not identify fire and smoke-like scenes as flames and smoke, effectively avoiding the interference of the environment to the algorithm in this paper. From this, we deduce that the original Vision Transformer and ECA Net do have a very strong ability to filter out interference information.

Table 6 Comparison of the recognition effects of various model files for similar fire and smoke
Fig. 6
figure 6

Recognition renderings

The algorithm in this paper improves the detection effect of small targets

In this paper, we connect PANet structures on four effective layers and use multiple SPP structures as transition layers between the convolutional and fully connected layers to address the size mismatch and improve the algorithm’s detection of small flame states early in the fire. As a result, a certain number of small target fire tests were collected. This image set was used to compare the recognition effectiveness of the model algorithm with that of detecting small-size flames or smoke. As shown in Table 7 and Fig. 7, the accuracy of the algorithm in this paper is far superior to other algorithms. We can also see from Fig. 7 that all the algorithms except this one miss some small size flames and smoke. These combined experimental analyses demonstrate that the improvements to the algorithm’s neck network in this paper can indeed greatly improve the detection of small targets.

Table 7 Comparison of different methods
Fig. 7
figure 7

Recognition renderings

Comparison with other algorithms

In this paper, six common deep learning image recognition algorithms are used for fire detection, and the final comparison results are shown in Table 8 below. The results show that the algorithm in this paper can achieve the best balance between recognition speed and accuracy. It is only slower than YOLOv4-tiny, while its accuracy is infinitely close to YOLOv4. Considering that the difference in algorithm effectiveness cannot be visually compared by just a few percentages of data, this paper uses confusion matrices (Fig. 8) for further comparison. From the distribution of each confusion matrix, we can clearly see that the confusion matrix of this paper’s algorithm has the best data on the positive diagonal, further showing the advantage of this paper’s algorithm.

Table 8 Comparison of different methods
Fig. 8
figure 8

Confusion matrix for six algorithms

At the same time, this paper selects three classic algorithms (listed in Table 9) and compares them with the MobileNetV3-large-YOLOv4 algorithm. The public dataset used is BoWFire (including 119 fire images and 107 non-fire images), which has been used as a test dataset by many fire detection research works (Howard et al. 2017). We can see that the false alarm rate of MobileNetV3-large-YOLOv4 is slightly higher, but the detection rate and accuracy are better. Figure 9 shows the final recognition effect.

Table 9 Comparison of different methods
Fig. 9
figure 9

Running renderings on PC

Algorithms are deployed on Jetson NX

We deployed the algorithms in this paper on the fire extinguishing robot RXR-M80D-13KT, as shown in Fig. 10. We play a video from a mobile device to simulate a real-time fire situation, while using the TensorRT algorithm on the embedded device (Jeston Xavier NX) to accelerate the algorithm in this paper to recognize the fire images captured from the fire extinguishing robot. We can finally find that the algorithm has achieved a frame rate of 26.13 FPS for real-time recognition and detection, and there are no significant false detections. We used a firefighting robot as the image delivery platform and Jeston Xavier NX as the algorithm running platform, and then recognized in real time a total of 2334 images, including 1387 flame images, 1082 smoke images and 846 images without flame and smoke, selected from the image test set and some real collected images, and presented the final test results on the confusion matrix (Fig. 11), and the final results are excellent. Some of the recognition results are shown in Fig. 12, and from this figure, we can also find that the algorithm did not show any misjudgment or omission.

Fig. 10
figure 10

Appearance of RXR-M80D-13KT

Fig. 11
figure 11

The final test results on the confusion matrix

Fig. 12
figure 12

The real-time recognition effect of fire


This paper presents the MobileNetV3-large-YOLOv4 algorithm, which can be used for real-time fire identification in small embedded devices. Based on the experimental results, the following conclusions can be drawn.

  1. (1)

    Using the MobileNetV3-large-YOLOv4 algorithm to identify the fire public dataset BoWFire, the identification accuracy can reach 96.24%. Deployed on the Jeston Xavier NX, the FPS can be stabilized at around 26. Overall, this algorithm achieves a balance of running speed and accuracy, with excellent overall performance.

  2. (2)

    The MobileNetV3-large-YOLOv4 algorithm has good fire recognition performance and can recognize various types of fires. Several improved and important components of the algorithm play an important role in real-time fire recognition, and through the effective integration of these components, the algorithm shows high accuracy and real-time performance.

  3. (3)

    The MobileNetV3-large-YOLOv4 algorithm is not only suitable for the PC side but also for the embedded side. The algorithm can be deployed directly on the embedded Jeston Xavier NX platform and can meet the real time and accuracy of fire recognition. The cloud AI algorithm can therefore be pushed to the edge side for computation, which is in line with current requirements for edge intelligence.

  4. (4)

    However, the algorithm is still inadequate: due to legal provisions such as fire prevention in urban areas, the algorithm in this paper lacks more field exercises, and it is hoped that more sites will be available for simulating realistic fires in the future.

Availability of data and materials

There were no known competing financial interests or personal relationships that may have affected this work.


Download references


This work received support from the Industry University Research Innovation Fund of Science And Technology Development Center of the Ministry of Education (NO. 2021JQR004), Public Welfare Projects in Zhejiang Province (No. LGF20F030002), Project of Hangzhou Science and Technology Bureau (No. 20201203B96), the Ministry of Education Industry-University Cooperation Collaborative Education Project (202102019039), and Zhejiang University City College Scientific Research Cultivation Fund Project (J-202223).

Author information

Authors and Affiliations



H.Z. and Y.L. conceived the idea. H.Z., J.D., Y.D., and Y.L. designed the research methods. H.Z., J.D., and Y.D. coordinated the data collection and assembly. H.Z., J.D., and Y.D. wrote the manuscript. All authors contributed to the editing and revision of the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Yan Liu.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, H., Duan, J., Dong, Y. et al. Real-time fire detection algorithms running on small embedded devices based on MobileNetV3 and YOLOv4. fire ecol 19, 31 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Fire detection
  • YOLOv4
  • MobileNetV3-large
  • PANet
  • BiFPN
  • SPP
  • ECA Net
  • TensorRT