Comparing machine learning algorithms to predict vegetation fire detections in Pakistan

Shahzad, Fahad; Mehmood, Kaleem; Hussain, Khadim; Haidar, Ijlal; Anees, Shoaib Ahmad; Muhammad, Sultan; Ali, Jamshid; Adnan, Muhammad; Wang, Zhichao; Feng, Zhongke

doi:10.1186/s42408-024-00289-5

Original research
Open access
Published: 25 June 2024

Comparing machine learning algorithms to predict vegetation fire detections in Pakistan

Fahad Shahzad¹^na1,
Kaleem Mehmood^4,5^na1,
Khadim Hussain³,
Ijlal Haidar^4,5,
Shoaib Ahmad Anees⁶,
Sultan Muhammad⁵,
Jamshid Ali⁴,
Muhammad Adnan⁷,
Zhichao Wang¹ &
…
Zhongke Feng^1,2

Fire Ecology volume 20, Article number: 57 (2024) Cite this article

140 Accesses
Metrics details

Abstract

Vegetation fires have major impacts on the ecosystem and present a significant threat to human life. Vegetation fires consists of forest fires, cropland fires, and other vegetation fires in this study. Currently, there is a limited amount of research on the long-term prediction of vegetation fires in Pakistan. The exact effect of every factor on the frequency of vegetation fires remains unclear when using standard analysis. This research utilized the high proficiency of machine learning algorithms to combine data from several sources, including the MODIS Global Fire Atlas dataset, topographic, climatic conditions, and different vegetation types acquired between 2001 and 2022. We tested many algorithms and ultimately chose four models for formal data processing. Their selection was based on their performance metrics, such as accuracy, computational efficiency, and preliminary test results. The model’s logistic regression, a random forest, a support vector machine, and an eXtreme Gradient Boosting were used to identify and select the nine key factors of forest and cropland fires and, in the case of other vegetation, seven key factors that cause a fire in Pakistan. The findings indicated that the vegetation fire prediction models achieved prediction accuracies ranging from 78.7 to 87.5% for forest fires, 70.4 to 84.0% for cropland fires, and 66.6 to 83.1% for other vegetation. Additionally, the area under the curve (AUC) values ranged from 83.6 to 93.4% in forest fires, 72.6 to 90.6% in cropland fires, and 74.2 to 90.7% in other vegetation. The random forest model had the highest accuracy rate of 87.5% in forest fires, 84.0% in cropland fires, and 83.1% in other vegetation and also the highest AUC value of 93.4% in forest fires, 90.6% in cropland fires, and 90.7% in other vegetation, proving to be the most optimal performance model. The models provided predictive insights into specific conditions and regional susceptibilities to fire occurrences, adding significant value beyond the initial MODIS detection data. The maps generated to analyze Pakistan’s vegetation fire risk showed the geographical distribution of areas with high, moderate, and low vegetation fire risks, highlighting predictive risk assessments rather than historical fire detections.

Resumen

Los fuegos de vegetación tienen grandes impactos en los ecosistemas y presentan una amenaza significativa para la vida humana. En este estudio, los fuegos de vegetación comprenden fuegos forestales, en cultivos, y otros fuegos de vegetación. Al presente, hay un limitado número de investigaciones sobre la predicción a largo plazo de los fuegos de vegetación en Pakistán. El efecto exacto de cada factor en la frecuencia de los fuegos de vegetación es poco claro cuando se usan análisis estándar. Esta investigación utilizó la alta eficiencia de los algoritmos del aprendizaje automático (i. e. Machine Learning algorithms), para combinar datos de diversas fuentes, incluyendo datos del MODIS Global Fire Atlas, y datos topográficos, de condiciones climáticas, y de diferentes tipos de vegetación adquiridos entre 2001 y 2022. Probamos muchos algoritmos y finalmente elegimos cuatro modelos para procesar formalmente los datos. Su selección fue basada en la performance de sus medidas, como la exactitud, eficiencia computacional, y los resultados preliminares de estas pruebas. El modelo de regresión logística, bosque al azar (random forest), un algoritmo de aprendizaje supervisado (support vector machine), y una técnica de potenciación de gradiente extremo (extreme Gradient Boosting) fueron usados para identificar y elegir los nueve factores clave en fuegos forestales y en cultivos y, en caso de otro tipo de vegetación, siete factores clave que causan incendios en Pakistán. Los resultados indican que los modelos de predicción alcanzaron exactitudes que variaron entre 78,7 y el 87,5% para los fuegos forestales, el 70,4 al 84,0% en el caso de los fuegos en cultivos, y del 66,6 al 83,1% para otro tipo de vegetación. Adicionalmente, el área de los valores bajo la curva (AUC) variaron del 83,6 al 93,4% para fuegos forestales, del 72,6 al 90,6% para los cultivos, y del 74,2 al 90,7% para otro tipo de vegetación. El modelo Random Forest fue quien presentó la mayor exactitud –87,5% en fuegos forestales, 84,0% en cultivos, y 83.1% en otro tipo de vegetación–, y también el AUC más alto (93,4%) para fuegos forestales, (90,6%) en cultivos, y 90,7 en otro tipo de vegetación, lo que probó ser el modelo más óptimo. Los modelos proveyeron de perspectivas predictivas en condiciones específicas y susceptibilidades regionales a la ocurrencia de incendios, adicionando un valor significativo más allá de los datos iniciales de detección por MODIS. Los mapas generados para analizar el riesgo de incendio de la vegetación de Pakistán mostraron áreas de distribución geográfica con riesgo alto, moderado y bajo, señalando determinaciones predictivas más que detecciones históricas de fuegos.

Introduction

Wildfires represent a critical ecological and environmental challenge, impacting ecosystems and human communities globally. This study narrows its focus on the scope of wildfires, particularly vegetation fires, highlighting their frequency, spread, and management strategies. Forest loss and degradation, the emission of significant gasses and aerosols, etc., and the decrease in biodiversity have been identified as significantly contributing to increased vulnerability to fires (Albar et al. 2018). The global occurrence of wildfires shows considerable variation, with estimates suggesting they annually affect between 300 and 400 million hectares, varying significantly by geographic intensity and local conditions (van Lierop et al. 2015; Attri et al. 2020). Over 80% of global wildfires occur in savannahs and grasslands, mainly in South America, Australia, Africa, and South Asia. Forest and shrub-dominated regions account for 20% (Schultz et al. 2008). Annually, substantial funds are allocated towards fire management efforts to reduce or prevent the adverse consequences of wildfires (Thomas et al. 2017). Wildfire events lead to the death and displacement of fauna (Tien Bui et al. 2016; Bhujel et al. 2017), pose risks to the lives and livelihoods of local communities, impact soil fertility and water cycles, release harmful pollutants, including particulate matter (Shahdeo et al. 2020) that may contribute to global warming, and result in the loss of vegetation cover (Martell 2007; Usoltsev et al. 2020; Shobairi et al. 2022; Anees et al. 2022b, 2024; Akram et al. 2022; Aslam et al. 2022; Khan et al. 2024). Advancements in remote sensing technologies have contributed significantly to the monitoring and evaluating of vegetation fires (Gitas et al. 2012). Previous research has leveraged multi-temporal and multi-sensor remote sensing technologies to assess and monitor vegetation fires (Table 1).

Table 1 List of sensers used for fire monitoring

Full size table

Vegetation fires result from a complex network of interactions among various natural variables, including climate and weather conditions (Andreevich et al. 2020), fuel composition, and topography. The ignition sources for these fires encompass hot surfaces, electrical sparks, flames, friction, static electricity, mechanical impacts (such as from machinery contact or falling rocks), and natural events like lightning (Vadrevu et al. 2008; Bui et al. 2017; Nami et al. 2018). Although human activities are globally recognized as predominant causes of fires, practices such as slash-and-burn for agricultural purposes are widely prevalent in South and Southeast Asia. Our study focuses on the climatic influences on fire occurrences in Pakistan. This study addresses how climatic factors, rather than direct human interventions, predominantly influence fire dynamics in Pakistan. While acknowledging the significant impact of human activities on fire occurrences as seen in regions such as the Eastern Ghats and northeast India (Vadrevu et al. 2008), Sarawak in Malaysia (Kleinman et al. 1995), and the Chittagong hill tracts in Bangladesh (Borggaard et al. 2003), our analysis focuses on how environmental variables (Anees et al. 2022b) like temperature, humidity, and solar radiation play crucial roles in the region’s fire ecology. Topographic factors such as aspect, slope (Muhammad et al. 2023), and elevation are also considered for their effects on the extent of burnt areas and fire intensity based on comparisons across different studies (Nunes et al. 2016; Pan et al. 2023).

Various models have been documented in the literature, focusing on distinct phases of the fire control cycle. These include vegetation fire occurrence models (Botequim et al. 2017), vegetation fire spread models (Zhai et al. 2020), deployment and dispatch models, vegetation fire damage models, and decision and information systems as technological support platforms (Marques et al. 2012; Duff and Tolhurst 2015). The studies describing models briefly discuss prominent algorithms in each category, including supervised, unsupervised, and agent-based modeling approaches. Additionally, they included references on the fundamentals of machine learning. Supervised learning works to establish a correlation between input data that has been labeled and the corresponding known output using a continuous target factor. A constant variable of interest is used in regression analyses, with various applications including fire vulnerability, fire occurrence, fire spread and burn area estimation, smoke and emissions prediction, and, finally, climate change assessment (Jain et al. 2020). Unsupervised learning aims to uncover patterns and relationships within data without using a specific target or outcome variable to guide the learning process. It is applicable for tasks involving clustering and dimensionality reduction. Clustering tasks in this context are used for fire mapping, fire detection, prediction of burnt areas, and fire weather prediction (Bot & Borges 2022). Some fire prediction algorithms, prominent for their computational speed and simplicity, utilize both supervised and unsupervised learning techniques to determine vegetation fire risks. These include neural networks, decision trees, random forest (Eslami et al. 2021), regression trees, and classification algorithms (Cabral et al. 2018), along with K-nearest neighbor, support vector machines, K-means clustering, self-organizing maps, autoencoders, hidden Markov models, and hard competitive learning (Arnold et al. 2014). A prominent gap exists in long-term, predictive studies integrating environmental, meteorological, and human factors, particularly across broader geographical scales (Sohail et al. 2023). This gap highlights the need for enhanced predictive modeling to inform proactive fire management strategies. In response to these gaps, our research aims to (1) compile a comprehensive dataset of historical fire incidents in Pakistan from 2001 to 2022; (2) develop a predictive model for wildfire occurrences using MODIS data, incorporating various environmental and meteorological variables to forecast spatial and temporal patterns; and (3) conduct a long-term trend analysis to evaluate the frequency, distribution, and severity of wildfires in Pakistan over the past two decades.

Materials and methods

Study area

The research focused on Pakistan, covering the period from 2001 to 2022. Pakistan is located in the western zone of South Asia, northeast of the Arabian Sea, between latitudes 24° and 37° N and longitudes 62° and 75° E (Qasim et al. 2014). Pakistan covers an area of 875,832 km2. Forests cover 2113 km2, croplands cover 176,976 km2, and other vegetation covers 261,755 km2. According to MODIS data, there were 208,943 fire events recorded in Pakistan from 2001 to 2022, including 642 in forests, 158,474 in croplands, and 31,484 in other vegetation types. Figure 1 shows classifications of forested land, cropland, and other vegetated land.

The country is known for its diverse landscapes, which include towering mountains in the north and expansive arid regions in the southwest. It has four distinct seasons: a mild and dry winter (December to February), a hot and dry spring (March to May), a rainy season (June to August), and a post-monsoon season (September to November) (Begum et al. 2011). Pakistan’s forest cover is only 4.5%, a substantial concern considering the country’s agricultural-driven economy and location within the South Asian Ecological Zone (Oliveira et al. 2011). Throughout the latter half of the twentieth century, evidence indicated an escalating incidence of wildfires in Pakistan, contributing to increased burn area (Rafaqat et al. 2022a, b). Characterized by its lowest elevation at sea level and vulnerability to desertification, the eastern region of Pakistan requires targeted conservation and fire prevention strategies, particularly considering the availability of remote sensing technologies and worldwide databases that provide opportunities for a more detailed identification of factors causing fires and enhanced prediction models (Rafaqat et al. 2022a, b). This region is particularly vulnerable to wildfires due to its dry environment with little rainfall and susceptibility to desertification (Kattel et al. 2019; Anees et al. 2022a).

Datasets

Handling of response variable

This study employs a comprehensive approach to analyze historical fire data, focusing on the period from January 2001 to December 2022. This study used the MODIS fire product from the Fire Information for Resource Management System (FIRMS), which gave information about active fires found by NASA’s Aqua and Terra satellites’ MODIS instruments (https://firms.modaps.eosdis.nasa.gov) (Zhang et al. 2021). We combined the monthly global 500 m grid product with 1 km of MODIS active fire observations to enhance the spatial analysis of the MCD64A1 Version 6 Burned Area data product (Giglio et al. 2018). This product facilitates the identification of per-pixel burned areas, detecting thermal anomalies and fire locations at a moderate resolution (Katagis and Gitas 2022). We used this data to evaluate fire regimes on a national to continental scale, identify global hot spots of fire, and monitor trends in global vegetation fire occurrences (Giglio et al. 2006; Chuvieco et al. 2008). All fire events reported with a confidence level exceeding 50% were considered for detailed analysis. The analysis followed a grid-based approach, examining each 1 × 1 km grid cell for vegetation fire occurrences, binary-labeled as “1” for presence and “0” for absence. In this study, analyzing land use and land cover was crucial for understanding the distribution and types of vegetation affected by fires. The International Geosphere-Biosphere Project (IGBP) classification scheme of the MODIS product MCD12Q1 was used in the study (Liang et al. 2015; Badshah et al. 2024). This product has 500-m-level data on land cover (Sulla-Menashe and Friedl 2018). The dataset available on the LP DAAC website (https://lpdaac.usgs.gov/) greatly aided in identifying the surfaces beneath various types of vegetation in the study area (Usoltsev et al. 2022; Zhao et al. 2022). The research area shown in Table 2 underwent a careful process of mosaicking and reprojection using the Hierarchical Data Format-Earth Observing System (HDF-EOS) to Grid (HEG) tools. This step was crucial for achieving an accurate and coherent spatial representation of land cover types. The study area divided grid cells into categories based on the land cover types a vegetation fire had affected, including forest fire, other vegetation, and cropland. Five hundred twelve out of 642 forest cells, 124,179 out of 158,474 cropland cells, and 22,663 out of 31,484 vegetation cells were marked as “fire cells” and given the number “1.”

Table 2 Descriptions of vegetation types

Full size table

During the dataset development, we created two random subsets of the actual MODIS vegetation fire ignition spots that were detected. We allocated 70% of this data for training the models and the remaining 30% for testing their performance. This division is standard practice in machine learning to validate models effectively, ensuring they can generalize well to new, unseen data. Using a 70–30 split, we aim to provide a robust dataset for training while retaining sufficient data for an accurate assessment of model performance in real-world scenarios (Rubí et al. 2023).

Selection and handling of predictor variables

This study utilized the Shuttle Radar Topography Mission’s (SRTM) Digital Elevation Model (DEM) dataset to investigate the impact of elevation, slope, and aspect as shown in Fig. 2 on the vegetation fire analysis. The SRTM dataset, downloaded from the SRTM Data Portal (January 1, 2023), provide highly accurate nationwide coverage.

The historical monthly climatic data was downloaded from two different sources: WorldClim (https://www.worldclim.org/) (Barreto and Armenteras 2020) and ERA 5 climate reanalysis data (https://cds.climate.copernicus.eu/) (Zhang et al. 2021) accessed on January 1, 2023). Key climatic variables extracted from WorldClim include minimum temperature (°C), maximum temperature (°C), and precipitation (mm), presented in GeoTiff format with a spatial resolution of approximately 2.5 min (~ 21 km2). Additional climatic variables sourced from ERA 5 climate reanalysis include northward and eastward components of the 10 m wind (m/s), skin temperature (°C), surface net solar radiation (W/m2), surface net thermal radiation (W/m2), surface pressure (hPa), soil temperature (°C), and forecast albedo (unitless). These variables are provided in Netcdf format with a spatial resolution of about 9 km2. All data underwent meticulous preprocessing using RStudio, specifically employing the “raster” and “ncdf4” packages, alongside the ArcGIS software (Table 3).

Table 3 Descriptions of independent variables

Full size table

Detection of violations of assumptions about independent variables

A linear regression model may encounter multicollinearity, characterized by a substantial correlation among its independent variables. This multicollinearity has the potential to distort the model’s estimation and impede accurate predictions (Chang et al. 2013). The correlation matrix shown in Fig. 3 uses a color scale ranging from blue (low correlation) to red (high correlation) to identify significant correlations between variables. Each cell in the matrix represents the correlation coefficient between two variables, providing a visual aid to detect potential multicollinearity issues. Analysis of multicollinearity involves assessing variance inflation factors (VIF) and tolerance levels (TOL), which are commonly utilized to evaluate the relationships among independent variables. It is widely acknowledged that a TOL value below 0.1 and a VIF value exceeding 10 indicate the presence of multicollinearity (Bui et al. 2019; Li et al. 2022). These thresholds suggest that multicollinearity could significantly impact the reliability of regression and classification model estimates. TOL and VIF are computed as follows (Eqs. 1 and 2):

$$\text{TOL}=1-{\text{R}}^{2}$$

(1)

$$VIF=\frac{1}{1 - {\text{R}}^{2} }=\frac{1}{TOL}$$

(2)

where the coefficient of complex determination is denoted by ${R}^{2}$.

Mann–Kendall mutation test

The Mann–Kendall mutation test is a statistical method used to analyze temporal fluctuations and detect significant trends or “mutational changes” within time series data. These “mutational changes” refer to substantial alterations in the trend of the data, such as shifts from increasing to decreasing values or vice versa, which could indicate environmental or systemic changes. This method is valued for its straightforward implementation, high precision, broad applicability across diverse datasets, minimal human intervention, and efficient validation capabilities (Yue et al. 2002). The time series x, including n samples, represents the fundamental temporal variations. By analyzing these patterns, it is possible to obtain knowledge of the historical evolution of the environmental system, including weather variables and MODIS-detected changes that generated the data (Mehmood et al. 2024d). The test calculates a sequence of detecting mutations according to the Eq. 3:

$${d}_{k}={\sum }_{i=1}^{k}{\gamma }_{i} \left(k=2, 3\dots , n\right).$$

(3)

The sequence dk is a succession of independent units that adhere to the common scoring factors for calculating (dk) (Zhang et al. 2020):

$$UF\left({d}_{k}\right)=\frac{\left[{d}_{k}-E\left(dk\right)\right]}{\sqrt{var}\left(dk\right)}$$

(4)

(dk) indicates the expected value, Var(dk) is the variance, and UFk is a standard distribution of values. The statistical order is determined by analyzing the time series x in the order $x$_1, $x$_2, …., $x$_n. The reverse sequence of x ($x$_n, $x$_n-1…, $x$₁) is computed. This procedure is repeated, and the value of ${d}_{k}$ is assessed by comparing each computed ${d}_{k}$ to its expected statistical properties, including the mean and variance, to determine deviations that suggest trends. A UB or UF value greater than 0 indicates the presence of both positive and negative trends in the time series. When these values exceed or fall below the key threshold (significance level), the time series trends upward or downward. The area beyond the threshold line is the mutation time region of the significant line (Feng et al. 2016).

Methodological overview machine learning models

Logistic regression

The logistic regression method is a classical statistical modeling method used to model binary outputs given one or more independent variables (Balboa et al. 2024). It is effective in different geographic locations for predicting and analyzing the variables that drive fire occurrence at different topographical levels (Garcia et al. 1995; Martínez et al. 2009). Many researchers have included model applicability (Oliveira et al. 2012; Rodrigues and De la Riva 2014). The formula for LR is:

$$\text{Logit}\left(p\right)=\mathit{ln}\left(\frac{p}{P-1}\right)$$

(5)

The equation represents the relationship between the probability of vegetation fire occurrence (P) and the number of variables (n), where (a1, a2, …, an) are the coefficients for each variable and (× 1, × 2, …, xn) are the factors that impact the rate of vegetation fires (Peng et al. 2002; Zhang et al. 2021).

Random forest

The random forest (RF) model was employed to determine the variables that drive vegetation fires and their respective influences on the probability of vegetation fires in the geographical areas of Pakistan. The RF model, presented by Breiman (2001), employs multiple decision trees to train and predict samples, rendering it a classifier (Haddouchi and Berrado 2019). RF is a machine learning method based on an ensemble of classification and regression trees (CARTs). Each tree in the RF model is built using bootstrap samples, enhancing the model’s robustness against outliers and variability, which is critical for predictive accuracy in forest fire forecasting (Su et al. 2018; Zhang et al. 2022). The RF model is a fast machine-learning approach that can handle many input factors and delivers high predicted accuracy (Sarkar et al. 2024). Still, it is sensitive to the danger of overfitting (Luo et al. 2024).

$$h\left(x\right)=\frac{1}{T}{\sum }_{t=1}^{T}h\left(x,{\theta }_{t}\right)$$

(6)

Hyperparameter adjustment was critical to derive the final models (Probst et al. 2019; Mehmood et al. 2024a, b, c). The number of trees (n = 1000), tree depth (maximum depth of 8), and minimum node size (minimum of 7 samples per leaf node) were optimized in the forest and crop fire prediction, but in the case of other vegetation, a minimum size of 6 for each node. The final prediction is obtained by taking the mean of each regression subtree $\{h (x,{\theta }_{t})\}$, T represents the number of decision trees, θt represents a random vector that is independently and identically distributed, and x represents the input vector. The predictive efficacy of the model is determined by the quantity of random features and trees (Segal and Xiao 2011).

eXtreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost), presented by Chen and Guestrin in 2016, is an innovative gradient-boosting decision tree (GBDT) algorithm (Chen and Guestrin 2016). It utilizes Taylor’s second-order expansion to optimize the loss function, exhibiting improved computing efficiency and generalization ability compared to other machine learning algorithms (Xie et al. 2022). The XGBoost model represents:

$$\begin{array}{c}{\widehat{y}}_{i}={\sum }_{k=1}^{k} {f}_{k}\left({x}_{i}\right),{f}_{k}\in F\end{array}$$

(7)

Here, ${\widehat{y}}_{i}$ is the predicted value for the ith sample, $k$ denotes the number of decision trees, ${x}_{i}$ is the input data for the ith sample, ${f}_{k}\left({x}_{i}\right)$ is the $k$ th decision tree generated in the $k$ th iteration, and ${f}_{k}$ belongs to the tree collection space $F$ (Luo et al. 2024).

The objective function for XGBoost is:

$$\begin{array}{c}Obj={\sum }_{i=1}^{N}l\left({y}_{i}, {\widehat{y}}_{i}\right)+{\sum }_{k=1}^{k} \Omega \left({f}_{k}\right)={\sum }_{i=1}^{N}\ l\left[{y}_{i},{{\widehat{y}}_{i}}^{t-1}+{f}_{t}\left({x}_{i}\right)\right]+{\sum }_{k=1}^{k} \Omega \left({f}_{k}\right)\end{array}$$

(8)

In Eq. (8), the first part represents the loss function, the difference between the predicted and observed numbers. The second component is a regularization term that essentially governs the complexity of the model, guides the construction of a tree structure, and prevents overfitting (Piraei et al. 2023).

Support vector machines

Pattern classification and nonlinear regression widely utilize support vector machines (SVMs). SVMs are based on the idea of minimizing structural risk (Jodhani et al. 2024). The fundamental concept behind SVMs is to create a classification hyperplane that serves as a decision boundary. The distance between positive and negative examples achieves superior generalization accuracy(Naderpour et al. 2019). SVMs specialize in manipulating data in high-dimensional environments by effectively employing kernel functions to tackle diverse nonlinear problems(Rossi and Villa 2006). For a two-class SVM, considering a training set T = {($x$₁, $y$₁), ··· ($x$₁, $y$₁)} ∈ (X × Y)¹, where ${x}_{i}\in$ X=${R}^{n}$ and ${y}_{i}$∈ {1, − 1} for $(i$ =1,2,…, $l$) which represents the feature vector. The consequence parameter C and the kernel function K ($x,{x}^{\prime}$) are specified. The problem of optimization is then formulated and resolved in the following manner (Boubeta et al. 2015):

$$\begin{array}{c}\begin{array}{c}min\\ \alpha \end{array}\frac{1}{2}{\Sigma }_{i=1}^{j}{\Sigma }_{j=1 }^{1}{y}_{i}{y}_{j}{a}_{i}{a}_{j}k\left(\varkappa ,{x}^{\prime}\right)-{\Sigma }_{j=1 }^{1}{\alpha }_{j}\end{array}$$

(9)

$$\begin{array}{c}s.t.{\Sigma }_{i=1}^{j}{y}_{i }{\alpha }_{\dot{i}}=\text{0,0}\le {\alpha }_{\dot{i}}\le C,\dot{i}=1,\dots ,l\end{array}$$

(10)

The optimal solution ${\alpha }^{*}=({\alpha }^{*}, \dots , {\alpha }^{*}{)}^{T}$ is obtained. A positive component ${\alpha }^{*}:0 \le {\alpha }_{j}^{*}\le C$ is then selected, and the threshold is computed as follows (Pang et al. 2022):

$$\begin{array}{c}{b}^{*}={y}_{j}-{\sum }_{i=1}^{1} {y}_{i }{\alpha }_{i}K\left({x}_{i}-{x}_{j}\right)\end{array}$$

(11)

Finally, the decision function is constructed:

$${\varvec{f}}\left({\varvec{x}}\right)=sgn({\sum }_{i=1}^{1} {\alpha }_{i}*{y}_{i }K \left( x, {x}_{i}\right)+ {b}^{*}$$

(12)

Model performance evaluation methods

Accuracy serves as a metric for evaluating categorical models, representing the percentage of correctly predicted outputs by the model as follows (Shao et al. 2023):

$$\begin{array}{c}Accuracy=\frac{TP +TN}{TP+TN+FP+FN}\end{array}$$

(13)

$TP$ is the percentage of true positive cases, $TN$ is the proportion of true negative cases, $FP$ indicates the percentage of false positive cases, and $FN$ is false negative cases (Pang et al. 2022). Recall or sensitivity, also presented as part of our evaluation metrics in Table 5, measures the proportion of actual positives that are correctly identified by the model and is calculated as (Eq. 15). The F1 score, which combines precision and recall into a single metric, is particularly useful when dealing with imbalanced datasets and is computed using (Eq. 16).

$$\text{Sensitivity}=\frac{TP}{TP+FN}$$

(14)

The F1 score, combining precision and recall, is computed as:

$$\begin{array}{c}F1 Score=2. \frac{\text{Precision }.\text{ Recall}}{\text{Precision}+\text{Recall}}\end{array}$$

(15)

The kappa coefficient is an indicator of statistical significance used to assess the level of reliability in testing. The expression is given by the following (Watson and Petrie 2010):

$$\begin{array}{c}Kappa=\frac{{P}_{0}-{P}_{E}}{1-{P}_{E}}\end{array}$$

(16)

where Po is the accuracy of the prediction, and Pe is the probability of chance agreement, derived from the class probabilities, and is crucial in understanding the kappa calculation as it considers both the observed and expected agreements. Kappa coefficients are categorized into five categories to represent varying degrees of accuracy: 0.0 to 0.20 for extremely low accuracy, 0.21 to 0.40 for medium accuracy, 0.41 to 0.60 for high accuracy, 0.61 to 0.80 for excellent accuracy, and 0.81 to 1 for virtually perfect accuracy (Landis and Koch 1977).

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity), illustrating the trade-offs between true positive and false positive rates across different thresholds (Carter et al. 2016). The area measures the accuracy of the results under the curve (ROC). The equations for the sensitivity and specificity are as follows (El Emam et al. 2001; Pang et al. 2022). The AUC quantifies the overall ability of the model to discriminate between classes and is discussed in terms of effectiveness (Muschelli III 2020). The area under the curve (AUC) measures the model’s predictive power, categorized into four distinct groups: 0.5–0.85 denotes medium performance, 0.85 ~ 0.95 signifies high performance, and 1.0 indicates ideal performance (Yingyongyudha et al. 2016; Sun et al. 2021). Figure 4 illustrates the workflow depicted in this paper.

Results

This study examined the multicollinearity of various environmental and topographic factors; their tolerance (TOL) values are more than 0.1, and variance inflation factors (VIF) are less than 10 across different vegetation types: forest, crop, and other vegetation, as shown in Table 4. This indicates a lack of covariance among the factors that may initiate fires, suggesting that these variables can inform fire risk assessments within the defined constraints of this study area and period.

Table 4 Results of multicollinearity analysis

Full size table

Mann–Kendall mutation

The Mann–Kendall test applied to vegetation fires in Pakistan from 2001 to 2022 reveals fluctuating but overall upward trends in fire hotspots. Specifically, from 2006 to 2007, UF values were negative, indicating a temporary decline in fire occurrences. Conversely, from 2001 to 2006 and 2008 to 2022, UF values were consistently above zero, demonstrating a rising trend in the frequency of fires. Notably, the UF curve surpasses the 0.05 confidence level (± 1.96 standard deviations), suggesting that the decline and rise in fire frequencies are statistically significant. These trends are visually detailed in Fig. 5. In Fig. 6, the temporal evolution of vegetation fires spanning the years 2001 to 2022 is depicted, with a detailed legend categorizing the data into distinct types, including forest fires, crop fires, and other vegetation fires.

The cumulative anomaly curve on the vegetation fire points in Pakistan showed negative, indicating a consistent buildup of negative anomalies from 2001 to 2022, as shown in Fig. 7. The Mann–Kendall test shows a substantial increase trend in vegetation fires, but the curve’s below-zero position suggests consistent deviations from predicted values. These anomalies suggest that hotspots frequently go below expectations, requiring further investigation into specific time frames and environmental variables. The point at which UF and UB meet the confidence line validates its validity to detect an essential change in the number of national hotspots between 2001 and 2022.