COMPARISON BETWEEN CLASSIFICATION ALGORITHMS: GAUSSIAN MIXTURE MODEL - GMM AND RANDOM FOREST - RF, FOR LANDSAT 8 IMAGES

Purpose: Given the importance of monitoring and managing land cover, especially in countries with continental proportions, such as Brazil. This research aimed to compare two remote sensing image classifier algorithms. Method/design/approach: The article compared the Gaussian Mixture Model and Random Forest classification algorithms, using Landsat 8 image, which was classified in a supervised way, in the Dezetsaka plugin of QGIS. The analysis of the performance of each model was performed using the Kappa index and Total Accuracy. Results and conclusion: The results showed that the Random Forest algorithm was more efficient than the Gaussian Mixture Model. Taking the Kappa Index (K ) and Total Accuracy ( po ), the models obtained the following performances in the classification of classes: the Random Forest Model ( K = 0.94 and po = 96.31) and the Gaussian Mixture Model obtained ( K =0.85 and po =90.60). Research implications: The results can support the choice of classification method by researchers and others interested in monitoring land cover. Originality/value: This is a unique proposal, which compares an algorithm based on Machine Learning with another one from the category of probabilistic models. Interesting, since machine learning techniques have been gaining notoriety in several contexts.


INTRODUCTION
Brazil is known for its great biodiversity and abundance of environments with exuberant nature.These environments, however, have been compromised by development models, which produce unsustainable economic growth at the expense of nature, increasingly threatening ecological processes and living beings (BRANCO, ALMEIDA and FRANCISCO, 2022).
In recent years, anthropogenic actions on the environment have generated significant negative impacts that compromise the conservation of this natural resource (ARAGÃO et al., 2022) In view of this, the assessment of the changes in the landscape is fundamental for efficiency in territorial management, since it can provide input for decisions related to the use and conservation of natural and environmental resources.To aid in understanding these changes, the landscape is commonly subdivided into cover and land use classes (MACEDO et al., 2013).
In this aspect, Remote Sensing (SR) is a ground observation technique that allows the imaging of an object without direct contact, through sensors in aircraft or satellites (RICHARDS, 2013).For Curran (1995), it is the use of electromagnetic radiation sensors to record images of the physical environment so that they can be interpreted in such a way as to generate useful information.
Following this premise, in this work will be used SR data from the Landsat program, started in 1972, which currently runs with the mission Landsat 8.The product characteristics of Landsat 8 are in One of the ways of analyzing matrix data is through Digital Image Processing (DIP) techniques, such as digital image classification, which aims to identify the spectral patterns of objects on the Earth's surface, by analyzing the digital value of a pixel of the image (PRINA and TRETIN, 2015).
In the supervised classification, from which this work will be used, the user initially points out a set of training samples for each class that he wants to differentiate in the image (DUARTE and SILVA, 2019).Training samples are areas bounded on the image that correspond to representative terrain locations of each class.In this context, it is important that the operator has prior knowledge about the area to be classified, however, this may be theoretical, i.e. when the operator knows the spectral behavior of each target.
According to the type of classification adopted (supervised or unsupervised), the choice of the classifier algorithm should be made.In this respect, according to Duarte e Silva (2019), there are several algorithms used in supervised classification, such as: Minimum Distance, Maximum Likelihood, Random Forest, Gaussian Blending Model and until recently the Support Vector Machines.However, this work holds in the comparison between the Gaussian Mixing Model (GMM) and Random Forest (RF) algorithms for image classification.
GMM is one of the most widely used grouping models, as it has a simple mathematics to implement and its small number of parameters is easy to calculate.According to Hui, Wu and Nguyen (2013) is a sum of Gaussian functions, parametrized by θ  , that contains a vector of mean µ  , and a matrix of covariance   .Each of these density components has a weight, resulting in a weighted sum (WENLONG, JOHNSTON and MENGJIE, 2013;HUI et al., 2013).In Equation 1 the weighted function is described.
where, x is a vector of D-dimensional characteristics,   and (|  ,   )are component densities.As pointed out by Weiling, Lei and Ming (2015), when defining the component weight criteria, ∑  = 1  =1 we can represent the parameterization of the mean vector, the covariance matrix and the weight of the mixtures of each component, according to Equation 2.
where i= 1, 2,..., M and   the weights of the mixtures.That is, the weighted sum of the Gaussian densities of component M constitutes what we know as the Gaussian Mixture Model.
With respect to the Random Forest algorithm, it is a Machine Learning classifier, proposed by Breiman (2001) which consists of a general term for methods of ensemble using tree-type classifiers.RF builds a large number of decision trees out of the data subset from a single defined training.(LOPES et al., 2017).
For example, let H= {h1, h2, h3} be a set or ensemble of three H classifiers.If the three make distinct errors, then when h1(xi)is wrong, it is possible that h2 (xi) and h3 (xi) are correct, so that the combination of hypotheses by voting can correctly rank xi.(LIMA et al., 2021).Classification via randomized forest or RF algorithm is based on the ensemble strategy, i.e. we can understand it as a combination of model results that converge to produce a more accurate predictive model.

Image Rating Performance Metrics in SR
In view of the importance of land use and occupation analyzes in the context of management, and monitoring of land cover, the classifications require analysis of accuracy measurement.In this context, the accuracy assessment indicates the quality of the map that was created using remote sensing data.(CONGALTON and GREEN, 2019).
Thematic accuracy is an important validation metric used in SR for image classification.According to Mastella and Vieira (2018), "it is a check to verify if the map is correct regarding the label of the classes, besides increasing the quality of the information of the map, identifying and correcting the sources of errors".For the evaluation of this accuracy, error matrices are used, also known as confounding matrices (CONGALTON and GREEN, 2019, p. 2).
The confounding matrix, in turn, is a square-arranged tool consisting of the result of a cross-tab between classified SR data and reference data.According to Souza (2020) through the confounding matrix several measures can be applied to assess the accuracy of the data, among them are Total Accuracy, Cohen Kappa Index, Tau Index, among others.
Souza (2020) conducted a search in the Scopus database, and noted that 62.7% and 29.5% of the articles found on accuracy of validation metrics in SR, used, respectively, the Total Accuracy and Kappa Index.This justifies the choice of the two indices to compare the algorithms and objects of this study.
For Total Accuracy, Story and Congalton (1986) proposed it with the aim of analyzing the general agreement (po) between the classification and reference data by summing the principal diagonal divided by the total samples, according to Equation 3: With respect to the Kappa coefficient, it is expressed by the K estimator of the KHAT statistics.In order to estimate K it is necessary to estimate the general agreement (po) and the ratio of casual agreement (locallypc), calculated by the product of the marginal proportions of the rows and columns of the reference and classification data (CONGALTON and GREEN, 2019).Where (pc), for Souza (2020), corresponds to the product of the marginal proportions of the rows and columns of the reference and classification data, as follows in Equation 4: Equation 4. Casual agreement estimator.
Finally, we can describe K according to Equation 5, where the result obtained by the Kappa coefficient varies in the range from -1 to +1, and the closer to 1, the better the quality of the classified data (PRINA E TRETIN, 2015).In the scientific literature there are several groups that seek to describe these quantitative data in a qualitative manner, among which we will use for this work, that of Fonseca (2000), as provided in Table 2.

Field of Study
The area chosen to carry out this study is the municipality of Rondon do Pará, which is located in the southeast of Pará, inserted in the Brazilian Amazon, as shown in Figure 1

Acquisition and processing of data
Data acquisition took place through the Image Generation Division (DGI) of the Brazilian Institute for Space Research (INPE) <http://www.dgi.inpe.br/siteDgi/index_pt.php>.Where an image of the Landsat Satellite 8, OLI Sensor was obtained, as described above in Table 1 and with cloud cover less than 10%.
After acquisition of the satellite image, the correction, reprojection and clipping process was performed for the study area of interest.This and all other processes, including obtaining all results, including classified image, confounding matrix, statistics of each class, and accuracy measurements, were obtained through the free geoprocessing software, Quantum GIS (QGIS Development Team, 2022).
To perform the supervised classification, a vector layer of algorithm training was created with the six classes chosen for this work.In that context, at least 20 representative samples of each class were indicated through that layer, as described (LOPES et al., 2017).
In order to choose the classes considered in the classification of land use and occupation, the spatial characteristics of the region of interest distinguished by the mapping data of the MapBiomas project were taken into account, according to Table 3, where the class found in the MapBiomas project is placed and the corresponding class considered for this study.In the image classification process, the Dezetsaka plugin, from QGIS was used, which among the options of classifying via GMM and RF algorithm chosen for this study, also makes available classification through the algorithms: Support Vector Machines (algorithm as well as RF, based on Machine Learning) and theK-Nearest Neighbors (also known as K-th nearest neighbor, a non-parametric method).
To extract the statistics and other results represented below, the SCP-Semi Automatic plugin was used.In this respect, they were obtained by comparing the matrix image classified and the training layer, such as Congalton and Green (2009); Souza (2020) and Prina and Tretin (2015) indicated in their studies.

RESULTS AND DISCUSSIONS
From the procedures involving the supervised classification of the image, considering the methods (Gaussian and Random Forest Mixing Model), it was possible to produce a collection of results that are expressed through maps of land use and coverage, table with area data and pixel count, besides the confounding matrix and, from it, the performance metrics.Figure 2 and Table 4, respectively, specialize the land use and cover considering the six classes defined for this work; and show the percentage and area occupied by each class when the classification was performed by the GMM algorithm.On the other hand, "Figure 3" presents the specialized classes within the region of interest, through classification by the RF algorithm.While "Table 5" points to the area proportion by each class defined, using this same classification algorithm.Comparing the two spatializations (Figure 2 and Figure 3), it is easy to observe the difference, especially in the classes "Urbanization" and "Exposed soil".In this aspect, when also comparing Table 3 and Table 4, it was found that the class "Urbanization" increased from a percentage of 1.10% when the Raster was classified with GMM algorithm, to 3.44% with classification via RF algorithm, an increase in percentage terms corresponding to 212.72%.The "Exposed soil" class showed inverse behavior, when it decreased the area occupied by the class from 6.65% in the GMM method, to 4.94% with the RF algorithm.

Class
These apparent changes in classification using both methods can be better understood by referring to the confounding matrix obtained during the IDP via the GMM algorithm, Table 6 and the RF algorithm, in Table 6 shows that out of a total of 4054 pixels of the "Urbanization" class, 1711 were confused in the classification process with the "Exposed Soil" class, when using the GMM method.In Table 7, when the RF algorithm was used, of the 4054 pixels of "Urbanization" only 237 were confused with pixels of the class "Exposed soil".In this aspect, the confusion of pixels that occurred in these two classes, can occur because the coverage referring to the urban area is complexly consisting of concrete and asphalt roads, parking areas, tiles and even exposed soil (JENSEN, 2015).In this case, the RF algorithm took advantage because it classified smaller amount of pixel from "Urbanization" as "Exposed Soil".When analyzing the two generated matrices, it is possible to observe other classes that were confused in the classification process, such as, for example, "Dense vegetation" and "Secondary vegetation", for having similar reflectance, in the visible bands (R, G and B).In this case, it is observed that the RF algorithm performed better in the classification, because while the GMM algorithm classified 3028 of a total of 216320 pixels of "Dense Vegetation" as "Secondary Vegetation", in the process via RF, only 1788 pixels were classified as "Secondary Vegetation".In the classification of "Secondary vegetation" there was also greater confusion of pixels classified as "Dense vegetation" in the GMM algorithm.
When comparing the classification of the class "Bodies of water" with "Dense vegetation", the Random Forest algorithm performed better, as observed in the matrices, out of the total 1480 pixels, 96 were confused with "Dense vegetation" in the GMM classification.Meanwhile, 69 were classified as "Dense Vegetation" in the RF classification.
In this context, it is important to note that water has a unique spectral reflectance behavior, however, the higher the concentration of suspended sediments, the greater the reflectance is in all wavelengths, mainly in the range 500 and 700 nm, an increase is still seen towards infrared.This suggests that some of the classified portions that have been mistaken for "dense vegetation" may be suffering from excess sediment, algae, or organic matter.JENSE, 2015;DE PAULA et al., 2021.In the classification process, from the generation of the confounding matrices, the Plugin "SCP-Semi Automatic", from the classified matrix file and the training layer, calculated and provided the performance indices, Table 8, for each algorithm used in the classifications.When analyzing the performance metrics, it is observed that both agree on the advantage in classification performance via Random Forest method, however, according to the qualitative grouping of the Kappa coefficient proposed by Fonseca (2000), both classifications were considered excellent, (0.8 < k < 1), proving that the training samples collected corroborated the classified information.
According to Perroca and Gaidzinski (2003) the Kappa coefficient can be defined as a measure of association used to describe and test the degree of agreement (reliability and precision).In this aspect, the two models had high concordances between the classified data and the training layer, demonstrating the efficiency of both in the process of identifying the classes of use and occupation of the land.
Total accuracy expresses the ratio of the reference points (training layer) to the correctly ranked points.In this aspect, the values obtained in the methods GMM and RF show that in both classifications, the values are above the acceptable minimum (85%), proposed by authors such as Anderson (1971) and Fitzpatrick-Lins (1981).However, although the two classifications are considered acceptable by this metric, once again RF took advantage of the GMM algorithm, making it clear that the Machine Learning-based algorithm is more advantageous to use when aiming to obtain a more cohesive map with reality.

FINAL CONSIDERATIONS
Increasingly, it is necessary to understand the dynamics of land use and coverage.Whether with the aim of detecting changes, monitoring to preserve or, in the broader context, managing the resulting natural resources.In this context, technology has been a great ally of man in search of achieving a sustainable management of the environment.
Remote sensing has been a technique used for decades in environmental studies, and has accompanied technological evolution in parallel.Since the first images available of remote sensing in the 1970s, the quality of SR products has increased significantly, which requires the use of refined methods to conduct the analyzes with the maximum possible accuracy.
This study proposed to compare two supervised classification algorithms with respect to the Kappa index and total accuracy.In this respect, it is important to note that visually, differences in classifications were noted using the Gaussian Mixing Model and Random Forest.However, both rankings were considered excellent, as proposed by Fonseca (2000) for the Kappa index, with a noticeable statistical advantage of the Random Forest algorithm over the Gaussian Blending Model.
It is made explicit that more and more machine learning gains ground in many tasks, and in various contexts.In digital image processing, it seems to be no different.Expressing a high robustness in the results, the method has everything to continue being used and becoming more and more solid in the research scenario in remote sensing.
. The municipality covers an area of around 8,254 km² and the estimated population in 2021 was 53,242 inhabitants, as indicated by the BRAZILIAN INSTITUTE OF GEOGRAPHY AND STATISTICS (IBGE, 2022).The local economy is fostered around agricultural, mineral and service activities.Due to the strong environmental exploitation, be it in the extraction of timber, intense in the past decades, farming activities and extraction of the bauxite ore, the spatial configuration of the region has changed considerably in recent years.The municipality currently consists of the municipal headquarters, which borders the state of Maranhão, 9 villages, more than 10 settlements and among other small settlements.(RONDON DO PARÁ GOVERNMENT, 2021).

Figure 1 .
Figure 1.Location of the municipality of Rondon do Para Source: Authors.

Figure 2 .
Figure 2. Spatialization of land use and cover classes through the GMM Method.Source: Authors.

Figure 3 .
Figure 3. Spatialization of land use and cover classes through the RF Method.Source: Authors.

Table 3 .
Classes mapped by MapBiomas vs Classes used in this mapping.

Table 4 .
Percentage and area occupied by each class of interest, by GMM classification.

Table 5 .
Percentage and area occupied by each class of interest, by RF classification.

Table 6 .
Confusion matrix obtained in classification via the GMM method, by number of pixels Source: Authors.

Table 7 .
Confusion matrix obtained in classification via RF method, by number of pixels Source: Authors.

Table 8 .
Performance metrics for classifications via GMM and RF.