SAMPLING BIAS WORSEN THE PREDICTIVE ABILITY OF NICHE MODELS

Purpose: Ecological Niche Modeling (ENM) and Species Distribution Modeling (SDM) have become powerful tools in biology, biogeography, paleoecology and biodiversity conservation. ENM and SDM are evaluated using metrics that take into account the errors and successes of the models in predicting the presence or absence of species in certain locations. Here, we evaluate the effects of sampling bias on the relationship between evaluative metrics and the predictive capacity of models. Theoretical framework: ENM and SDM are powerful tools with extensive potential for use, but in order for them to produce useful results, they need to be constructed and validated appropriately. The occurrence data used in both processes may not have been collected randomly, which can lead to issues. An investigation into potential problems arising from the use of non-randomly collected occurrence data is necessary, as new issues may arise from simply filtering the data and reducing the number of occurrence records. Material and Methods: We use Virtual Species (VS) to evaluate the effect of sampling bias. Using VS is the most robust approach for this type of testing, as we know the entire VS distribution. Results and conclusion: Our results showed that sampling bias reduces the predictive capacity of the ENM and SDM models. We did not find a consistent pattern of the effect of sampling bias on the relationship between evaluation metrics and the predictive capacity of models. The effect size varied between different bias intensities. We emphasize that reducing the strength of the bias is one of the most efficient ways to minimize this problem.


INTRODUCTION
The use of modeling to predict phenomena has become widely applied in nature (Amatulli et al., 2018;I. A. Costa et al., 2023;Guisan et al., 2002).Models can help achieve goals in various areas, such as developing more precise conservation and agricultural strategies (Neves et al., 2023;Velazco et al., 2020;Yang et al., 2023).In biological and environmental sciences, Ecological Niche Modeling and Species Distribution Modeling (ENM and SDM) stand out for enabling predictions that assist sustainable development (Evangelista-Vale et al., 2021).Ecological Niche Modeling and Species Distribution Modeling have become powerful tools in various fields of biology, focusing on understanding the past, present and future geographic species distribution (Gomes et al., 2020;Nogués-Bravo et al., 2008;Siqueira et al., 2009).Correlating species occurrence data with environmental variables at the locations where these species occur has made the application of ENM and SDM relatively straightforward (Elith & Franklin, 2013).However, occurrence data available in various online databases may contain biases in identification and geolocation, which can lead to detrimental effects on the evaluation and predictive capacity of models (Beck et al., 2014;Halvorsen et al., 2016;Hortal et al., 2008;Leitão et al., 2011).
The use of occurrence data with sampling bias can lead to imprecise estimates of species niche and distribution, compromising the trustworthiness of results in ENM and SDM (Dormann, 2007;Merckx et al., 2011;Segurado et al., 2006).Several methods have been proposed to minimize these effects, such as environmental and spatial data filtering (Aiello-Lammens et al., 2015;Boria et al., 2014;Varela et al., 2014;Zizka et al., 2019).Nevertheless, the development of "bias correction" methods has often led to the exclusion of occurrence data (Aiello-Lammens et al., 2015;Nolan et al., 2022;Robertson et al., 2016;Zizka et al., 2019), and questions about the potential effects of bias on evaluation metrics and model predictive performance have not been fully elucidated (Hortal et al., 2008).It is important to note that real species data have been used to study the effects of sampling bias, but this approach may not be the most precise and satisfactory form to evaluate methodological issues in ENM and SDM 3 because we do not know the real knowledge of the entire distribution of species (Colwell & Rangel, 2009;G. C. Costa et al., 2010;Meynard & Kaplan, 2013).
Bias correction methods based on filtering usually reduce the size of the dataset (exclusion of records) and assume that a smaller dataset with less bias improves the predictive capacity of models (Aiello-Lammens et al., 2015;Varela et al., 2014).The filtering can reduce the collection bias effect, but this may not be an alternative when occurrence data is scarce and does not inform on the potential negative effects of the combination of bias and the number of available records (Baker et al., 2022;Boria et al., 2014;Jiménez-Valverde, 2020;Nolan et al., 2022).The reduced number of records is a recognized source of uncertainty for ENM and SDM and can generate imprecise estimates of species distribution and niche, and it is recommended to use the completeness of species occurrence points (Bean et al., 2012;Jiménez-Valverde, 2020;Moudrý & Šímová, 2012).Thus, an approach using Virtual Species (VS) is necessary to investigate the effects of collection bias on the predictive ability of models (the model's ability to correctly predict the distribution of a species) and possible interference in the relationship between evaluation metrics and the predictive ability of models (evaluative capacity) (Meynard et al., 2019;Meynard & Kaplan, 2013;Miller, 2014).This approach allows for controlling other sources of uncertainty (Golicher et al., 2012;Jiménez-Valverde, 2020;Leroy et al., 2018;Liu et al., 2019).
Herein, we assess the effectiveness of spatial projections in accurately representing the current distribution of species and their impact on evaluation metrics (i.e., Area Under ROC Curve -AUC, Area Under lift Curve -AUl, Jaccard, MaxVDl, SEDI, Sørensen, True Skill Statistic -TSS, True Positive Rate -TPR, True Negative Rate -TNR and Pearsons D).Specifically, we used VS to address three key questions: 1) the influence of sampling bias intensity on model predictive ability, and 2) the potential interference of sampling bias strength in the relationship between evaluation metrics and model performance.We aimed to identify the ability of evaluative metrics to measure the predictive capacity of ENM and SDM when biased data is used.This is particularly important as biased data is the most widely available (Beck et al., 2014)

THEORETICAL FRAMEWORK
The ENM and SDM are tools that have gained prominence and relevance in various contexts and are increasingly being used in an attempt to address empirical questions.However, the reliability of the results depends on the predictive ability of the models, which must be evaluated, posing a significant challenge (Araújo & Guisan, 2006;Elith & Franklin, 2013;Fielding & Bell, 1997).This challenge becomes even more difficult to overcome when dealing with models constructed from data that were collected in a non-homogeneous spatial manner, resulting in a higher sample representativeness of easily accessible localities (Bean et al., 2012;Dudík et al., 2005;Ranc et al., 2017).This is particularly problematic because when there are many occurrence records of a species in a location, ENM and SDM will consider those conditions suitable for the species in question, which may not be entirely true.This scenario becomes even more concerning when models need to be evaluated, as the bias can reduce the predictive ability of the models and the ability of evaluation metrics to properly assess model performance (Baker et al., 2022;Bean et al., 2012;Boria et al., 2014).

Study Area
The study area encompasses the entire Brazilian continental territory, with an extension of 8,516,000 km2, and extends -73.99º to -28.84º of latitude and -84.00º to 5.27º of longitude.This choice was made to add realism to the models since the environmental conditions in this region are heterogeneous, which is common in procedures for modeling real species.

Environmental Variables
To represent a wide range of environmental conditions, we used 22 environmental variables, of which 19 were related to climate and three provided information on relief and topography (see Supplementary Table 3).Climate variables were obtained from the WorldClim database (Fick & Hijmans, 2017), whereas relief and topography variables were obtained through of the SoilGrids database (Hengl et al., 2014); Aspect and Slope variables were obtained using the "terrain" function from the 'raster' package in R (Hijmans, 2018).All variables were clipped to the Brazilian territory and resampled to a 5 arc resolution (~10 km).

Virtual Species and Sampling of Occurrences
To create five distributions of Virtual Species (VS) in different regions of Brazil, we used the 'virtualspecies' package in R (Leroy et al., 2016) (Figure 1).The distributions were generated on the three first axes of a Principal Component Analysis (PCA) that used all environmental variables.To create the VS, we used the "generateSpFromPCA" function of the 'virtualspecies' package and followed the methodological evaluation recommendations in ENM and SDM (Meynard et al., 2019).This function allows the creation of the "fundamental niche" of a species based on a PCA, in such a way that it's possible to choose the tolerance limits of this species in the environmental space, and then project its niche onto the geographic space to generate the species distribution in the study area.Initially, a suitability gradient is created for the species' fundamental niche (continuous), ranging from 0 to 1.The continuous maps were converted into binary maps (0 or 1), where 0 values indicate species absence and 1 indicates species presence.The binarization was performed using a threshold approach through the "convertToPA" function from the 'virtualspecies' package.As predictors in the modeling process, we retained the three PCA axes used to generate the VS to guarantee that the variables that constitute the fundamental niche of the VS will be those used to produce the models.With this, we guarantee the models have all the relevant information to predict the distribution of VS.After generating five continuous distribution maps VS, we converted them to binary maps (0-1), where 1 indicates the presence of the VS and 0 indicates the absence.We sampled actual occurrence records of the VS for use in the modeling stage, conducting samples with 50 occurrence records, which we consider a parsimonious number (Liu et al., 2019;VanDerWal et al., 2009).To incorporate the collection bias effect in the sampling procedure, we used the Brazilian road and waterway mesh (federal highways and waterways) available at https://ide.geobases.es.gov.br/ and https://www.gov.br/infraestrutura/pt-br/(Figure 1F).We created a 5 km buffer on each side of the highways and waterways, and obtained the biased samples within this buffer, increasing the probability of sampling an occurrence near the roads and waterways (Figure 2).To sample occurrence records, we used the "sampleOccurrences'' function from the 'virtualspecies' package in four bias levels.We progressively increased the strength of the bias, using X0, X10, X100, and X1000, corresponding to 0, 10, 100, and 1000 times more chances of obtaining an occurrence near a segment of the Brazilian road network, respectively.In the subsequent modeling step, we only used the occurrence records from EV.

Modeling
To build the models, we used 15 widely used algorithms (see Supplementary Table 2).We maintained the default package settings in the versions listed in Supplementary Table 2 for all algorithms.Regression-based algorithms (Generalized Linear Models -GLM, Generalized Additive Models -GAM, Multivariate Adaptive Regression Splines -MARS, and Generalized Boosted Models -GBM) and machine learning algorithms (Artificial Neural Networks -ANN, Random Forest -RF, and Naive Bayes) require presence and absence of the modeled species and for these algorithms we use random points sampled from the study area to represent absences (i.e.pseudo-absences) to model VS.For algorithms that require background data, such as MaxEnt and SVM, we sampled 10,000 random points throughout the study area.The remaining algorithms were Bioclim, Domain and Ecological Niche Factor Analysis -ENFA.
We built the models using various bias intensities and partitioned the presence records into 70% to make models (training data) and the remaining 30% to evaluate (test data), a procedure often used in the literature (Elith & Franklin, 2013;Fielding & Bell, 1997;Guisan et al., 2017).The partitioning process was repeated 100 times.For each data partition, we conducted modeling for multiple algorithms using the same training datasets to calibrate the models and the same test dataset to evaluate their performance.

Evaluation and Overlap Metrics
To evaluate the model's performance, we used 10 evaluation metrics described in Supplementary Table 1.These metrics are based on the Confusion Matrix (CM) and measure the proportion of correct and incorrect models' predictions on the test data.To evaluate the model's predictive ability, we used an overlap metric proposed by Warren et al. (2008), based on the I index.This metric was employed to evaluate the model's ability to correctly predict the distribution area of the EV, ranging from 0 to 1, and measuring the similarity between two maps (Warren et al., 2008).

Data analysis
After training and evaluating the models, we generated evaluation metrics and the overlap metric values, considering different bias levels.We categorized these values into four groups regarding the intensity of bias present in the data used to generate the models, which consist of a control group (C = bias X0) and three treatments (T1 = bias X10, T2 = bias X100, T3 = bias X1000).
We performed pairwise comparisons, using a paired t-test, for the factor bias, which generated three comparisons, to assess the increasing effect of sampling bias on the evaluative ability of the models.The p-values were calculated using 9999 permutations by using the "perm.t.test" function from the 'RVAideMemoire' package in R (Hervé, 2022;R Core Team, 2018).The I index was used here as the response variable.We applied the Benjamini & Yekutieli correction for multiple hypothesis testing, which is less conservative but maintains high rates of Type I error control (Benjamini & Yekutieli, 2001).
To evaluate the effect of sampling bias on the relationship between evaluation metrics and the models' predictive ability, we used the Spearman correlation coefficient (rho) between the evaluation metrics and the overlap metric as the response variable for each modeling algorithm.We submitted the rho values for each evaluation metric to a repeated measure ANOVA, using bias as a factor.We used paired t-tests for pairwise comparisons to identify differences between pairs.The p-value was calculated using 9999 permutations using the "perm.t.test" function from the 'RVAideMemoire' package in R (Hervé, 2022; R Core Team, 2018) (Hervé, 2022;R Core Team, 2018), and corrected using the Benjamini & Yekutieli method (Benjamini & Yekutieli, 2001).

RESULTS AND DISCUSSION
Our results indicate that sampling bias negatively impacts the predictive ability of models in geographic space.We found variation in the predictive capacity of the models in all scenarios of bias intensity evaluated (Repeated Measures Factorial ANOVA; p-value < 0.05).All three pairwise comparisons for the interaction between bias were significant (p-value < 0.05).In comparing the size of the effect of the bias, we found that the effect increases with the intensity of the bias (Figure 3).Our findings corroborate those of Baker et al. ( 2022), who evaluated sampling bias in environmental space (niche space) and identified significant effects on model predictivity.These results indicate that bias effects may go beyond model evaluation and compromise niche estimation, a frequent application of ENM and SDM (Adhikari et al., 2019;G. C. Costa et al., 2010;Warren, 2012).We observed that bias intensity is directly related to effect size, and filtering techniques to reduce bias, such as those mentioned by Aiello-Lammens et al. ( 2015), (Vollering et al., 2019), andZizka et al. (2019), can reduce bias and improve model predictivity.Although Boria et al. (2014) suggested that bias effects may increase with the number of points, our study did not confirm this assumption, as we observed a reduction in effect size as the number of occurrence records increased.
After analyzing our results, the significance of all factors and interactions evaluated in the relationship between sampling bias and evaluation metrics for model predictive ability was found (ANOVA, p-value <0.05).Only four out of the 10 evaluation metrics analyzed showed significance in some pairwise comparisons (see Table 1).All significant evaluation metrics depend on the use of absences/pseudo-absences, except TPR (Sensitivity), which measures the proportion of presences correctly identified by the model.Only SEDI and Sørensen showed no statistical significance from the metrics that require absences.the relationship between evaluation metrics and overlap metric for models constructed with real absence data (PEA), using paired t-tests based on 9999 permutations.Filled cells indicate significance (p-value < 0.05) of the comparison between the bias control group X0 and the other bias groups (X10, X100 and X1000).For more information on each metric, see Supplementary Table 1.When evaluating the effects of sampling bias on evaluation metrics, we found a great deal of variability in the results, with some metrics being more affected by bias than others, depending on the type of absence data and the scenario considered.However, we noted a recurring pattern: the intensity of the bias is more harmful in situations of high bias intensity, which corroborates with the findings related to the predictive capacity of the models and reinforces the importance of using filters to reduce bias intensity (Aiello-Lammens et al., 2015;Nolan et al., 2022;Varela et al., 2014).In addition, we emphasize that evaluation metrics that do not depend on the absence of data are less susceptible to the effects of sampling bias, which makes them conceptually and methodologically attractive alternatives, as has been advocated for ENM and SDM (Leroy et al., 2018;Pearson et al., 2007).

Metrics
We have shown that the sampling bias does not follow a consistent pattern in the relationship between evaluation and overlap metrics, even when considering metrics that depend only on presence data, such as AUl, MaxVDl, and Pearson D. We conclude that evaluating models based solely on sensitivity (TPR) is not sufficient to ensure the evaluative consistency of ENMs and SDMs.On the other hand, the three metrics that rely only on presence data, together with SEDI and Sørensen, which depend on absences, proved to be good conceptual and methodological alternatives since they were not affected by collection bias in any of the evaluated scenarios (Leroy et al., 2018;Pearson et al., 2007;Varela et al., 2014).

Figure 1 :
Figure 1: Continuous distributions of the five Virtual Species (A, B, C, D, and E) and a map of Brazil with its transportation routes -railways/waterways (black lines) used to bias the collection of occurrence data (F).Sources: Prepared by the authors themselves (2023).

Figure 2 :
Figure 2: Schematic illustration of biased occurrences (black dots), which have a higher probability of being sampled near highways and waterways (shaded area), and unbiased occurrences (red dots).Only occurrences are shown; absences were also sampled using the same procedure.Sources: Prepared by the authors themselves (2023).

Figure 3 :
Figure 3: Bias effect (Y axis) when comparing overlap metric values (I) from the bias-free dataset with biased datasets 10, 100, and 1000 times (X axis).The effect size is expressed by the standardized T statistic, ranging from 0 to 1. Sources: Prepared by the authors themselves (2023).

Table 1 :
Comparative table of

Table 3 :
(Amatulli et al., 2018) themselves (2023).Tablewithvariables used in all our analyses.On the left are the variable abbreviations, in the center are the variable names, and on the right are the references for each variable.SlopeDegree of Incline of a Surface(Amatulli et al., 2018)Sources: Prepared by the authors themselves (2023).