- Effects of the training dataset characteristics on the performance of nine species distribution models: application to Diabrotica virgifera virgifera.
Effects of the training dataset characteristics on the performance of nine species distribution models: application to Diabrotica virgifera virgifera.
Many distribution models developed to predict the presence/absence of invasive alien species need to be fitted to a training dataset before practical use. The training dataset is characterized by the number of recorded presences/absences and by their geographical locations. The aim of this paper is to study the effect of the training dataset characteristics on model performance and to compare the relative importance of three factors influencing model predictive capability; size of training dataset, stage of the biological invasion, and choice of input variables. Nine models were assessed for their ability to predict the distribution of the western corn rootworm, Diabrotica virgifera virgifera, a major pest of corn in North America that has recently invaded Europe. Twenty-six training datasets of various sizes (from 10 to 428 presence records) corresponding to two different stages of invasion (1955 and 1980) and three sets of input bioclimatic variables (19 variables, six variables selected using information on insect biology, and three linear combinations of 19 variables derived from Principal Component Analysis) were considered. The models were fitted to each training dataset in turn and their performance was assessed using independent data from North America and Europe. The models were ranked according to the area under the Receiver Operating Characteristic curve and the likelihood ratio. Model performance was highly sensitive to the geographical area used for calibration; most of the models performed poorly when fitted to a restricted area corresponding to an early stage of the invasion. Our results also showed that Principal Component Analysis was useful in reducing the number of model input variables for the models that performed poorly with 19 input variables. DOMAIN, Environmental Distance, MAXENT, and Envelope Score were the most accurate models but all the models tested in this study led to a substantial rate of mis-classification.