Our results showed that none of the models perfectly reproduced recorded observations at all sites and in all years, and none could unequivocally be labelled robust and accurate in terms of yield prediction across different environments and crop cultivars with only minimum calibration. The best performance regarding yield estimation was for DAISY and DSSAT, for which the RMSE values were lowest (1428 and 1603 kg ha−1) and the index of agreement (0.71 and 0.74) highest. CROPSYST systematically underestimated yields (MBE – 1186 kg ha−1), whereas HERMES, STICS and WOFOST clearly overestimated them (MBE 1174, 1272 and 1213 kg ha−1, respectively). APES, DAISY, HERMES, STICS and WOFOST furnished high total above-ground biomass estimates, whereas CROPSYST, DSSAT and FASSET provided low total above-ground estimates. Consequently, DSSAT and FASSET produced very high harvest index values, followed by HERMES and WOFOST. APES and DAISY, on the other hand, returned low harvest index values. In spite of phenological observations being provided, the calibration results for wheat phenology, i.e. estimated dates of anthesis and maturity, were surprisingly variable, with the largest RMSE for anthesis being generated by APES (20.2 days) and for maturity by HERMES (12.6).
The wide range of grain yield estimates provided by the models for all sites and years reflects substantial uncertainties in model estimates achieved with only minimum calibration. Mean predictions from the eight models, on the other hand, were in good agreement with measured data. This applies to both results across all sites and seasons as well as to prediction of observed yield variability at single sites – a very important finding that supports the use of multi-model estimates rather than reliance on single models.