Next Sunday, 20 November, the men's football teams start the FIFA World Cup in Qatar. The favourite this time is Brazil with a probability of winning of 15 percent. This is what an international team of researchers consisting of Andreas Groll and Neele Hormann (both TU Dortmund), Gunther Schauberger (TU Munich), Christophe Ley (University of Luxembourg), Hans Van Eetvelde (University of Ghent) and Achim Zeileis (University of Innsbruck) has shown with the help of machine learning. Their forecast combines several statistical models for the teams' playing strengths with information about the team structure (such as market value or number of Champions League players) as well as socio-economic factors of the country of origin (population or gross domestic product). “This time, the World Cup is overshadowed by many ethical and sportive problems that we cannot ignore. Nevertheless, for scientific reasons, we have decided to use our machine learning approach, which we have used successfully at previous tournaments, to make probabilistic forecasts,” says Achim Zeileis from theDepartment of Statistics of the University of Innsbruck.
100,000 simulations
With the predicted values from the researchers' model, the entire World Cup was simulated 100,000 times: match by match, following the tournament draw and all FIFA rules. This results in probabilities for all teams advancing to the different tournament rounds and ultimately winning the Championship. This time, the favourite is Brazil with a probability of winning of 15 percent, followed by Argentina (11.2 per cent), the Netherlands (9.7 per cent), Germany (9.2 per cent) and France (9.1 per cent) – the full forecast can be found here. Of course, the tournament is far from being predetermined – this is reflected by the comparatively low winning probability of the top teams. “It is in the very nature of forecasts that they can also be incorrect – otherwise football tournaments would be very boring. We provide probabilities, not certainties, and a probability of winning of 15 per cent also implies a probability of 85 per cent of not winning,” explains Andreas Groll. So far, however, the predictions have been quite successful: Achim Zeileis' Innsbruck model, which is based on adjusted bookmakers odds, was able to correctly predict the EURO final in 2008, as well as the World and European champion Spain in 2010 and 2012. This year Zeileis’ model will be used for the second time after the EURO 2021 as part of a more comprehensive combined model developed by the teams around Andreas Groll (TU Dortmund), Gunther Schauberger (TU Munich) and Christophe Ley (University of Luxembourg), which surpassed the forecasting quality of the betting providers at the 2018 World Cup.
The 2022 World Cup is interesting for the researchers from a scientific perspective because of the unusual date – the tournament had to be postponed to the winter months because of the extremely high temperatures in Qatar in the summer: “In addition to the widely discussed ethical problems of this World Cup, this also raises very critical sportive questions: In the winter months, all the major football leagues in Europe and South America have to interrupt their usual match schedule to accommodate the tournament. This gives the national teams less time to prepare and the players less time to recover before and after the World Cup. Combined with the extreme climatic conditions, this also increases the risk of injuries,” explains Achim Zeileis. Having a team with many players in the international leagues – such as the Champions League, Europa League, Europa Conference League – could therefore prove to be more of a disadvantage this year instead of an advantage, as Andreas Groll points out: “All these factors make it more difficult to predict how the tournament will turn out, as variables that proved to be very meaningful at previous World Cups may not work well or work differently.”
As football fans, the researchers are dismayed by the circumstances under which the World Cup is taking place this year, Achim Zeileis emphasises: “The usual joy and anticipation of a World Cup has been crushed by the terrible circumstances this year: from the alleged corruption in the host selection process, to the human rights and working conditions in Qatar, and the lack of sustainability in the construction and operation of the stadiums”.
Machine learning
The researchers' calculation is based on four sources of information: A statistical model for the playing strength of each team based on all international matches of the past eight years (Universities of Ghent and Luxembourg), another statistical model for the playing strength of the teams based on the betting odds of 28 international bookmakers (University of Innsbruck) and further information about the teams, for example the market value, and their countries of origin, such as population size (TU Dortmund and TU Munich). The fourth source or “partner” is a machine learning model that combines the different sources and optimises them step by step. The researchers previously trained the model with historical data, as Andreas Groll explains: “We fed the model with the current data for the past five World Cups, i.e., between 2002 and 2018, and compared it with the actual outcomes of all matches in the respective tournaments – this way, the weighting of the individual sources of information for the current tournament will ideally be very accurate.” Incidentally, the model trained in this way can also be used for other forecasts in the future – so a better football forecast may eventually also provide more accurate weather forecasts. In any case, we will find out how well the model performs in terms of football forecasts by the evening of 18 December.