Language Models: From Uncertainty Estimation and Optimization, to Soccer Event Prediction

PandaScore Research Insights #13

, and

Feb 01, 2024

Learning the Language of Soccer using Transformers

What it is about: This paper has been published in KDD 2022 (Knowledge Discovery and Data Mining) conference, authored by researchers from the University of Southampton. They worked on predicting the next action during a live action of a game of soccer.

How it works: To do this, they first built a dataset representing sequence of actions. Only offensive sequences were selected as their primary interest was in predicting goals. Each sequence then becomes a list of actions composed of passes, dribbles, crosses, shots, and ball losses. Additionally, they augmented this action sequence with time series data representing useful information to understand the recent sequence of actions. For instance, these time series encoded the ball's position on the field, the distance to the opposing team's goal, and other similar information.

They tried two different modeling approaches: one based on transformer neural networks and one based on recurrent neural networks. The task of the models is to both predict, from a current live situation, the next action and the next location of the ball.

Spatial distribution of some the actions.

Results: They analysed the performance of the model under different settings. First, they evaluated the model's accuracy in terms of the number of correct next action predictions. Then, they also extended the usage of their model by deriving some metrics used to predict, for instance, the estimated number of goals given match sequences. They showed that their estimator of number of goals was on par with current state of the art methods like xG Goals, but using fewer statistics from past matches.

What they found out: One interesting metric they built is called the poss-util. It represents the effectiveness of a team making good usage of a ball possession. Interestingly, they used it to characterise a soccer team overall performance and noticed a very distinctive pattern differentiating average teams from the most performant soccer teams.

Why it matters: The paper demonstrates the ability to use transformer-based models to learn sequences of actions, which could be beneficial for analyzing team behavior patterns. It extends nicely the use of models used generally to do language tasks.

Our takeaways: Sports and esports share many similarities, and what works for one could greatly benefit the other. Their model is capable of making objective predictions and performing in-depth analyses of team behavior leading to team performance.

Teaching Models to Express Their Uncertainty in Words

What it is about: In this paper from June 2022, authors from University of Oxford and OpenAI finetune models to express epistemic uncertainty using natural language. When posed with a question, the model generates an answer accompanied by a level of confidence (e.g., “90% confidence” or “high confidence”).

How it works: The authors finetuned GPT-3 using supervised learning with a set of mathematical questions (referred to as Calibrated Math). The training set includes questions of addition/subtraction type, each with only one correct answer. The testing set includes multiplications and arithmetic progression that can have multiple answer, in order to create a shift in difficulty and content. Each input consists of a question followed by a GPT-generated answer, and the label is a calibrated confidence level. The basic intuition is that for questions GPT-3 is likely to answer incorrectly, its confidence should be low. The goal is not to improve the model’s answers but instead to improve calibration in expressing uncertainty over these answers. Three techniques are compared:

Verbalized: the model express the uncertainty in language along with the response.
Answer logit: normalized logprob of the model answer.
Indirect logit: logprob of the True token when appended to model’s answer.

Calibration curves for training (left) and evaluation (center and right).

Results: From the published results, we observe the following:

Verbalized outperforms both logit methods in the Multi-answer category.
Verbalized probability tends to overfit the training data, as we see bad performances on the evaluation set.
Indirect logit generalizes well to Multiply-divide but not to Multi-answer, which shows an overfit on the task type.

Why it matters: The main outcome of the paper is the fact that GPT models can learn to express calibrated uncertainty using words and that this calibration performance is not explained by learning to output logits. This is interesting as most LLMs inherently struggle with probabilities. Like other initiatives involving finetuning or leveraging the answer logit, it adds a powerful feature without the need of a full model retraining, which is way less costly.

Our takeaways: For PandaScore, the ability of a LLM to express calibrated uncertainty could pave the way for using such models in probabilistic forecasting. Although the performance showcased in the paper doesn't yet meet the requirements for esports betting, this topic is definitely one to follow closely this year.

Large Language Model as Optimizers

What it is about: This paper, published by DeepMind in December 2023, introduces a novel method named Optimization by PROmpting (OPRO), which utilizes Large Language Models (LLMs) as general purpose optimizers. This approach leverages the natural language understanding capabilities of LLMs to tackle optimization tasks, even if the optimization is discrete or non-differentiable.

How it works: OPRO works by constructing a “meta-prompt” that encapsulates the optimization problem in natural language. This prompt includes a description of the problem, previously generated solutions, and their respective evaluations. The LLM uses this information to generate new solutions iteratively. Each new solution is evaluated, and the feedback is incorporated back into the prompt for the next iteration. This process enables the model to progressively refine its approach, drawing on its contextual understanding of the problem through natural language.

How OPRO works: each solution generated by the LLM is evaluated, and the evaluation is then fed into the context of the next prompt.

Results: The paper shows the effectiveness of OPRO through case studies on classic optimization problems like linear regression and the traveling salesman problem, as well as a prompt engineering problem. The authors demonstrate that LLMs can find correct solutions, sometimes surpassing hand-designed heuristic algorithms, in less complex settings. In prompt optimization, the paper reveals that prompts optimized by OPRO can significantly enhance external task accuracy, surpassing human-designed prompts in various benchmarks.

Why it matters: The authors conclude that LLMs have some potential as optimizers. The ability of LLMs to understand and process natural language allows for a more intuitive and flexible approach to optimization tasks. With this in mind, it is important to note that the problems addressed in this paper are relatively simple, and LLMs would need significant improvement to be feasible as optimizers in real-world scenarios. This is also mentioned by the authors as they state that the goal of the research was not to have state of the art performance but rather to highlight the possibilities of LLMs that are not immediately obvious.

Our Takeaways: At PandaScore, optimization tasks are very common, which is why this work attracted our attention. The ability to easily handle all types of optimization problems, including those with non-differentiable scoring functions, is an intriguing research direction for LLMs. However, we see this as a long-term goal, not achievable in the near future with current models or without substantial fine-tuning.