What does GPT-4 think will happen in 2024?
As regular readers will know, each January I run a prediction contest, inviting people to predict what will happen in the year ahead. Taking 40 - 50 statements - such as, ‘Joe Biden will win the US Presidential election’ - contestants are asked to say how likely each is to happen as a percentage expressed between 0 and 100, where 100 is definitely going to happen and 0 is definitely not going to happen. This year’s contest included questions on the UK General Election, economy and public services, on the wars in Gaza and Ukraine, on the US Presidential election, and on some wild-card matters such as Dune 2’s take at the box office and whether scientists will discover signs of alien life.
At the beginning of the follow year the contest is scored and the winner announced. The results of last year’s contest can be found here, and the questions for this year’s contest can be found here - alongside my own forecasts, the forecasts of last year’s winner and the ‘wisdom of crowds’ - the mean of all the forecasts.
One question is how AI would do on this sort of prediction contest. A reader,
, who has access to GPT-4 has very kindly asked it all of the questions and the results are below. They are surprisingly good!What was done?
The version used was GPT-4, trained on data from 2021, and with an 8000 token context length. It did not have live access to the internet, which as we’ll see appears to have made a difference for a few questions. It also was a generic GPT-4 model, not a programme specifically trained on forecasting1 - all of which means we should consider this a lower bound on what AI is capable of.
In asking the questions, Rachael began with the prompt:
Predict how likely you think the following events are to happen, from 0 to 100, where 0 means certain not to happen and 100 means certain to happen.
For example: The next coin I flip will show heads. Answer: 50.
Because it didn’t have access to my preambles, she also slightly tweaked some of the questions to make clear - when it wasn’t already - that they referred to the UK and to 2024. ‘In the UK general election’ was also added to all the questions about the UK general election. The table shows the questions that were asked to GPT-4; forecasting nerds are free to compare them to the original questions here, but everone else should rest assured they are pretty similar.
Of course, these were asked in March, not early January. This would be an issue of real unfairness for a human contestant, but given that the AI didn’t have access to the internet - and didn’t ‘know’ anything that has happened since the start of the year - that shouldn’t make a difference.
The results
I’ve shown the results below - and given the ‘Wisdom of Crowds’ result for comparison. The Wisdom of Crowds is the mean of all human predictions and last year was more accurate than 80% of human entrants.
How did GPT-4 do?
On one level, it is hard to assess how well GPT-4 did - given that I also don’t know what’s going to happen this year! But there are a few things we can say.
Firstly, it stayed on task and continued to respond with numbers between 0 and 100 all the way through, and didn't get sidetracked into other kinds of answers. In that it did better than about 5% of human entrants!
It also answered all the questions and didn’t refuse to answer any, even on controversial subjects such as Israel.
Its answers seem a little conservative (in that there are very few above 80 or below 20). But they seem to generally be directionally right, and also to be more confident about things we can be confident about (Xi will remain in charge of China) than those that are more uncertain (who will win the US Presidential election).
In most cases the answers strongly track the Wisdom of Crowds. In fact, in 28 out of 45, it is within 10 points of it: if we assume that, as last year, the Wisdom of Crowds will be more accurate than 80% of humans, that’s a good sign. In a further 11 it is between 10 and 20 away, and in only 6 is it more than 20 away.
Of the answers which are very different from the Wisdom of Crowds, it’s notable that three - on Israel/Gaza and on Venezuela - relate to recent events. It seems likely that here the lack of internet access is hindering it: unaware of the events of October 7th and their aftermath, or the increase in tensions between Venezuela and Guyana, it is giving answers that would have made since in 2022, rather than taking into account recent happenings.
Interestingly, this shows how well one can apparently do simply by looking at base rates and taking the outside view.
Overall I was pretty impressed. Though when we consider that last year more than half of participants did worse than if they’d written 50% for every question, maybe we shouldn't be surprised a little AI can help. Humans find forecasting hard!
The proof of the pudding will be in the eating. I'll be scoring GPT-4 at the end of the year, on the same basis as other entries - and we can see how it stacks up!