Great article. I thought a lot about this in grad school, where I used math models to predict dynamics of gene drives. I think even those working closely with the models tend to be overly trusting of the results, and I wrote a paper on the "reification" of these models. It was helpful to keep the humbling "all models are wrong, but some are useful" in mind.
For chess, I agree that the probability measure is at least more useful, and it seems similar to calculation for win probability based on difference in Elo ratings. I'm reminded of a completely closed position I once saw where there was no way to make progress, but the engines would give something like +2.0.
I wonder if Chess.com's Game Review uses the Q score, or some combination of Q and centipawns. I've noticed in a won position you can go from +9 to +6, and it will give you the thumbs up "good move" whereas anywhere else a -3 would be a blunder. Makes sense, since it makes little difference in winning probability when that far ahead.
Another confusing thing about the evaluation of the above position is despite stockfish evaluating white's advantage as "worth" three and a half pawns, if you actually try to give black say a bad knight or a bishop or two pawns, the evaluation will immediately swing in black's favor (any light squared bishop not immediately attacked will put the eval at -6 or so). I've heard tournament commentators for big events say stuff like "the engine thinks white's advantage is worth an extra rook" when the eval bar says +5, even though the evaluations don't work that way at all.
This is pretty much the same thing that leads to being a pawn up giving +3.5 though. When you add the bishop it's a totally new position, and in that position black is up a piece. That piece first helps defend from the threats white had in the previous position, then it helps black with threats of their own. Maybe saying the advantage is worth a rook in +5 positions is wrong, but when an extra pawn gives +3,5 an extra bishop giving -6 shouldn't be any more surprising.
I wonder if you could also enlighten us about the Lichess accuracy number. Sometimes it makes no sense e.g. recently I played a game where I went down 100 centipawns early in the game, managed to not lose more ground but stayed down 100 centipawns for 26 moves then my opponent blundered and I eventually got the better game. Analysis showed 97% accuracy with no inaccuracies, no mistakes, no blunders. How is that possible? It felt like I was in serious trouble for 26 moves.
Really interesting post, but I wonder also about evaluating based on ease of play. In other words, if one player has a simple plan, but the other has a complex plan, simple seems better, certainly in a practical game. Perhaps that's a better measure of the player? Their ability to weave in more and more sophisticated ideas to create a more nuanced plan, and do it in a reasonable amount of time? If the evaluation is so winning that even a 1200 will beat a GM, that's pretty overwhelming, there are presumably very few tricks or subtleties available, right? When AlphaGo competed in that Go match, didn't they claim that one of the moves that resulted in the human winning game 3 was a 1 in a million move (or something like that)? That seems like another measure that could be considered.... Then again, perhaps not...
I think a difficulty metric would be fantastic for humans reviewing games. I've seen a lot of people discussing this, but haven't seen it as part of any chess apps so far. Seems like it should be possible to do.
Your remarks about digital evaluators reporting on probability of outcome raises a curious point. Opponents in a chess game are limited to just two alternatives actions outside the board. They can offer draw or resign. This is rather limiting and overlooks the possibility that they have a continuous range of perceptions on their chances.
What if a player had the possibility of offering his opponent, say, .3 win points to end the game (1 win point = win, 0 WP for a draw)? This also offers observers greater possibility to participate in grand master games by comparing ongoing assessments. One could even imagine a market with shares at indexed win points might be traded.
The problem with both measures is that chess is not a game which produces one result out of two possibilities but out of three possibilities. In both cases 0 is a logical value for the position where winning as Black and winning as White is equally probable but for me, as a human, it makes big difference whether it is 40% vs. 40% (with 20% chances for a draw) or 10% vs. 10% (with 80% chances for a draw).
And I think this problem is hardly resolvable because it is difficult even to define the desired outcome. If I had to choose between a move which gives me the highest probability of my victory in a game and (a different) move which offers me the best expected value of the final outcome, I could probably make a human decision depending on some factors from outside the board, like the current tournament situation, or my rating goal, or even my plans for the evening. But I can't see any good way to determine the engine's most wanted outcome for such a case.
There are only three true evaluations: win for white, win for black, or draw.
Any point values (such as centipawns) or range of values (-1 to 1) are just made up numbers that represent a guess at what the true evaluation is. According to game theory, there are only the above three valuations.
This is because chess is finite, does not rely on chance, and has perfect information. It is finite because the game cannot go on forever due to the 50-move rule. It doesn't rely on chance such as a dice game. And both players have perfect information of the game state, unlike a card game or Scrabble.
This means that chess can be solved, just like tic-tac-toe or checkers. Every single position in chess is either a forced win for white, a forced win for black, or a forced draw for both players... if both players were to play perfectly.
Since chess is unsolved, engines are simply estimating whether a position looks more like a win, loss, or draw. And in many instances their estimations also seem to represent how "easy" of a win it is for non-perfect playing humans. For example, white being up two pawns in a normal position is probably a theoretical forced win for white, but white being up two pawns and a queen is an "easier" win for white thus deserving a higher evaluation number.
Instead of having one single metric to represent the position evaluation, I think there should be multiple metrics. Just to list a few ideas: perhaps percentage of how confident the engine is in it being a win, loss, or draw. How complex the position is (typically being up two pawns is more complicated to win than being up a queen). And perhaps the sharpness of a position (it could be mate in 12 for white but otherwise losing without exact precision, thus perhaps black would practically have better chances in a real human game).
I think there should be multiple metrics for the purpose of studying games and learning from them and stuff, but it's still important to decide which metric is the *most* important, because chess in addition to being a game is also a spectator sport and when you're streaming tournaments on twitch or youtube, you gotta have a single bar to avoid clutter. So what should that bar represent? I think that's a meaningful question.
Great post. I have also often felt centipawns do not give a very usable assesment of every position. Especially when comparing the playing strength of players of different eras, the 'centipawn loss' measurements really irk me. There are only three kinds of positions, winning for white, drawing, and winning for black. If chess will ever be solved, the perfect engine will be only giving those outcomes. I did not know that AlphaZero used an evaluation function between -1 and 1, but in this context it makes sense.
Also, lately I have also seen a different measurement of playing strength come up called 'accuracy', which is a percentage, so a value between 0 and 1 that captures the average strength of the moves in a given game. Do you know anything about how this relates to models with positional evaluations between -1 and 1?
Of course individual positions can be much easier to win or draw than others for humans regardless of evaluation in many cases (especially when factors such as time and external stakes get factored in), but if we look at a large enough sample size I imagine the human curve would look very similar to this curve, so it is likely useful as is (if for no reason than I may commit an error, but so may my opponent and these should more or less offset over huge sample sizes). The main difference lies in that humans do make big errors and so the ELO difference between them is still very important in predicting results. For example, Stockfish today in a +6 position vs any engine in the world including say an improved Stockfish 10 years from now on even stronger hardware would be basically irrelevant. It's +6. Stockfish will win that. However, a 1700 who miraculously caught a GM in a trap/who blundered is seriously not guaranteed that win in a +6 position. My best guess is that some sort of table with a multiplication of this S curve's prediction for a given position's advantage by the various differences in elo S curve's prediction for the start of the game would give us the overall estimate. If true, it might also be useful in creating "fair" odds games for players of varying skill to have a competitive game with one another for fun.
As you said you are biased becouse you are a poker player, in poker you know all the cards and can calculate every outcome, in chess this doesn' t work and all the problems that you listed in the centipawn mesure are the same in the percentage mesure.
why a -1 to +1 should be better than a -inf to +inf mesure? Numbers are just numbers, the only thing that matters is bigger the number better it is.
When i see an evaluation of +3 i think: "white is in advantge like if he has 3 more pawns and he will win with perfect play"
When i see an evaluation of 60% i think: "ok now?"
People don' t evaluate in percentage but in material and still with all the problem you listed the centipawn method is the closest you can get.
I guess I read not a long time ago Matthew Sandler saying "+3 is the new +1" -or something similar- pointing to the fact that engines evaluations have changed in the last times. IMHO, for 99% of chess players, most positions in which Stockfish (SF) gives a +-1 score are perfectly payable if the player understand the position and is comfortable with it (vs a player of similar strength). I wonder if an engine from say 5 years ago would be more useful to a "normal" chess player than the last SF.
It was interesting to read the formula converting between centipawns and Q. Does a similar one exist for SF?
Great article. I thought a lot about this in grad school, where I used math models to predict dynamics of gene drives. I think even those working closely with the models tend to be overly trusting of the results, and I wrote a paper on the "reification" of these models. It was helpful to keep the humbling "all models are wrong, but some are useful" in mind.
For chess, I agree that the probability measure is at least more useful, and it seems similar to calculation for win probability based on difference in Elo ratings. I'm reminded of a completely closed position I once saw where there was no way to make progress, but the engines would give something like +2.0.
I wonder if Chess.com's Game Review uses the Q score, or some combination of Q and centipawns. I've noticed in a won position you can go from +9 to +6, and it will give you the thumbs up "good move" whereas anywhere else a -3 would be a blunder. Makes sense, since it makes little difference in winning probability when that far ahead.
Another confusing thing about the evaluation of the above position is despite stockfish evaluating white's advantage as "worth" three and a half pawns, if you actually try to give black say a bad knight or a bishop or two pawns, the evaluation will immediately swing in black's favor (any light squared bishop not immediately attacked will put the eval at -6 or so). I've heard tournament commentators for big events say stuff like "the engine thinks white's advantage is worth an extra rook" when the eval bar says +5, even though the evaluations don't work that way at all.
This is pretty much the same thing that leads to being a pawn up giving +3.5 though. When you add the bishop it's a totally new position, and in that position black is up a piece. That piece first helps defend from the threats white had in the previous position, then it helps black with threats of their own. Maybe saying the advantage is worth a rook in +5 positions is wrong, but when an extra pawn gives +3,5 an extra bishop giving -6 shouldn't be any more surprising.
Thankfully, Stockfish 15.1 has the common sense to dispense with this metric:
https://stockfishchess.org/blog/2022/stockfish-15-1/
this is great
I wonder if you could also enlighten us about the Lichess accuracy number. Sometimes it makes no sense e.g. recently I played a game where I went down 100 centipawns early in the game, managed to not lose more ground but stayed down 100 centipawns for 26 moves then my opponent blundered and I eventually got the better game. Analysis showed 97% accuracy with no inaccuracies, no mistakes, no blunders. How is that possible? It felt like I was in serious trouble for 26 moves.
Really interesting post, but I wonder also about evaluating based on ease of play. In other words, if one player has a simple plan, but the other has a complex plan, simple seems better, certainly in a practical game. Perhaps that's a better measure of the player? Their ability to weave in more and more sophisticated ideas to create a more nuanced plan, and do it in a reasonable amount of time? If the evaluation is so winning that even a 1200 will beat a GM, that's pretty overwhelming, there are presumably very few tricks or subtleties available, right? When AlphaGo competed in that Go match, didn't they claim that one of the moves that resulted in the human winning game 3 was a 1 in a million move (or something like that)? That seems like another measure that could be considered.... Then again, perhaps not...
I think a difficulty metric would be fantastic for humans reviewing games. I've seen a lot of people discussing this, but haven't seen it as part of any chess apps so far. Seems like it should be possible to do.
Your remarks about digital evaluators reporting on probability of outcome raises a curious point. Opponents in a chess game are limited to just two alternatives actions outside the board. They can offer draw or resign. This is rather limiting and overlooks the possibility that they have a continuous range of perceptions on their chances.
What if a player had the possibility of offering his opponent, say, .3 win points to end the game (1 win point = win, 0 WP for a draw)? This also offers observers greater possibility to participate in grand master games by comparing ongoing assessments. One could even imagine a market with shares at indexed win points might be traded.
The problem with both measures is that chess is not a game which produces one result out of two possibilities but out of three possibilities. In both cases 0 is a logical value for the position where winning as Black and winning as White is equally probable but for me, as a human, it makes big difference whether it is 40% vs. 40% (with 20% chances for a draw) or 10% vs. 10% (with 80% chances for a draw).
And I think this problem is hardly resolvable because it is difficult even to define the desired outcome. If I had to choose between a move which gives me the highest probability of my victory in a game and (a different) move which offers me the best expected value of the final outcome, I could probably make a human decision depending on some factors from outside the board, like the current tournament situation, or my rating goal, or even my plans for the evening. But I can't see any good way to determine the engine's most wanted outcome for such a case.
There are only three true evaluations: win for white, win for black, or draw.
Any point values (such as centipawns) or range of values (-1 to 1) are just made up numbers that represent a guess at what the true evaluation is. According to game theory, there are only the above three valuations.
This is because chess is finite, does not rely on chance, and has perfect information. It is finite because the game cannot go on forever due to the 50-move rule. It doesn't rely on chance such as a dice game. And both players have perfect information of the game state, unlike a card game or Scrabble.
This means that chess can be solved, just like tic-tac-toe or checkers. Every single position in chess is either a forced win for white, a forced win for black, or a forced draw for both players... if both players were to play perfectly.
Since chess is unsolved, engines are simply estimating whether a position looks more like a win, loss, or draw. And in many instances their estimations also seem to represent how "easy" of a win it is for non-perfect playing humans. For example, white being up two pawns in a normal position is probably a theoretical forced win for white, but white being up two pawns and a queen is an "easier" win for white thus deserving a higher evaluation number.
Instead of having one single metric to represent the position evaluation, I think there should be multiple metrics. Just to list a few ideas: perhaps percentage of how confident the engine is in it being a win, loss, or draw. How complex the position is (typically being up two pawns is more complicated to win than being up a queen). And perhaps the sharpness of a position (it could be mate in 12 for white but otherwise losing without exact precision, thus perhaps black would practically have better chances in a real human game).
I think there should be multiple metrics for the purpose of studying games and learning from them and stuff, but it's still important to decide which metric is the *most* important, because chess in addition to being a game is also a spectator sport and when you're streaming tournaments on twitch or youtube, you gotta have a single bar to avoid clutter. So what should that bar represent? I think that's a meaningful question.
Great post. I have also often felt centipawns do not give a very usable assesment of every position. Especially when comparing the playing strength of players of different eras, the 'centipawn loss' measurements really irk me. There are only three kinds of positions, winning for white, drawing, and winning for black. If chess will ever be solved, the perfect engine will be only giving those outcomes. I did not know that AlphaZero used an evaluation function between -1 and 1, but in this context it makes sense.
Also, lately I have also seen a different measurement of playing strength come up called 'accuracy', which is a percentage, so a value between 0 and 1 that captures the average strength of the moves in a given game. Do you know anything about how this relates to models with positional evaluations between -1 and 1?
Of course individual positions can be much easier to win or draw than others for humans regardless of evaluation in many cases (especially when factors such as time and external stakes get factored in), but if we look at a large enough sample size I imagine the human curve would look very similar to this curve, so it is likely useful as is (if for no reason than I may commit an error, but so may my opponent and these should more or less offset over huge sample sizes). The main difference lies in that humans do make big errors and so the ELO difference between them is still very important in predicting results. For example, Stockfish today in a +6 position vs any engine in the world including say an improved Stockfish 10 years from now on even stronger hardware would be basically irrelevant. It's +6. Stockfish will win that. However, a 1700 who miraculously caught a GM in a trap/who blundered is seriously not guaranteed that win in a +6 position. My best guess is that some sort of table with a multiplication of this S curve's prediction for a given position's advantage by the various differences in elo S curve's prediction for the start of the game would give us the overall estimate. If true, it might also be useful in creating "fair" odds games for players of varying skill to have a competitive game with one another for fun.
How do neural-network-based engines compute the "probability" of a win from a given position?
It learns to estimate the win probability from a position from millions of games of self play.
As you said you are biased becouse you are a poker player, in poker you know all the cards and can calculate every outcome, in chess this doesn' t work and all the problems that you listed in the centipawn mesure are the same in the percentage mesure.
why a -1 to +1 should be better than a -inf to +inf mesure? Numbers are just numbers, the only thing that matters is bigger the number better it is.
When i see an evaluation of +3 i think: "white is in advantge like if he has 3 more pawns and he will win with perfect play"
When i see an evaluation of 60% i think: "ok now?"
People don' t evaluate in percentage but in material and still with all the problem you listed the centipawn method is the closest you can get.
Stockfish already does give an evaluation that is a probability, it is just that it is common to convert that to centipawns. (Or am I wrong on this?)
I guess I read not a long time ago Matthew Sandler saying "+3 is the new +1" -or something similar- pointing to the fact that engines evaluations have changed in the last times. IMHO, for 99% of chess players, most positions in which Stockfish (SF) gives a +-1 score are perfectly payable if the player understand the position and is comfortable with it (vs a player of similar strength). I wonder if an engine from say 5 years ago would be more useful to a "normal" chess player than the last SF.
It was interesting to read the formula converting between centipawns and Q. Does a similar one exist for SF?