Evaluating Chess Positions

The good, the bad, and the unclear

In the dawn of time these were the symbols we used to evaluate a chess game:

Much like ancient hieroglyphics, their precise meaning is difficult to discern from a modern vantage point. The glyph for an unclear position (∞) is especially mysterious. What are we to make of the fact that it is also the symbol for infinity? Is this a hint that the beauty of chess lies in the unknown?

In practice ∞ was most often used when the commentator didn’t know what was going on and didn’t especially feel like figuring it out. If we were going to choose a symbol for unclear today, we’d probably go with ¯\_(ツ)_/¯.

Fortunately that’s not necessary because these days if you don’t know what’s going on you can always fall back on the computer evaluation. Indeed, chess analysis has become more and more computerized, and when you talk to computers you need to speak in numbers. Thus the centipawn evaluation has become the standard, with bigger positive numbers representing a White advantage, and the same with negative numbers for Black, such at +1.6 or -0.9.

These numerical evaluations, however, are not as clear-cut as they seem. While they form an important part of the computer’s workings, it’s not at all clear what they actually mean. They’re what we’d call unnormalized data: they’re not tethered to any real-world quantity.

“Aha,” you say, “but they are normalized, one point is equal to one pawn.”

Well yes, in the computer code one pawn is typically held constant at one point, but does it really make sense to say a +9 advantage is “like being up nine pawns”? Those of us who work with engines frequently have gotten used to pretending this makes sense. As Jonathan Rowson says in The Seven Deadly Chess Sins, “we are asked to compare pieces with vastly different ‘personalities’ and which we value in different ways in different positions not only amongst themselves but also, ultimately, to pawns.”

When DeepMind created the neural network chess engine AlphaZero, they were effectively starting from scratch, and took the opportunity to build in an evaluation that makes more sense from a modern engineering perspective. AlphaZero evaluates positions on a 0-1 scale representing expected points, which is similar to win probability, but also taking into account draws. Basically it says, “If a win is worth 1 point and a draw 0.5, how many points would White score from this position on average if we played it out many times?”

Compared to the centipawn evaluation, this one has a much clearer definition, but it still conceals some ambiguities. In particular, expected points for who? As in, who exactly is playing out all these hypothetical games? Since it’s trained using self-play, the answer is that the evaluation is based on AlphaZero playing itself. However, I’m not AlphaZero and neither are my opponents, so it’s not clear how relevant this evaluation is for me. For example, the computer gives White a near-winning advantage in many lines in the Benko gambit, but go play ten blitz games with White against the Benko and let me know how it goes.

What we could really use is an evaluation that is clearly defined, relevant, and takes into account human fallibility. Such an evaluation could perhaps be built using Monte Carlo simulations, playing out the position over and over, not with full-strength AlphaZero but with an engine that simulates human play including human mistakes. While efforts are underway to develop such engines the path to perfecting human-like play and leveraging it to create a human evaluation remains ∞.