Automatic Tuning of Evaluation Function

When it comes to Maverick’s evaluation function I’m frustrated and excited in equal measure!

My focus over the last couple of months has been to improve Maverick’s evaluation function.  However I’ve found it quite difficult to improve on Maverick 0.51’s actual playing strength.  Adding extra knowledge (e.g backward pawns) seems to add little in terms of playing strength. I’m struck by the massive number of ways to encode chess knowledge.  I feel like a blind archer as I make wild guesses trying to calibrate the parameters.

There must be a better way!

Early on in Maverick’s development I came across Thomas Petzke‘s approach to tuning evaluation function. He uses a form of genetic algorithm (PBIL) to tune the parameters.  PBIL optimization algorithms are really neat – they represents each number in binary format. Each bit of each number is represented as a floating point value between zero and one. As the system “learns” the floating point values are “trained” and gravitate to either zero or one based on the training data. In Thomas’ case he played a few games of chess to assess the fitness of a set of parameters. This is expensive – but ultimately game-playing-ability is the attribute we’d like to optimize.  Sp maybe the training time is justified.

Back in 2000 I worked a lot with standard genetic algorithms. I used them to evaluate marketing campaigns. I think PBIL may be even better for evaluating marketing campaigns (but that’s a story for another day).  I’m certainly interested in using them to tune Maverick’s evaluation function. The only problem is Thomas’ method take ages to complete (weeks!). I’d prefer a method which is quicker.

Then I came across a post on CCC by Peter Österlund:

How Do You Automatically Tune Your Evaluation Tables

Peter outlines a new way to tune an evaluation function. His approach takes 2.5 million chess positions and minimizes the following fitness function:

E = 1/N * sum(i=1,N, (result[i] – Sigmoid(QuiescientScore(pos[i])))^2)

This is really quite interesting for the following reasons:

  • Since we’re not playing complete chess games this runs *much* faster – maybe less than one day of computing time
  • The sigmoid function is *really* sensitive in the -100 to +100 centipawn range. This is a critical region where virtually all games are decided. If we can create an evaluation function which accurately evaluated this range then we’re likely to have a strong chess engine
  • I suspect Houdini uses a similar technique to calibrate its evaluation function since it’s evaluation is linked to the probability of winning. Robert Houdart mentions this on his website, “Houdini 4 uses calibrated evaluations in which engine scores correlate directly with the win expectancy in the position. A +1.00 pawn advantage gives a 80% chance of winning the game against an equal opponent at blitz time control. At +2.00 the engine will win 95% of the time, and at +3.00 about 99% of the time. If the advantage is +0.50, expect to win nearly 50% of the time
  • Some people have tested this approach to training have reported good results
  • When tested the pawn evaluation parameters (resulting from the PBIL optimization) they have varied considerably between the middlegame and endgame. The middlegame value of a pawn is 50 centipawns, while the endgame value is +145 centipawn. If these values are robust and uses in normal play they are likely to produce exciting chess where the engine is happy to sacrifices a pawn for a modest positional advantage. This sounds like the recipe for an interesting engine – which is one of the goals for Maverick!

So I’m in the process of rewriting Maverick evaluation code to accommodate the PBIL algorithm. I’m also writing a PGN parser (in Delphi) so I can train the evaluation function using different sets of training positions.

With all of this re-writing it may be a little while before I can release a new version of Maverick. My aim is still to take Maverick to 2500 ELO using only an improved evaluation function!

I’ll keep you posted!

Maverick 1.0 To Do List…

It’s been two months since I did any significant chess coding – life got in the way! I did tinker here and there but nothing special.  However, Christmas is around the corner. I’m hoping to have some time to relax with a IPA beer and do some coding.

I thought it would be worthwhile spending some time to think about what I need to do to create version 1.0 of Maverick. I’ve said from the start, version 1.0 will have a playing strength of 2500 ELO (i.e. Grandmaster level) on the CCRL scale. So I’ll release Maverick 1.0 when I can beat all of the following opponents in a 100 game match at 5 mins per move:

  • Tao 5.6 – I’ve admired Bas Hamstra’s engine for a long time – I hope one day he continues the development
  • YACE  – another classic
  • Anmon 5.6
  • Hermann 2.8 64 bit 

I tested Maverick 0.51 at fast time controls. One minute per game was the slowest time control I used. In my test this gave Maverick a rating of about 2370 ELO. However, in the CCRL Division 6 tournament Maverick struggled. I thought it would be in the top half of the table. But it finished one off the bottom! I suspect this is due to the time controls. Maverick is quite a fast searcher but I intentionally haven’t put any real selectivity into the search apart from null move. This probably results in a higher branching factor than other engines with Late-Move-Reduction. At super fast time control Maverick may do well due to its raw speed, but its higher branching factor holds it back at the slower time control. When version 0.51 is tested by CCRL I suspect it will be 100 ELO lower than my tests i.e. about 2270 ELO. This mean I have a long way to go to reach 2500 ELO.

Before I set out my to-do list for version 1.0 it might be helpful to go over what Maverick already contains i.e. the current state-of-the-engine. Here’s a reasonably comprehensive list:

  • Magic bitboard engine
  • Move dictionary architecture
  • Hash tables (4 slots structure)
  • Killer Moves
  • Evaluation function with the following terms:
    • Middle-game to Endgame interpolation
    • Piece square tables
    • All basic endgames (e.g. B + N +K vs. K)
    • Mobility
    • Passed pawns (a major impact on strength)
    • Basic pawn weaknesses (extremely rudimentary)
  • Null Move (R = 3)
  • Static-Exchange-Evaluation for move ordering and quiescent move pruning
  • Checks in the first ply of of the quiescent search
  • Extend on one-reply-to-check on the first ply of quiescent search
  • Internal Iterative Deepening
  • Enhanced Transposition cutoffs
  • Opening book
  • Pondering (really easy using the UCI protocol)

So I’d say Maverick isn’t too complex and doesn’t contain anything different.

Here’s the list of to-do’s for version 1.0:

  • CLOP tuning
  • Revisit some parameters for tuning:
    • Piece-Square-Tables
    • Mobility
    • Passed pawns
  • Add king safety
  • Add potential passed pawn terms to the evaluation
  • Add pawn storm code when castling on opposite sides
  • Add bad exchange terms to the evaluation (and tune)
  • Change the maximum depth measure to exclude the quiescent search (I think?)
  • Use magic multipliers to calculate mobility and center control
  • Improve knight placement and isolation code
  • Add pawn and king endgame code (this could be a big strength gain)
  • Add special endgame code for K + R vs. K + N (inspired by Thomas Pestzke)
  • Tweak move ordering:
    • Evade check move history
    • Null Move capture-evade
  • Multiple PV
  • Support for Chess960
  • Possibly MAC and Linux support (need to get a development environment up and running for both)

I’m hoping this list will be enough to push 2500 ELO – let’s see!

Have I missed anything?

Adding Endgame Knowledge

If you test the first version of Maverick (v 0.05) you’ll soon find out it cannot mate at fast time controls with rook and king vs. a lone king. Quite shocking!  Despite gnawing its way through 9 million positions per second, the mate is too deep for an engine without hash tables.  So it just pushes its king and rook around the board aimlessly.  Obviously it needs help.  

I could add a few “if-then” statements to handle these cases.  This would work well if there are only a few standard endgame.  But I’d like to create a scalable solution where I can add all sorts of knowledge.  I posted a questions on the Computer Chess Club and Steven Edwards pointed me in the rights direction – material hash tables.  

Material hash work like this.  Each position has a material hash value which is calculated based on the number of each type of pieces on the board.  So when a piece us added or taken away the material hash value is updated.  Unlike a normal zobrist hash value the location of the piece is not relevant.

When the evaluation routine is called, one of the first tasks is to check to see if the material hash of the position matches the signature of a known endgame.  If there is a match the endgame specific routine is used to evaluate the position.  Here’s the code to evaluate white rook and king vs. long black king:

void known_endgame_RKvk(struct t_board *board, struct t_chess_eval *eval)
eval->static_score = lone_king[board->king_square[BLACK]] + 800 - 10 * square_distance(board->king_square[WHITE], board->king_square[BLACK]);
eval->static_score *= (1 - board->to_move * 2);

Adding knowledge is also really easy.  Here’s the routine to add KBN vs. K:

//-- R + K vs. k
material[WHITEROOK] = 1;
key = get_material_hash(material);
index = key & material_hash_mask;
if (material_hash[index].key != 0)
return FALSE;
material_hash[index].key = key;
material_hash[index].eval_endgame = &known_endgame_RKvk;

Cool stuff!

Time to Create the Search Routines!

I had planned to have a reasonably solid evaluation function in Maverick from “day 1”.  It looks as if I’ll need to backtrack a little.  

I had imagined I’d put all of the gubbins of the evaluation function (e.g. attack tables) in the “board” structure.  However, the drawback to this approach is the need to store changes as moves are made and unmade on the board.  

For example, if I generate all of the attack tables when I evaluate the position I’d subsequently like to use this information in my Late-Move-Reduction code.  However, if I include the attack tables in the board structure I’ll need to store and retrieve it as part of the make / unmake process.  This sounds expensive! 

So what’s the solution?

What I intend to do is to have a “ply_data” array which stores information about the current position at any give ply.  This is not unusual and Monarch has this structure.  As an example, it enable us to see what move was played two moves ago.  So I’ll move some of the information I was planning to store in the board structure, including the attack tables, over to this ply_data array.

This is a fundamental structural issue.  It concerned me that as I push ahead with the evaluation I may need to make structural changes later when I put the search routines in place. So I’ve also decided to freeze the evaluation function in it’s curent basic state and create the search routines (i.e. make Maverick play chess).  I have the structure of the evaluation function in place so (hopefully) I’ll be able to develop it quickly I have a basic search.

Using Excel to Help Create Piece-Square Tables

I’ve been thinking about Piece-Square tables.  They’re not the most exciting part of computer chess but they seem quite important (based on Ed Schroder’s experiments), so I thought I’d put some time into thinking about reasonable values.  I’ve created a simple spreadsheet in Excel to help create them.  You can download it here (Piece-Square-Tables.xlsx).  It assumes your board representation starts with A1 = 0 and H8 = 63.

Here’s how to use the Excel spreadsheet:

  • The worksheet is protected to prevent accident overwrites.  You can unlock it easily as it’s not password protected
  • Enter the values in the yellow squares on the left.  The gray squares are the mirror image of the yellow squares.
  • The table in the middle is adjusted to make the average table value zero.   This means if I set the value of a bishop to be 350 centi-pawns and put it on a random square,  the value returned by the Piece Square Table will average out to 350 centi-pawns.  The idea is to minimize interaction with other evaluation terms.
  • The column to the far right is the table formatted as a “C” array.  You can copy and paste this code.

Here’s a screenshot:


Domain Space of a Chess Evaluation Function

I’m up to my chops in evaluation code.  As I’ve said in previous posts the engines I’ve written in the past have had basic evaluation function.  I want Maverick to be different.  I’m using a number of good resources to inspire me.  The first is a book by Hans Berliner The System: A World Champion’s Approach to Chess, which takes quite a quantitative approach to chess.  I’ve found it invaluable and would recommend it to any chess programmer.

As I’m writing the code I can’t help but feel the enormity of the domain space of a chess evaluation function.  Firstly there are the values I give to each element.  For example in the middelgame, is a pawn on g4 worth more or less than a pawn on f4?  The pawn on f4 is probably worth more, but how much more – is it 2 centi pawns or 10 centi pawns?  I’ve got to make these decision for all six piece types in the middle-game and endgame.  And then there is the values I assign for a double pawn on h2 or e4?  The task is truly enormous.  Secondly there is complexity due to what I choose to evaluate and how I go about doing so.

This all reinforces the idea of a solid testing framework.  I also recent read an interview with Robert Houdart.  Of course I would expect Robert to test Houdini but I was struck by him playing 100,000 games per day!  That’s a lot of games.  This impressed upon me the need to create a testing setup (when Maverick starts to play chess).  Testing was also one of the things Fabian used to rocket Fruit up the rating lists back in 2005.  Once I get Maverick playing chess I must make it my objective to be 100% test driven.

Is a Lazy Eval a Good Thing?

Lazy evaluation is a technique often used to save time when evaluating chess position.  If a position’s true score is much higher than beta or lower than alpha the full evaluation is really unnecessary.  For example, if you’re a queen down it’s a waste of time checking to see if you have a weak double pawn on a4.  So in the right situation this approach can save a lot of time and speed up an engine.  Of course there is the risk that the lazy evaluation is way-off and the position is incorrectly evaluated.  The technique was quite popular in the 80’s (and probably the 90’s).  Ed Shoeder noted it gave him an enormous speed increase in Rebel. 

I’ve been giving the issue some thought and I’ve come to the conclusion I will not add a lazy evaluation in Maverick.  My reason is simple – I’d like the leaf nodes to be evaluated as accurately as possible.  My hunch is that by evaluating the leaf nodes as accurately as possible I’ll be able to be more selective in positions closer to the root (which will dramatically reduce the tree size).  This type of decision is part of the design philosophy of Maverick. 

I also suspect the lazy evaluation technique is less relevant in modern programs which use null move and late-move-reduction.  Back in the 80’s there was still a battle between the Spracklen’s brute force approach and Lang’s selective search.  In those days it was common to only evaluate at the leaf nodes (i.e. not evaluate the interior nodes).  In this situation the leaf nodes evaluation could have a wide range of values, some much higher than beta and others much lower than alpha.  So lazy evaluation worked well.  In contrast null move and late-move-reduction have the effect of keeping the search tree closer to alpha and beta.  Null move reductions are good at quickly weeding out obvious fail high moves (i.e. very good positions) and late-move-reductions reduce the tree size for obvious fail low positions (i.e. very bad position).  So by utilizing these techniques lazy evaluation becomes less relevant and more risky.

Of course this is just theory – let’s see how it works in practice.

What are your thoughts on Lazy Evaluation?

Chess Position Evaluation Philosophy – Part 2

Tim Ferriss of “Four Hour Work Week / Body / Chef” fame has championed the idea of Minimum Effective Dose.  This is simply the smallest does which will have the desired outcome.  In the context of computer chess evaluation I think it’s interesting. 

I would like to try to create an automatic parameter tuning framework for the evaluation function using Probability Based Incremental LearningThomas Petzke is doing some cool stuff in this area.  However the more parameters I have the more difficult it will be to tune.  So my aim is to create a Minimum Effective Evaluation.  Of course over time I’ll add to the evaluation but only when improvements are significant and proven (at least that’s the intention).

To this end I’ve been going over some articles which I have previously saved to Evernote.  This one from Ed Schroeder is of particular interest – The Value of an Evaluation Function.  Ed shows the relative importance of each evaluation term.  In summary it would seem the high level order of importance is as follows:

  1. Mobility
  2. Passed Pawn
  3. Piece Square Tables
  4. King Safety
  5. Minor Piece “Stuff”
  6. Pawn Evaluation

I’m going to use this to prioritize my effort.

Chess Position Evaluation Philosophy – Part 1

I’m now starting to think about the evaluation function.  I’m doing this at a completely different stage than for Monarch (the Delphi or “C” version) – but that’s intentional – here’s why.

Monarch was my first chess engine.  While I had been a computer chess fan (fanatic?) since the early 80’s, and liked programming, I didn’t write my first engine until 1999.  I remember being giddy with excitement when I completed the move generator and the make / unmake routines.  I don’t think I tested Monarch as rigorously using the perft methodology.  Once I was at this stage my aim was to create a chess playing entity as quickly as possible.  I slapped some piece-square tables together and bolted on a vanilla alpha-beta search to see what would happen.  To my amazement, even with this rudimentary evaluation function, the first version played “OK” (maybe 1900 ELO). 

However, I suspect a poor evaluation routine restricts and hampers improvements to the search.  I look at creating a chess program as one big optimization task.  As programmers we are doing a manual “hill climb” optimization – changing parameter, adding knowledge, improving the search, while all the time keeping improvements and throwing out changes which don’t work.  The number of dimensions and domain space for the optimization is massive.  If we are doing a hill climb optimization where you start has some impact as to where you’ll finish.  My theory is that if you start with a moderately good evaluation function (i.e. not just piece square tables), you’ll have a better chance to improve it in the long term and avoid local optimums.  It’s a hunch – I don’t have any proof that this is the case. 

So I’m trying to put all of the components of a strong engine in place from day one.  I could be completely wrong but that’s why I’m putting some effort into creating the evaluation function before Maverick has even played one move in a real life chess game.

Automatic Tuning of a Chess Evaluation Function

I’ve taken a little detour over the past week to look at Population Based Incremental Tuning.  This was something which Thomas Petzke brought to my attention.  You can read his take here:

It’s an interesting approach.  After some investigation I’ve decide to add some form of automated tuning to Maverick’s chess position evaluation function.  I think PBIL seems to hold some promise.