You all know I’m a maniac when it come to hunting down bugs. I’ve posted about it (Bugs-Bugs-Bugs). I suspect many engines don’t get past the 2200 ELO point simply because they are buggy.
I hit the same issue several times, different node counts between Debug and Release, between x32 and x64 and Intel vs MSVC.
Was painful to hunt it down but at the end there was always a bug
Comparing node counts using different compilers is an interesting idea, and one I haven’t previously tried. Since I’ve spent many hours perfecting my perft node counts I was quite sure Maverick would sail through this test.
I was wrong!!
I set up the test and used one of the complex perft positions as an example. I then created debug and release versions of Maverick. To my horror when performing a fixed ply search there were different node counts for the two versions.
How could this be? I know the perft nodes counts are all ok; so my move-generation must be ok – right? Doubt starts to creep into my thinking.
Here’s how I tackled the problem.
First I found the minimum depth which triggered a difference in node counts. In my case this was the 10th ply of an late middlegame position.
Then I created a small function which writes a position’s vital statistics (FEN, hash key and move to be played) to a text file. I added a call to this function during the “make_move” routine, just before I make every single move. Effectively I’m writing the whole search tree to a text file.
I was then able to run the two versions which create the two text files containing the search tree. I renamed them debug-tree.txt and release-tree.text. Both files were about 35 mb in size.
To search by hand through these files would be a boring and tedious task. I’m not a Linux user but I believe Linux has a “diff” command which find the differences between two files. Windows doesn’t come with anything similar. However, there is a useful utility called WinMerge which does the same thing. It’s also free.
When I compared the two files this is what I saw:
The yellow sections show the variation. As you can see the first variation in the tree was at line 75,416! The release version seems to go back to the position after evaluating the position with hash key of 15560305028174901452 and play the same move again (e7e4), whereas the debug version plays another move (f3e3). It looks like the release version fired a cutoff, whereas the debug version didn’t. Here’s the critical position after f6e4:
It doesn’t look like anything special. I then inserted a break to stop the search when it encounters the position (i.e. when there is a hash key match).
Sure enough, the two versions evaluated the same position differently. The bug must be in the evaluation!
Since I knew the problematic position, finding the bug proved to be straightforward. I stepped through the evaluation, comparing the results at each point until I tracked down the bug.
It turns out the bug was in the connected passed pawn code. which was evaluating the white pawns on f2 and g3. The buggy code was the following:
//-- Connected passed pawn mask
connected_pawn_mask[s] = king_mask[s] & neighboring_file[s];
neighboring_files is a pre-calculated array which returns the mask of neighboring files. However, it take the file number as an input, not the square. This was the bug. The corrected code is as follows:
//-- Connected passed pawn mask
connected_pawn_mask[s] = king_mask[s] & neighboring_file[COLUMN(s)];
Once corrected I created and compared the MSVC Release, MSVC Debug and MinGW x64 versions and they all produced the same node count!
This isn’t a technique I’ve used before. I always assumed perft would track down search errors, and flipping the board and comparing white and black evaluations of the flipped position would find evaluation errors. I was wrong. This will be another technique I’ll put in my debugging toolbox.
Is your node count the same for different compilers?