Is Perft Speed Important?

The perft routine in Maverick is almost fully debugged!  I believe this is an important milestone in any chess engine’s development.  It’s really the first time the code “plays” chess, in the sense it generates moves, make and un-makes the moves, and iterates up and down the search tree.

I also think the speed of the perft routine is significant and is a measure which can be used to compare engines.  Now some people will disagree with me but here is my logic:

Validation of Structure:  The structure of a chess program could be defined as how the board is represented and the way moves are generated and stored.  This impacts the overall speed of an engine.   A perft routine is a good measure of manipulating this chess board structure.  Clearly there are other factors, most notable the size and complexity of the evaluation function.  But I would even argue a fast perft implementation is a indicator of a chess structure which can support a good evaluation function.  If you think about it, any half decent evaluation function must iterate over the squares and find some proxy for mobility.   At a basic level this is what is happening in a perft routine.  So I would argue a fast perft speed is an indication of a solid foundation upon which a strong chess engine can be developed.

Difficult to “Fiddle”:  Some people talk about an engine being “fast” based on the self reported measure of “number of nodes processed per second’.  The problem with nodes per second as a measure of speed is the definition of a node.  There is no standard definition of a node.  Some engine authors definite it as a call to the “alpha-beta” routine, while other base it around the make / undo move routines, and then again others use the “generate-move” procedure.  There isn’t one standard.  And the measure can easily be fiddled.  This is not the case with perft.  There are only really two ways to count the number of nodes; one being the number of leaf nodes, and the other is the total nodes (internal and leaf).  It would seem the standard measure is the number of leaf nodes.

Having said it’s difficult to fiddle there are three distinct approaches to perft routines and each one impacts the speed:

Make & Unmake All Moves: Most chess engines generate pseudo legal moves.  These are moves which are generally legal but may expose their king to a discovered attack and so are not actually level.  The reason chess engines generate pseudo legal moves is to save the time checking to see if there is a discovered check (which is costly) and may not even be required is there is a “cut-off”.  So the test for the discovered check is often carried out as part of the “Make-Move” routine.  The simplest perft implementation simply iterates through the depths generating the moves, make and unmaking each move in turn and counting the number of nodes.  This is the slowest type of perft implementation.  I regard this type of perft as a measure of the efficiency of the make and unmake routines, since this is the task which is carried out the most.

Counting Legal Moves at The Leaf: In contrast to the first method, another approach is to generate only truly legal moves.  In this approach a lot of time can be saved at the leaf nodes by simply returning the number of moves generated (without having to make and unmake each one).  The cost is a slightly more complex move generator which must detect potential discovered check.  In general this approach will be quite a bit faster than the first approach.  I regards this approach as a measure of the efficiency of the move generation routines.

Hashed Moves: In a perft search many transpositions occur.  This means the whole search can be significantly sped up by hashing the results and storing them in a hash table.  If the position reoccurs in the tree, the number of nodes in the sub-tree can be retrieved from the hash table and there is no need to search the sub tree.  I have not implemented this in Maverick.

Parallel Search: This is not something I have implemented in Maverick but the speed of the perft routine could be improved by implementing a parallel multiprocessor search.

Initial Perft Speed Results:

Based on the above logic I was eager to see how fast the new bitboard structure is compared to Monarch, which used a letter-box data structure. Monarch isn’t by any measure a fast searcher, so I was hoping for a speed improvement.  The position I used is from:

http://chessprogramming.wikispaces.com/Perft+Results

perft chess

FEN: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq –

The number of leaf nodes to a depth of 5 ply deep is 193,690,690.  Monarch’s perft routine is of the “make / unmake all moves” type.  It managed to crank out the six five ply perft for the above position in 71.5 seconds on my Core i7 2670QM running a 2.2 GHz.  To my surprise Maverick blew this away with a time of only 16.3 second.  This is 4.3 times speed up – woohoo!!  Maverick’s legal move generator approach completed perft 5 in exactly 5.0 seconds.

Both engines were compiled using Microsoft Visual Studio Express 2012 in 32 bit mode.  I imagine there will be a reasonable speedup when I move to 64 bit and add the SSE bit-twiddling routines.  This is a much bigger speed-up than I anticipated and illustrates the superiority of the magic bitboard approach.

What type of perft routine does your engine employ and how fast is it?

  • Fred

    black screen is running, white screen is while tracing it

    http://pages.videotron.com/spirch/perft/perft.png
    http://pages.videotron.com/spirch/perft/fullperft.png

    c# engine with array, 0x88 implementation

    • Fred

      oops forgot this; windows 7 x64, Intel 2500k overclocked at 4.4ghz

  • Steve Maughan

    Good stuff Fred. I assume the first one is where you count the legal moves on the last ply without making them – and the second one is make / unmake all moves. Is this correct? Good times for a c# program.

    • Fred

      exactly, i might try to do a bitboard myself when i will have “more” time to see the difference

  • Fred

    in your dropbox, could you provide the solution file with the source code?

    • Steve Maughan

      Hi Fred – it’s in test.c,

      Steve

      • Fred

        what I meant is the visual studio solution file to compile maverick, I tried to make one myself based on your source code but I failed.

        • Steve Maughan

          Hi Fred – I’ll expose my solution file but the DropBox view is a literal link to my source code. So if I’m tinkering with something it might not compile!

          • Fred

            could you include quick readme.txt to describe how to build maverick? I will tell you that i never made a c/c++ solution in visual studio. so far i tried from command line to simply build it / creating win32 console / creating clr console / even an empty solution.
            I was not able to compile it.

          • Steve Maughan

            Hi Fred – I’ve zipped the MSVC solutions and put it on the download page:

            http://www.chessprogramming.net/downloads/

            I’m not an expert with “C” so I hope it works for you. I’ve only managed to get it working using Express 2012.

  • My 0x88 based move generator is pseudo-legal – it doesn’t currently check if the move would put the king in check. It’s single threaded and doesn’t make use of hash table.

    I have a version that uses lookup table for queen, rook and bishop move generation – it runs slower than the non-lookup table version.

    For the above position on a intel core i7 2600k:

    088 Move Generator:
    Perft 5: 201746069
    Time taken: 1.30195 seconds, nps: 154956409

    Lookup table based Move Generator:
    Perft 5: 201746069
    Time taken: 1.58924 seconds, nps: 126945181

    088 Move Generator:
    Perft 6: 8630103144
    Time taken: 63.5041 seconds, nps: 135898423

    Lookup table based Move Generator:
    Perft 6: 8630103144
    Time taken: 72.847 seconds, nps: 118468823

    The perft counts are wrong because my program doesn’t check for full legality.

    • Steve Maughan

      Hi Ankan – it looks as if your engine has the potential to be very fast. However, if I were you I’d implement the necessary checks to ensure the nodes match. It isn’t really a valid comparison until you do so, since those checks can be expensive.

      Steve

      • You are absolutely right. I added the checks (which is currently somewhat brute-force), and now the move generation speed is almost an order of magnitude slower than before. Also, I found around half a dozen bugs by comparing the node counts. After 2 days of painful debugging finally I think I have fixed all the bugs and now will try to improve the performance of my move generator.

        • I finally completed my bitboard based move generator (sliding pieces attacks generated using kogge-stone approach).
          I took many ideas from the chess programming wiki (really great resource) and few tips from frcperft source code (http://members.ziggo.nl/allard.siemelink/spark/index.html).

          I am quite happy with the speed of my new move generator, for the above test position:

          move generation with make move (I use copy-make):
          Perft 5: 193690690, Time taken: 2.76645 seconds, nps: 70014146

          Count only (like frcperft my program only counts the moves without even adding them to movelist – so it’s kind of cheating):
          Perft 5: 193690690, Time taken: 0.86337 seconds, nps: 224342650

          The tests were run with a 64 bit build on a core 2 qx9650.

          My program is available at: https://github.com/ankan-ban/perft_cpu

          • Steve Maughan

            Hi Ankan – that’s an awesome speed! Well done!

  • Thomas Petzke

    Hi Steve, I have a legal move generator and use a make/unmake scheme (instead of copy make). I’m not so obsessed with the perft speed because in real play in the most cases you only generate captures (QS) or play the hash move (no move generation at all). But of course perft is a nice toy.

    On my i7 (2.4 GHz) ice makes in your position

    normal perft
    perft Nodes Time
    5 193.690.690 1.076 sec

    perft generating tactical and non tactical moves separately
    perftTNT Nodes Time
    5 193.690.690 1.295 sec

    and when the hash table is used
    perftTT Nodes Time
    5 193.690.690 0.592 sec

    Thomas…

    • Steve Maughan

      Hi Thomas – Ice is fast! Is this 64 bit with SSE 4.2 popcount and bitscan functions?

      I agree that you shouldn’t be obsessive about perft, but nevertheless it’s a valid point of comparison.

      Steve

      • Thomas Petzke

        Yes, those numbers are for the 64 bit version with SSE 4.2 (but I don’t think I use popcount a lot while generating moves). My 32 bit version uses inline assembler for finding the least significant bit, which is also rather fast.

        On my old Intel Core-2 2.4 GHz 32 bit, iCE needs 4.39 sec for the above perft 5.

        I use my usual board::makeMove function for perft and so I include some stuff that would not be really necessary for perft, like updating the hash keys, determination whether in the position the king is in check, updating the repetition table etc… On the other hand I only count the moves at the leaves and don’t make them.

        I remember when I started iCE two years ago I invested quite some time in speeding up the move generator, so when I say I’m not obsessed with perft, this means not obsessed anymore. I definitely was two years ago 🙂

        Thomas…

  • Dusan

    Hi, nice article.

    but I have a doubt.

    “Maverick’s legal move generator approach completed perft 5 in exactly 5.0 seconds.”
    “Maverick blew this away with a time of only 16.3 second” (perf 6)

    perf 5 = 193,690,690 combinations
    perf 6 = 8,031,647,685

    perft 6 has ~ 41.5 times more nodes and it only takes 3 times more time? That is strange.

    for my java chess moves generator,
    perf 5 = 6.7 s
    perf 6 = 250 s
    which is really aproximatelly linear (constant speed in nodes per second, 37 time more time than perf 5)

    ?

    • Steve Maughan

      Hi Dusan – I’m glad you liked the article.

      Good spot – there was an error. The times area all perft 5. Monarch did a “make/unmake” perft 5 in 71 seconds. Maverick did the same in 16 seconds. Maverick also did a “legal move generation” (i.e. without making and unmaking at the tips) perft 5 in 5 seconds. I’ve now edited the article to correct the error.

      Steve

      P.S. I think the latest 64 bit version does the same perft 5 in 1.7 seconds.

      • Dusan

        🙂 cool.

        I know perft is just a toy, but quite enjoyable.
        So I improved my algorithm and made it multithreaded, posting results below. (not sure if Maveric uses multiple CPU as well)
        Not bad for a Java app, that’s probably all I can do right now.

        FEN: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq –

        Generated 2 plies. Performance: 0.006s, 339833 nodes / seconds. Total combinations: 2039
        Generated 3 plies. Performance: 0.051s, 1918862 nodes / seconds. Total combinations: 97862
        Generated 4 plies. Performance: 0.098s, 41689826 nodes / seconds. Total combinations: 4085603
        Generated 5 plies. Performance: 2.471s, 78385548 nodes / seconds. Total combinations: 193690690
        Generated 6 plies. Performance: 118.801s, 67605892 nodes / seconds. Total combinations: 8031647685

        system info:
        Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
        Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
        Linux 3.10-ARCH #1 SMP PREEMPT Fri Jul 26 11:26:59 CEST 2013 x86_64 GNU/Linux
        Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz

        • Steve Maughan

          Hi Dusan – I think that’s quite fast for Java. Maverick isn’t multi-threaded but does about 80 mnps on equivalent hardware. I think the fastest perft engines are doing 200+ mnps!

  • Tapani

    Just a question for comparing perft values — how do you guys count?
    – For depth 1, do you just enumerate the number of moves, or actually make/unmake them?
    – Multi-threaded?
    – Hashing?
    – Full legality?
    – Additional information tracked (like, if moves gives check)

    My umpth (lost count!) attempt at a chessboard class now performs at ~45Mn/s (full legality, all info, full make/unmake, single threaded, no hashing) on a 3.5GHz i7 (64-bit). With 32bit I get roughly half the speed. My exact figure is 45150554n/s for the example pos, depth 5.

    Getting to 200Mn/s (20 clock cycles per generate+make+unmake feels unreal)!

    • Steve Maughan

      Hi Tapani,

      Here’s a rundown of my answers

      – For depth 1 I test for legality but don’t actually make / unmake – if in check I just count as it’s a legal out-of-chech move generator
      – Not multi-threaded
      – No Hashing
      – Full legality – yes, only legal moves are counted but the move generator is pseudo legal
      – Yes, other necessary info is tracked (in check, ep, castling rights etc.)

      On my 2.8 GHz i7 I get about 115 mnps. This turned into about 8.5 mnps for real chess when the evaluation was just piece-square tables (i.e. version 0.05)

      Steve

    • At depth 1 I just return the move count. As my move generator only generates legal moves there is no point in making them. I’m single threaded. I check whether in the resulting position the enemy king is in check. I also update the hash keys (main, pawn, material) which are not really required for perft. I have a normal perft and I have a perft that uses the transposition table (not to speed up perft but as a unit test to check whether the TT works correctly). The hash table perft is of course much faster (> factor 2).

      I also have a perft the checks the incremental move generation for tactical and non tactical moves, so if you generate captures/promotions and non captures separately you don’t miss a move or generate one twice. It is about 20% slower than the normal perft where all moves are generated at once.

      perft is really a useful debugging thingy.

      Thomas…

      • Tapani

        Thomas and Steve,

        thank you. It becomes clearer for me that perft testing is really not a way to benchmark engines. Everyone does things differently. Perft/divide is as you say invaluable tools to find move generation problems.

        Regarding no point of making moves at depth 1 — that depends what you are trying to perf test. Actually doing them adds more weight make/unmake to the total speed, while not doing them weighs move generation more.

        Anyway, seems my performance is adequate for start implementing search and evaluation to go with the board class… 🙂

        • Steve Maughan

          Hi Tapani – you’re right, you certainly have adequate performance to start implementing search and evaluation. If you’re doing full make / unmake then your speed is fine. I still think perft is a valid indicator of the underlying efficiency of the data structure. On your hardware, if you have a perft speed of 25 mnps I would say the data structure could certainly be improved. This was the case for me with Monarch.

          Anyway, good luck with your engine!

          Steve