How do I apply rules, like chess rules, to a neural network, so the network doesn't predict/train invalid moves?
In the example of AlphaZero Chess, the network's output shape allows for all possible moves for any pieces starting on any square.
From the paper Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm:
A move in chess may be described in two parts: selecting the piece to move, and then selecting among the legal moves for that piece. We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves. Each of the 8 × 8 positions identifies the square from which to “pick up” a piece. The first 56 planes encode possible ‘queen moves’ for any piece: a number of squares [1..7] in which the piece will be moved, along one of eight relative compass directions {N, N E, E, SE, S, SW, W, N W }. The next 8 planes encode possible knight moves for that piece. The final 9 planes encode possible underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop or rook respectively. Other pawn moves or captures from the seventh rank are promoted to a queen.
So for example the network is allowed to output a positive probability for the move g1-f3 even if there isn't a knight on g1, or for the move e8=Q even if there isn't a pawn on e7, or d1-h5 if there is a Queen in d1 but another piece is blocking the diagonal.
The key is that it outputs a probability distribution over possible moves, and since it is trained by playing against itself where only legal moves are allowed, it will learn to output very low or zero probabilities for illegal moves.
More precisely, after a set number of self-play games, the network is trained using supervised learning to predict the probability and value of moves given a board position. At the very beginning of self-play the network has random weights and it will output significant probabilities for lots of impossible moves, but after one or more iterations of supervised learning the move output probabilities will start to look much more reasonable.
The reason the AlphaZero team chose this architecture over something that enforces rules in the network is simple: The output must take a fixed size, since there should be a fixed number of output neurons. It wouldn't make sense to have a different number of output neurons corresponding to a different number of legal moves. Alternatively, it wouldn't make sense to zero out outputs for non-legal moves inside the network, because this would be a highly non-standard operation which would probably be a nightmare to run backpropagation on. You would need to differentiate a chess move generator!
Furthermore, when the network uses its policy output to play games, it can simply normalize each output over only legal moves. In this way we are enforcing move legality within the self-play system, but not within the neural network architecture itself. This would be done with the aid of a move generator.
Since you are asking about keras, specifically, you could represent such an output layer as:
model.add(Dense(4672, activation='softmax'))
In Summary: It is not necessarily to enforce move legality in the architecture of a neural network for predicting chess moves, we can allow all possible moves (including illegal ones) and train the network to output low or zero probabilities for illegal moves. Then when we use the move probabilities for playing, we can normalize over only legal moves to get the desired result, but this is happening outside of the neural network.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With