I am currently working on a poker hand history parser as a part of my bachelor project. I've been doing some research past couple of days, and came across a few nice parser generators (of which I chose JavaCC, since the project itself will be coded in Java).
Despite the hand history grammar being pretty basic and straightforward, there's an ambiguity problem due to allowed set of characters in player's nickname.
Suppose we have a line in a following format:
Seat 5: myNickname (1500 in chips)
Token myNickname can contain any character as well as white spaces. This means, that both (1500 in chip and Seat 5: are valid nicknames - which ultimately leads to an ambiguity problem. There are no restrictions on player's nickname except for length (4-12 characters).
I need to parse and store several data along with player's nickname (e.g. seat position and amount of chips in this particular case), so my question is, what are my options here?
I would love to do it using JavaCC, something along this:
SeatRecord seat() :
{ Token seatPos, nickname, chipStack; }
{
    "Seat" seatPos=<INTEGER> ":" nickname=<NICKNAME> "(" chipStack=<INTEGER> 
    "in chips)"
    {
        return new SeatRecord(seatPos.image, nickname.image, chipStack.image); 
    }
}  
Which right now doesn't work (due to the mentioned problem)
I also searched around for GLR parsers (which apparently handle ambigious grammars) - but they mostly seem to be abandoned or poorly documented, except for Bison, but that one doesn't support GLR parsers for Java, and might be too complex to work with anway (aside for the ambiguity problem, the grammar itself is pretty basic, as I mentioned)
Or should I stick to tokenizing the string myself, and use indexOf(), lastIndexOf() etc. to parse the data I need? I would go for it only if it was the only option remaining, since it would be too ugly IMHO and I might miss some cases (which would lead to incorrect parsing)
If your input format is as simple as you specify, you can probably get away with a simple regular expression:
^Seat ([0-9]+): (.*) \(([0-9]+) in chips\)$
The NFA of the regex engine in this case solves your ambiguity, and the parentheses are capture groups so that you can extract the information you are interested in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With