Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case-insensitive keyword matching

Tags:

regex

perl

yacc

I'm writing a grammar for parsing a computer language, that can be used with Parse::Eyapp. This is a Perl package that simplifies writing parsers for regular languages. It is similar to yacc and other LALR parser generators, but has some useful extensions, like defining tokens in terms of regular expressions.

The language I want to parse uses keywords to denote sections and describe control flow. It also supports identifiers that serve as placeholders for data. An identifier can never have the same name as a keyword.

Now, here comes the tricky part: I need to separate keywords from identifiers, but they may look similar, so I need a regular expression pattern that matches an identifier case-insensitively, and nothing else.

The solution I came up with is the following:

  1. Each keyword is identified by a token of the following form: /((?i)keyword)(?!\w)/
    • (?i) will apply case-insensitive matching for the following subpattern
    • (?!\w) will not accept any word characters (a-z, 0-9, etc.) after the keyword
    • those characters will not be part of the match
  2. Keywords that are the same as the beginning of another keyword are listed after the longer keyword, so they match first
  3. The token for matching identifiers comes last so it will only match when no keyword is recognized

The token definitions and part of the grammar I came up with work well so far, but there is still a lot to do. However, that is not my question.

What I wanted to ask is, am I on the right track here; are there better, simpler regular expressions for matching those keywords? Should I stop and use a different approach for language parsing altogether?

The idea of using the tokenizer to match whole strings instead of single characters came from the Parse::Eyapp documentation, by the way. I started with a character-by-character grammar first, but that approach wasn't very elegant and seems to contradict the flexible nature of the parser generator. It was very cumbersome to write, too.

like image 742
onitake Avatar asked Jan 30 '26 17:01

onitake


1 Answers

If you would like to parse a language, Marpa maybe much better suited for you. Here's a tutorial. You could also use regexp grammars.

like image 54
user1126070 Avatar answered Feb 02 '26 06:02

user1126070



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!