My question is inspired by this one, but for javascript, using the parsimmon parser-combinator library. I want to parse an indentation-sensitive language, like python or yaml.
I've managed to convert the scala example in that answer to javascript easily enough - the key is the chain function in parsimmon, which is equivalent to the >> operator in scala's parser combinators - they both take a parser and a function that returns a parser, and the result of the first parser is passed to the function to choose the next parser.
However, I can't quite wrap my head around how to make this recursive. The example is for a single block - I don't see how to create nested blocks, tracking de-dent levels as needed to parse something like python.
I'm the maintainer of Parsimmon. I realize this question is really old, but I stumbled across it and wanted to answer.
The python-ish.js example in the parsimmon repo on GitHub should help you understand how to parse an indentation-based language.
It's very similar to Josh's answer, but I think a little easier to understand.
https://github.com/jneen/parsimmon/blob/master/examples/python-ish.js
"use strict";
// Run me with Node to see my output!
let util = require("util");
let P = require("..");
///////////////////////////////////////////////////////////////////////
// Because parsing indentation-sensitive languages such as Python requires
// tracking state, all of our parsers are created inside a function that takes
// the current parsing state. In this case it's just the current indentation
// level, but a real Python parser would also *at least* need to keep track of
// whether the current parsing is inside of () or [] or {} so that you can know
// to ignore all whitespace, instead of further tracking indentation.
//
// Implementing all of Python's various whitespace requirements, including
// comments and line continuations (backslash at the end of the line) is left as
// an exercise for the reader. I've tried and frankly it's pretty tricky.
function PyX(indent) {
return P.createLanguage({
// This is where the magic happens. Basically we need to parse a deeper
// indentation level on the first statement of the block and keep track of
// new indentation level. Then we make a whole new set of parsers that use
// that new indentation level for all their parsing. Each line past the
// first is required to be indented to the same level as that new deeper
// indentation level.
Block: r =>
P.seqObj(
P.string("block:"),
r.NL,
["n", r.IndentMore],
["first", r.Statement]
).chain(args => {
const { n, first } = args;
return PyX(n)
.RestStatement.many()
.map(rest => ["BLOCK", first, ...rest]);
}),
// This is just a statement in our language. To simplify, this is either a
// block of code or just an identifier
Statement: r => P.alt(r.Block, r.Ident),
// This is a statement which is indented to the level of the current parse
// state. It's called RestStatement because the first statement in a block
// is indented more than the previous state, but the *rest* of the
// statements match up with the new state.
RestStatement: r => r.IndentSame.then(r.Statement),
// Just a variable and then the end of the line.
Ident: r => P.regexp(/[a-z]+/i).skip(r.End),
// Consume zero or more spaces and then return the number consumed. For a
// more Python-like language, this parser would also accept tabs and then
// expand them to the correct number of spaces
//
// https://docs.python.org/3/reference/lexical_analysis.html#indentation
CountSpaces: () => P.regexp(/[ ]*/).map(s => s.length),
// Count the current indentation level and assert it's more than the current
// parse state's desired indentation
IndentSame: r =>
r.CountSpaces.chain(n => {
if (n === indent) {
return P.of(n);
}
return P.fail(`${n} spaces`);
}),
// Count the current indentation level and assert it's equal to the current
// parse state's desired indentation
IndentMore: r =>
r.CountSpaces.chain(n => {
if (n > indent) {
return P.of(n);
}
return P.fail(`more than ${n} spaces`);
}),
// Support all three standard text file line endings
NL: () => P.alt(P.string("\r\n"), P.oneOf("\r\n")),
// Lines should always end in a newline sequence, but many files are missing
// the final newline
End: r => P.alt(r.NL, P.eof)
});
}
// Start parsing at zero indentation
let Pythonish = PyX(0);
///////////////////////////////////////////////////////////////////////
let text = `\
block:
alpha
bravo
block:
charlie
delta
echo
block:
foxtrot
golf
`;
function prettyPrint(x) {
let opts = { depth: null, colors: "auto" };
let s = util.inspect(x, opts);
console.log(s);
}
let ast = Pythonish.Statement.tryParse(text);
prettyPrint(ast);
Well, here's one way to do it. It's probably not the best, and it's definitely not intuitive (I'm not sure I understand why it works, and I wrote it :) but it seems to be pretty robust.
Basically it says: a tree is a line optionally followed by a block. A block, in turn, is an indented list of trees.
indent is a function that takes the current indentation level, and returns a parser that expects a line indented more than that. The returned parser returns a stack of previous indentation levels.
I said before it's robust - in fact, it's too robust. It accepts input that really ought to throw an error instead: if you un-indent to an indentation level that doesn't match a previous level, it basically "rounds up" to the next indentation level. I'm not sure where the logic to fix that should go - mutual recursion mixed with parser "chain"-ing is hard to follow!
var {regex, seq} = Parsimmon;
function tree(state) {
return seq(
line,
block(state).atMost(1).map(x => x[0]? x[0] : [])
);
}
function block(state) {
return indent(state)
.chain(tree).atLeast(1);
}
function indent(state) {
return regex(/\s/).atLeast(state + 1)
.map(x => x.length)
.desc('indent');
}
let item = regex(/[^\s].*/).desc('item');
let line = item.skip(regex(/\n?/));
let start = block(-1);
let result = start.parse('top item 1\n sub item 1\n sub item 2\n' +
' even deeper\n sub item 3\ntop item 2');
console.log(JSON.stringify(result['value'], null, 2));
<script src="https://cdn.rawgit.com/jneen/parsimmon/master/src/parsimmon.js"></script>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With