Wednesday, August 15, 2012

A PEG Parser Generator for Vim

Barry Arthur
v1.2, August 15, 2012
v1.1, October 10, 2011

What is VimPEG?

VimPEG is a Parser Generator which uses the newer Parsing Expression Grammar formalism to specify parse rules.

Why VimPEG?

Vim is a powerful editor. It has lots of features baked right in to make it an editor most awesome. It has a deliciously potent regular expression engine, jaw-dropping text-object manipulations, and fabulous scriptability -- just to name a few of its aces.

One thing our little Vim still lacks, though, is an actual parser. Regular expressions will only get you so far when you're trying to analyse and understand complex chunks of text. If your text is inherently infinite or recursive, then regular expressions become at best combersome, and at worst, useless.

So, Vim needs a parser. I've needed one myself several times when wanting to build a new plugin:
Awesome! This idea will so rock! Now all I need to do is parse <SomeLanguage> and I'll be able to... awww... :-(
I've seen people ask on #vim: How can I <DoSomethingThatNeedsAParser>? And invariably the answer is: You can't. Not easily, anyway. Vimscript is a capable enough language to write your own parser in, but a little alien to most to do so.

You could also use one of the many language bindings that Vim comes bundled with these days to use a parser library in your favourite scripting language. The problem being that your code will only then run on a similarly compiled Vim (not everyone enables these extra language bindings) and with your parser library dependencies.

Beyond those two options, the world of parsing in Vim is quite scant. There exist a small handful of purpose-built recursive descent parsers that target a specific task (like parsing json), but for the general case -- a parser-generator -- you're out of luck. Until now. VimPEG aims to solve this problem.

VimPEG aims to be a 100% VimL solution to your parsing needs.

What would I use VimPEG for?

  • You've come to that paralysing sinkhole in your Vimming when you've said to yourself, "Damn... I wish Vim had a parser."
  • You've asked for something on #vim and the reply is "you can't do that because Vim doesn't have a parser."
  • You're up to your neck in recklessly recursive regexes.

Some ideas:

  • An expression calculator (the beginnings of which we explore here.)
  • Expanding tokens in typed text (think: snippets, abbrevs, maps.)
  • Semantic analysis of code -- for refactoring, reindenting (but sadly not syntax highlighting yet.)
  • C Code bifurcation based on #define values -- want to see what the code would look like with #define DEBUG disabled?
  • Coffeescript for Vim -- sugar-coating some of the uglies in VimL -- this example will be presented in a subsequent VimPEG article.

In fact, most of these ideas have been explored in part inside the examples/ directory of the VimPEG plugin.

For the purposes of introducing VimPEG and parsing in general (if you're new to it), let's consider a fairly easy example of reading and understanding (perhaps calculating) a sum series of integers. They look like this:

    1 + 2 + 12 + 34

NOTE: Vim can already do this for you, so writing a parser for it here is purely pedagogical -- it's a simple enough example without being utterly devoid of educational value. I hope.

The list can be any (reasonable) length, from a single integer upwards. So, this is a valid input to our parser:


As are all of the following:

    1 + 2
    3 + 4 + 5
    123 + 456 + 789

Stop. Right now. And think: How would you parse such an arbitrarily long series of integers separated by + operators? What tool would you reach for? What if you had to do it in Vim? And :echo eval('1 + 2 + 3') is cheating. :-p

We'll continue to use this example throughout this article and eventually show you how VimPEG solves this little parsing requirement.

But first, let's make sure we're all on the same page about the question: What is parsing?


Feel free to skip to the next section if you're comfortable with the following concepts:
  • parsing
  • pasrer generators
  • (E)BNF and PEGs

Let's begin by defining some terms:

What is 'Parsing'?

Parsing is making sense of something. When we want a computer to understand something we've written down for it to do, it needs to 'parse' that writing.  Without going into too much detail yet, let's consider a sentence uttered at one time or another by your parental unit: "Take the rubbish out!". When you (eventually -- after you unplug your iPod, put down your PS3 controller, pocket your smart-phone and wipe the disdain off your face) parse this sentence, your brain goes through two processes:

firstly, syntax recognition:

  • it scans the words to make sure they're legitimate:
    • they're in a language you know
    • they're all valid words, and
    • they're all in the right order

and secondly, semantic analysis:

  • it filters out the 'meaning' and presents that to a higher actor for further deliberation

In this case, the parser would extract the verb phrase 'take out' and the noun 'rubbish'. Your higher self (sarcasm aside) knows where this magic 'out' place is. We'll come back to these two processes ('syntax recognition' and 'semantic analysis') later.

In the case of our sum series of integers, syntax recognition would involve collecting the sequence of digits that comprise an integer, skipping unnecessary whitespace and expecting either an end of input or a + character and another integer and... so on. If the input contained an alphabetic character it would fail in this phase -- alphabetic characters are just not expected in the input. If the lexical recogniser found two integers separated by whitespace or two + characters in a row...  it would not fail in this phase -- these are all valid tokens in 'this' lexical recogniser.

I am describing the more general process of lexical recognition and it being a separate stage to semantic analysis which is typical of a lot of parsers. PEG parsers, however, do not have separate phases as described here -- they are quite strict about not only what shape the next token must have, but also its purpose in this place (context) of the input. Having two consecutive integers or two consecutive characters will upset a PEG parser expecting a sum series of integers -- it's just that it gets upset all in its single parse phase.

The semantic analysis phase is all about doing something
"meaningful" with the collected integers. Maybe we should sum them? Maybe we just want to pass back a nested list structure representing the parse tree, like this:

    [1, '+', [2, '+', [3, '+', 4]]]

given this input:

    1 + 2 + 3 + 4

Either way, whatever is done, it's the job of the semantic analysis phase to do so. In our example in this article, we produce a sum of the collected integer series. So, our parser would return: 10 for the example input given above.

What is a 'Parser Generator'?

Writing a parser is not easy. Well, it's not simple. It's fussy. It's messy. There's a lot of repetition and many edge cases and minutia that bores a good coder to tears. Sure, writing your first recursive descent parser is better than sex, but writing your second one isn't. Writing many is as much fun as abstinence. Enough said.

So, we (as fun loving coders) want a better alternative. Parser generators provide that alternative. They generate parsers; which means they do all the boring, tedious, repetitive hard-labour and clerical book-keeping stuff for us. I hope I've painted that with just the right amount of negative emotion to convince you on a subliminal level that Parser Generators are a Good Thing(TM).

How do they generate a parser? or What's a 'PEG'?

Parser Generators are told what to expect (what is valid or invalid) through a grammar -- a set of rules describing the allowed constructs in the language it's reading. Defining these rules in a declarative form is much easier, quicker and less error-prone than hand-coding the equivalent parser.

Bryan Ford recently (circa 2004) described a better way to declare these rules in the form of what he called Parsing Expression Grammars -- PEGs.

NOTE: We used to declare these parsing rules in EBNF, intended for a recursive descent parser (or an LL or LALR or other parser). And before you drown me in comments of "They so still use that, dude!" -- I know. They do.

In a nutshell, PEGs describe what is expected in the input, rather than the (E)BNF approach of describing what is possible. The difference is subtle but liberating. We'll not go too much into that now -- except to say: PEGs offer a cleaner way to describe languages that computers are expected to parse. If you want to re-program your 13 year old brother, you might not reach for a PEG parser generator, but as we're dabbling here in the confines of computers and the valley of vim, PEGs will do just fine.

A major benefit to PEG parsers is that there is no separate lexical analysis phase necessary. Because PEG parsers 'expect' to see the input in a certain way, they can ask for it in those expected chunks. If it matches, great, move on. If it doesn't match, try another alternative. If all the alternatives fail, then the input doesn't match. Allow for backtracking, and you have all you need to parse 'expected' input.

NOTE: VimPEG is not a memoising (packrat) parser -- not yet, anyway.

A brief overview of the PEG parsing rule syntax

  • Terminal symbols are concrete and represent actual strings (or in the case of VimPEG, Vim regular expressions) to be matched.
  • Non-terminal symbols are names referring to combinations of other terminal and/or non-terminal symbols.
  • Each rule is of the form:   A ::= e -> #s
    • A is a non-terminal symbol
    • e is a parsing expression
    • s (optional) is a semantic transformation (data-munging callback)
  • Each parsing expression is either: a terminal symbol, a non terminal symbol or the empty string.
  • Given the parsing expressions, ++e1++ and ++e2++, a new parsing expression can be constructed using the following operators:
    • Sequence: e1 e2
    • Ordered choice: e1 / e2
    • Zero-or-more: e*
    • One-or-more: e+
    • Optional: e?
    • And-predicate: &e
    • Not-predicate: !e

A Conceptual Model of VimPEG

There are three players in the VimPEG game:
  1. The VimPEG Parser Generator (Vim plugin)
  2. The Language Provider
  3. The Client

The VimPEG Parser Generator

This is a Vim plugin you'll need to install to both create and use VimPEG based parsers.

The Language Provider

This is someone who creates a parser for a new or existing language or data-structure. They create the grammar, data-munging callbacks, utility functions and a public interface into their 'parser'.

The Client

This is someone who wants to 'use' a parser to get some real work done. Clients can either be Vim end-users or other VimL coders using a parser as a support layer for even more awesome and complicated higher-level purposes.

There are five pieces to VimPEG

  1. The VimPEG library (plugin)
  2. A PEG Grammar (provider-side)
  3. Callbacks and utility functions [optional] (provider-side)
  4. A public interface (provider-side)
  5. Client code that calls the provider's public interface. (client-side)

Our Parsing Example

Let's return to our parsing example: recognising (and eventually evaluating) a sum series of integers.

Examples of our expected Input

  • 123
  • 1 + 2 + 3
  • 12 + 34 + 56 + 78

A traditional CFG style PEG for a Series of Integer add & subtract operations:

  Expression  ::= Sum | Integer
  Sum         ::= Integer '+' Expression
  Integer     ::= '\d\+'

In the above PEG for matching a Sum Series of Integers, we have:
  • Three non-terminal symbols: 'Integer', 'Sum' and 'Expression'
  • Two terminal symbols: \d\+ and  '+'
  • One use of Sequence with the three pieces: 'Integer' '+' 'Expression'
  • One use of Ordered choice: 'Sum' | 'Integer'
NOTE: The original (and actual) PEG formalism specifies the fundamental expression type as a simple string. VimPEG shuns (at probable cost) this restriction and allows regular expressions as the fundamental expression type. Original PEG grammars use / to indicate choice, but VimPEG uses | instead.

Anyone familiar with CFG grammar specifications will feel right at home with that example PEG grammar above. Unfortunately, it isn't idiomatic PEG. The thing to be parsed here is a list. PEGs have a compact idiomatic way of expressing that structure:

  Expression   ::= Integer (('+' | '-') Integer)*
  Integer      ::= '\d\+'

Here the arguably simpler concept of iteration replaces the CFG use of recursion to describe the desired list syntax to be parsed. It's so much simpler that it seemed a waste not to bundle subtraction in with the deal. Now our parser can evaluate a series of integer add and subtract operations.


  peg.e(expression, options)                  "(Expression)
  peg.and(sequence, options)                  "(Sequence)
  peg.or(choices, options)                    "(Ordered Choice)
  peg.maybe_many(expression, options)         "(Zero or More)
  peg.many(expression, options)               "(One or More)
  peg.maybe_one(expression, options)          "(Optional)
  peg.has(expression, options)                "(And Predicate)
  peg.not_has(expression, options)            "(Not Predicate)

Defining the Series of Integer Add and Subtract Operations PEG

  let p = vimpeg#parser({'skip_white': 1})
  call p.e('\d\+', {'id': 'integer'})
  let expression =
        \ p.and(
        \   [ 'integer',
        \     p.maybe_many(
        \       p.and(
        \         [ p.or(
        \           [ p.e('+'),
        \             p.e('-')]),
        \           'integer'])),
        \     p.e('$')],
        \   {'on_match': 'Expression'})

This example demonstrates several aspects of VimPEG's API:
  1. Elements that have been 'identfied' (using the 'id' attribute) can be referred to in other expressions. 'Integer' is identified in this case and referenced from 'Expression'.
  2. Only root-level elements need to be assigned to a Vim variable. In this case, the 'expression' element is considered to be a root element -- we can directly call on that element now to parse a series of integer add and subtract operations.
  3. Intermediate processing (for evaluations, reductions, lookups, whatever) is achieved through callback functions identified by the 'on_match' attribute.  The 'Expression' rule uses such a callback to iterate the list of add or subtract operations to evaluate their final total value. Here is that callback function:

  function! Expression(args)
    " initialise val with the first integer in the series
    let val = remove(a:args, 0)
    " remaining element of a:args is a list of [ [<+|->, <int>], ... ] pairs
    let args = a:args[0]
    while len(args) > 0
      let pair = remove(args, 0)
      let val = (pair[0] == '+') ? (val + pair[1]) : (val - pair[1])
    return val

The public API interface

  function! EvaluateExpression(str)
    let res = g:expression.match(a:str)
    if res.is_matched
      return res.value
      return res.errmsg

The res object holds a lot of information about what was actually parsed (and an errmsg  if parsing failed). The value element will contain the cumulative result of all the 'on_match' callbacks as the input was being parsed.

Using it

  echo EvaluateExpression('123')
  echo EvaluateExpression('1 + 2')
  echo EvaluateExpression('1 + 2 + 3')
  echo EvaluateExpression('4 - 5 + 6')
  echo EvaluateExpression('1 - a')

NOTE: The last example there will return the error message: 'Failed to match Sequence at byte 2'. This might seem unexpected -- we might have been hoping for something more meaningful about not expecting an alphabetic character when looking for an integer digit. It's telling us that (after gracefully falling back out of the optional series of add and subtract operations) it can't match '$' (end of line) at byte 2 because a '-' character is in the way.

Not terribly exciting, granted, but hopefully this serves as a reasonable introduction to the VimPEG Parser Generator. What can you do with it? I look forward to seeing weird and wonderful creations and possibilities in Vim now that real parsing tasks are more accessible.

What's Next?

As beautiful (ok, maybe not, but I've seen more hideous interfaces) as VimPEG's API is, she could do with a touch of lipstick. Instead of calling the API directly, it would be nice to be able to declare the rules using the PEG formalism. That's exactly what Raimondi has done in one of his contributions to VimPEG and that's what we'll be talking about in the next article.

In a future article I will show an example of sugar-coating the VimL language to make function declarations both a little easier on the eyes and fingers as well as adding two long-missing features from VimL -- default values in function parameters and inline function
declarations, a la if <condition> | something | endif .

No comments:

Post a Comment