Introducing ...
lrparsing.py

Russell Stuart

Why?


.....Surely?

pyparsing.py

Has an example grammar for Sqlite.
... and has nice syntax:

from pyparsing import *

integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable

expop, signop, multop  = Literal('^'), oneOf('+ -'), oneOf('* /')
plusop, factop = oneOf('+ -'), Literal('!')

expr = operatorPrecedence( operand,
    [("!", 1, opAssoc.LEFT), ("^", 2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT), (plusop, 2, opAssoc.LEFT),]
    )
	

pyparsing.py ... but

- Syntax for Sqlite buggy.
- Poor free documentation - buy the book.
- And its absurdly slow:

$ ./pyparsing-sqlite.py 
select * from xyzzy where z > 100
parse time=0.389035

select * from xyzzy where z > 100 order by zz
parse time=0.646955

select * from xyzzy
parse time=0.001180

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=1.323233
	

Next .. hand written LL parser

select a, b, c from t1 as A, t2 as B
    where t1.a > t2.b + 2 * 3 - 1 and t1 < ?
    order by 1 asc, 2 desc, t1.a
parse time= 0.001747
	






A sane person probably would have stopped at this point.

Next ... parsing.py.

An LR(1) parser generator with LR(1) and GLR(1) parsers.

select a, b, c from t1 as A, t2 as B
    where t1.a > t2.b + 2 * 3 - 1 and t1 < ?
    order by 1 asc, 2 desc, t1.a
parse time= 0.004702
	

... parsing.py - Sample Code - The Grammar

Grammar definition:

class NArithExpr(NonTerm):
    @NonTerm.reduce
    def reduceA(self, expr1):
        "%reduce NUnaryExpr"
        self.HIDDEN = True
    @NonTerm.reduce
    def reduceB(self, expr1, mult, expr2):
      "%reduce NArithExpr SymbolMult NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceC(self, expr1, div, expr2):
      "%reduce NArithExpr SymbolDiv NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceD(self, expr1, mod, expr2):
      "%reduce NArithExpr SymbolMod NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceE(self, expr1, plus, expr2):
      "%reduce NArithExpr SymbolPlus NArithExpr [PrecAdd]"
    @NonTerm.reduce
    def reduceF(self, expr1, minus, expr2):
      "%reduce NArithExpr SymbolMinus NArithExpr [PrecAdd]"
    @NonTerm.reduce
    def reduceG(self, atom):
        "%reduce NAtom"
        self.HIDDEN = True
	

... parsing.py - Sample Code - Defining Tokens

Tokens have to be defined individually:

class Token(parsing.Token):
    __metaclass__ = TokenMeta
    #
    # ... etc ...
    #

class Keyword(Token): SUPPRESS = True

class KeywordAnd(Keyword): keyword = "and"
class KeywordAs(Keyword): keyword = "as"
class KeywordAsc(Token): keyword = "asc"
class KeywordBetween(Keyword): keyword = "between"
class KeywordBy(Keyword): keyword = "by"
class KeywordDesc(Token): keyword = "desc"
class KeywordFirst(Keyword): keyword = "first"
class KeywordFrom(Keyword): keyword = "from"
class SymbolPlus(Token): token = '+'
class SymbolMinus(Token): token = '-'
class SymbolParam(Token): token = '?'
class SymbolMult(Token): token = '*'
class SymbolDiv(Token): token = '/'
class SymbolMod(Token): token = '%'
# and on and on ...
	

... parsing.py - Sample Code - Defining Precedences

Each precedence is a class too:

class Prec(parsing.Precedence):
    pass
class PrecUnary(Prec):
    "%right"
class PrecMult(Prec):
    "%left <PrecUnary"
class PrecAdd(Prec):
    "%left <PrecMult"
class PrecCompare(Prec):
    "%left <PrecAdd"
class PrecEquivalence(Prec):
    "%left <PrecCompare"
class PrecBetween(Prec):
    "%left <PrecEquivalence"
class PrecAnd(Prec):
    "%left <PrecBetween"
class PrecOr(Prec):
    "%left <PrecAnd"
class PrecIdent(Prec):
    "%left"
class PrecExpr(Prec):
    "%left <PrecIdent"
	

... parsing.py - Summary






Better than a hand written LL parser?

Afterward ... ply

An LALR(1) parser with inbuilt tokeniser.

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=0.001505
	


Had I found ply, I would have used it.

ply - Sample Code - The Tokeniser

class Lexer(object):
    reserved = {
        "and": "KeywordAnd",
        "as": "KeywordAs",
        "asc": "KeywordAsc",
        "by": "KeywordBy",
	# and so on ...
    tokens = (
	# more lines like below ..
        "SymbolEquals",
        "SymbolStringCat",
        "TokenString",
        "TokenInteger",
        "TokenDotIdent",
        "TokenIdent",) + tuple(reserved.itervalues())
    # more lines like below ..
    t_SymbolEquals = '=|=='
    t_SymbolStringCat = '\|\|'
    t_TokenString = "(?:'(?:[^']*)')+"
    t_TokenFloat = "[0-9]+[.][0-9]*|[.][0-9]+"
    t_TokenInteger = "[0-9]+"
    t_TokenDotIdent = '[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"'

    def t_TokenIdent(self, t):
        '[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*"'
        t.type = self.reserved.get(t.value.lower(), 'TokenIdent')
        return t

ply - Sample Code - The Parser

class Parser(object):
    tokens = Lexer.tokens

    def p_NWhere(self, p):
        """NWhere : KeywordWhere NBoolExpr"""
        p[0] = ("NWhere",) + tuple(p[i] for i in range(1,len(p)))

    def p_NBoolExpr(self, p):
        """NBoolExpr : KeywordNot NBoolExpr %prec KeywordNotU
            | NBoolExpr KeywordAnd NBoolExpr 
            | NBoolExpr KeywordOr NBoolExpr 
            | NArithExpr SymbolLessThan NArithExpr 
            | NArithExpr SymbolLessEqual NArithExpr 
            | NArithExpr SymbolGreaterEqual NArithExpr 
            | NArithExpr SymbolGreaterThan NArithExpr 
            | NArithExpr SymbolEquals NArithExpr 
            | NArithExpr SymbolNotEquals NArithExpr 
            | SymbolLeftParen NBoolExpr SymbolRightParen"""
        p[0] = ("NBoolExpr",) + tuple(p[i] for i in range(1,len(p)))
        

So where are we here?




Lots of choices ..

Some acceptable.

Probably more ...




So why a talk about lrparsing?

Introducing ... lrparsing

An LR(1) parser with inbuilt tokeniser.

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=0.000807
	

lrparsing - Sample Code

class Parser(Grammar):
    class T(TokenRegistry):
        float = Token(re="[0-9]+[.][0-9]*|[.][0-9]+")
        integer = Token(re="[0-9]+")
        ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*')
        dot_ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z_][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"')
        string = Token(re="(?:'(?:[^']*)')+")

    NArithExpr = Ref("NArithExpr"); NBoolExpr = Ref("NBoolExpr")
    NFuncCall = T.ident + '(' + List(NBoolExpr, ',') + ')'
    NArithExpr = Prio(
            T.ident | T.dot_ident | T.string | T.integer | T.float | '?' | NFuncCall,
            Tk("+ -") + THIS,
            THIS << Tk("* / %") << THIS, THIS << Tk("+ -") << THIS,)
    NOrderByTerm = Prio(T.ident | T.integer, NArithExpr) + Opt * Tk("", "asc desc")
    NOrderBy = K("order") + K("by") + List(NOrderByTerm, ',', 1)
    NBoolExpr = Prio(
        '(' + THIS + ')',
        K("not") + THIS, NArithExpr + Tk("< <= >= > == = != <>") + NArithExpr,
        THIS << K("or") << THIS, THIS << K("and") << THIS,)
    NFrom = K("from") + List(T.ident + Opt(K("as") + T.ident), ",", 1)
    NSelectColumn = Prio(
            (T.ident | T.dot_ident) + Opt(K("as") + T.ident) | '*',
            NArithExpr + Opt(K("as") + T.ident))
    NSelect = K("select") + List(NSelectColumn, ",", 1)
    NQuery = NSelect + NFrom + Opt(K("where") + NBoolExpr)
    NUnion = List(NQuery, K("union"), 1) + Opt * NOrderBy
    START = NUnion

Back to the beginning ... What does a parser do?


We start with a string like this:

4 + -3 ** 2 / (1 ++ 2)


And we assign meaning.
We do that via a parse tree.
The parse tree assigns names to things.

4 + -3 ** 2 / (1 ++ 2)
expr

And we write that as:

(expr ....)

What does a parser do? ... it breaks things up


Next break down the ... like this:

  4           +    -3 ** 2 / (1 ++ 2)
number=4   '+'         expr        

And we write that as:

(expr number=4 '+' (expr ...))


And we repeat:

  -3 ** 2    /    (1 ++ 2)
     expr    '/'     expr    

(expr number=4 '+' (expr (expr ...) '/' (expr ...)))

What does a parser do? ... into their constituent bits


So a parser is something that starts with a string:

4 + -3 ** 2 / (1 ++ 2)

 

And turns it into a parse tree:

(expr
  number=4
  '+'
  (expr
    (expr '-' (expr number=3 '**' number=2)
    '/'
    (expr '(' (expr number=1 '+' (expr '+' number=2)) ')'))))

And with that parse tree, we could evaluate the expression according to the rules of arithmetic.

What makes up a parser?


The tokeniser (aka lexer) names the smallest chunks.
   »  * is a '*'
   »  ** is a '**'
   »  ++ is two tokens: '+' followed by '+'.


A grammar that says what the language looks like.


A parser generator that compiles the grammar.
   »  The compiled form is called a parse table.


A parser. A small piece of code that:
   »  Process the tokenised input stream,
   »  according to a parse table,
   »  to produce a parse tree.

A Grammar

The parse tree:

  -3 ** 2    /    (1 ++ 2)
     expr    '/'     expr    

Suggests this pattern:

expr = expr '/' expr

 

This is called a production. A grammar is just a collection of all productions making up the language:

expr = expr '+' expr
expr = expr '**' expr
expr = '+' expr
expr = '-' expr
expr = '(' expr ')'
expr = number

lrparsing's Grammar

lrparing uses Python expressions to represent it's grammar.

  • Have to look be valid Python expressions.
  • Assigning many things to a variable looses all but the last.
  • Can't have refer to ourselves before defining ourselves.
from lrparsing import Grammar, THIS, Token
class Expr(Grammar):
    expr = (
	THIS + '/' + THIS |
	THIS + '+' + THIS |
	THIS + '**' + THIS |
	'+' + THIS |
	'-' + THIS |
	'(' + THIS + ')' |
	Token(re="[0-9]+"))
    START = expr

lrparsing's tokeniser


Internally tokens are recognised using a single large regular expression:

token1_re|token2_re|...

Token constructors:

 
  • Token(literal="**", [case=True]),
    or Token("**", [case=True]),
    or just "**".
  • Token(re="[a-zA-Z_][a-zA-Z0-9_]*")
  • Keyword("and", [case=True])
  • UnrecognisedToken()

OMG! It's a pig in lipstick

from lrparsing import Grammar, THIS, Token
class Expr(Grammar):
    expr = (
	THIS + '/' + THIS | THIS + '+' + THIS | THIS + '**' + THIS |
	'+' + THIS | '-' + THIS | '(' + THIS + ')' |
	Token(re="[0-9]+"))
    START = expr
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))
While trying to recognise state 11:
  expr = '+' expr ^
  expr = expr ^ '+' expr
on seeing a '+' I could not decide between:
  replace the sequence ['+' expr] with expr
and
  accept the '+' in the hope it will match:
    expr = expr '+' ^ expr
Please remove this ambiguity from your grammar

Translating pig

While trying to recognise state 11:
  expr = '+' expr ^
  expr = expr ^ '+' expr
on seeing a '+' I could not decide
 

Substitute the Reduce into the Shift and we get:

expr = '+' expr ^ '+' expr
 

The choice is between two parse trees:

'+' expr '+' expr
  expr   '+' expr
and:
'+' expr '+' expr
'+'     expr     

Prio()

Rewrite the grammar to get rid of the ambiguity:

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"), THIS + '**' + THIS,
	'+' + THIS | '-' + THIS,
	THIS + '/' + THIS, THIS + '+' + THIS)
    START = expr
lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise state 22:
  expr.Prio.Prioritised4 = expr '+' expr ^
  expr.Prio.Prioritised4 = expr ^ '+' expr
on seeing a '+' I could not decide between:
  replace the sequence [expr '+' expr] with expr.Prio.Prioritised4
and
  accept the '+' in the hope it will match:
    expr.Prio.Prioritised4 = expr '+' ^ expr
Please remove this ambiguity from your grammar
      

Displaying the compiled productions - the code

Lrparsing has compiled your grammar into productions.

#!/usr/bin/python
from lrparsing import Grammar, Prio, THIS, Token

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"),
	THIS + '**' + THIS,
	'+' + THIS | '-' + THIS,
	THIS + '/' + THIS,
	THIS + '+' + THIS)
    START = expr

try:
  Expr.compile_grammar()
except:
  print Expr.repr_productions()
  raise
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))

Displaying the compiled productions - the result

0     :  = START __end_of_input__
1     : START = expr
2     : expr = expr.Prio
3     : expr.Prio = expr.Prio.Prioritised0
        expr.Prio = expr.Prio.Prioritised1
        expr.Prio = expr.Prio.Prioritised2
        expr.Prio = expr.Prio.Prioritised3
        expr.Prio = expr.Prio.Prioritised4
4.1   : expr.Prio.Prioritised0 = '(' expr ')'
        expr.Prio.Prioritised0 = /[0-9]+/
5.1   : expr.Prio.Prioritised1 = expr '**' expr
6.1   : expr.Prio.Prioritised2 = '+' expr
        expr.Prio.Prioritised2 = '-' expr
7.1   : expr.Prio.Prioritised3 = expr '/' expr
8.1   : expr.Prio.Prioritised4 = expr '+' expr
Traceback (most recent call last):
  File "example-3.py", line 15, in 
    Expr.compile_grammar()
  ........

Translating pig (again)

While trying to recognise state 23:
  expr.Prio.Prioritised3 = expr ^ '/' expr
  expr.Prio.Prioritised3 = expr '/' expr ^
on seeing a '/' I could not decide
 

Substitute the Reduce into the Shift and we get:

expr = expr '/' expr ^ '/' expr
 

The choice is between two parse trees:

expr '/' expr '/' expr
     expr     '/' expr
and:
expr '/' expr '/' expr
expr '/'     expr     

Using '<<' and '>>'

Rewrite the grammar to get rid of the ambiguity:

#!/usr/bin/python
from lrparsing import Grammar, Prio, THIS, Token

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"),
	THIS >> '**' >> THIS,
	'+' + THIS | '-' + THIS,
	THIS << '/' << THIS,
	THIS << '+' << THIS)
    START = expr
    
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))
	

Finally, a parse tree.

(START (expr
  (expr '4') '+'
  (expr
    (expr '-' (expr
      (expr '3') '**'
      (expr '2')) '/'
    (expr '('
      (expr
        (expr '1') '+'
	(expr '+' (expr '2'))) ')'))))

Other grammar constructors

Syntatic sugar for Repeat

»  Opt(symbol) or Opt * symbol
»  Some(symbol) or Some * symbol
»  Many(symbol) or Many * symbol
»  symbol * N
  »  THIS * 0     (idiom for name =   )
  »  symbol * 1     (idiom for name2 = name1)

Raw constructors

»  Choice(s0, s1, ...) or s0 | s1 | ...
»  Left(s0 + s1 + ...) or s0 << s2 << ...
»  Right(s0 + s1 + ...) or s0 >> s2 >> ...
»  Sequence(s0, s1, ...) or s0 + s1 + ...

And finally for the truly lazy

»  Tokens(literals, keywords, case=True)

Tokens("== !=", "like in")

Is the same as:

Token("==") | Token("!=") | Keyword("like") | Keyword("in")

lrparsing - Sample Code

class Parser(Grammar):
    class T(TokenRegistry):
        float = Token(re="[0-9]+[.][0-9]*|[.][0-9]+")
        integer = Token(re="[0-9]+")
        ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*')
        dot_ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z_][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"')
        string = Token(re="(?:'(?:[^']*)')+")
    Tk = lambda s,k=None: Tokens(s, k, case=False); K = lambda k: Keyword(k, case=False)
    NArithExpr = Ref("NArithExpr"); NBoolExpr = Ref("NBoolExpr")
    NFuncCall = T.ident + '(' + List(NBoolExpr, ',') + ')'
    NArithExpr = Prio(
            T.ident | T.dot_ident | T.string | T.integer | T.float | '?' | NFuncCall,
            Tk("+ -") + THIS,
            THIS << Tk("* / %") << THIS, THIS << Tk("+ -") << THIS,)
    NOrderByTerm = Prio(T.ident | T.integer, NArithExpr) + Opt * Tk("", "asc desc")
    NOrderBy = K("order") + K("by") + List(NOrderByTerm, ',', 1)
    NBoolExpr = Prio(
        '(' + THIS + ')',
        K("not") + THIS, NArithExpr + Tk("< <= >= > == = != <>") + NArithExpr,
        THIS << K("or") << THIS, THIS << K("and") << THIS,)
    NFrom = K("from") + List(T.ident + Opt(K("as") + T.ident), ",", 1)
    NSelectColumn = Prio(
            (T.ident | T.dot_ident) + Opt(K("as") + T.ident) | '*',
            NArithExpr + Opt(K("as") + T.ident))
    NSelect = K("select") + List(NSelectColumn, ",", 1)
    NQuery = NSelect + NFrom + Opt(K("where") + NBoolExpr)
    NUnion = List(NQuery, K("union"), 1) + Opt * NOrderBy
    START = NUnion

It's not that un-natural

$ python
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X(object):
...     def x(self): pass
...     y = [x]
>>> X.x == X.x
True
>>> X.x == X.y[0]
False
>>> from lrparsing import Grammar, Token
>>> class G(Grammar):
...   START = Token('x')
...   y = [START]
>>> G.START == G.START
True
>>> G.START == G.y[0]
False
>>> 

The internals of the parse tree

>>> from lrparsing import *
>>> class Expr(Grammar):
...     class T(TokenRegistry):
...         a = Token('a')
...     A = T.a
...     B = A + Token('b') + Token('b')
...     START = A | B
>>> parse_tree = Expr.parse("a\n   b  b")
>>> print Expr.repr_parse_tree(parse_tree, False)
(START (B (A 'a') 'b' 'b'))
>>> print repr(parse_tree)
(START = A | B, (B = A + 'b' + 'b', (A = T.a='a', (T.a='a', 'a', 0, 1, 1)), \
    ('b', 'b', 5, 2, 4), ('b', 'b', 8, 2, 7)))
>>> print str(Expr.START)
START
>>> print repr(Expr.START)
START = A | B
>>> print repr(Expr.T.a)
T.a='a'
>>> parse_tree[0] is Expr.START
True
>>> parse_tree[1][1][1][0] is Expr.T.a
True
>>> isinstance(parse_tree[1][2][0], TokenSymbol)
True
>>> 

The inbuilt tokeniser

>>> class E(Grammar):
...     class T(TokenRegistry):
...         u = UserToken()
...         x = UnrecognisedToken()
...     START = Repeat(T.u) + Repeat(Token('a') | T.x)
...     COMMENTS = (
...          Token(re='#[^\n]*\n?') |
...          Token(re=r'/\*(?:[^*]|\*[^/])*\*/'))
...     WHITESPACE = "."
>>> parse_tree = E.parse(
...     ((E.T.u, 1), (E.T.u, 2, set()), "a.a.xxxa# some stuff"))
>>> print E.repr_parse_tree(parse_tree)
(START T.u T.u 'a' 'a' '.xxx' 'a')
>>> print repr(parse_tree)
(START = T.u * () + ('a' | T.x=UnrecognisedToken()) * (), (T.u, 1), \
  (T.u, 2, set([])), ('a', 'a', 0, 1, 1), ('a', 'a', 2, 1, 3), \
  (T.x=UnrecognisedToken(), '.xxx', 3, 1, 4), ('a', 'a', 7, 1, 8))
>>> 

Abusing the parse tree

#!/usr/bin/python
import operator
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    number = Token(re="[0-9]+")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(number, unary_expr, binary_expr)
    START = e

    @classmethod
    def eval_node(cls, n):
        return n[1] if not "eval" in n[0] else n[0]["eval"](n)

    binary_ops = {'+': operator.add, '/': operator.floordiv}
    unary_ops = {'-': operator.neg}
    number["eval"] = lambda n: int(n[1], 10)
    binary_expr["eval"] = lambda n: E.binary_ops[n[2]](n[1], n[3])
    unary_expr["eval"] = lambda n: E.unary_ops[n[1]](n[2])

print E.parse("2 + 3 / -1", tree_factory=E.eval_node)

Using the parse tree

#!/usr/bin/python
import operator
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    number = Token(re="[0-9]+")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(number, unary_expr, binary_expr)
    START = e


class Node(tuple):
    value = None
    def __new__(cls, n): return super(Node, cls).__new__(cls, n)
    def __repr__(self): return E.repr_parse_tree(self, False)


class Compiler(object):

    def __call__(self, node):
        node = Node(node)
        name = node[0].name
        if not isinstance(node[0], TokenSymbol):
            node.value = node[1].value
        else:
            name = name.split(".")[-1]
        if name in self.__class__.__dict__:
            self.__class__.__dict__[name](self, node)
        return node

    def number(self, node):
        node.value = int(node[1][1], 10)

    binary_ops = {'+': operator.add, '/': operator.floordiv}
    def binary_expr(self, node):
        node.value = self.binary_ops[node[2][1]](node[1].value, node[3].value)

    unary_ops = {'-': operator.neg}
    def unary_expr(self, node):
        node.value = self.unary_ops[node[1][1]](node[2].value)

print E.parse("2 + 3 / -1", tree_factory=Compiler()).value

Error messages

#!/usr/bin/python
import sys
from lrparsing import *

class E(Grammar):
    class T(TokenRegistry):
        number = Token(re="[0-9]+")
    e = Ref("e")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(T.number, unary_expr, binary_expr)
    START = e

try:
    print E.repr_parse_tree(E.parse("""2 + 3 // -1"""))
except ParseError, e:
    sys.stderr.write("%s\n" % e)
line 1 column 8: Got '/' when expecting '-' or T.number while trying
    to match binary_expr in state 17
	

Pre-compiling

Parser generation can be eliminated by pre-compiling it.

import sys
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(Token(re="[0-9]+"), unary_expr, binary_expr)
    START = e
    PRE_COMPILED = ('5709b73c7a329fb147636020987104248ebb2273bfb5d2153e30b452ba9645420e3e7e851f6288a663bb8445987692b5b020c5c1cfd863ef23587416bd0904b3', 0, ({'/[0-9]+/': 7, "'-'": 10}, {2: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['<E>']), ({'__end_of_input__': 16}, {}, ['<E>']), ({'__end_of_input__': (2, 1, 'START'), "'/'": 17, "'+'": 18}, {}, ['START', 'binary_expr']), ({'__end_of_input__': (4, 1, 'e'), "'/'": (4, 1, 'e'), "'+'": (4, 1, 'e')}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (6, 1), "'/'": (6, 1), "'+'": (6, 1)}, {}, ['e']), ({'__end_of_input__': (7, 1), "'/'": (7, 1), "'+'": (7, 1)}, {}, ['e']), ({'__end_of_input__': (8, 1), "'/'": (8, 1), "'+'": (8, 1)}, {}, ['e']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 19, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['unary_expr']), ({'__end_of_input__': (11, 1, 'binary_expr'), "'/'": (11, 1, 'binary_expr'), "'+'": (11, 1, 'binary_expr')}, {}, ['binary_expr']), ({'__end_of_input__': (13, 1), "'/'": (13, 1), "'+'": (13, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (13, 1), "'/'": (13, 1), "'+'": (13, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (14, 1), "'/'": (14, 1), "'+'": (14, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (15, 1), "'/'": (15, 1), "'+'": (15, 1)}, {}, ['binary_expr']), ({'__empty__': (0, 2, '<E>')}, {}, ['<E>']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 20, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['binary_expr']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 21, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['binary_expr']), ({'__end_of_input__': (10, 2, 'unary_expr'), "'/'": (10, 2, 'unary_expr'), "'+'": (10, 2, 'unary_expr')}, {}, ['binary_expr', 'unary_expr']), ({'__end_of_input__': (16, 3), "'/'": (16, 3), "'+'": (16, 3)}, {}, ['binary_expr']), ({'__end_of_input__': (17, 3), "'/'": 17, "'+'": (17, 3)}, {}, ['binary_expr']))

print E.pre_compile_grammar(E.PRE_COMPILED)

Returns None if the passed string matches the grammar, a string containing the parse table otherwise.

Utility functions

Lrparsing offers a few other utilities for debugging grammars.

»  pre_compile_grammar(pre_compiled=None)
»  parse(input, tree_factory=None, on_error=None, log=None)
»  repr_grammar() - the grammar as entered.
»  repr_productions() - compiled productions.
»  repr_parse_table(state=None) - print an LR(1) ItemSet.
»  repr_parse_tree(parse_tree) - pretty print a parse tree
»  unused_rules() - frozenset() of unused rules

Error recovery - saving the worst for last

Only useful if you want to report on more than one error. parse() calls on_error instead of raising an exception.

lrparsing.on_error(iterator, input_tuple, stack)

The stack is a partially constructed branch of the parse tree:

([], (__empty__))
([START,binary_expr], ((e '2')))
([binary_expr], ('+'))
([binary_expr], ((e '3')))
([binary_expr], ('/'))

(START (e (binary_expr
  (e '2') '+'
  (e (binary_expr
    (e '3') '/'
    (e (unary_expr '-' (e '1'))))))))

Resources

Home page:
    http://lrparsing.sourceforge.net.

Pypi page:
    https://pypi.python.org/pypi/lrparsing.

Online manual:
    http://lrparsing.sourceforge.net/doc.

And examples:
»  Expression evaluator.
»  Sqlite3 data manipulation statement grammar.
»  Lua 5.2 to Python 2.7 compiler.

The end