Introducing ...
lrparsing.py

Russell Stuart

Why?

I had a job to do - parse SQL.
A pypi search for parser gets about 1000 hits.
A google search for python parsing sql yields pyparsing.py
Job done within in an hour!

.....Surely?

pyparsing.py

Has an example grammar for Sqlite.
... and has nice syntax:

from pyparsing import *

integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable

expop, signop, multop  = Literal('^'), oneOf('+ -'), oneOf('* /')
plusop, factop = oneOf('+ -'), Literal('!')

expr = operatorPrecedence( operand,
    [("!", 1, opAssoc.LEFT), ("^", 2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT), (plusop, 2, opAssoc.LEFT),]
    )

pyparsing.py ... but

- Syntax for Sqlite buggy.
- Poor free documentation - buy the book.
- And its absurdly slow:

$ ./pyparsing-sqlite.py 
select * from xyzzy where z > 100
parse time=0.389035

select * from xyzzy where z > 100 order by zz
parse time=0.646955

select * from xyzzy
parse time=0.001180

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=1.323233

Next .. hand written LL parser

Took me about two hours to write the tokeniser and parser.
174 lines, including test code vs 101 lines for pyparsing.

select a, b, c from t1 as A, t2 as B
    where t1.a > t2.b + 2 * 3 - 1 and t1 < ?
    order by 1 asc, 2 desc, t1.a
parse time= 0.001747

A sane person probably would have stopped at this point.

Next ... parsing.py.

An LR(1) parser generator with LR(1) and GLR(1) parsers.

It does not include a tokeniser.
Re-writing grammar and porting the tokeniser took 3 hours.
Documentation is large blocks of comments in the code.
376 lines, including test code vs 101 lines for pyparsing.

select a, b, c from t1 as A, t2 as B
    where t1.a > t2.b + 2 * 3 - 1 and t1 < ?
    order by 1 asc, 2 desc, t1.a
parse time= 0.004702

... parsing.py - Sample Code - The Grammar

Grammar definition:

class NArithExpr(NonTerm):
    @NonTerm.reduce
    def reduceA(self, expr1):
        "%reduce NUnaryExpr"
        self.HIDDEN = True
    @NonTerm.reduce
    def reduceB(self, expr1, mult, expr2):
      "%reduce NArithExpr SymbolMult NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceC(self, expr1, div, expr2):
      "%reduce NArithExpr SymbolDiv NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceD(self, expr1, mod, expr2):
      "%reduce NArithExpr SymbolMod NArithExpr [PrecMult]"
    @NonTerm.reduce
    def reduceE(self, expr1, plus, expr2):
      "%reduce NArithExpr SymbolPlus NArithExpr [PrecAdd]"
    @NonTerm.reduce
    def reduceF(self, expr1, minus, expr2):
      "%reduce NArithExpr SymbolMinus NArithExpr [PrecAdd]"
    @NonTerm.reduce
    def reduceG(self, atom):
        "%reduce NAtom"
        self.HIDDEN = True

... parsing.py - Sample Code - Defining Tokens

Tokens have to be defined individually:

class Token(parsing.Token):
    __metaclass__ = TokenMeta
    #
    # ... etc ...
    #

class Keyword(Token): SUPPRESS = True

class KeywordAnd(Keyword): keyword = "and"
class KeywordAs(Keyword): keyword = "as"
class KeywordAsc(Token): keyword = "asc"
class KeywordBetween(Keyword): keyword = "between"
class KeywordBy(Keyword): keyword = "by"
class KeywordDesc(Token): keyword = "desc"
class KeywordFirst(Keyword): keyword = "first"
class KeywordFrom(Keyword): keyword = "from"
class SymbolPlus(Token): token = '+'
class SymbolMinus(Token): token = '-'
class SymbolParam(Token): token = '?'
class SymbolMult(Token): token = '*'
class SymbolDiv(Token): token = '/'
class SymbolMod(Token): token = '%'
# and on and on ...

... parsing.py - Sample Code - Defining Precedences

Each precedence is a class too:

class Prec(parsing.Precedence):
    pass
class PrecUnary(Prec):
    "%right"
class PrecMult(Prec):
    "%left <PrecUnary"
class PrecAdd(Prec):
    "%left <PrecMult"
class PrecCompare(Prec):
    "%left <PrecAdd"
class PrecEquivalence(Prec):
    "%left <PrecCompare"
class PrecBetween(Prec):
    "%left <PrecEquivalence"
class PrecAnd(Prec):
    "%left <PrecBetween"
class PrecOr(Prec):
    "%left <PrecAnd"
class PrecIdent(Prec):
    "%left"
class PrecExpr(Prec):
    "%left <PrecIdent"

... parsing.py - Summary

Twice the size of the hand written LL parser.
Slower.
Ordinary documentation doesn't help.

Better than a hand written LL parser?

Afterward ... ply

An LALR(1) parser with inbuilt tokeniser.

Documentation is good.
Took me about an hour to write it.
198 lines vs 101 lines for pyparsing.

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=0.001505

Had I found ply, I would have used it.

ply - Sample Code - The Tokeniser

class Lexer(object):
    reserved = {
        "and": "KeywordAnd",
        "as": "KeywordAs",
        "asc": "KeywordAsc",
        "by": "KeywordBy",
	# and so on ...
    tokens = (
	# more lines like below ..
        "SymbolEquals",
        "SymbolStringCat",
        "TokenString",
        "TokenInteger",
        "TokenDotIdent",
        "TokenIdent",) + tuple(reserved.itervalues())
    # more lines like below ..
    t_SymbolEquals = '=|=='
    t_SymbolStringCat = '\|\|'
    t_TokenString = "(?:'(?:[^']*)')+"
    t_TokenFloat = "[0-9]+[.][0-9]*|[.][0-9]+"
    t_TokenInteger = "[0-9]+"
    t_TokenDotIdent = '[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"'

    def t_TokenIdent(self, t):
        '[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*"'
        t.type = self.reserved.get(t.value.lower(), 'TokenIdent')
        return t

ply - Sample Code - The Parser

class Parser(object):
    tokens = Lexer.tokens

    def p_NWhere(self, p):
        """NWhere : KeywordWhere NBoolExpr"""
        p[0] = ("NWhere",) + tuple(p[i] for i in range(1,len(p)))

    def p_NBoolExpr(self, p):
        """NBoolExpr : KeywordNot NBoolExpr %prec KeywordNotU
            | NBoolExpr KeywordAnd NBoolExpr 
            | NBoolExpr KeywordOr NBoolExpr 
            | NArithExpr SymbolLessThan NArithExpr 
            | NArithExpr SymbolLessEqual NArithExpr 
            | NArithExpr SymbolGreaterEqual NArithExpr 
            | NArithExpr SymbolGreaterThan NArithExpr 
            | NArithExpr SymbolEquals NArithExpr 
            | NArithExpr SymbolNotEquals NArithExpr 
            | SymbolLeftParen NBoolExpr SymbolRightParen"""
        p[0] = ("NBoolExpr",) + tuple(p[i] for i in range(1,len(p)))

So where are we here?

Lots of choices ..

Some acceptable.

Probably more ...

So why a talk about lrparsing?

Introducing ... lrparsing

An LR(1) parser with inbuilt tokeniser.

Documentation better than the others.
Took me about 15 minutes to write the grammar.
52 lines vs 101 lines for pyparsing.

select a, b, c from t1 as A, t2 as B
    where x1 > x2 + 2 * 3 - 1 and x1 < 3
    order by 1 asc, 2 desc
parse time=0.000807

lrparsing - Sample Code

class Parser(Grammar):
    class T(TokenRegistry):
        float = Token(re="[0-9]+[.][0-9]*|[.][0-9]+")
        integer = Token(re="[0-9]+")
        ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*')
        dot_ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z_][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"')
        string = Token(re="(?:'(?:[^']*)')+")

    NArithExpr = Ref("NArithExpr"); NBoolExpr = Ref("NBoolExpr")
    NFuncCall = T.ident + '(' + List(NBoolExpr, ',') + ')'
    NArithExpr = Prio(
            T.ident | T.dot_ident | T.string | T.integer | T.float | '?' | NFuncCall,
            Tk("+ -") + THIS,
            THIS << Tk("* / %") << THIS, THIS << Tk("+ -") << THIS,)
    NOrderByTerm = Prio(T.ident | T.integer, NArithExpr) + Opt * Tk("", "asc desc")
    NOrderBy = K("order") + K("by") + List(NOrderByTerm, ',', 1)
    NBoolExpr = Prio(
        '(' + THIS + ')',
        K("not") + THIS, NArithExpr + Tk("< <= >= > == = != <>") + NArithExpr,
        THIS << K("or") << THIS, THIS << K("and") << THIS,)
    NFrom = K("from") + List(T.ident + Opt(K("as") + T.ident), ",", 1)
    NSelectColumn = Prio(
            (T.ident | T.dot_ident) + Opt(K("as") + T.ident) | '*',
            NArithExpr + Opt(K("as") + T.ident))
    NSelect = K("select") + List(NSelectColumn, ",", 1)
    NQuery = NSelect + NFrom + Opt(K("where") + NBoolExpr)
    NUnion = List(NQuery, K("union"), 1) + Opt * NOrderBy
    START = NUnion

Back to the beginning ... What does a parser do?

We start with a string like this:

4 + -3 ** 2 / (1 ++ 2)

And we assign meaning.

We do that via a parse tree.

The parse tree assigns names to things.

4 + -3 ** 2 / (1 ++ 2)

expr

And we write that as:

(expr ....)

What does a parser do? ... it breaks things up

Next break down the ... like this:

4 + -3 ** 2 / (1 ++ 2)

number=4 '+' expr

And we write that as:

(expr number=4 '+' (expr ...))

And we repeat:

-3 ** 2 / (1 ++ 2)

expr '/' expr

(expr number=4 '+' (expr (expr ...) '/' (expr ...)))

What does a parser do? ... into their constituent bits

So a parser is something that starts with a string:

4 + -3 ** 2 / (1 ++ 2)

And turns it into a parse tree:

(expr
  number=4
  '+'
  (expr
    (expr '-' (expr number=3 '**' number=2)
    '/'
    (expr '(' (expr number=1 '+' (expr '+' number=2)) ')'))))

And with that parse tree, we could evaluate the expression according to the rules of arithmetic.

What makes up a parser?

The tokeniser (aka lexer) names the smallest chunks.
   » * is a '*'
   » ** is a '**'
   » ++ is two tokens: '+' followed by '+'.

A grammar that says what the language looks like.

A parser generator that compiles the grammar.
» The compiled form is called a parse table.

A parser. A small piece of code that:
   » Process the tokenised input stream,
   » according to a parse table,
   » to produce a parse tree.

A Grammar

The parse tree:

-3 ** 2 / (1 ++ 2)
expr '/' expr

Suggests this pattern:

expr = expr '/' expr

This is called a production. A grammar is just a collection of all productions making up the language:

expr = expr '+' expr
expr = expr '**' expr
expr = '+' expr
expr = '-' expr
expr = '(' expr ')'
expr = number

lrparsing's Grammar

lrparing uses Python expressions to represent it's grammar.

Have to look be valid Python expressions.
Assigning many things to a variable looses all but the last.
Can't have refer to ourselves before defining ourselves.

from lrparsing import Grammar, THIS, Token
class Expr(Grammar):
    expr = (
	THIS + '/' + THIS |
	THIS + '+' + THIS |
	THIS + '**' + THIS |
	'+' + THIS |
	'-' + THIS |
	'(' + THIS + ')' |
	Token(re="[0-9]+"))
    START = expr

lrparsing's tokeniser

Internally tokens are recognised using a single large regular expression:

token1_re|token2_re|...

Token constructors:

Token(literal="**", [case=True]),
or Token("**", [case=True]),
or just "**".
Token(re="[a-zA-Z_][a-zA-Z0-9_]*")
Keyword("and", [case=True])
UnrecognisedToken()

OMG! It's a pig in lipstick

from lrparsing import Grammar, THIS, Token
class Expr(Grammar):
    expr = (
	THIS + '/' + THIS | THIS + '+' + THIS | THIS + '**' + THIS |
	'+' + THIS | '-' + THIS | '(' + THIS + ')' |
	Token(re="[0-9]+"))
    START = expr
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))

While trying to recognise state 11:
  expr = '+' expr ^
  expr = expr ^ '+' expr
on seeing a '+' I could not decide between:
  replace the sequence ['+' expr] with expr
and
  accept the '+' in the hope it will match:
    expr = expr '+' ^ expr
Please remove this ambiguity from your grammar

Translating pig

While trying to recognise state 11:
  expr = '+' expr ^
  expr = expr ^ '+' expr
on seeing a '+' I could not decide

Substitute the Reduce into the Shift and we get:

expr = '+' expr ^ '+' expr

The choice is between two parse trees:

'+' expr '+' expr
expr '+' expr

and:

'+' expr '+' expr
'+' expr

Prio()

Rewrite the grammar to get rid of the ambiguity:

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"), THIS + '**' + THIS,
	'+' + THIS | '-' + THIS,
	THIS + '/' + THIS, THIS + '+' + THIS)
    START = expr

lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise state 22:
  expr.Prio.Prioritised4 = expr '+' expr ^
  expr.Prio.Prioritised4 = expr ^ '+' expr
on seeing a '+' I could not decide between:
  replace the sequence [expr '+' expr] with expr.Prio.Prioritised4
and
  accept the '+' in the hope it will match:
    expr.Prio.Prioritised4 = expr '+' ^ expr
Please remove this ambiguity from your grammar

Displaying the compiled productions - the code

Lrparsing has compiled your grammar into productions.

#!/usr/bin/python
from lrparsing import Grammar, Prio, THIS, Token

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"),
	THIS + '**' + THIS,
	'+' + THIS | '-' + THIS,
	THIS + '/' + THIS,
	THIS + '+' + THIS)
    START = expr

try:
  Expr.compile_grammar()
except:
  print Expr.repr_productions()
  raise
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))

Displaying the compiled productions - the result

0     :  = START __end_of_input__
1     : START = expr
2     : expr = expr.Prio
3     : expr.Prio = expr.Prio.Prioritised0
        expr.Prio = expr.Prio.Prioritised1
        expr.Prio = expr.Prio.Prioritised2
        expr.Prio = expr.Prio.Prioritised3
        expr.Prio = expr.Prio.Prioritised4
4.1   : expr.Prio.Prioritised0 = '(' expr ')'
        expr.Prio.Prioritised0 = /[0-9]+/
5.1   : expr.Prio.Prioritised1 = expr '**' expr
6.1   : expr.Prio.Prioritised2 = '+' expr
        expr.Prio.Prioritised2 = '-' expr
7.1   : expr.Prio.Prioritised3 = expr '/' expr
8.1   : expr.Prio.Prioritised4 = expr '+' expr
Traceback (most recent call last):
  File "example-3.py", line 15, in 
    Expr.compile_grammar()
  ........

Translating pig (again)

While trying to recognise state 23:
  expr.Prio.Prioritised3 = expr ^ '/' expr
  expr.Prio.Prioritised3 = expr '/' expr ^
on seeing a '/' I could not decide

Substitute the Reduce into the Shift and we get:

expr = expr '/' expr ^ '/' expr

The choice is between two parse trees:

expr '/' expr '/' expr
expr '/' expr

and:

expr '/' expr '/' expr
expr '/' expr

Using '<<' and '>>'

Rewrite the grammar to get rid of the ambiguity:

#!/usr/bin/python
from lrparsing import Grammar, Prio, THIS, Token

class Expr(Grammar):
    expr = Prio(
	'(' + THIS + ')' | Token(re="[0-9]+"),
	THIS >> '**' >> THIS,
	'+' + THIS | '-' + THIS,
	THIS << '/' << THIS,
	THIS << '+' << THIS)
    START = expr
    
print Expr.repr_parse_tree(Expr.parse("4 + -3 ** 2 / (1 ++ 2)"))

Finally, a parse tree.

(START (expr
  (expr '4') '+'
  (expr
    (expr '-' (expr
      (expr '3') '**'
      (expr '2')) '/'
    (expr '('
      (expr
        (expr '1') '+'
	(expr '+' (expr '2'))) ')'))))

Other grammar constructors

List(sym, delim, min=0, max=None, opt=False)

name = Ref("name")

expr = Ref("expr")
call = ident + '(' + List(expr, ',') + ')'
expr = number | call

Repeat(symbol, min=0, max=None)

list = '[' + List(expr, ',') + Repeat(',', 0, 1) + ']'

Syntatic sugar for Repeat

» Opt(symbol) or Opt * symbol
» Some(symbol) or Some * symbol
» Many(symbol) or Many * symbol
» symbol * N
» THIS * 0 (idiom for name = )
» symbol * 1 (idiom for name2 = name1)

Raw constructors

» Choice(s0, s1, ...) or s0 | s1 | ...
» Left(s0 + s1 + ...) or s0 << s2 << ...
» Right(s0 + s1 + ...) or s0 >> s2 >> ...
» Sequence(s0, s1, ...) or s0 + s1 + ...

And finally for the truly lazy

» Tokens(literals, keywords, case=True)

Tokens("== !=", "like in")

Is the same as:

Token("==") | Token("!=") | Keyword("like") | Keyword("in")

lrparsing - Sample Code

class Parser(Grammar):
    class T(TokenRegistry):
        float = Token(re="[0-9]+[.][0-9]*|[.][0-9]+")
        integer = Token(re="[0-9]+")
        ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*|"[a-zA-Z_][a-zA-Z_0-9]*')
        dot_ident = Token(re='[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z_][a-zA-Z_0-9]*)+|"[a-zA-Z_][a-zA-Z_0-9]*(?:[.][a-zA-Z][a-zA-Z_0-9]*)+"')
        string = Token(re="(?:'(?:[^']*)')+")
    Tk = lambda s,k=None: Tokens(s, k, case=False); K = lambda k: Keyword(k, case=False)
    NArithExpr = Ref("NArithExpr"); NBoolExpr = Ref("NBoolExpr")
    NFuncCall = T.ident + '(' + List(NBoolExpr, ',') + ')'
    NArithExpr = Prio(
            T.ident | T.dot_ident | T.string | T.integer | T.float | '?' | NFuncCall,
            Tk("+ -") + THIS,
            THIS << Tk("* / %") << THIS, THIS << Tk("+ -") << THIS,)
    NOrderByTerm = Prio(T.ident | T.integer, NArithExpr) + Opt * Tk("", "asc desc")
    NOrderBy = K("order") + K("by") + List(NOrderByTerm, ',', 1)
    NBoolExpr = Prio(
        '(' + THIS + ')',
        K("not") + THIS, NArithExpr + Tk("< <= >= > == = != <>") + NArithExpr,
        THIS << K("or") << THIS, THIS << K("and") << THIS,)
    NFrom = K("from") + List(T.ident + Opt(K("as") + T.ident), ",", 1)
    NSelectColumn = Prio(
            (T.ident | T.dot_ident) + Opt(K("as") + T.ident) | '*',
            NArithExpr + Opt(K("as") + T.ident))
    NSelect = K("select") + List(NSelectColumn, ",", 1)
    NQuery = NSelect + NFrom + Opt(K("where") + NBoolExpr)
    NUnion = List(NQuery, K("union"), 1) + Opt * NOrderBy
    START = NUnion

It's not that un-natural

$ python
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X(object):
...     def x(self): pass
...     y = [x]
>>> X.x == X.x
True
>>> X.x == X.y[0]
False
>>> from lrparsing import Grammar, Token
>>> class G(Grammar):
...   START = Token('x')
...   y = [START]
>>> G.START == G.START
True
>>> G.START == G.y[0]
False
>>>

The internals of the parse tree

>>> from lrparsing import *
>>> class Expr(Grammar):
...     class T(TokenRegistry):
...         a = Token('a')
...     A = T.a
...     B = A + Token('b') + Token('b')
...     START = A | B
>>> parse_tree = Expr.parse("a\n   b  b")
>>> print Expr.repr_parse_tree(parse_tree, False)
(START (B (A 'a') 'b' 'b'))
>>> print repr(parse_tree)
(START = A | B, (B = A + 'b' + 'b', (A = T.a='a', (T.a='a', 'a', 0, 1, 1)), \
    ('b', 'b', 5, 2, 4), ('b', 'b', 8, 2, 7)))
>>> print str(Expr.START)
START
>>> print repr(Expr.START)
START = A | B
>>> print repr(Expr.T.a)
T.a='a'
>>> parse_tree[0] is Expr.START
True
>>> parse_tree[1][1][1][0] is Expr.T.a
True
>>> isinstance(parse_tree[1][2][0], TokenSymbol)
True
>>>

The inbuilt tokeniser

>>> class E(Grammar):
...     class T(TokenRegistry):
...         u = UserToken()
...         x = UnrecognisedToken()
...     START = Repeat(T.u) + Repeat(Token('a') | T.x)
...     COMMENTS = (
...          Token(re='#[^\n]*\n?') |
...          Token(re=r'/\*(?:[^*]|\*[^/])*\*/'))
...     WHITESPACE = "."
>>> parse_tree = E.parse(
...     ((E.T.u, 1), (E.T.u, 2, set()), "a.a.xxxa# some stuff"))
>>> print E.repr_parse_tree(parse_tree)
(START T.u T.u 'a' 'a' '.xxx' 'a')
>>> print repr(parse_tree)
(START = T.u * () + ('a' | T.x=UnrecognisedToken()) * (), (T.u, 1), \
  (T.u, 2, set([])), ('a', 'a', 0, 1, 1), ('a', 'a', 2, 1, 3), \
  (T.x=UnrecognisedToken(), '.xxx', 3, 1, 4), ('a', 'a', 7, 1, 8))
>>>

Abusing the parse tree

#!/usr/bin/python
import operator
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    number = Token(re="[0-9]+")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(number, unary_expr, binary_expr)
    START = e

    @classmethod
    def eval_node(cls, n):
        return n[1] if not "eval" in n[0] else n[0]["eval"](n)

    binary_ops = {'+': operator.add, '/': operator.floordiv}
    unary_ops = {'-': operator.neg}
    number["eval"] = lambda n: int(n[1], 10)
    binary_expr["eval"] = lambda n: E.binary_ops[n[2]](n[1], n[3])
    unary_expr["eval"] = lambda n: E.unary_ops[n[1]](n[2])

print E.parse("2 + 3 / -1", tree_factory=E.eval_node)

Using the parse tree

#!/usr/bin/python
import operator
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    number = Token(re="[0-9]+")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(number, unary_expr, binary_expr)
    START = e


class Node(tuple):
    value = None
    def __new__(cls, n): return super(Node, cls).__new__(cls, n)
    def __repr__(self): return E.repr_parse_tree(self, False)


class Compiler(object):

    def __call__(self, node):
        node = Node(node)
        name = node[0].name
        if not isinstance(node[0], TokenSymbol):
            node.value = node[1].value
        else:
            name = name.split(".")[-1]
        if name in self.__class__.__dict__:
            self.__class__.__dict__[name](self, node)
        return node

    def number(self, node):
        node.value = int(node[1][1], 10)

    binary_ops = {'+': operator.add, '/': operator.floordiv}
    def binary_expr(self, node):
        node.value = self.binary_ops[node[2][1]](node[1].value, node[3].value)

    unary_ops = {'-': operator.neg}
    def unary_expr(self, node):
        node.value = self.unary_ops[node[1][1]](node[2].value)

print E.parse("2 + 3 / -1", tree_factory=Compiler()).value

Error messages

#!/usr/bin/python
import sys
from lrparsing import *

class E(Grammar):
    class T(TokenRegistry):
        number = Token(re="[0-9]+")
    e = Ref("e")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(T.number, unary_expr, binary_expr)
    START = e

try:
    print E.repr_parse_tree(E.parse("""2 + 3 // -1"""))
except ParseError, e:
    sys.stderr.write("%s\n" % e)

line 1 column 8: Got '/' when expecting '-' or T.number while trying
    to match binary_expr in state 17

Pre-compiling

Parser generation can be eliminated by pre-compiling it.

import sys
from lrparsing import *

class E(Grammar):
    e = Ref("e")
    binary_expr = Prio(e << Token("/") << e, e << Token("+") << e)
    unary_expr = Token("-") + e
    e = Prio(Token(re="[0-9]+"), unary_expr, binary_expr)
    START = e
    PRE_COMPILED = ('5709b73c7a329fb147636020987104248ebb2273bfb5d2153e30b452ba9645420e3e7e851f6288a663bb8445987692b5b020c5c1cfd863ef23587416bd0904b3', 0, ({'/[0-9]+/': 7, "'-'": 10}, {2: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['<E>']), ({'__end_of_input__': 16}, {}, ['<E>']), ({'__end_of_input__': (2, 1, 'START'), "'/'": 17, "'+'": 18}, {}, ['START', 'binary_expr']), ({'__end_of_input__': (4, 1, 'e'), "'/'": (4, 1, 'e'), "'+'": (4, 1, 'e')}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (5, 1), "'/'": (5, 1), "'+'": (5, 1)}, {}, ['e']), ({'__end_of_input__': (6, 1), "'/'": (6, 1), "'+'": (6, 1)}, {}, ['e']), ({'__end_of_input__': (7, 1), "'/'": (7, 1), "'+'": (7, 1)}, {}, ['e']), ({'__end_of_input__': (8, 1), "'/'": (8, 1), "'+'": (8, 1)}, {}, ['e']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 19, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['unary_expr']), ({'__end_of_input__': (11, 1, 'binary_expr'), "'/'": (11, 1, 'binary_expr'), "'+'": (11, 1, 'binary_expr')}, {}, ['binary_expr']), ({'__end_of_input__': (13, 1), "'/'": (13, 1), "'+'": (13, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (13, 1), "'/'": (13, 1), "'+'": (13, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (14, 1), "'/'": (14, 1), "'+'": (14, 1)}, {}, ['binary_expr']), ({'__end_of_input__': (15, 1), "'/'": (15, 1), "'+'": (15, 1)}, {}, ['binary_expr']), ({'__empty__': (0, 2, '<E>')}, {}, ['<E>']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 20, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['binary_expr']), ({'/[0-9]+/': 7, "'-'": 10}, {4: 21, 5: 3, 6: 4, 7: 5, 8: 6, 10: 8, 11: 9, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15}, ['binary_expr']), ({'__end_of_input__': (10, 2, 'unary_expr'), "'/'": (10, 2, 'unary_expr'), "'+'": (10, 2, 'unary_expr')}, {}, ['binary_expr', 'unary_expr']), ({'__end_of_input__': (16, 3), "'/'": (16, 3), "'+'": (16, 3)}, {}, ['binary_expr']), ({'__end_of_input__': (17, 3), "'/'": 17, "'+'": (17, 3)}, {}, ['binary_expr']))

print E.pre_compile_grammar(E.PRE_COMPILED)

Returns None if the passed string matches the grammar, a string containing the parse table otherwise.

Utility functions

Lrparsing offers a few other utilities for debugging grammars.

» pre_compile_grammar(pre_compiled=None)
» parse(input, tree_factory=None, on_error=None, log=None)
» repr_grammar() - the grammar as entered.
» repr_productions() - compiled productions.
» repr_parse_table(state=None) - print an LR(1) ItemSet.
» repr_parse_tree(parse_tree) - pretty print a parse tree
» unused_rules() - frozenset() of unused rules

Error recovery - saving the worst for last

Only useful if you want to report on more than one error. parse() calls on_error instead of raising an exception.

lrparsing.on_error(iterator, input_tuple, stack)

The stack is a partially constructed branch of the parse tree:

([], (__empty__))
([START,binary_expr], ((e '2')))
([binary_expr], ('+'))
([binary_expr], ((e '3')))
([binary_expr], ('/'))

(START (e (binary_expr
  (e '2') '+'
  (e (binary_expr
    (e '3') '/'
    (e (unary_expr '-' (e '1'))))))))

Resources

Home page:
http://lrparsing.sourceforge.net.

Pypi page:
https://pypi.python.org/pypi/lrparsing.

Online manual:
http://lrparsing.sourceforge.net/doc.

And examples:
» Expression evaluator.
» Sqlite3 data manipulation statement grammar.
» Lua 5.2 to Python 2.7 compiler.

Introducing ... lrparsing.py

Why?

pyparsing.py

pyparsing.py ... but

Next .. hand written LL parser

Next ... parsing.py.

... parsing.py - Sample Code - The Grammar

... parsing.py - Sample Code - Defining Tokens

... parsing.py - Sample Code - Defining Precedences

... parsing.py - Summary

Afterward ... ply

ply - Sample Code - The Tokeniser

ply - Sample Code - The Parser

So where are we here?

Introducing ... lrparsing

lrparsing - Sample Code

Back to the beginning ... What does a parser do?

What does a parser do? ... it breaks things up

What does a parser do? ... into their constituent bits

What makes up a parser?

A Grammar

lrparsing's Grammar

lrparsing's tokeniser

OMG! It's a pig in lipstick

Translating pig

Prio()

Displaying the compiled productions - the code

Displaying the compiled productions - the result

Translating pig (again)

Using '<<' and '>>'

Finally, a parse tree.

Other grammar constructors

Syntatic sugar for Repeat

Raw constructors

And finally for the truly lazy

lrparsing - Sample Code

It's not that un-natural

The internals of the parse tree

The inbuilt tokeniser

Abusing the parse tree

Using the parse tree

Error messages

Pre-compiling

Utility functions

Error recovery - saving the worst for last

Resources

The end

Introducing ...
lrparsing.py