lrparsing

Abstract

lrparsing is an LR(1) parser hiding behind a pythonic interface.

Author

Russell Stuart <russell-lrparsing@stuart.id.au>

Introduction

Lrparsing provides both an LR(1) parser and a tokeniser. It differs from other Python LR(1) parsers in using Python expressions as grammars, and offers simple to use disambiguation tools.

The result is something that is powerful yet concise. For simple tasks this means it can be thought of as an extension to Python’s existing re module, used when regular expressions become too cumbersome. For complex tasks lrparsing offers a high speed parser (very roughly 25us per token from string to parse tree on a desktop CPU), pre-compilation of the grammar and error recovery hooks so parsing can continue after an error is found.

In addition to extensive documentation it comes with a parser for Sqlite3 data manipulation statements and a Lua 5.2 to Python compiler as examples. The documentation can be read online at http://lrparsing.sourceforge.net/doc/html/.

First Steps

Below this section is a wall of text that will overwhelm all but the most determined. This example is designed to help you navigate it. Invest some time in reading it, enough to gain an intutitive feel for how lrparsing works. If your task is small it might be enough to get you through.

The grammar is specified by a class derived from Grammar. Attributes of that class define the grammar:

import lrparsing
from lrparsing import Keyword, List, Prio, Ref, THIS, Token, Tokens

class ExprParser(lrparsing.Grammar):
    #
    # Put Tokens we don't want to re-type in a TokenRegistry.
    #
    class T(lrparsing.TokenRegistry):
        integer = Token(re="[0-9]+")
        integer["key"] = "I'm a mapping!"
        ident = Token(re="[A-Za-z_][A-Za-z_0-9]*")
    #
    # Grammar rules.
    #
    expr = Ref("expr")                # Forward reference
    call = T.ident + '(' + List(expr, ',') + ')'
    atom = T.ident | T.integer | Token('(') + expr + ')' | call
    expr = Prio(                      # If ambiguous choose atom 1st, ...
        atom,
        Tokens("+ - ~") >> THIS,      # >> means right associative
        THIS << Tokens("* / // %") << THIS,
        THIS << Tokens("+ -") << THIS,# THIS means "expr" here
        THIS << (Tokens("== !=") | Keyword("is")) << THIS)
    expr["a"] = "I am a mapping too!"
    START = expr                      # Where the grammar must start
    COMMENTS = (                      # Allow C and Python comments
        Token(re="#(?:[^\r\n]*(?:\r\n?|\n\r?))") |
        Token(re="/[*](?:[^*]|[*][^/])*[*]/"))

parse_tree = ExprParser.parse("1 + /* a */ b + 3 * 4 is c(1, a)")
print(ExprParser.repr_parse_tree(parse_tree))

Will print:

(START (expr                  # Here START is ExprParser.START
  (expr                       # And expr is ExprParser.expr too.
    (expr
      (expr (atom '1')) '+'   # '+' is repr(input_matched_by_token)
      (expr (atom 'b'))) '+'  # 'b' is actually ExprParser.T.ident
    (expr
      (expr (atom '3')) '*'   # '3' is actually ExprParser.T.integer
      (expr (atom '4')))) 'is'
  (expr (atom (call 'c' '('
    (expr (atom '1')) ','     # ',' is an anonymous Token
    (expr (atom 'a')) ')')))))

As is suggested by the output, the parse tree returned by parse() is a tree composed of nested Python tuples. The quoted strings above are actually tuples of the form:

(Token(), 'matched_input', offset_in_stream, line_number, column_number)

So the tuple representing '3' would be (ExprParser.T.integer, '3', 16, 1, 17).

Your next step will probably be writing a small grammar, and you will become discouraged when lrparsing complains it is ambiguous. Read Ambiguities for hints on how to fix it.

If you go much beyond the example above you will need to read the reference material below, and you will find yourself getting lost in a thicket of jargon. The jargon unavoidable as parsing involves a lot of concepts and each must name its own name. Traditionally a Glossary is the solution for jargon, but you may find the Glossary Narrative a gentler introduction to the world of parsing.

Module contents

The module contains many classes and a few functions. The classes are described below, lumped into the following categories:

In addition, there are a few utility routines. They all duplicate methods of the Grammar class. They are provided in case a Grammar subclass redefines them:

lrparsing.compile_grammar(grammar)

Compile grammar, which must be a subclass of Grammar. If the compile fails a GrammarError will be raised. This is typically used when debugging new Grammars. Grammar must be compiled before it can be used, but this is done implicitly for you by parse() or can be done explicitly by pre_compile_grammar() if compiling is too slow to do at run time. repr_parse_table() displays a lot more information when the grammar is compiled by compile_grammar(). Compiling a Grammar is not thread safe.

lrparsing.epoch_symbol(grammar)

Returns the start symbol, which is the Symbol used to initialise the Grammar subclass grammar. It is a non-terminal with a single production: <MyGrammar> = START __end_of_input__. Useful during Error Recovery.

lrparsing.parse(grammar, input, tree_factory=None, on_error=None, log=None)

Parse the input using the subclass of Grammar passed, returning a parse tree if the grammar matches the input or raising a ParseError otherwise, calling compile_grammar() first if the grammar hasn’t been compiled. If the function on_error isn’t None it will be called when the parser input doesn’t match the grammar. It is documented in Error Recovery. If the function log is passed it will be called repeatedly with strings showing parser’s stack.

The input is passed to a tokeniser which converts it to the stream of input tuples expected by the parser. This tokeniser is created by the TokenRegistry declared in the Grammar subclass. By default the inbuilt tokeniser is used, and it accepts either a string or a generator that yields strings and input tuples. Input tuples are normal Python tuples that have a TokenSymbol instance as their first element. The parser ignores the remaining elements in the tuple.

The default format of the returned parse tree is a tree of nested tuples. These tuples will be one of the input tuples given to the parser, or will have one of the Rule instances defined as a rule in the Grammar followed by the tuples representing the Symbols that made up the right hand side of the production matched. (Use repr_productions() to see these productions.) The format of the returned parse tree can be altered by passing a function as tree_factory. Whatever it returns is substituted into the parse tree instead the tuple. It must accept one argument: the tuple that would have been inserted.

The parse operation is thread safe but the lazy compiling of the Grammar is not. 1

lrparsing.pre_compile_grammar(grammar, pre_compiled=None)

compile_grammar() can take a long while for reasonable sized grammars (seconds on a 32 bit out-of-order desktop CPU). This function allows you to skip that time in production evnironments by giving the parser a pre-compiled parse table.

The pre-compiled parse table is actually a (very big) tuple made up entirely of Python constants. This function returns the repr() of that tuple (ie, a string). Use the returned value by passing it to this funcion(!) as the pre_compiled argument. If you pass returned string directly it will be around 10 faster than calling it without the pre_compiled parameter. If you pass it the raw tuple it will be about 100 times faster. The reason for the 10 fold difference is the passed string must be parsed to eval(), but if you embed the raw tuple into your code it will be tokenised by Python and put in the .pyc file.

If this function is passed a pre-compiled parse table that matches the current grammer it returns None quickly, otherwise it compiles the grammar into a parse table and returns it as a string, which is orders of magnitude slower. This means apart from speed the Grammar always behaves the same way (ie, as specified by the source defining it’s class) regardless of whether this function is called; even if called with a pre-compiled parse table that doesn’t match the Grammar.

The returned parse table is optimised, meaning it has information useful for debugging the grammar out striped out of it. This operation is not thread safe.

lrparsing.repr_grammar(grammar)

Return a printable string with the Rules as seen by the Grammar subclass grammar. Useful for debugging grammars.

lrparsing.repr_parse_table(grammar, state=None)

Return printable string describing the parse table of the passed Grammar subclass grammar. This only returns something useful after compile_grammar() has been called. The return value may even be useful if compile_grammar() raised an exception. The result returned by a pre compiled grammar is barely useful for debugging the pre compiler. If state (an integer) is passed only that state instead of the entire table is returned. If state is negative that state and all states that refer to it in their action or reduce tables are returned.

lrparsing.repr_parse_tree(parse_tree, indent=' ')

Return a printable string representing the passed parse_tree (as returned by parse()) in a human readable format. If indent is False it will be on one line, otherwise it will be split across lines and indented using the string passed as indent to reveal the tree structure.

lrparsing.repr_productions(grammar)

Return a printable string describing all the productions in the passed Grammar subclass grammar. This only works if compile_grammar() has been called. It may work even if compile_grammar() has raised an exception. Useful for debugging grammars.

lrparsing.unused_rules(grammar)

Returns frozenset of rules that aren’t reachable from the Grammar.START rule. If this isn’t empty the grammar probably contains an error.

The Grammar Class

A grammar’s rules are contained in a class derived from lrparsing's Grammar class. Members of the Grammar subclass that are assigned instances of the Symbol class (described below) make up the rules of the grammar. For the most part other members are simply ignored by lrparsing, meaning you can add your own functions and variables to a Grammar subclass with impunity.

class lrparsing.Grammar

The constructor does nothing by default. Lrparsing uses classes rather than instances to do its work, so it doesn’t expect or care if a subclass is instantiated.

classmethod compile_grammar(optimise=True, cache=None)

Identical to the module’s compile_grammar(). If redefined the module’s function continues to work.

classmethod epoch_symbol()

Identical to the module’s epoch_symbol(). If redefined the module’s function continues to work.

classmethod parse(input, tree_factory=None, on_error=None, log=None)

Identical to the module’s parse(). If redefined the module’s function continues to work.

classmethod pre_compile_grammar(pre_compiled=None)

Identical to the module’s pre_compile_grammar(). If redefined the module’s function continues to work.

classmethod repr_grammar()

Identical to the module’s repr_grammar(). If redefined the module’s function continues to work.

classmethod repr_parse_tree(parse_tree, indent=' ')

Identical to the module’s repr_parse_tree(). If redefined the module’s function continues to work.

classmethod repr_parse_table(state=None)

Identical to the module’s repr_parse_table(). If redefined the module’s function continues to work.

classmethod repr_productions()

Identical to the module’s repr_productions(). If redefined the module’s function continues to work.

classmethod unused_rules()

Identical to the module’s unused_rules(). If redefined the module’s function continues to work.

_parser_

Reserved for internal use by Grammar. Do not use.

COMMENTS

This member must be assigned a Keyword(), Token, UnrecognisedToken() or UserToken, or a Choice of those symbols. TokenSymbols assigned to COMMENTS are ignored by parse().

START

START must assigned a Rule instance. The goal of the grammar is to recognise the Rule defined by START.

WHITESPACE

This is passed unaltered to the tokeniser. The inbuilt tokeniser insists that if this is present, it is a string whose characters can optionally separate Tokens.

Assigning a Symbol to a member of a Grammar subclass creates a new Rule instance containing the Symbol on the right hand side, and whose Rule.name is the name of the member it was assigned to. This makes it different from normal Python assignment where the left hand side and the right hand side are the same object after the assignment executes. In practice there are two places this matters. Firstly, except for assigning to Grammar.START, you may not assign one Rule directly to another as it makes it impossible to determine the grammar hierarchy 2. Secondly, you cannot get a reference to a TokenSymbol by assigning it to a rule. Use a TokenRegistry to do that 5. Rules whose names start with an underscore (_) are treated specially. They are suppressed in the output parse tree 3.

Symbol Classes

Instances of Symbol subclasses are used to build rules in a Grammar. Symbol instances can be created in the normal way - by calling their class name (aka the constructor) with the appropriate arguments. However lrparsing provides some syntactic sugar by hijacking the Python expression builder. The Python operators that are translated to calling a Symbol constructor are noted below. Another sweetener is all Symbol class constructors will, when they are expecting a Symbol as an argument, cast some standard Python types to an appropriate Symbol class instance. For example, a string will be cast to a literal Token. 4

The Symbol subclass constructors that can be used to define Grammars are:

class lrparsing.Choice(s0, s1, ...)

Alternate syntax: s0 | s1 | ...

The grammar must choose between one of the symbols s0, s1, …

class lrparsing.Left(s)

Alternate syntax: s = s0 << s1 << s2 << ...

If the symbol s is a Sequence(s0, s1, s2, ...), and there is a choice between the parse trees (((s0 s1) s2) ...) and (... (s0 (s1 s2))) use the (((s0 s1) s2) ...). In other words s is left associative.

class lrparsing.List(s, d, min=0, max=None, opt=False)

The grammar must see a list of the symbol s separated by the symbol d, ie, s d s d s .... If opt is True the list may optionally end with d, otherwise it must end with s. If min is absent or 0 the list may be empty, otherwise s must appear a minimum of min times. If max is None any number of s are allowed, otherwise max is the maximum number of occurrences of s.

class lrparsing.Nonassoc(s)

If the symbol s is a Sequence(s0, s1, s2, ...), and there is a choice between the parse trees (((s0 s1) s2) ...) and (... (s0 (s1 s2))) raise a ParseError. In other words s isn’t associative.

class lrparsing.Prio(s0, s1, ..)

Alternate syntax: (s0, s1, ...)

The grammar must choose between one of the symbols s0, s1, ... If several choices are possible, choose s0 over s1 over ... A tuple assigned directly to a class member is ignored (as any non Symbol is), but will be converted to a Prio if passed to a Symbol constructor.

class lrparsing.Ref(name)

A forward reference to the Rule assigned to the class member name (a string). It will be replaced by that member definition, which must occur later.

class lrparsing.Repeat(s, min=0, max=None)
Alternative syntaxes:
  • s * N

  • s * (min,)

  • s * (min,max)

  • Opt(s), Opt * s, s * (0,1)

  • Some(s), Some * s, s * (1,)

  • Many(s), Many * s, Repeat * s, s * ()

The grammar must see s repeated. If min is 0 or absent no repeats are allowed meaning there may be no s at all, otherwise it must appear a minimum of min times. If max is None any number of s are allowed, otherwise max is the maximum number of repeats. These shortcuts are also provided:

  • Opt(x) or Opt * x is equivalent to Repeat(x, min=0, max=1).

  • Some(x) or Some * x is equivalent to Repeat(x, min=1).

  • Many(x) or Many * x or Repeat * x is equivalent to Repeat(x).

  • x * N is equivalent to Repeat(x, min=N, max=N).

  • x * tuple(...) is equivalent to Repeat(x, *tuple(...)).

class lrparsing.Right(s)

Alternative syntax: s = s0 >> s1 >> s2 >> ...

If the symbol s is a Sequence(s0, s1, s2, ...), and there is a choice between the parse trees (((s0 s1) s2) ...) and (... (s0 (s1 s2))) use the (... (s0 (s1 s2))). In other words s is right associative.

class lrparsing.Sequence(s0, s1, ...)

Alternative syntax: s0 + s1 + ...

The grammar must see symbol s0, followed by s1, followed by …

class lrparsing.THIS

A reference to the Rule being assigned to. So identical to Ref('this') in this = Ref('this') + '+' + Ref('this').

class lrparsing.Tokens(literals[, keywords])

The grammar must choose one of the supplied literals and keywords recognised by the inbuilt tokeniser. Both literals and keywords are strings which are string.split() to yield multiple literals and keywords, each of which is recognised using the inbuilt Token class.

Python does not allow rule = for an empty rule. Use rule = THIS * 0 instead.

Although the classes above are used to build Grammar rules by appearing on the right hand side of a Python assignment statement their instances are not directly visible. Instead an instance of the Rule class is created to hold them, and this is what is assigned to your Grammar attribute. Since they are a subclass of Symbol they can be referenced by other grammar rules:

class lrparsing.Rule
name

The name of the Grammar attribute the Rule was assigned to. A string.

__str__()

Returns name.

All symbols the parser can recognise are derived from this abstract base class, and thus share these attributes:

class lrparsing.Symbol

If you override Symbol the constructor must be called as it does initialisation.

dict

A dict you can use for your own purposes for all rules bar Grammar.START.

__contains__(key):

Equivalent to key in symbol.dict.

__delitem__(key):

Equivalent to del symbol.dict[key].

__getitem__(key):

Equivalent to symbol.dict[key].

__iter__():

Equivalent to iter(symbol.dict).

__len__():

Equivalent to len(symbol.dict).

__nonzero__():

Always returns True, ensuring a Symbol instance is always truthy.

__setitem__(key, value)

Equivalent to symbol.dict[key] = value.

__repr__()

Returns the production for this Symbol.

__str__()

Generates a unique name of the Symbol, based on the Rule it belongs to. Tokens override this to return their name attribute.

Tokens

Tokens are a special kind of Symbol 9. As a convenience the same token may have several instances in a Grammar. Instances of the same token are merged during the compile, resulting in all bar one being discarded. Two tokens are considered to be the same if they have the same name attribute. The inbuilt tokens generate a suitable name for you, but you can assign a different name to any token using a TokenRegistry. Names containing __ are used internally, and so are reserved.

The following methods and classes are used directly in grammars:

lrparsing.Keyword(word, case=True)

Create an instance of the Token class that recognises a keyword. If case is False the keyword isn’t case sensitive, otherwise it is. The default name is repr(word). A keyword is a literal Token that is recognised by another re Token. For example, an identifier might be recognised by Token(re='[A-Za-z_]+'). The keyword 'then' would normally be recognised by the identifier Token, but if Keyword('then') is present it will match 'then', while all other non-keywords will still be matched by the identifier Token.

class lrparsing.Token(literal=None, re=None, case=True)

Alternative syntax: 'literal'

Use the inbuilt tokeniser to generate a Token symbol when it sees the text specified by literal or re. The arguments becomes attributes with the name. You can only supply literal or re, not both. If literal is supplied the tokeniser will match exactly that text unless case is False in which case it ignores case when matching. If re is supplied it must be a regular expression the tokeniser will match. The re is compiled with re.MULTILINE and re.DOTALL; adding additional flags using re's (?FLAG) syntax affects the recognition of all Tokens. The default name for literal tokens is repr(literal). The default name for re tokens is the re surrounded by /’s. Strings assigned directly to class members are ignored as are all non Symbols, but strings passed to Symbol constructors are turned into a literal Token.

class lrparsing.UserToken

A user defined token. The UserToken must be assigned a name by a TokenRegistry. This class can be used as a base for user defined token classes.

lrparsing.UnrecognisedToken()

Create an instance of the Token class that is given to the parser when the inbuilt tokeniser strikes a string it doesn’t recognise as a token or whitespace. If this token doesn’t appear in the Grammar the inbuilt tokeniser raises a TokenError instead.

If you delve deeply into lrparsing you will come across these token classes. They aren’t directly used by grammars:

class lrparsing.MetaToken(name)

Used to construct special tokens used internally by the grammar compiler and parser. You will never see these in the output parse tree, but error recovery may see them. There are two: __empty__ represents a 0 width character string, and __end_of_input__ represents the end of the input stream.

class lrparsing.TokenSymbol(name)

The abstract base class all tokens are based on, thus isinstance(symbol, TokenSymbol) will return True if a Symbol is a token of some kind. The passed name is the generated name which can be overridden by the TokenRegistry. TokenSymbols and thus all subclasses have these attributes and methods:

dict

This is inherited from Symbol, as is the shorthand access provided by Symbol's mapping interface. TokenSymbol.dict is best set from within a TokenRegistry. Changes made to this attribute in the body of a Grammar subclass change the Symbol.dict of the Rule the TokenSymbol is assigned to, not the TokenSymbols dict.

name

A string. The name of the token. If two tokens with the same name are used in a Grammar they will be merged using merge() and only one will be kept.

named

A bool. True if set_name() has been called.

merge(other)

Called by the TokenRegistry when two tokens have the same name. Other will be discarded after merge returns.

position(input_tuple)

Given an input_tuple supplied to the parser, return a string describing the position in the input stream. The default implementation returns None, which means if the input_tuple triggers an error in the parser it won’t insert the position in the input stream of the error.

set_name(name)

Called by the TokenRegistry to override the generated name. Sets named to True.

__str__()

Returns str(name).

The TokenRegistry Class

The TokenRegistry is the home for tokens. If you don’t define subclass within the Grammar it will use an instance of TokenRegistry directly 5. Assigning a token to a member of a TokenRegistry class gives you a handle to the token instances recognised by the Grammar, and it is the only way to get such a handle 6. Assigning the token also sets the TokenSymbol.name of the token to the qualified attribute name, which is a string with the format 'NameOfTokenRegistrySubclass.member_name'.

class lrparsing.TokenRegistry

A single instance of the a TokenRegistry or the subclass defined in the Grammar is created when the class when the Grammar is declared. The constructor looks up all tokens declared in its body and registers their TokenSymbol.name using _resolve_token_().

_resolve_token_(token_symbol)

Called when the Grammar is compiled into LR(1) productions as tokens are found. It returns the TokenSymbol that will be presented to the parser for TokenSymbol.name. That instance will always be the first TokenSymbol passed to it with the TokenSymbol.name. If the passed token is a duplicate, it will pass it to TokenSymbol.merge() before discarding it.

compile_tokens(whitespace)

Called once, during Grammar compilation and before the first call to tokeniser(), so the tokeniser can initialise itself. The whitespace parameter is the Grammar.WHITESPACE attribute if one was declared, None otherwise. This method must raise an exception if whitespace isn’t valid. What “whitespace” means is up to the tokeniser. The default definition creates an instance of the inbuilt tokeniser, and calls its compile_tokens method with the same arguments.

tokeniser(input, whitespace)

This is a generator, called by parse() to create input tuples to form the input to feed to the LR(1) parser from the input parameter passed to parse(). The whitespace parameter is the same one that was passed to compile_tokens(). The input tuples accepted by the parser are tuples whose first element is a TokenSymbol instance. The remainder of the tuple is ignored by the parser, but the tuple is placed unmodified into the output parse tree. The default definition calls the tokeniser() method of the inbuilt tokeniser, passing it the same arguments 8.

Inbuilt Tokeniser

The inbuilt tokeniser generates Token instances. Thus if you use Keyword(), Token, Tokens or UnrecognisedToken() in your grammar, you must use the inbuilt tokeniser. It is used by default unless you add a TokenRegistry to your grammar and override its TokenRegistry.compile_tokens() and TokenRegistry.tokeniser() methods.

The input accepted by the inbuilt tokeniser is either a string or a generator yielding a mixture of strings and input tuples 7. If the input is broken up into multiple strings you should ensure the breaks occur at token boundaries. Grammar.WHITESPACE is a string whose characters can appear between tokens but are otherwise ignored. If not defined it defaults to " \f\n\r\t\v".

Strings are turned into input tuples with the following format:

(Token(), matched_data, stream_offset, line_number, column_number)

where:

Token()

is an instance of the Token class used in the Grammar.

matched_data

is the matched string from the input. A string.

stream_offset

is the offset from the start of the input where the matched data was found. An integer.

line_number

is the line number the string was found on. The first line is numbered 1. An integer.

column_number

is the column number within the line where the string was found. The first column is numbered 1. An integer.

If a generator is passed and it yields objects other than strings they are passed onto the parser unchanged in the order they were received.

Exceptions

exception lrparsing.LrParsingError

Base: StandardError.

An abstract base class used by all exceptions defined here.

exception lrparsing.GrammarError

Base: LrParsingError.

Raised when there is an error in the grammar. This can happen both when the grammar is first parsed, and when compile_grammar() is called.

exception lrparsing.ParsingError

Base: LrParsingError.

An abstract base class for all errors that can be raised by this module when parse() is called.

exception lrparsing.TokenError(message, data, offset, line, column)

Base: ParsingError.

Raised by the inbuilt tokeniser when it strikes something in the input that isn’t entirely whitespace, isn’t a recognisable Token, and the grammar doesn’t use UnrecognisedToken. The arguments following message become attributes of the instance.

data

The string that wasn’t recognised.

offset

Data's position in the input stream as a character offset from the start. An integer.

line

Data's position in the input stream as a line number starting at 1 for the first line in the input. An integer.

column

Data's position in the input stream as the column number in the current line, starting at 1 for the first column. An integer.

exception lrparsing.ParseError(input_tuple, lr1_state)

Base: ParsingError.

Raised by parse() when the input the parser is given doesn’t match the Grammar. The arguments become instance attributes with the same name.

input_tuple

The input tuple given to the parser that triggered the error. If this input tuple was created by the inbuilt tokeniser it will contain position information, ie, line number, column number and so on.

lr1_state

The current parser state, an Lr1State instance 10, from the parse table (ie, as returned by repr_parse_table()). If the parse table hasn’t been pre compiled repr() reveals a lot of information about what the parser was expecting at the time.

Error Recovery

The default action of parse() is to raise a ParseError exception which creates a suitable message, which of course stops the parser at that point. If you want the parser to continue so it can discover more errors in the same run you need to do error recovery. Error recovery is always accomplished by manipulating the parser’s input so the parser thinks it has seen something valid, and then letting it continue. It is implemented by the on_error parameter to parse():

lrparsing.on_error(iterator, input_tuple, stack)

Handle a parser error.

Parameters
  • input_tuple – The input tuple that triggered the error.

  • iterator – The iterator the parser is reading input tuples from.

  • stack – The parser’s stack. A list.

Return type

None or an iterable yielding input tuples that will be inserted into the input stream.

The difficulty with any error recovery is determining how to turn the mess you are given as input into something the parser will accept. Invariably this means you try to figure out what the user was trying to do. For example, let’s say you are parsing Python and strike an problem in an expression. If the expression was part of a for statement your best strategy is probably to replace the entire for statement with for x in ():. But if that for was inside a list comprehension inserting a for statement at that point would likely generate a cascade of errors later on. So understanding where you are is critical. Sadly this is one thing LR(1) makes hard. That information is buried in the stack parameter. The stack is a partially constructed branch, running from the root to a leaf, of a parse tree. It is a list of tuples, like this:

[
  (lr1_state, (symbol, ...)),
  (lr1_state, (symbol, ...)),
  ...
]

Consider the following two grammars which both recognise the same thing, a small subset of the C language:

>>> def on_error(iterator, token, stack):
    for item in stack:
        print "([%s], (%s))" % (
            ','.join(sorted(str(rule) for rule in item[0].rules)),
            ','.join(
                lrparsing.repr_parse_tree(sym, indent="")
                for sym in item[1]))

>>> class Lang1(Grammar):
    e = Prio(
        Token(re='[0-9]+') | Token(re='[a-z]+') + Opt(Token('(') + THIS + ')'),
        THIS << '*' << THIS,
        THIS << '+' << THIS)
    START = Repeat((
        Opt(e) |
        Keyword("if") + e + Keyword('then') + THIS +
            Opt(Keyword('else') + THIS + Keyword("endif") |
        Keyword("while") + e + Keyword('do') + THIS + Keyword("done") |
        Token(re='[a-z]+') + '=' + e) + ';')

>>> Lang1.parse(
    "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
([<Lang1>], (__empty__))
([START], ('e','=',(e '1'),';'))
([START], ('while'))
([START,e], ((e 'funca' '(' (e 'e') ')'))),
([START], ('do'))
([START], ('if'))
([START,e], ((e 'a')))
([START], ('then'))
([START], ((START 'e' '=' (e 'a') ';' (e 'fi') ';')))
lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match START in state 44

>>> class Lang2(Grammar):
    class T(TokenRegistry):
        number = Token(re='[0-9]+')
        ident = Token(re='[a-z]+')
        kwd_endif = Keyword("endif")
    exp = Ref("exp")
    block = Ref("block")
    call = T.ident + '(' + exp + ')'
    mul_op = exp << '*' << exp
    add_op = exp << '+' << exp
    exp = Prio(T.number | T.ident | call, mul_op, add_op)
    if_part = Keyword("if") + exp
    then_part =  Keyword("then") + block
    else_part = Keyword("else") + block
    if_st = (if_part + then_part + Opt(else_part) + T.kwd_endif)
    while_part = Keyword("while") + exp
    do_part = Keyword("do") + block
    while_st = while_part + do_part + Keyword("done")
    exp_st = exp * 1
    ass_exp = exp * 1
    ass_st = T.ident + '=' + ass_exp
    empty_st = THIS * 0
    st = ass_st | empty_st | exp_st | if_st | while_st
    block = Repeat(st + ';')
    START = block

>>> Lang2.parse(
    "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
([<Lang2>], (__empty__))
([block], ((st (ass_st 'e' '=' (ass_exp (exp '1'))),';'))
([while_st], ((while_part 'while' (exp (call 'funca' '(' (exp 'e') ')'))))
([do_part], ('do'))
([if_st], ((if_part 'if' (exp 'a'))))
([then_part], ('then'))
([then_part], ((block (st (ass_st 'e' '=' (ass_exp (exp 'a'))) ';' (st (exp_st (exp 'fi'))) ';'))),
lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match then_part in state 50

Here the on_error() function prints the stack passed, but doesn’t attempt recovery. Looking at the right hand side of the stack, the (symbol, symbol, ...) bit, it shouldn’t be too hard to see how stitching it together will yield a branch of the parse tree. This sequence of symbols must eventually match one of the Rules listed to their left. The LR(1) parser may list several as it will only make its final decision once it has seen the entire production. For Lang2 it is not difficult to see the parser was in the middle of the then_part of an if_st. Lang1 is an example of how to make recognising this difficult.

lr1_state has more information in it than the above example shows. It is an Lr1State 10:

class lrparsing.Lr1State(id, actions, gotos, rules)

This class is immutable, the parameters becoming attributes of the same name. It has no useful methods beyond object.__str__() and object.__repr__().

id

An integer identifing the state. In an Lr1State this integer is its index into the parse table.

actions

The shift table of this parse state. It is a dict, indexed by all valid TokenSymbols that could appear at this point. The error occurred because the current token isn’t a key to this dict.

gotos

The reduce table of this parse state. A dict. Not useful for error recovery.

rules

This is a list of Rules from your grammar the parser was in the process of recognising. In other words if the parser was processing the grammar class G(Grammar): START = Token("x"), then this must be True: rules[0] in (G.START,G.epoch_symbol()).

By just looking at the top item of the passed stack for Lang2 we see the parser was in the middle of a then_block. Its task is to re-arrange the input so the parser can continue. The simple way is to insert one of the TokenSymbols out of Lr1State.actions. In this case 'endif' would be the right choice. Another choice is to erase some input, possibly replacing it with something else. This would be appropriate if, for example the string being parsed was ((2 func 2;. Replacing func 2 with )) would get the parser onto the next statement so the user only gets one error for the current line.

If on_error() returns None the parser will raise a ParseError as if on_error() wasn’t passed to it. Otherwise it will read input tuples from the returned iterable, and when that is exhausted return to reading the original input. So, returning a list of input tuples will insert them, reading input tuples off the passed iterator will delete them, and reading them and returning what you read allows you to peek ahead. As the parser has already processed the input tuple it passed to on_error(), if you want the parser to reprocess it, it must be in the returned iterable. Going back to the example, this line would insert an 'endif', thus allowing the parser to process the remainder of the input:

endif = (Lang2.T.kwd_endif, Lang2.T.kwd_endif.literal) + token[2:]
return [endif, token]

There is one other option for the adventurous. The stack passed to on_error() is the actual parser stack, so you can modify it. The states on the stack depend on the whims of the parser generator so adding or modifying them is impossibly fragile. Deleting them, using say del stack[-2:] is not. In the ((2 func 2 example deleting the entire expression from the stack and presenting the parser with a replacement expression on its input (eg, just Lang2.T.number) is often simpler.

Ambiguities

Lurking under lrparsing’s hood is an LR(1) parser. LR(1) parser generators are cantankerous beasts. lrparsing puts lipstick on the pig, but the pig is still a pig. The principal way it bites is to say your grammar is ambiguous. This section is here to help you work past this error, but be aware writing large complex grammars requires a fair amount of expertise which can’t be imparted by this small how-to.

To understand what ambiguous means, understand that a parser is meant to take some input, validate it conforms to the grammar, and then produce a parse tree. When the parser generator says the grammar is ambiguous, it means that for some valid input several different parse trees can be produced. An LR(1) grammar requires there be only one parse tree for any given input.

A blatantly ambiguous grammar will produce a Reduce/Reduce conflict. Consider:

class A(Grammar):
   a = Token('x')
   b = Token('x')
   START = a | b

A.compile_grammar() raises this exception:

lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise:
  b = 'x' ^
  a = 'x' ^
on seeing a __end_of_input__ I could not decide between:
  replace the sequence ['x'] with b
and
  replace the sequence ['x'] with a
Please remove this ambiguity from your grammar

Looking at the error message, it displays every possible production that could apply at this point, with an ^ showing where the parser is up to in each production. There are only two types of actions the parser can take. It can accept the next token in the input stream. This is called a Shift. It isn’t possible here as the next symbol is __end_of_input__, in other words it is at the end of the string. The other action it could take is replace some symbols with the production they match. This is called a Reduce. It can do a reduce as there are two productions that match. Sadly two is one too many, and this conflict is called Reduce/Reduce because both options are reduces. They correspond to these parse trees:

(START (a 'x'))  OR  (START (b 'x'))

Since there are two possible parse trees the grammar is ambiguous. If you look at the error message just the right way, it could almost be said to say that. The solution in this case is to remove one of the redundant Rules:

class A(Grammar):
   a = Token('x')
   START = a

Looping grammars are also ambiguous. Consider this grammar:

class B(Grammar):
    b = Ref("e")
    e = b | Token("x")
    START = e

Again, the Grammar will recognise a single 'x'. B.compile_grammar() raises this exception:

lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise:
  START = e ^
  b = e ^
on seeing a __end_of_input__ I could not decide between:
  replace the sequence [e] with b
and
  replace the sequence [e] with START
Please remove this ambiguity from your grammar

The error message is saying when the parser has seen a single 'e', it has two choices, which are listed. These two choices correspond to an infinite number of parse trees:

(START (e 'x')) OR (START (e (b (e 'x')))) OR (START (e (b (e (b (e 'x')))))) OR ...

Again since there is more than one possible parse tree, the grammar is ambiguous.

Unfortunately those two simple examples don’t capture the complexity of common ambiguities. Consider this example which is common in real world grammars:

class E(Grammar):
    e = THIS + Tokens("+ /") + THIS | Token(re="[0-9]+")
    START = e

This Grammar recognises strings of integers separated by '+' and '/'. If the grammar writer wanted to produce a grammar that only recognised valid integer expressions using the operators '+' and '/' he was successful. Nonetheless, when he calls E.compile_grammar() this exception is raised:

lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise:
  e = e '+' e ^
  e = e ^ '/' e
on seeing a '/' I could not decide between:
  replace the sequence [e '+' e] with e
and
  accepting the '/' in the hope it will match:
    e = e '/' ^ e
Please remove this ambiguity from your grammar

Again the error message lists the productions under consideration and how far the parser has got in each. The smallest possible string the parser could be looking at and end up with those possibles is: e + e ^ / e. The parser wants to know whether should it replace (reduce) e + e with e, which corresponds to performing the addition now, or should it continue reading, ie, delay the addition until it processes the '/'. This is called a Shift/Reduce conflict because the two actions possible are Shift and Reduce. The two parse trees under consideration are thus:

Choose Shift:  e + (e / e)
Choose Reduce: (e + e) / e

The normal arithmetic convention is to do the division first, so we want to choose Shift. There are several ways to make the grammar do that. The easiest way in lrparsing is to prioritise the two choices, saying e '/' e should always be done before e '+' e:

class E(Grammar):
    e = Prio(THIS + '/' + THIS, THIS + '+' + THIS) | Token(re="[0-9]+")
    START = e

But doing that only replaces the previous error E.compile_grammar() raised with a new one:

lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise:
  e.Choice.Prio.Prioritised0 = e ^ '/' e
  e.Choice.Prio.Prioritised0 = e '/' e ^
on seeing a '/' I could not decide between:
  replace the sequence [e '/' e] with e.Choice.Prio.Prioritised0
and
  accepting the '/' in the hope it will match:
    e.Choice.Prio.Prioritised0 = e '/' ^ e
Please remove this ambiguity from your grammar

Firstly, notice how the choices listed don’t directly correspond to the rules in your Grammar. That is because lrparsing has compiled your Grammar into productions, which is the only thing an LR(1) parser understands. You can print this compiled form using print E.repr_productions() which yields:

0   : <E> = START __end_of_input__
1   : START = e
2   : e = e.Choice.Prio
      e = /[0-9]+/
3   : e.Choice.Prio = e.Choice.Prio.Prioritised0
      e.Choice.Prio = e.Choice.Prio.Prioritised1
4   : e.Choice.Prio.Prioritised0 = e '/' e
5   : e.Choice.Prio.Prioritised1 = e '+' e

It is worth taking some time to understand how these productions correspond to the original set of Rules. In any case you can now see where all those odd looking symbol names come from.

Using the same technique as before we deduce the smallest string it could be looking at and end up with these possibilities: e '/' e ^ '/' e. In other words it can’t decide between these two parse trees:

e / (e / e)   OR   (e / e) / e

As an aside this is a real problem because integer division isn’t commutative: (9 / 3) / 2 = 1, whereas 9 / (3 / 2) = 9. So if the parse tree was being used to evaluate the expressions, choosing the wrong one would yield an incorrect answer.

We can’t use Prio to solve this because the production is conflicting with itself. The solution is to say division is left associative:

class E(Grammar):
    e = Prio(THIS << '/' << THIS, THIS << '+' << THIS) | Token(re="[0-9]+")
    START = e

And with that the parser will compile, and will yield parse trees that conform to the conventions of arithmetic.

Looking deeper under the hood, an LR(1) parser is called that because it delays all decisions until it has recognised an entire production. If you look at the above error messages, you will see they all involve a reduce and that is because is a reduce is what the parser does when it makes the decision on which production it has seen. There are two ways it can get confused.

The first involves two reduces. Since the parser is making its decision after it has seen the entire production, the only way it can be confused is if two productions have the same right hand side. If you look at the Reduce/Reduce conflicts above that is indeed the case. But even in that case the LR(1) parser can still distinguish between the two if they are followed by different tokens. Consider this ambiguous Grammar:

class F(Grammar):
    a = Token('a')
    b = Token('a')
    START = a + 'x' + 'y' | b + 'x' + 'z'

F.compile_grammar() raises this exception:

lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise state 4:
  a = 'a' ^
  b = 'a' ^
on seeing a 'x' I could not decide between:
  replace the sequence ['a'] with a
and
  replace the sequence ['a'] with b
Please remove this ambiguity from your grammar

Unlike the previous example this Grammar isn’t inherently ambiguous as there is one unique parse tree for every input. The problem is an LR(1) parser isn’t powerful enough to recognise it. The symbol it needs to see in order to know whether it should be an a or a b (the 'y' or 'z') is two tokens away and LR(1) only looks one token ahead. An LR(2) parser would be powerful enough but we don’t have one of those.

In this case it is easy enough re-write the Grammar so an LR(1) parser is powerful enough. The solution is to delay recognising a and b until the parser can see the y or z:

class F(Grammar):
    a = Token('a') + 'x'
    b = Token('a') + 'x'
    START = a + 'y' | b + 'z'

The problem with re-writing like this is it changes the parse tree. If the back end that processes the parse tree can’t tolerate this, something is going to have to give, and it won’t be the LR(1) parser. Post processing the parse tree so it is in the form the back end wants is the normal solution. Console yourself with the knowledge that people using LL(1) grammars have to do it far more.

Avoid the temptation to use Prio to resolve Reduce/Reduce conflicts. It will make the conflict go away, but it means the parser will never recognise part of your Grammar. In the above example yielding to temptation and using Prio would result in:

class F(Grammar):
    a = Token('a')
    b = Token('a')
    START = Prio(a + 'x' + 'y', b + 'x' + 'z')

And presto F.compile_grammar() works. But now F.parse("axz") yields:

lrparsing.ParseError: line 1 column 3: Got 'z' when expecting 'y'

Since "axz" is a valid input for our Grammar this is a bug. It was expecting a 'y' because our Grammar says if it can’t decide choose a + 'x' + 'y' over b + 'x' + 'z'. As we know it could not decide, so will always choose a + 'x' + 'y'. This means it can never recognise b + 'x' + 'z'.

Since at least one reduce must be involved, the only other possibility is a Shift/Reduce conflict. These always take the form:

SYMBOL = SYMBOL <anything1> SYMBOL ^
SYMBOL = SYMBOL ^ <anything2> SYMBOL

You can pick the Reduce because it has the ^ at the end. In this case the Grammar is always ambiguous, which is good because you can fix it by using associativity or priorities. The trick in figuring out which to choose is to understand what the parser is trying to decide between, and you can do that by replacing a Symbol in the right hand side of the Shift with the entire right hand side of the Reduce. The name of the Symbol replaced must be the same as the left hand side of the Reduce, and the ^ in the two productions must coincide, like this:

SYMBOL = SYMBOL <anything1> SYMBOL ^ <anything2> SYMBOL

The parse trees become either:

Shift:  SYMBOL <anything1> (SYMBOL <anything2> SYMBOL)
Reduce: (SYMBOL <anything1> SYMBOL) <anything2> SYMBOL

If <anything1> and <anything2> are different (like '*' and '+' above) you can use Prio to choose the one that should happen first. If they have the same priority (and in particular this will be the case if <anything1> and <anything2> are the same thing), then you must use Left if the Shift is the correct choice or Right if the Reduce is the correct choice. If it should never happen use Nonassoc.

Examples

The source of lrparsing contains the follow examples, which can also be viewed online at http://lrparsing.sourceforge.net/examples:

lrparsing-sqlite3.py

A Grammar for all of the Data Manipulation Statements in Sqlite3.py. lrparsing was written because the author could not find a Python parsing library that ran at an acceptable speed and required less lines of code to do this job than a hand written LL(1) parser.

lrparsing-expr.py

An integer expression evaluator It illustrates one design pattern for building a compiler on top of lrparsing: using a Symbols inbuilt mapping capabilities. If the job is simple this works well.

lua52.py

A Grammar for Lua 5.2 and a cross compiler to Python. It illustrates a second design pattern for building a compiler: using a compiler class whose functions have the same names as rules and and tokens in the Grammar. These functions are then called to compile the node as it is output by parse(). This is a good way to do it for more complex tasks. It cleanly separates the compiler from the parser, while keeping the relationship between the output parse tree and the generated code clear.

Acknowledgements

lrparsing was developed with inspiration from several sources. The two main ones were parsing.py (http://www.canonware.com/Parsing/) which from a technical perspective is a powerful and clean LR parser implementing several things lrparsing doesn’t, including a GLR(1) parser. The other source of inspiration is pyparsing.py (http://pyparsing.wikispaces.com/) which proves that providing a pythonic interface to parsing is a worthwhile thing to do.

Finally, thank you to Greg Black for proof reading this document.

Lrparsing Glossary

If you use a parser generator to its full extent and particularly if you are doing error recovery, you will end up knowing enough about parser generation to write your own. It will be a long journey, made longer by parser generation being an old art that has built up a truly impressive amount of jargon. The glossary is here to help you get past the jargon.

Unfortunately the glossary is large and the concepts complex, so it ends up being another wall of text. The narrative below is less rigorous but hopefully doesn’t suffer from the wall of text problem. May it speed your journey to parser generation zen.

Glossary Narrative

Here is a simple grammar:

import lrparsing
class MyGrammar(lrparsing.Grammar):
    xy_non_term = lrparsing.Token('x') + lrparsing.Token('y')
    y_non_term = lrparsing.Token('y')
    the_world_non_term = lrparsing.Repeat(xy_non_term | y_non_term, min=1)
    START = the_world_non_term

This grammar is composed of rules, which are things Python can parse. lrparsing translates the rules into an alternate representation of the same grammar using productions. Productions are what an LR(1) parser generator understands (this is the output produced by MyGrammar.repr_productions()):

0     : <MyGrammar> = START __end_of_input__
1     : START = the_world_non_term
2     : the_world_non_term = the_world_non_term.Repeat
3     : the_world_non_term.Repeat = xy_non_term
        the_world_non_term.Repeat = y_non_term
        the_world_non_term.Repeat = the_world_non_term.Repeat xy_non_term
        the_world_non_term.Repeat = the_world_non_term.Repeat y_non_term
4     : xy_non_term = 'x' 'y'
5     : y_non_term = 'y'

Productions have an internal structure and each bit has its own name; sometimes several. Apart from the = they are made up of symbols. The symbol on the left hand side is called a non-terminal. A non-terminal is never read from the input stream directly. It is just a place holder created by the grammar author to represent whatever is on the right hand side of the production. This contrasts to the symbols read from the input stream. They are called tokens. lrparsing embeds the tokens in a input tuple mostly because additional information is needed for error reporting - things like the where the token was found (eg, line and column).

The parser generator now creates a parse table from the productions. To do that it goes through an intermediate step. It creates a third representation of the grammar, a directed graph. Each node in the graph is called an item set. The first item set is:

{
   <MyGrammar> = ^ START __end_of_input__
}

As the {} are meant to imply, an item set really is a set, a set of productions. However there is an extra bit of information associated with each production, represented as a ^ here. This is called the dot and it shows how far the parser has progressed in recognising the production. The combination of the production and the dot is called an item, and hence the term item set arises because it is a set of items.

The production used in this initial item set is the root of the parse tree. It includes the special end of MetaToken __end_of_input__, which signals the entire input has been read. The non-terminal on its left hand side (<MyGrammar> here) has a special name. It is called the start symbol. So the item set above is a succinct representation of the parsers state before it has read anything. However it’s not finished yet.

In order to finish it off an operation called closure must be performed on it. The closure is intuitively easy to construct once you recognise that if the parser is about to see a START symbol, then the START symbols production(s) belong in the closure as well:

{
    <MyGrammar> = ^ START __end_of_input__

    START = ^ the_world_non_term
}

But the closure is still not finished as now the_world_non_term is also a non-terminal proceeded by the dot. Adding it adds yet more unclosed non-terminals. This goes on for a bit. Once complete the closure will be:

{
    <MyGrammar> = ^ START __end_of_input__

    START = ^ the_world_non_term
    the_world_non_term = ^ the_world_non_term.Repeat
    the_world_non_term.Repeat = ^ xy_non_term
    the_world_non_term.Repeat = ^ y_non_term
    the_world_non_term.Repeat = ^ the_world_non_term.Repeat xy_non_term
    the_world_non_term.Repeat = ^ the_world_non_term.Repeat y_non_term
    xy_non_term = ^ 'x' 'y'
    y_non_term = ^ 'y'
}

The item set is now complete. There are two tokens that follow the dots in it: ‘x’ and ‘y’. This means these two tokens, and only these two tokens, could legitimately be appear in the input stream at this point. Anything else will cause the parser to raise a ParseError.

Lets say ‘x’ was the next thing to be read from the input stream. Reading it creates a new item which can be used to create a core we haven’t see before:

{
    xy_not_term = 'x' ^ 'y'
}

Since there are no non-terminals following the dot the closure adds nothing, so this core is also a completed item set. ‘x’ is not the only token that the epoc item sets could first read from the input stream. It might also be an ‘y’. This yields another new core, and doing a closure on that core will create a new item set for it.

Recall the parser generator is creating a graph, and these item sets are the nodes in the graph. Since reading a token moves to these new item sets, the net effect of reading any symbol is to create a directed line connecting the two nodes in the graph. The act of moving along that line to a new node in the graph is called an action. This particular variant of an action is called a shift action. All possible shift actions a node / item set can make are held in its shift table. For the item set representing the start symbol, the shift table will be:

{
    'x':     item_set({ xy_non_term = 'x' ^ 'y', }),
    'y':     item_set({ y_non_term = 'y' ^, }),
}

When the xy_non_term reads the ‘x’ followed ‘y’ the entire right hand side of the production has been seen. Each shift has pushed the token it accepted onto a data structure called the stack. The ‘x’ and ‘y’ on the stack are popped from the stack and the non-terminal xy_non_term is pushed onto the stack, effectively replacing them. These are the first steps of an action called reduce. Effectively, the non-terminal xy_non_term has just been read from the input stream, and based on that parser must move to a new state (node in the graph, aka item set). It does it in the same way as a shift action: it looks up the non-terminal just “read” in a table called the reduce table. This table can be displayed using MyGrammer.repr_parse_table(). The reduce table of the start symbols item set is:

{
    START:
        item_set({<MyGrammar> = START ^ __end_of_input__}),
    the_world_non_term:
        item_set({START = the_world_non_term ^}),
    the_world_non_term.Repeat:
        item_set({
            the_world_non_term = the_world_non_term.Repeat ^
            the_world_non_term.Repeat = the_world_non_term.Repeat ^ xy_non_term
            the_world_non_term.Repeat = the_world_non_term.Repeat ^ y_non_term
        }),
    xy_non_term:
        item_set({the_world_non_term.Repeat = xy_non_term ^}),
    y_non_term:
        item_set({the_world_non_term.Repeat = y_non_term ^}),
}

“Reading” a non-terminal moves the dot just like reading a token does, and so creates a new item set. All these new item sets are processed, yielding new item sets in turn that are also processed - but only if the core hasn’t been seen before. When the process winds down the graph is complete.

To do it’s job the parser needs only the base structure of the graph, not the details used to build it. In fact the item sets can be represented by a integer, which can conveniently be used as an index into an array whose elements contain the reduce table and shift table for the node the item set represented. This stripped down version of the item set graph is the parse table.

Glossary

Action

A action is a collective name for the two steps an LR(1) parser can perform: shift and reduce. Alternatively, if the parse table is thought of as a graph, an action is the name of a line connecting the nodes in the graph. The name for a node in the graph is a state.

Ambiguous

A parser of any sort has two jobs: to verify the input is valid according to the grammar, and to create a parse tree for it. The words create a parse tree could equally well be written assign meaning. Eg, 9 + 3 / 2 is a valid arithmetic, but without the normal rules of arithmetic could have two different meanings: 9 + (3 / 3) = 10 or (9 + 3) / 3 = 4. In English, jokes are based on this same ambiguity, eg, Time flies like an arrow. It follows a parser would not be much use if it could not assign a unique meaning to every input sequence. Sadly it is possible (even easy) to write grammars that don’t assign a unique meaning to every input sequence. These grammars are rejected by the parser constructor as ambiguous. Even more sadly no practical parser, including LR(1), is powerful enough to recognise all non-ambiguous grammars, so they incorrectly say that some non-ambiguous grammars are ambiguous. The only work around is to re-write the grammar in a form the parser can use, and post process the parse tree if the new version is too far from the truth.

Closure

Closure the name for both an operation performed by item sets during their construction and the output of that operation, which is a set of items. An item set is a set of items, ie, partially recognised productions. Initially an item set contains just its core. If the next expected symbol of any item in the item set is a non-terminal, ie the left hand side of another production, then it follows the parser is also recognising this new item, which is that production with its dot at the start. This new item becomes part of the closure. That new item could also have a non-terminal at its start, thus yielding another new item recursively. All items added in this way become the item sets closure.

Core

The core is the name of the initial subset of items in an item set. The core is constructed from the item sets in the state graph that have lines leading to this new item set. It is the items in those ancestor item sets after the recognition of the symbol labelling the line connecting them. The dot of those items has been moved along one position to reflect the recognition of that symbol.

Conflict
Reduce/Reduce
Shift/Reduce

The job of the parser generator is to create a graph. The nodes of the graph become the parsers states. The lines that join the nodes of the graph are termed actions. They are labeled with the symbol the parser just recognised. If the parser generator can’t uniquely decide what the next state should be when symbol is seen (ie several lines would be labelled with the same symbol), the grammar is said to have a conflict. This means the grammar is ambiguous, because having several possible states reachable for the same symbol means there are several possible parse trees. For an LR parser one of the conflicting actions will always be a reduce because an LR parser only makes a decision once it has seen an entire production. Since there are two types of actions, shift and reduce, this means a conflict must be between a shift and a reduce which is termed a Shift/Reduce conflict or it could be between two reduces which is termed a Reduce/Reduce conflict.

Dot
^

Marks a position in the right hand side of a production. In days gone by when parse tables were constructed manually on paper, a human would mark a dot in a production to keep track of where he was up to. In a computer the position of a dot is represented by the number of symbols on the right hand side seen so far.

Grammar

Rules that explain how a series of tokens should be interpreted. For example, in English tokens are words and punctuation. English grammar shows how to build up words and punctuation into phrases of various kinds (eg, noun phrases, adverb phrases), then the phrases into sentences, and the sentences into constructs like quotes, paragraphs and chapters. The result is often represented as a parse tree.

Input Tuple

The parser recognises tokens. However, it is often helpful to have other information associated with the token that the parser doesn’t use, such as the position in the input stream it was found at for error reporting. lrparsing calls the combination of a Token plus that other information an input tuple. lrparsing uses a Python tuple with the Token as it first element for this purpose.

Item

A partially recognised production. It is a production, plus a marker (the dot) which shows how much the parser has seen so far, plus the tokens that could legally follow the production. An LR(0) parser doesn’t store any such tokens, an LR(1) stores all valid tokens that could come next and an LR(n) parser stores all valid sequences of n tokens that could possibly follow it. Since an LR(1) parser only makes decisions after seeing the entire right hand side of a production, it only considers the symbols that could follow a production when it does a reduce.

Item Set

This data structure is the heart of an LR(1) parser generator. As the name implies an item set is set of items. Internally this set is broken up into two distinct subsets, the core and the closure. The generator initialises itself by creating the core of the initial item set from the start symbol. This core consists of the sole production of the start symbol with the dot before the first symbol. The next step is to compute the closure of this core. In this case if the first symbol of the right hand side is a non-terminal, then it follows the parser is also at the start of all the non-terminal’s productions, so they are added to the closure. And so on, recursively, if they have non-terminals as their first symbol. So now we have a set of items (the core and its its closure), each waiting for their first symbol to appear. If the parser receives one such symbol it can move forward, and what it moves to is a new item set whose core is all the current items waiting on that symbol, but with their dot position moved forward one step because a new symbol has been seen. This process of computing closures and generating new item sets continues until only duplicate item sets are created. The item sets then become the states of the parse table.

Lhs
Left Hand Side

The Symbol on the left hand side of a production. Such symbols are called a non-terminal. In a lrparsing Grammar calls the symbol on the left hand side of a production a Rule.

LR(1)

LR(1) is an abbreviation for Left Right (parser with) 1 (symbol lookahead). Left here simply means it treats the string of characters like English - it starts reading at the left. The Right here means it delays making a decision on whether it has recognised a production until it has read the last symbol (right most) of its right hand side. The 1 symbol lookahead means it considers the next input symbol when it makes this decision. (It only needs to do this if two different non-terminals have the same right hand side. An LR(1) parser will look at the next token to be read to distingiush between them. If the same token can ligetimately follow both doing that doesn’t help, so the parser generator will say it has a Reduce/Reduce confict.) LR(0) parser doesn’t consider the next input token. If someone were to write an LR(2) parser it would consider the next 2 tokens. The other common type of parser is an LL(1) parser, also called a recursive descent parser. The only difference is R becomes an L (for Left), meaning the parser must make the decisions on what it is doing using just the left hand side of the production it is currently trying to recognise and the next input symbol. In theory because it has significantly less information to work with (it hasn’t seen any of the right hand side yet), an LL(1) parser is less powerful than a LR(1) parser. In practice recursive descent parsers are often written by hand and the programmer cheats by looking ahead or re-writing the grammar. This cheating goes a long way to making up the difference in power. However, automatically generated LL(1) parsers don’t cheat and so are significantly less powerful (ie, LL(1) handles a smaller set of otherwise valid grammars) than automatically generated LR(1) parsers such as lrparsing.

Parser State
State

The state, which is typically identified by a number, is how the LR(1) parser knows what it is up to. A state is called an item set by the parser generator. The name change reflects a change in the way is is used, not in what it is. Conceptually an LR(1) parser is a Deterministic Finite Automaton (DFA) which is a fancy name for a way of moving through a graph. The states are the nodes in this graph. The LR(1) parser progresses by moving from one node to another, but like all graphs this is only possible if the nodes are directly connected by a line. In the parsers graph these lines are actions. Every valid token that can be accepted as input while the parser is in that state has an associated action. In fact the only time the parser raises an error is when it knows the input stream can’t match the grammar, and the only time it knows that is when the token it just read doesn’t have an action associated with it. A shift action consumes the next token (pushes it onto the stack) and then moves onto the next state by following the line labelled with that token. A reduce action recognises a production and so it pops the stack to see what parent state the LR(1) parser is trying to recognise. That parent state will have a line in the graph labeled with the non-terminal the left hand side of the production just recognised, which will take it to the next state. People familiar with the purest form of regular expressions will know that they can also be recognised with DFA’s. An LR(1) parser is a souped up version of a regular expression DFA that recognises the right hand side of its productions.

Parse Table

A parse table is the data structure that drives a parser. The parse table is an array of states, the states being identified by their array index. The parse table can be viewed as the description of the nodes and lines of a graph. The nodes of the graph are the states. The lines are the state the parser moves to when it recognises a symbol, and thus are labelled with that symbol. The lines correspond to the state actions.

Parse Tree

The way a parser explains the meaning of its input is to output a parse tree. The parse tree assigns the meaning to the symbols and the relationships between them. The sentence Time flies like an arrow can be parsed two ways. In the one way “Time” is an adjective describing a type of insect, and “like” is a verb describing what “Time flies” do to arrows. An lrparsing parse tree for that meaning might look like:

(sentence
  (noun_phrase (adjective 'Time') (noun 'flies'))
  (verb_phrase
    (verb 'like')
    (noun_phrase (determiner 'an') (noun 'arrow')))))

The alternative, where “Time” is a noun, looks like this:

(sentence
  (noun 'Time')
  (verb_phrase
    (verb 'flies')
    (preposition_phrase
      (preposition 'like')
      (noun_phrase (determiner 'an') (noun 'arrow')))))
Parser

The parser in an LR(1) is a simple loop that processes a stream of input tuples according to a parse table, and produces a parse tree as a consequence. The same parser is used for all input streams and all parse tables. It is fast because it is simple. Its one internal data structure is the stack. The parse table is actually a graph, and the parser’s sole job is move to a new node in the graph as it recognises each new symbol by following the line labelled with that symbol. As it goes it outputs the productions it has recognised, and these become the parse tree.

Parser Generator

A parser generator takes a grammar as input and generates a parse table as output, unless the grammar is ambiguous in which cause it raises an error. It does this by constructing item sets.

Production

The LR(1) name for the lines describing a Grammar an LR(1) parser generator accepts as input. lrparsing does not use productions, it uses rules. It compiles your rules into LR(1) productions for you. You can display those productions using repr_productions(). Productions are similar to rules in that they have a left hand side which replaces the symbols on the right hand side when they are recognised. The differences between a Rule and a production are a production can only be a Sequence, and alternative derivations for the same non-terminal are expressed by allowing a non-terminal to have many productions.

Non-Terminal

A symbol that appears on the left hand side of a production. In lrparsing Grammars the Rules are its non-terminals.

Reduce
Goto

A reduce is the parsers aah! moment. It happens when it has decided it has recognised a complete production. This has occurred because the symbols on the right hand side plus the next token in the input stream uniquely determined the non-terminal on the productions left hand side. The reduce action now pops the symbols from the right hand side from the stack, and then pushes the non-terminal on the left hand side. In other words, it replaces the symbols on the stack are the productions right hand side with the non-terminal on its left hand side. It then looks up the newly arrived arrived non-terminal in the reduce table to determine the next state. Goto is an older name for reduce still found in the literature.

Reduce Table
Goto Table

The reduce table is a mapping of a non-terminal to a state. When a production is recognised by a reduce action it is replaced by the non-terminal on its left hand side. This is done by popping the symbols on right hand side from the stack, and then pushing the non-terminal. The parser must then move to a new state, and it does that by looking up non-terminal in the states reduce table. Every state has a reduce table, but it will be empty if all the productions in the item set the state represents have right hand sides consisting purely of tokens. If the parse table is thought of as a graph, the reduce table describes the lines in the graph labelled by the non-terminal (ie, to be followed) when the non-terminal is recognised by a reduce. There is a corresponding mapping for tokens called the shift table and between the two they describe all the lines in the graph. In the literature this table is sometimes called the reduce table. Goto Table is an older name for the reduce table which is still found in the literature.

Rhs
Right Hand Side

The sequence of symbols on the right hand side of a production. When these symbols are seen in the order they were given, they can be replaced on the stack (via an operation called a reduce) by the non-terminal on the left hand side.

Shift

When a token is recognised as valid it is removed from the input stream and pushed onto the stack. The parser looks up token in the shift table to determine the new state it should be in.

Shift Table

The shift table maps a token to another state. Each state has its own shift table that lists every token the grammar says could legitimately come next. If the next token to be read isn’t in that list a ParseError is raised, and fact this is the only thing that triggers a ParseError. If the parse table is thought of as a graph, the shift table describes the lines in the graph followed when a particular token appears as the next item in the input stream. There must be only one, otherwise the grammar would be ambiguous. The act of doing this is called a shift and thus the shift table lists all possible shift actions that can happen when the parser in a particular state. There is a similar table called a reduce table that describes the lines in the graph to be followed when a non-terminal is recognised. Between them, the shift tables and the reduce tables contain all lines in the graph, and thus describe all possible transitions the parser can make been states.

Stack

The stack is a data structure the LR(1) parser uses internally to keep track of where it is up to. A shift pushes the next input Token onto the stack. A reduce happens when the symbols on the right hand side of a production match the symbols on the top of the stack. When they do, a reduce replaces those matching symbols with the non-terminal on the left hand side of the production. That symbol will always (unless it is the start symbol) be part of another production that the parser was in the process of recognising, and now its (probably partial) right hand side will be what is now on the top of the stack. The stack continues downwards with symbols from partially recognised productions until, at its bottom, we find the partially recognised production from the start symbol. The Symbols popped from the stack become nodes in the parse tree. Thus the stack is actually a partially constructed branch of the parse tree, with the bottom becoming the root node of the parse tree.

Start Symbol

The goal of the grammar is to recognise this symbol. Although an lrparsing Grammar requires a Symbol named Grammar.START and its goal is indeed to recognise that symbol, it isn’t the real start symbol. The real start symbol has a single production whose right hand side is the Grammar.START symbol followed by an internal MetaToken called __end_of_input__. This real start symbol is returned by epoch_symbol().

Symbol

In LR(1) parlance a symbol is the name for something that is recognised by the parser, ie, it is a collective name for tokens and non-terminals.

Terminal

What lrparsing calls a token.

Token

Typically what the parser is trying to recognise is a series of characters. The first step is to break that series of characters into tokens. In English a token is a word or a punctuation character. The parser then works on these tokens, not the individual characters.

Tokeniser

The thing that turns text into a stream of Tokens.

Footnotes

1

Compile_grammar() is the one lrparsing operation that is not thread safe. Parse is thread safe, except that it will lazily call compile_grammar() for you if you haven’t done it yet. This means in threaded environments you must compile your grammar yourself and not reply on parse() to do it. Compile it on module load or wrap it threading.Lock() if you insist on doing it lazily. Calling parse() from multiple threads without calling compile_grammar() or pre_compile_grammar() will cause lrparsing to crash sporadically.

2

Assigning a Symbol instance to a rule in a Grammar subclass does not work like normal Python assignment, where the left hand side and the right hand side are the same objects after the assignment. Instead the Grammar subclass creates a new Rule object to hold the Symbol object on the right hand side which results in some unusual side effects:

class MyGrammar(Grammar):
    rule1 = Token('a')          # Legal, but rule1 != Token('a')
    rule2 = rule1 + Token('b')  # Also legal
    rule3 = rule2               # Illegal, direct assignment
    rule3 = rule2 * 1           # Same result but is legal
    START = rule3               # Legal because this is START
3

Not outputting rules whose name starts with a leading underscore to the parse tree means these two grammars produce the same parse tree:

class Ex1(Grammar):
    _rule1 = Token('x')
    START = _rule1 + Token("y")

class Ex2(Grammar):
    START = Token('x') + Token("y")
4

An example of syntactic sugar for Symbol. This:

class MyGrammar(Grammar):
    rule = Choice(Token('x'), Prio(Repeat(Token('a'), 1, 1), Sequence(Opt(Token('y')), Token('z'))))

Can be rewritten as this:

class MyGrammar(Grammar):
    rule = Token('x') | (Token('a') * 1, Opt('y') + 'z')
5(1,2)

An example of using a TokenRegistry:

class MyGrammar(Grammar):
    class MyTokens(TokenRegistry):
        my_token = UserToken()
        identifier = Token(re='[A-Za-z_][A-Za-z_0-9]*', case=False)
        #
        # Being a subclass of Symbol, TokenSymbols have dicts.
        # Since assignments TokenSymbol's .dict don't work in the
        # body Grammar subclasses it is best to put them in the
        # TokenRegistry's body.
        #
        my_token["a"] = "a_value"
    START = MyTokens.my_token | MyTokens.identifier
assert MyGrammar.MyTokens.my_token == 'MyTokens.my_token'
6

A corollary of a TokenRegistry being the only way to obtain a handle to the TokenSymbol used by the parser is if you use UserTokens, you have to use a TokenRegistry otherwise there is no way to pass the correct UserToken instance to the parser.

7

The inbuilt tokeniser passes input tuples through to the parser so you can use it to handle the bulk of the work even if it can’t handle all of it. Write a generator that replaces the input strings the inbuilt parser can’t handle with input tulpes and yields the remainder of the input untouched. Pass your generator to parse() and you are done.

8

If you want to intercept the input tuples emitted by the inbuilt tokeniser, perhaps to change or delete them, override TokenRegistry.tokeniser() like this:

class G(Grammar):
    class T(TokenRegistry):
        def tokeniser(self, *args):
            for input_tuple in super(G.T, self).tokeniser(*args):
                # do something with input_tuple, maybe
                yield input_tuple
9

Using tokens rather than dealing with strings directly is primarily an optimisation. Being more powerful than a regular expression an LR(1) parser could easily do what the tokeniser does: recognise identifiers, numbers, keywords and so in a character stream. But that power comes at a price. An LR(1) parser both consumes more memory and is slower than a regular expression. Since tokens are typically 4 or 5 characters long we reduce the workload on an LR(1) parser by a factor of 4 or 5 by getting a light weight regular expression to break the input stream up into tokens, and then have the LR(1) parser operate on them.

10(1,2)

The statement that the stack passed to on_error() and ParseError is a Lr1State is a simplification. It will be an Lr1State if pre_compile_grammar() is used, otherwise it will be an item set. However, Lr1State contains a strict subset of the information in an item set so the description applies to both.