lrparsing
¶
Abstract
lrparsing
is an LR(1) parser hiding behind a pythonic interface.
- Author
Russell Stuart <russell-lrparsing@stuart.id.au>
Introduction¶
Lrparsing provides both an LR(1) parser and a tokeniser. It differs from other Python LR(1) parsers in using Python expressions as grammars, and offers simple to use disambiguation tools.
The result is something that is powerful yet concise. For simple
tasks this means it can be thought of as an extension to Python’s
existing re
module, used when regular expressions become
too cumbersome. For complex tasks lrparsing offers a high speed
parser (very roughly 25us per token from string to parse tree
on a desktop CPU), pre-compilation of the grammar and error recovery
hooks so parsing can continue after an error is found.
In addition to extensive documentation it comes with a parser for Sqlite3 data manipulation statements and a Lua 5.2 to Python compiler as examples. The documentation can be read online at http://lrparsing.sourceforge.net/doc/html/.
First Steps¶
Below this section is a wall of text that will overwhelm all but the
most determined. This example is designed to help you navigate it.
Invest some time in reading it, enough to gain an intutitive feel for
how lrparsing
works. If your task is small it might be enough to
get you through.
The grammar is specified by a class derived from
Grammar
. Attributes of that class define the
grammar:
import lrparsing
from lrparsing import Keyword, List, Prio, Ref, THIS, Token, Tokens
class ExprParser(lrparsing.Grammar):
#
# Put Tokens we don't want to re-type in a TokenRegistry.
#
class T(lrparsing.TokenRegistry):
integer = Token(re="[0-9]+")
integer["key"] = "I'm a mapping!"
ident = Token(re="[A-Za-z_][A-Za-z_0-9]*")
#
# Grammar rules.
#
expr = Ref("expr") # Forward reference
call = T.ident + '(' + List(expr, ',') + ')'
atom = T.ident | T.integer | Token('(') + expr + ')' | call
expr = Prio( # If ambiguous choose atom 1st, ...
atom,
Tokens("+ - ~") >> THIS, # >> means right associative
THIS << Tokens("* / // %") << THIS,
THIS << Tokens("+ -") << THIS,# THIS means "expr" here
THIS << (Tokens("== !=") | Keyword("is")) << THIS)
expr["a"] = "I am a mapping too!"
START = expr # Where the grammar must start
COMMENTS = ( # Allow C and Python comments
Token(re="#(?:[^\r\n]*(?:\r\n?|\n\r?))") |
Token(re="/[*](?:[^*]|[*][^/])*[*]/"))
parse_tree = ExprParser.parse("1 + /* a */ b + 3 * 4 is c(1, a)")
print(ExprParser.repr_parse_tree(parse_tree))
Will print:
(START (expr # Here START is ExprParser.START
(expr # And expr is ExprParser.expr too.
(expr
(expr (atom '1')) '+' # '+' is repr(input_matched_by_token)
(expr (atom 'b'))) '+' # 'b' is actually ExprParser.T.ident
(expr
(expr (atom '3')) '*' # '3' is actually ExprParser.T.integer
(expr (atom '4')))) 'is'
(expr (atom (call 'c' '('
(expr (atom '1')) ',' # ',' is an anonymous Token
(expr (atom 'a')) ')')))))
As is suggested by the output, the parse tree returned by
parse()
is a tree composed of nested Python tuples
.
The quoted strings above are actually tuples
of the
form:
(Token(), 'matched_input', offset_in_stream, line_number, column_number)
So the tuple representing '3'
would be
(ExprParser.T.integer, '3', 16, 1, 17)
.
Your next step will probably be writing a small grammar, and
you will become discouraged when lrparsing
complains it is
ambiguous. Read Ambiguities for hints on how to fix it.
If you go much beyond the example above you will need to read the reference material below, and you will find yourself getting lost in a thicket of jargon. The jargon unavoidable as parsing involves a lot of concepts and each must name its own name. Traditionally a Glossary is the solution for jargon, but you may find the Glossary Narrative a gentler introduction to the world of parsing.
Module contents¶
The module contains many classes and a few functions. The classes are described below, lumped into the following categories:
Symbol
classes describe rules in the grammar.
Token
classes break up the input into input tuples, which is what the parser wants to be fed.A
TokenRegistry
manages tokens, the tokeniser, and the inbuilt tokeniser.
In addition, there are a few utility routines. They all duplicate
methods of the Grammar
class. They are provided in case a
Grammar
subclass redefines them:
-
lrparsing.
compile_grammar
(grammar)¶ Compile
grammar
, which must be a subclass ofGrammar
. If the compile fails aGrammarError
will be raised. This is typically used when debugging newGrammars
.Grammar
must be compiled before it can be used, but this is done implicitly for you byparse()
or can be done explicitly bypre_compile_grammar()
if compiling is too slow to do at run time.repr_parse_table()
displays a lot more information when thegrammar
is compiled bycompile_grammar()
. Compiling aGrammar
is not thread safe.
-
lrparsing.
epoch_symbol
(grammar)¶ Returns the start symbol, which is the
Symbol
used to initialise theGrammar
subclassgrammar
. It is a non-terminal with a single production:<MyGrammar> = START __end_of_input__
. Useful during Error Recovery.
-
lrparsing.
parse
(grammar, input, tree_factory=None, on_error=None, log=None)¶ Parse the
input
using the subclass ofGrammar
passed, returning a parse tree if thegrammar
matches the input or raising aParseError
otherwise, callingcompile_grammar()
first if thegrammar
hasn’t been compiled. If the functionon_error
isn’tNone
it will be called when the parser input doesn’t match thegrammar
. It is documented in Error Recovery. If the functionlog
is passed it will be called repeatedly withstrings
showing parser’s stack.The
input
is passed to a tokeniser which converts it to the stream of input tuples expected by the parser. This tokeniser is created by theTokenRegistry
declared in theGrammar
subclass. By default the inbuilt tokeniser is used, and it accepts either astring
or a generator that yieldsstrings
and input tuples. Input tuples are normal Pythontuples
that have aTokenSymbol
instance as their first element. The parser ignores the remaining elements in thetuple
.The default format of the returned parse tree is a tree of nested
tuples
. Thesetuples
will be one of the input tuples given to the parser, or will have one of theRule
instances defined as a rule in theGrammar
followed by thetuples
representing theSymbols
that made up the right hand side of the production matched. (Userepr_productions()
to see these productions.) The format of the returned parse tree can be altered by passing a function astree_factory
. Whatever it returns is substituted into the parse tree instead thetuple
. It must accept one argument: thetuple
that would have been inserted.The parse operation is thread safe but the lazy compiling of the
Grammar
is not. 1
-
lrparsing.
pre_compile_grammar
(grammar, pre_compiled=None)¶ compile_grammar()
can take a long while for reasonable sized grammars (seconds on a 32 bit out-of-order desktop CPU). This function allows you to skip that time in production evnironments by giving the parser a pre-compiled parse table.The pre-compiled parse table is actually a (very big) tuple made up entirely of Python constants. This function returns the
repr()
of that tuple (ie, astring
). Use the returned value by passing it to this funcion(!) as the pre_compiled argument. If you pass returned string directly it will be around 10 faster than calling it without the pre_compiled parameter. If you pass it the raw tuple it will be about 100 times faster. The reason for the 10 fold difference is the passed string must be parsed toeval()
, but if you embed the raw tuple into your code it will be tokenised by Python and put in the .pyc file.If this function is passed a pre-compiled parse table that matches the current grammer it returns
None
quickly, otherwise it compiles the grammar into a parse table and returns it as astring
, which is orders of magnitude slower. This means apart from speed theGrammar
always behaves the same way (ie, as specified by the source defining it’s class) regardless of whether this function is called; even if called with a pre-compiled parse table that doesn’t match theGrammar
.The returned parse table is optimised, meaning it has information useful for debugging the grammar out striped out of it. This operation is not thread safe.
-
lrparsing.
repr_grammar
(grammar)¶ Return a printable
string
with theRules
as seen by theGrammar
subclassgrammar
. Useful for debugginggrammars
.
-
lrparsing.
repr_parse_table
(grammar, state=None)¶ Return printable
string
describing the parse table of the passedGrammar
subclassgrammar
. This only returns something useful aftercompile_grammar()
has been called. The return value may even be useful ifcompile_grammar()
raised an exception. The result returned by apre compiled
grammar
is barely useful for debugging the pre compiler. Ifstate
(aninteger
) is passed only that state instead of the entire table is returned. Ifstate
is negative that state and all states that refer to it in their action or reduce tables are returned.
-
lrparsing.
repr_parse_tree
(parse_tree, indent=' ')¶ Return a printable
string
representing the passedparse_tree
(as returned byparse()
) in a human readable format. Ifindent
isFalse
it will be on one line, otherwise it will be split across lines and indented using thestring
passed asindent
to reveal the tree structure.
-
lrparsing.
repr_productions
(grammar)¶ Return a printable
string
describing all the productions in the passedGrammar
subclassgrammar
. This only works ifcompile_grammar()
has been called. It may work even ifcompile_grammar()
has raised an exception. Useful for debugginggrammars
.
-
lrparsing.
unused_rules
(grammar)¶ Returns
frozenset
of rules that aren’t reachable from theGrammar.START
rule. If this isn’t empty thegrammar
probably contains an error.
The Grammar Class¶
A grammar’s rules are contained in a class derived
from lrparsing's
Grammar
class.
Members of the Grammar
subclass that are assigned instances
of the Symbol
class (described below) make up the rules of
the grammar. For the most part other members are simply
ignored by lrparsing
, meaning you can add your own functions
and variables to a Grammar
subclass with impunity.
-
class
lrparsing.
Grammar
¶ The constructor does nothing by default.
Lrparsing
uses classes rather than instances to do its work, so it doesn’t expect or care if a subclass is instantiated.-
classmethod
compile_grammar
(optimise=True, cache=None)¶ Identical to the module’s
compile_grammar()
. If redefined the module’s function continues to work.
-
classmethod
epoch_symbol
()¶ Identical to the module’s
epoch_symbol()
. If redefined the module’s function continues to work.
-
classmethod
parse
(input, tree_factory=None, on_error=None, log=None)¶ Identical to the module’s
parse()
. If redefined the module’s function continues to work.
-
classmethod
pre_compile_grammar
(pre_compiled=None)¶ Identical to the module’s
pre_compile_grammar()
. If redefined the module’s function continues to work.
-
classmethod
repr_grammar
()¶ Identical to the module’s
repr_grammar()
. If redefined the module’s function continues to work.
-
classmethod
repr_parse_tree
(parse_tree, indent=' ')¶ Identical to the module’s
repr_parse_tree()
. If redefined the module’s function continues to work.
-
classmethod
repr_parse_table
(state=None)¶ Identical to the module’s
repr_parse_table()
. If redefined the module’s function continues to work.
-
classmethod
repr_productions
()¶ Identical to the module’s
repr_productions()
. If redefined the module’s function continues to work.
-
classmethod
unused_rules
()¶ Identical to the module’s
unused_rules()
. If redefined the module’s function continues to work.
-
COMMENTS
¶ This member must be assigned a
Keyword()
,Token
,UnrecognisedToken()
orUserToken
, or aChoice
of those symbols.TokenSymbols
assigned toCOMMENTS
are ignored byparse()
.
-
START
¶ START
must assigned aRule
instance. The goal of the grammar is to recognise theRule
defined bySTART
.
-
WHITESPACE
¶ This is passed unaltered to the tokeniser. The inbuilt tokeniser insists that if this is present, it is a string whose characters can optionally separate
Tokens
.
-
classmethod
Assigning a Symbol
to a member of a Grammar
subclass
creates a new Rule
instance containing the Symbol
on
the right hand side, and whose Rule.name
is the name of the
member it was assigned to. This makes it different from normal Python
assignment where the left hand side and the right hand side are the
same object after the assignment executes. In practice there are two
places this matters. Firstly, except for assigning to
Grammar.START
, you may not assign one Rule
directly
to another as it makes it impossible to determine the grammar hierarchy
2. Secondly, you cannot get a reference to
a TokenSymbol
by assigning it to a rule. Use a
TokenRegistry
to do that 5.
Rules
whose names
start with an
underscore (_
) are treated specially. They are suppressed in the
output parse tree 3.
Symbol Classes¶
Instances of Symbol
subclasses are used to build rules in
a Grammar
. Symbol
instances can be created in the
normal way - by calling their class name (aka the constructor) with
the appropriate arguments. However lrparsing
provides some
syntactic sugar by hijacking the Python expression builder. The
Python operators that are translated to calling a Symbol
constructor are noted below. Another sweetener is all
Symbol
class constructors will, when they are expecting a
Symbol
as an argument, cast some standard Python types to
an appropriate Symbol
class instance. For example, a
string
will be cast to a literal Token
.
4
The Symbol
subclass constructors that can be used to define
Grammars
are:
- class
lrparsing.
Choice
(s0, s1, ...)¶Alternate syntax:
s0 | s1 | ...
The grammar must choose between one of the
symbols
s0, s1, …
- class
lrparsing.
Left
(s)¶Alternate syntax:
s = s0 << s1 << s2 << ...
If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
use the(((s0 s1) s2) ...)
. In other wordss
is left associative.
- class
lrparsing.
List
(s, d, min=0, max=None, opt=False)¶The grammar must see a list of the
symbol
s
separated by thesymbol
d
, ie,s d s d s ...
. Ifopt
isTrue
the list may optionally end withd
, otherwise it must end withs
. Ifmin
is absent or 0 the list may be empty, otherwises
must appear a minimum ofmin
times. Ifmax
isNone
any number ofs
are allowed, otherwisemax
is the maximum number of occurrences ofs
.
- class
lrparsing.
Nonassoc
(s)¶If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
raise aParseError
. In other wordss
isn’t associative.
- class
lrparsing.
Prio
(s0, s1, ..)¶Alternate syntax:
(s0, s1, ...)
The grammar must choose between one of the symbols
s0, s1, ...
If several choices are possible, chooses0
overs1
over...
Atuple
assigned directly to a class member is ignored (as any nonSymbol
is), but will be converted to aPrio
if passed to aSymbol
constructor.
- class
lrparsing.
Ref
(name)¶A forward reference to the
Rule
assigned to the class membername
(astring
). It will be replaced by that member definition, which must occur later.
- class
lrparsing.
Repeat
(s, min=0, max=None)¶
- Alternative syntaxes:
s * N
s * (min,)
s * (min,max)
Opt(s)
,Opt * s
,s * (0,1)
Some(s)
,Some * s
,s * (1,)
Many(s)
,Many * s
,Repeat * s
,s * ()
The grammar must see
s
repeated. Ifmin
is 0 or absent no repeats are allowed meaning there may be nos
at all, otherwise it must appear a minimum ofmin
times. Ifmax
isNone
any number ofs
are allowed, otherwisemax
is the maximum number of repeats. These shortcuts are also provided:
Opt(x)
orOpt * x
is equivalent toRepeat(x, min=0, max=1)
.
Some(x)
orSome * x
is equivalent toRepeat(x, min=1)
.
Many(x)
orMany * x
orRepeat * x
is equivalent toRepeat(x)
.
x * N
is equivalent toRepeat(x, min=N, max=N)
.
x * tuple(...)
is equivalent toRepeat(x, *tuple(...))
.
- class
lrparsing.
Right
(s)¶Alternative syntax:
s = s0 >> s1 >> s2 >> ...
If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
use the(... (s0 (s1 s2)))
. In other wordss
is right associative.
- class
lrparsing.
Sequence
(s0, s1, ...)¶Alternative syntax:
s0 + s1 + ...
The grammar must see
symbol
s0
, followed bys1
, followed by …
- class
lrparsing.
THIS
¶A reference to the
Rule
being assigned to. So identical toRef('this')
inthis = Ref('this') + '+' + Ref('this')
.
- class
lrparsing.
Tokens
(literals[, keywords])¶The grammar must choose one of the supplied literals and keywords recognised by the inbuilt tokeniser. Both
literals
andkeywords
are strings which arestring.split()
to yield multiple literals and keywords, each of which is recognised using the inbuiltToken
class.
Python does not allow rule =
for an empty rule. Use
rule = THIS * 0
instead.
Although the classes above are used to build Grammar
rules
by appearing on the right hand side of a Python assignment statement
their instances are not directly visible. Instead an instance of
the Rule
class is created to hold them, and this is what is
assigned to your Grammar
attribute. Since they are a
subclass of Symbol
they can be referenced by other grammar
rules:
All symbols the parser can recognise are derived from this abstract base class, and thus share these attributes:
- class
lrparsing.
Symbol
¶If you override
Symbol
the constructor must be called as it does initialisation.
dict
¶A
dict
you can use for your own purposes for all rules barGrammar.START
.
__contains__(key):
Equivalent to
key in symbol.dict
.
__delitem__(key):
Equivalent to
del symbol.dict[key]
.
__getitem__(key):
Equivalent to
symbol.dict[key]
.
__iter__():
Equivalent to
iter(symbol.dict)
.
__len__():
Equivalent to
len(symbol.dict)
.
__setitem__
(key, value)¶Equivalent to
symbol.dict[key] = value
.
Tokens¶
Tokens are a special kind of Symbol
9. As a convenience the same token may have several
instances in a Grammar
. Instances of the same
token
are merged during the compile, resulting
in all bar one being discarded. Two tokens
are considered to be the same if they have the same
name
attribute. The inbuilt tokens generate a
suitable name
for you, but you can assign a
different name to any token
using a
TokenRegistry
. Names
containing
__
are used internally, and so are reserved.
The following methods and classes are used directly in
grammars
:
lrparsing.
Keyword
(word, case=True)¶Create an instance of the
Token
class that recognises a keyword. If case isFalse
the keyword isn’t case sensitive, otherwise it is. The defaultname
isrepr(word)
. A keyword is a literalToken
that is recognised by another reToken
. For example, an identifier might be recognised byToken(re='[A-Za-z_]+')
. The keyword'then'
would normally be recognised by the identifierToken
, but ifKeyword('then')
is present it will match'then'
, while all other non-keywords will still be matched by the identifierToken
.
- class
lrparsing.
Token
(literal=None, re=None, case=True)¶Alternative syntax:
'literal'
Use the inbuilt tokeniser to generate a
Token
symbol when it sees the text specified byliteral
orre
. The arguments becomes attributes with the name. You can only supplyliteral
orre
, not both. Ifliteral
is supplied the tokeniser will match exactly that text unlesscase
isFalse
in which case it ignores case when matching. Ifre
is supplied it must be a regular expression the tokeniser will match. There
is compiled withre.MULTILINE
andre.DOTALL
; adding additional flags usingre's
(?FLAG)
syntax affects the recognition of allTokens
. The defaultname
for literal tokens isrepr(literal)
. The defaultname
for re tokens is there
surrounded by/
’s. Strings assigned directly to class members are ignored as are all nonSymbols
, but strings passed toSymbol
constructors are turned into a literalToken
.
- class
lrparsing.
UserToken
¶A user defined token. The
UserToken
must be assigned aname
by aTokenRegistry
. This class can be used as a base for user defined token classes.
lrparsing.
UnrecognisedToken
()¶Create an instance of the
Token
class that is given to the parser when the inbuilt tokeniser strikes a string it doesn’t recognise as a token or whitespace. If this token doesn’t appear in theGrammar
the inbuilt tokeniser raises aTokenError
instead.
If you delve deeply into lrparsing
you will come across these token
classes. They aren’t directly used by grammars
:
- class
lrparsing.
MetaToken
(name)¶Used to construct special tokens used internally by the grammar compiler and parser. You will never see these in the output parse tree, but error recovery may see them. There are two:
__empty__
represents a 0 width character string, and__end_of_input__
represents the end of the input stream.
- class
lrparsing.
TokenSymbol
(name)¶The abstract base class all tokens are based on, thus
isinstance(symbol, TokenSymbol)
will returnTrue
if aSymbol
is a token of some kind. The passedname
is the generated name which can be overridden by theTokenRegistry
.TokenSymbols
and thus all subclasses have these attributes and methods:
dict
¶This is inherited from
Symbol
, as is the shorthand access provided bySymbol's
mapping interface.TokenSymbol.dict
is best set from within aTokenRegistry
. Changes made to this attribute in the body of aGrammar
subclass change theSymbol.dict
of theRule
theTokenSymbol
is assigned to, not theTokenSymbols
dict
.
name
¶A
string
. The name of the token. If two tokens with the samename
are used in aGrammar
they will be merged usingmerge()
and only one will be kept.
named
¶A
bool
.True
ifset_name()
has been called.
merge
(other)¶Called by the
TokenRegistry
when two tokens have the samename
.Other
will be discarded after merge returns.
position
(input_tuple)¶Given an
input_tuple
supplied to the parser, return astring
describing the position in the input stream. The default implementation returnsNone
, which means if theinput_tuple
triggers an error in theparser
it won’t insert the position in the input stream of the error.
set_name
(name)¶Called by the
TokenRegistry
to override the generatedname
. Setsnamed
toTrue
.
The TokenRegistry Class¶
The TokenRegistry
is the home for tokens. If you don’t
define subclass within the Grammar
it will use an instance
of TokenRegistry
directly 5.
Assigning a token
to a member of a
TokenRegistry
class gives you a handle to the
token
instances recognised by the
Grammar
, and it is the only way to get such a handle
6. Assigning the
token
also sets the TokenSymbol.name
of the token
to the qualified attribute name,
which is a string
with the format
'NameOfTokenRegistrySubclass.member_name'
.
- class
lrparsing.
TokenRegistry
¶A single instance of the a
TokenRegistry
or the subclass defined in theGrammar
is created when the class when theGrammar
is declared. The constructor looks up all tokens declared in its body and registers theirTokenSymbol.name
using_resolve_token_()
.
_resolve_token_
(token_symbol)¶Called when the
Grammar
is compiled into LR(1) productions astokens
are found. It returns theTokenSymbol
that will be presented to the parser forTokenSymbol.name
. That instance will always be the firstTokenSymbol
passed to it with theTokenSymbol.name
. If the passed token is a duplicate, it will pass it toTokenSymbol.merge()
before discarding it.
compile_tokens
(whitespace)¶Called once, during
Grammar
compilation and before the first call totokeniser()
, so the tokeniser can initialise itself. Thewhitespace
parameter is theGrammar.WHITESPACE
attribute if one was declared,None
otherwise. This method must raise an exception ifwhitespace
isn’t valid. What “whitespace” means is up to the tokeniser. The default definition creates an instance of the inbuilt tokeniser, and calls itscompile_tokens
method with the same arguments.
tokeniser
(input, whitespace)¶This is a generator, called by
parse()
to create input tuples to form the input to feed to the LR(1) parser from theinput
parameter passed toparse()
. Thewhitespace
parameter is the same one that was passed tocompile_tokens()
. The input tuples accepted by the parser aretuples
whose first element is aTokenSymbol
instance. The remainder of thetuple
is ignored by the parser, but thetuple
is placed unmodified into the output parse tree. The default definition calls thetokeniser()
method of the inbuilt tokeniser, passing it the same arguments 8.
Inbuilt Tokeniser¶
The inbuilt tokeniser generates Token
instances. Thus
if you use Keyword()
, Token
, Tokens
or
UnrecognisedToken()
in your grammar, you must use the inbuilt
tokeniser. It is used by default unless you add a
TokenRegistry
to your grammar and override its
TokenRegistry.compile_tokens()
and
TokenRegistry.tokeniser()
methods.
The input accepted by the inbuilt tokeniser is either a
string
or a generator yielding a mixture of strings and
input tuples 7. If the input
is broken up into multiple strings you should ensure the breaks occur
at token boundaries. Grammar.WHITESPACE
is a
string
whose characters can appear between
tokens
but are otherwise ignored. If not defined it
defaults to " \f\n\r\t\v"
.
Strings
are turned into input tuples
with the following format:
(Token(), matched_data, stream_offset, line_number, column_number)
where:
Token()
matched_data
stream_offset
is the offset from the start of the input where the matched data was found. An
integer
.line_number
is the line number the string was found on. The first line is numbered 1. An
integer
.column_number
is the column number within the line where the string was found. The first column is numbered 1. An
integer
.
If a generator is passed and it yields objects other than strings they are passed onto the parser unchanged in the order they were received.
Exceptions¶
-
exception
lrparsing.
LrParsingError
¶ Base:
StandardError
.An abstract base class used by all exceptions defined here.
-
exception
lrparsing.
GrammarError
¶ Base:
LrParsingError
.Raised when there is an error in the grammar. This can happen both when the grammar is first parsed, and when
compile_grammar()
is called.
-
exception
lrparsing.
ParsingError
¶ Base:
LrParsingError
.An abstract base class for all errors that can be raised by this module when
parse()
is called.
-
exception
lrparsing.
TokenError
(message, data, offset, line, column)¶ Base:
ParsingError
.Raised by the inbuilt tokeniser when it strikes something in the input that isn’t entirely whitespace, isn’t a recognisable
Token
, and the grammar doesn’t useUnrecognisedToken
. The arguments followingmessage
become attributes of the instance.
-
exception
lrparsing.
ParseError
(input_tuple, lr1_state)¶ Base:
ParsingError
.Raised by
parse()
when the input the parser is given doesn’t match theGrammar
. The arguments become instance attributes with the same name.-
input_tuple
¶ The input tuple given to the parser that triggered the error. If this input tuple was created by the inbuilt tokeniser it will contain position information, ie, line number, column number and so on.
-
lr1_state
¶ The current parser state, an
Lr1State
instance 10, from the parse table (ie, as returned byrepr_parse_table()
). If the parse table hasn’t been pre compiledrepr()
reveals a lot of information about what the parser was expecting at the time.
-
Error Recovery¶
The default action of parse()
is to raise a ParseError
exception which creates a suitable message, which of course stops
the parser at that point. If you want the parser
to continue so it can discover more errors in the same run you
need to do error recovery. Error recovery is always accomplished
by manipulating the parser’s input so the parser
thinks it has seen something valid, and then letting it continue.
It is implemented by the on_error
parameter to parse()
:
-
lrparsing.
on_error
(iterator, input_tuple, stack)¶ Handle a parser error.
- Parameters
input_tuple – The input tuple that triggered the error.
iterator – The iterator the parser is reading input tuples from.
stack – The parser’s stack. A
list
.
- Return type
None
or an iterable yielding input tuples that will be inserted into the input stream.
The difficulty with any error recovery is determining how to turn the
mess you are given as input into something the parser will
accept. Invariably this means you try to figure out what the user was
trying to do. For example, let’s say you are parsing Python and
strike an problem in an expression. If the expression was part of a
for statement your best strategy is probably to replace the entire
for statement with for x in ():. But if that for was inside
a list comprehension inserting a for statement at that point would
likely generate a cascade of errors later on. So understanding where
you are is critical. Sadly this is one thing LR(1) makes
hard. That information is buried in the stack
parameter. The
stack
is a partially constructed branch, running from the root to
a leaf, of a parse tree. It is a list
of
tuples
, like this:
[
(lr1_state, (symbol, ...)),
(lr1_state, (symbol, ...)),
...
]
Consider the following two grammars which both recognise the same thing, a small subset of the C language:
>>> def on_error(iterator, token, stack):
for item in stack:
print "([%s], (%s))" % (
','.join(sorted(str(rule) for rule in item[0].rules)),
','.join(
lrparsing.repr_parse_tree(sym, indent="")
for sym in item[1]))
>>> class Lang1(Grammar):
e = Prio(
Token(re='[0-9]+') | Token(re='[a-z]+') + Opt(Token('(') + THIS + ')'),
THIS << '*' << THIS,
THIS << '+' << THIS)
START = Repeat((
Opt(e) |
Keyword("if") + e + Keyword('then') + THIS +
Opt(Keyword('else') + THIS + Keyword("endif") |
Keyword("while") + e + Keyword('do') + THIS + Keyword("done") |
Token(re='[a-z]+') + '=' + e) + ';')
>>> Lang1.parse(
"e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
([<Lang1>], (__empty__))
([START], ('e','=',(e '1'),';'))
([START], ('while'))
([START,e], ((e 'funca' '(' (e 'e') ')'))),
([START], ('do'))
([START], ('if'))
([START,e], ((e 'a')))
([START], ('then'))
([START], ((START 'e' '=' (e 'a') ';' (e 'fi') ';')))
lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match START in state 44
>>> class Lang2(Grammar):
class T(TokenRegistry):
number = Token(re='[0-9]+')
ident = Token(re='[a-z]+')
kwd_endif = Keyword("endif")
exp = Ref("exp")
block = Ref("block")
call = T.ident + '(' + exp + ')'
mul_op = exp << '*' << exp
add_op = exp << '+' << exp
exp = Prio(T.number | T.ident | call, mul_op, add_op)
if_part = Keyword("if") + exp
then_part = Keyword("then") + block
else_part = Keyword("else") + block
if_st = (if_part + then_part + Opt(else_part) + T.kwd_endif)
while_part = Keyword("while") + exp
do_part = Keyword("do") + block
while_st = while_part + do_part + Keyword("done")
exp_st = exp * 1
ass_exp = exp * 1
ass_st = T.ident + '=' + ass_exp
empty_st = THIS * 0
st = ass_st | empty_st | exp_st | if_st | while_st
block = Repeat(st + ';')
START = block
>>> Lang2.parse(
"e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
([<Lang2>], (__empty__))
([block], ((st (ass_st 'e' '=' (ass_exp (exp '1'))),';'))
([while_st], ((while_part 'while' (exp (call 'funca' '(' (exp 'e') ')'))))
([do_part], ('do'))
([if_st], ((if_part 'if' (exp 'a'))))
([then_part], ('then'))
([then_part], ((block (st (ass_st 'e' '=' (ass_exp (exp 'a'))) ';' (st (exp_st (exp 'fi'))) ';'))),
lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match then_part in state 50
Here the on_error()
function prints the stack
passed, but
doesn’t attempt recovery. Looking at the right hand side of the
stack
, the (symbol, symbol, ...)
bit, it shouldn’t be too hard
to see how stitching it together will yield a branch of the
parse tree. This sequence of symbols
must
eventually match one of the Rules
listed to their left.
The LR(1) parser may list several as it will only make its
final decision once it has seen the entire production. For
Lang2
it is not difficult to see the parser was in the
middle of the then_part of an if_st. Lang1
is an example of
how to make recognising this difficult.
lr1_state
has more information in it than the above example
shows. It is an Lr1State
10:
- class
lrparsing.
Lr1State
(id, actions, gotos, rules)¶This class is immutable, the parameters becoming attributes of the same name. It has no useful methods beyond
object.__str__()
andobject.__repr__()
.
id
¶An
integer
identifing the state. In anLr1State
this integer is its index into the parse table.
actions
¶The shift table of this parse state. It is a
dict
, indexed by all validTokenSymbols
that could appear at this point. The error occurred because the current token isn’t a key to thisdict
.
gotos
¶The reduce table of this parse state. A
dict
. Not useful for error recovery.
By just looking at the top item of the passed stack for Lang2
we see the parser was in the middle of a then_block. Its
task is to re-arrange the input so the parser can continue.
The simple way is to insert one of the
TokenSymbols
out of Lr1State.actions
.
In this case 'endif'
would be the right choice. Another choice
is to erase some input, possibly replacing it with something else.
This would be appropriate if, for example the string being parsed was
((2 func 2;
. Replacing func 2
with ))
would get the
parser onto the next statement so the user only gets one
error for the current line.
If on_error()
returns None
the parser will raise a
ParseError
as if on_error()
wasn’t passed to it.
Otherwise it will read input tuples from the
returned iterable, and when that is exhausted return to reading
the original input. So, returning a list
of
input tuples will insert them, reading
input tuples off the passed iterator
will delete them, and reading them and returning what you read
allows you to peek ahead. As the parser has already
processed the input tuple it passed to on_error()
,
if you want the parser to reprocess it, it must be in the returned
iterable. Going back to the example, this line would insert an
'endif'
, thus allowing the parser to process the
remainder of the input:
endif = (Lang2.T.kwd_endif, Lang2.T.kwd_endif.literal) + token[2:]
return [endif, token]
There is one other option for the adventurous. The stack
passed to on_error()
is the actual parser stack,
so you can modify it. The states on the stack
depend on the whims of the parser generator so adding or
modifying them is impossibly fragile. Deleting them, using say
del stack[-2:]
is not. In the ((2 func 2
example deleting
the entire expression from the stack and presenting the
parser with a replacement expression on its input (eg,
just Lang2.T.number
) is often simpler.
Ambiguities¶
Lurking under lrparsing
’s hood is an LR(1) parser.
LR(1) parser generators are cantankerous
beasts. lrparsing
puts lipstick on the pig, but the pig is still a
pig. The principal way it bites is to say your grammar is
ambiguous. This section is here to help you work past this
error, but be aware writing large complex grammars requires a fair
amount of expertise which can’t be imparted by this small how-to.
To understand what ambiguous means, understand that a parser is meant to take some input, validate it conforms to the grammar, and then produce a parse tree. When the parser generator says the grammar is ambiguous, it means that for some valid input several different parse trees can be produced. An LR(1) grammar requires there be only one parse tree for any given input.
A blatantly ambiguous grammar will produce a Reduce/Reduce conflict. Consider:
class A(Grammar):
a = Token('x')
b = Token('x')
START = a | b
A.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise:
b = 'x' ^
a = 'x' ^
on seeing a __end_of_input__ I could not decide between:
replace the sequence ['x'] with b
and
replace the sequence ['x'] with a
Please remove this ambiguity from your grammar
Looking at the error message, it displays every possible
production that could apply at this point, with an ^
showing where the parser is up to in each production.
There are only two types of actions the parser can
take. It can accept the next token in the input stream. This is
called a Shift. It isn’t possible here as the next symbol is
__end_of_input__
, in other words it is at the end of the string.
The other action it could take is replace some symbols with the
production they match. This is called a Reduce.
It can do a reduce as there are two productions that match.
Sadly two is one too many, and this conflict is called
Reduce/Reduce because both options are
reduces. They correspond to these
parse trees:
(START (a 'x')) OR (START (b 'x'))
Since there are two possible parse trees the
grammar is ambiguous. If you look at the error message just
the right way, it could almost be said to say that. The solution in
this case is to remove one of the redundant Rules
:
class A(Grammar):
a = Token('x')
START = a
Looping grammars are also ambiguous. Consider this grammar:
class B(Grammar):
b = Ref("e")
e = b | Token("x")
START = e
Again, the Grammar
will recognise a single 'x'
.
B.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise:
START = e ^
b = e ^
on seeing a __end_of_input__ I could not decide between:
replace the sequence [e] with b
and
replace the sequence [e] with START
Please remove this ambiguity from your grammar
The error message is saying when the parser has seen a single
'e'
, it has two choices, which are listed. These two choices
correspond to an infinite number of parse trees:
(START (e 'x')) OR (START (e (b (e 'x')))) OR (START (e (b (e (b (e 'x')))))) OR ...
Again since there is more than one possible parse tree, the grammar is ambiguous.
Unfortunately those two simple examples don’t capture the complexity of common ambiguities. Consider this example which is common in real world grammars:
class E(Grammar):
e = THIS + Tokens("+ /") + THIS | Token(re="[0-9]+")
START = e
This Grammar
recognises strings of integers separated by '+'
and '/'
. If the grammar writer wanted to produce a grammar that only
recognised valid integer expressions using the operators '+'
and
'/'
he was successful. Nonetheless, when he calls
E.compile_grammar()
this exception is raised:
lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise:
e = e '+' e ^
e = e ^ '/' e
on seeing a '/' I could not decide between:
replace the sequence [e '+' e] with e
and
accepting the '/' in the hope it will match:
e = e '/' ^ e
Please remove this ambiguity from your grammar
Again the error message lists the productions
under consideration and how far the parser has got in each. The
smallest possible string the parser could be looking at and end up
with those possibles is: e + e ^ / e
. The parser wants to know
whether should it replace (reduce) e + e
with e, which
corresponds to performing the addition now, or should it continue
reading, ie, delay the addition until it processes the '/'
.
This is called a Shift/Reduce conflict
because the two actions possible are Shift
and Reduce. The two parse trees
under consideration are thus:
Choose Shift: e + (e / e)
Choose Reduce: (e + e) / e
The normal arithmetic convention is to do the division first, so we want
to choose Shift
. There are several ways to make the grammar do that.
The easiest way in lrparsing
is to prioritise the two choices, saying
e '/' e
should always be done before e '+' e
:
class E(Grammar):
e = Prio(THIS + '/' + THIS, THIS + '+' + THIS) | Token(re="[0-9]+")
START = e
But doing that only replaces the previous error E.compile_grammar()
raised with a new one:
lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
While trying to recognise:
e.Choice.Prio.Prioritised0 = e ^ '/' e
e.Choice.Prio.Prioritised0 = e '/' e ^
on seeing a '/' I could not decide between:
replace the sequence [e '/' e] with e.Choice.Prio.Prioritised0
and
accepting the '/' in the hope it will match:
e.Choice.Prio.Prioritised0 = e '/' ^ e
Please remove this ambiguity from your grammar
Firstly, notice how the choices listed don’t directly correspond to the
rules in your Grammar
. That is because lrparsing
has compiled
your Grammar
into productions, which is the
only thing an LR(1) parser understands. You can print
this compiled form using print E.repr_productions()
which yields:
0 : <E> = START __end_of_input__
1 : START = e
2 : e = e.Choice.Prio
e = /[0-9]+/
3 : e.Choice.Prio = e.Choice.Prio.Prioritised0
e.Choice.Prio = e.Choice.Prio.Prioritised1
4 : e.Choice.Prio.Prioritised0 = e '/' e
5 : e.Choice.Prio.Prioritised1 = e '+' e
It is worth taking some time to understand how these
productions correspond to the original set of
Rules
. In any case you can now see where all those odd
looking symbol names come from.
Using the same technique as before we deduce the smallest string it
could be looking at and end up with these possibilities:
e '/' e ^ '/' e
. In other words it can’t decide between these
two parse trees:
e / (e / e) OR (e / e) / e
As an aside this is a real problem because integer division isn’t
commutative: (9 / 3) / 2 = 1
, whereas 9 / (3 / 2) = 9
.
So if the parse tree was being used to evaluate the
expressions, choosing the wrong one would yield an incorrect
answer.
We can’t use Prio
to solve this because the production
is conflicting with itself. The solution is to say division is left
associative:
class E(Grammar):
e = Prio(THIS << '/' << THIS, THIS << '+' << THIS) | Token(re="[0-9]+")
START = e
And with that the parser will compile, and will yield parse trees that conform to the conventions of arithmetic.
Looking deeper under the hood, an LR(1) parser is called that because it delays all decisions until it has recognised an entire production. If you look at the above error messages, you will see they all involve a reduce and that is because is a reduce is what the parser does when it makes the decision on which production it has seen. There are two ways it can get confused.
The first involves two reduces. Since the parser
is making its decision after it has seen the entire production, the
only way it can be confused is if two productions
have the same right hand side. If you look at the
Reduce/Reduce conflicts above that is indeed the
case. But even in that case the LR(1) parser can still
distinguish between the two if they are followed by different
tokens. Consider this ambiguous
Grammar
:
class F(Grammar):
a = Token('a')
b = Token('a')
START = a + 'x' + 'y' | b + 'x' + 'z'
F.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce
While trying to recognise state 4:
a = 'a' ^
b = 'a' ^
on seeing a 'x' I could not decide between:
replace the sequence ['a'] with a
and
replace the sequence ['a'] with b
Please remove this ambiguity from your grammar
Unlike the previous example this Grammar
isn’t inherently
ambiguous as there is one unique parse tree for every
input. The problem is an LR(1) parser isn’t powerful
enough to recognise it. The symbol it needs to see in order
to know whether it should be an a
or a b
(the 'y'
or
'z'
) is two tokens away and LR(1) only
looks one token ahead. An LR(2) parser would be
powerful enough but we don’t have one of those.
In this case it is easy enough re-write the Grammar
so an
LR(1) parser is powerful enough. The solution is to
delay recognising a
and b
until the parser can see the y
or z
:
class F(Grammar):
a = Token('a') + 'x'
b = Token('a') + 'x'
START = a + 'y' | b + 'z'
The problem with re-writing like this is it changes the parse tree. If the back end that processes the parse tree can’t tolerate this, something is going to have to give, and it won’t be the LR(1) parser. Post processing the parse tree so it is in the form the back end wants is the normal solution. Console yourself with the knowledge that people using LL(1) grammars have to do it far more.
Avoid the temptation to use Prio
to resolve
Reduce/Reduce conflicts. It will make the
conflict go away, but it means the parser will never
recognise part of your Grammar
. In the above example
yielding to temptation and using Prio
would result in:
class F(Grammar):
a = Token('a')
b = Token('a')
START = Prio(a + 'x' + 'y', b + 'x' + 'z')
And presto F.compile_grammar()
works. But now F.parse("axz")
yields:
lrparsing.ParseError: line 1 column 3: Got 'z' when expecting 'y'
Since "axz"
is a valid input for our Grammar
this is a bug.
It was expecting a 'y'
because our Grammar
says if it can’t
decide choose a + 'x' + 'y'
over b + 'x' + 'z'
. As we know it
could not decide, so will always choose a + 'x' + 'y'
. This means it
can never recognise b + 'x' + 'z'
.
Since at least one reduce must be involved, the only other possibility is a Shift/Reduce conflict. These always take the form:
SYMBOL = SYMBOL <anything1> SYMBOL ^
SYMBOL = SYMBOL ^ <anything2> SYMBOL
You can pick the Reduce because it has the ^ at the
end. In this case the Grammar
is always ambiguous,
which is good because you can fix it by using associativity or
priorities. The trick in figuring out which to choose is to
understand what the parser is trying to decide between, and
you can do that by replacing a Symbol in the
right hand side of the Shift with the entire
right hand side of the Reduce. The name of the
Symbol replaced must be the same as the left hand side
of the Reduce, and the ^
in the two
productions must coincide, like this:
SYMBOL = SYMBOL <anything1> SYMBOL ^ <anything2> SYMBOL
The parse trees become either:
Shift: SYMBOL <anything1> (SYMBOL <anything2> SYMBOL)
Reduce: (SYMBOL <anything1> SYMBOL) <anything2> SYMBOL
If <anything1>
and <anything2>
are different (like '*'
and
'+'
above) you can use Prio
to choose the one that should
happen first. If they have the same priority (and in particular this
will be the case if <anything1>
and <anything2>
are the same
thing), then you must use Left
if the Shift is the
correct choice or Right
if the Reduce is the
correct choice. If it should never happen use Nonassoc
.
Examples¶
The source of lrparsing
contains the follow examples, which can also
be viewed online at
http://lrparsing.sourceforge.net/examples:
- lrparsing-sqlite3.py
A
Grammar
for all of the Data Manipulation Statements in Sqlite3.py.lrparsing
was written because the author could not find a Python parsing library that ran at an acceptable speed and required less lines of code to do this job than a hand written LL(1) parser.- lrparsing-expr.py
An integer expression evaluator It illustrates one design pattern for building a compiler on top of
lrparsing
: using aSymbols
inbuilt mapping capabilities. If the job is simple this works well.- lua52.py
A
Grammar
for Lua 5.2 and a cross compiler to Python. It illustrates a second design pattern for building a compiler: using a compiler class whose functions have the same names asrules
and andtokens
in theGrammar
. These functions are then called to compile the node as it is output byparse()
. This is a good way to do it for more complex tasks. It cleanly separates the compiler from the parser, while keeping the relationship between the output parse tree and the generated code clear.
Acknowledgements¶
lrparsing
was developed with inspiration from several sources.
The two main ones were parsing.py (http://www.canonware.com/Parsing/)
which from a technical perspective is a powerful and clean LR
parser implementing several things lrparsing
doesn’t, including
a GLR(1) parser. The other source of inspiration is
pyparsing.py (http://pyparsing.wikispaces.com/) which proves that
providing a pythonic interface to parsing is a worthwhile thing
to do.
Finally, thank you to Greg Black for proof reading this document.
Lrparsing Glossary¶
If you use a parser generator to its full extent and particularly if you are doing error recovery, you will end up knowing enough about parser generation to write your own. It will be a long journey, made longer by parser generation being an old art that has built up a truly impressive amount of jargon. The glossary is here to help you get past the jargon.
Unfortunately the glossary is large and the concepts complex, so it ends up being another wall of text. The narrative below is less rigorous but hopefully doesn’t suffer from the wall of text problem. May it speed your journey to parser generation zen.
Glossary Narrative¶
Here is a simple grammar:
import lrparsing
class MyGrammar(lrparsing.Grammar):
xy_non_term = lrparsing.Token('x') + lrparsing.Token('y')
y_non_term = lrparsing.Token('y')
the_world_non_term = lrparsing.Repeat(xy_non_term | y_non_term, min=1)
START = the_world_non_term
This grammar is composed of rules
, which are
things Python can parse. lrparsing
translates the
rules
into an alternate representation of the same
grammar using productions. Productions
are what an LR(1) parser generator understands (this is the
output produced by MyGrammar.repr_productions()):
0 : <MyGrammar> = START __end_of_input__
1 : START = the_world_non_term
2 : the_world_non_term = the_world_non_term.Repeat
3 : the_world_non_term.Repeat = xy_non_term
the_world_non_term.Repeat = y_non_term
the_world_non_term.Repeat = the_world_non_term.Repeat xy_non_term
the_world_non_term.Repeat = the_world_non_term.Repeat y_non_term
4 : xy_non_term = 'x' 'y'
5 : y_non_term = 'y'
Productions have an internal structure and
each bit has its own name; sometimes several. Apart from the =
they are made up of symbols. The symbol on
the left hand side is called a non-terminal. A
non-terminal is never read from the input stream directly.
It is just a place holder created by the grammar author to represent
whatever is on the right hand side of the production.
This contrasts to the symbols read from the input
stream. They are called tokens. lrparsing
embeds
the tokens in a input tuple mostly because
additional information is needed for error reporting - things like
the where the token was found (eg, line and column).
The parser generator now creates a parse table from the productions. To do that it goes through an intermediate step. It creates a third representation of the grammar, a directed graph. Each node in the graph is called an item set. The first item set is:
{
<MyGrammar> = ^ START __end_of_input__
}
As the {} are meant to imply, an item set really is a
set
, a set of productions. However
there is an extra bit of information associated with each
production, represented as a ^ here. This is called the
dot and it shows how far the parser has progressed in
recognising the production. The combination of the
production and the dot is called an item,
and hence the term item set arises because it is a set of
items.
The production used in this initial item set is the
root of the parse tree. It includes the special end of
MetaToken
__end_of_input__, which signals the entire input
has been read. The non-terminal on its
left hand side (<MyGrammar> here) has a special name. It
is called the start symbol. So the item set above is
a succinct representation of the parsers state
before it has read anything. However it’s not finished yet.
In order to finish it off an operation called closure must be performed on it. The closure is intuitively easy to construct once you recognise that if the parser is about to see a START symbol, then the START symbols production(s) belong in the closure as well:
{
<MyGrammar> = ^ START __end_of_input__
START = ^ the_world_non_term
}
But the closure is still not finished as now the_world_non_term is also a non-terminal proceeded by the dot. Adding it adds yet more unclosed non-terminals. This goes on for a bit. Once complete the closure will be:
{
<MyGrammar> = ^ START __end_of_input__
START = ^ the_world_non_term
the_world_non_term = ^ the_world_non_term.Repeat
the_world_non_term.Repeat = ^ xy_non_term
the_world_non_term.Repeat = ^ y_non_term
the_world_non_term.Repeat = ^ the_world_non_term.Repeat xy_non_term
the_world_non_term.Repeat = ^ the_world_non_term.Repeat y_non_term
xy_non_term = ^ 'x' 'y'
y_non_term = ^ 'y'
}
The item set is now complete. There are two
tokens that follow the dots in it:
‘x’ and ‘y’. This means these two tokens, and
only these two tokens, could legitimately be appear
in the input stream at this point. Anything else will cause the
parser to raise a ParseError
.
Lets say ‘x’ was the next thing to be read from the input stream. Reading it creates a new item which can be used to create a core we haven’t see before:
{
xy_not_term = 'x' ^ 'y'
}
Since there are no non-terminals following the dot the closure adds nothing, so this core is also a completed item set. ‘x’ is not the only token that the epoc item sets could first read from the input stream. It might also be an ‘y’. This yields another new core, and doing a closure on that core will create a new item set for it.
Recall the parser generator is creating a graph, and these item sets are the nodes in the graph. Since reading a token moves to these new item sets, the net effect of reading any symbol is to create a directed line connecting the two nodes in the graph. The act of moving along that line to a new node in the graph is called an action. This particular variant of an action is called a shift action. All possible shift actions a node / item set can make are held in its shift table. For the item set representing the start symbol, the shift table will be:
{
'x': item_set({ xy_non_term = 'x' ^ 'y', }),
'y': item_set({ y_non_term = 'y' ^, }),
}
When the xy_non_term reads the ‘x’ followed ‘y’ the entire right hand side of the production has been seen. Each shift has pushed the token it accepted onto a data structure called the stack. The ‘x’ and ‘y’ on the stack are popped from the stack and the non-terminal xy_non_term is pushed onto the stack, effectively replacing them. These are the first steps of an action called reduce. Effectively, the non-terminal xy_non_term has just been read from the input stream, and based on that parser must move to a new state (node in the graph, aka item set). It does it in the same way as a shift action: it looks up the non-terminal just “read” in a table called the reduce table. This table can be displayed using MyGrammer.repr_parse_table(). The reduce table of the start symbols item set is:
{
START:
item_set({<MyGrammar> = START ^ __end_of_input__}),
the_world_non_term:
item_set({START = the_world_non_term ^}),
the_world_non_term.Repeat:
item_set({
the_world_non_term = the_world_non_term.Repeat ^
the_world_non_term.Repeat = the_world_non_term.Repeat ^ xy_non_term
the_world_non_term.Repeat = the_world_non_term.Repeat ^ y_non_term
}),
xy_non_term:
item_set({the_world_non_term.Repeat = xy_non_term ^}),
y_non_term:
item_set({the_world_non_term.Repeat = y_non_term ^}),
}
“Reading” a non-terminal moves the dot just like reading a token does, and so creates a new item set. All these new item sets are processed, yielding new item sets in turn that are also processed - but only if the core hasn’t been seen before. When the process winds down the graph is complete.
To do it’s job the parser needs only the base structure of the graph, not the details used to build it. In fact the item sets can be represented by a integer, which can conveniently be used as an index into an array whose elements contain the reduce table and shift table for the node the item set represented. This stripped down version of the item set graph is the parse table.
Glossary¶
- Action
A action is a collective name for the two steps an LR(1) parser can perform: shift and reduce. Alternatively, if the parse table is thought of as a graph, an action is the name of a line connecting the nodes in the graph. The name for a node in the graph is a state.
- Ambiguous
A parser of any sort has two jobs: to verify the input is valid according to the grammar, and to create a parse tree for it. The words create a parse tree could equally well be written assign meaning. Eg,
9 + 3 / 2
is a valid arithmetic, but without the normal rules of arithmetic could have two different meanings:9 + (3 / 3) = 10
or(9 + 3) / 3 = 4
. In English, jokes are based on this same ambiguity, eg, Time flies like an arrow. It follows a parser would not be much use if it could not assign a unique meaning to every input sequence. Sadly it is possible (even easy) to write grammars that don’t assign a unique meaning to every input sequence. These grammars are rejected by the parser constructor as ambiguous. Even more sadly no practical parser, including LR(1), is powerful enough to recognise all non-ambiguous grammars, so they incorrectly say that some non-ambiguous grammars are ambiguous. The only work around is to re-write the grammar in a form the parser can use, and post process the parse tree if the new version is too far from the truth.- Closure
Closure the name for both an operation performed by item sets during their construction and the output of that operation, which is a set of items. An item set is a set of items, ie, partially recognised productions. Initially an item set contains just its core. If the next expected symbol of any item in the item set is a non-terminal, ie the left hand side of another production, then it follows the parser is also recognising this new item, which is that production with its dot at the start. This new item becomes part of the closure. That new item could also have a non-terminal at its start, thus yielding another new item recursively. All items added in this way become the item sets closure.
- Core
The core is the name of the initial subset of items in an item set. The core is constructed from the item sets in the state graph that have lines leading to this new item set. It is the items in those ancestor item sets after the recognition of the symbol labelling the line connecting them. The dot of those items has been moved along one position to reflect the recognition of that symbol.
- Conflict
- Reduce/Reduce
- Shift/Reduce
The job of the parser generator is to create a graph. The nodes of the graph become the parsers states. The lines that join the nodes of the graph are termed actions. They are labeled with the symbol the parser just recognised. If the parser generator can’t uniquely decide what the next state should be when symbol is seen (ie several lines would be labelled with the same symbol), the grammar is said to have a conflict. This means the grammar is ambiguous, because having several possible states reachable for the same symbol means there are several possible parse trees. For an LR parser one of the conflicting actions will always be a reduce because an LR parser only makes a decision once it has seen an entire production. Since there are two types of actions, shift and reduce, this means a conflict must be between a shift and a reduce which is termed a Shift/Reduce conflict or it could be between two reduces which is termed a Reduce/Reduce conflict.
- Dot
- ^
Marks a position in the right hand side of a production. In days gone by when parse tables were constructed manually on paper, a human would mark a dot in a production to keep track of where he was up to. In a computer the position of a dot is represented by the number of symbols on the right hand side seen so far.
- Grammar
Rules that explain how a series of tokens should be interpreted. For example, in English tokens are words and punctuation. English grammar shows how to build up words and punctuation into phrases of various kinds (eg, noun phrases, adverb phrases), then the phrases into sentences, and the sentences into constructs like quotes, paragraphs and chapters. The result is often represented as a parse tree.
- Input Tuple
The parser recognises tokens. However, it is often helpful to have other information associated with the token that the parser doesn’t use, such as the position in the input stream it was found at for error reporting.
lrparsing
calls the combination of a Token plus that other information an input tuple.lrparsing
uses a Pythontuple
with the Token as it first element for this purpose.- Item
A partially recognised production. It is a production, plus a marker (the dot) which shows how much the parser has seen so far, plus the tokens that could legally follow the production. An LR(0) parser doesn’t store any such tokens, an LR(1) stores all valid tokens that could come next and an LR(n) parser stores all valid sequences of n tokens that could possibly follow it. Since an LR(1) parser only makes decisions after seeing the entire right hand side of a production, it only considers the symbols that could follow a production when it does a reduce.
- Item Set
This data structure is the heart of an LR(1) parser generator. As the name implies an item set is set of items. Internally this set is broken up into two distinct subsets, the core and the closure. The generator initialises itself by creating the core of the initial item set from the start symbol. This core consists of the sole production of the start symbol with the dot before the first symbol. The next step is to compute the closure of this core. In this case if the first symbol of the right hand side is a non-terminal, then it follows the parser is also at the start of all the non-terminal’s productions, so they are added to the closure. And so on, recursively, if they have non-terminals as their first symbol. So now we have a set of items (the core and its its closure), each waiting for their first symbol to appear. If the parser receives one such symbol it can move forward, and what it moves to is a new item set whose core is all the current items waiting on that symbol, but with their dot position moved forward one step because a new symbol has been seen. This process of computing closures and generating new item sets continues until only duplicate item sets are created. The item sets then become the states of the parse table.
- Lhs
- Left Hand Side
The
Symbol
on the left hand side of a production. Such symbols are called a non-terminal. In alrparsing
Grammar
calls the symbol on the left hand side of a production aRule
.- LR(1)
LR(1) is an abbreviation for Left Right (parser with) 1 (symbol lookahead). Left here simply means it treats the string of characters like English - it starts reading at the left. The Right here means it delays making a decision on whether it has recognised a production until it has read the last symbol (right most) of its right hand side. The 1 symbol lookahead means it considers the next input symbol when it makes this decision. (It only needs to do this if two different non-terminals have the same right hand side. An LR(1) parser will look at the next token to be read to distingiush between them. If the same token can ligetimately follow both doing that doesn’t help, so the parser generator will say it has a Reduce/Reduce confict.) LR(0) parser doesn’t consider the next input token. If someone were to write an LR(2) parser it would consider the next 2 tokens. The other common type of parser is an LL(1) parser, also called a recursive descent parser. The only difference is R becomes an L (for Left), meaning the parser must make the decisions on what it is doing using just the left hand side of the production it is currently trying to recognise and the next input symbol. In theory because it has significantly less information to work with (it hasn’t seen any of the right hand side yet), an LL(1) parser is less powerful than a LR(1) parser. In practice recursive descent parsers are often written by hand and the programmer cheats by looking ahead or re-writing the grammar. This cheating goes a long way to making up the difference in power. However, automatically generated LL(1) parsers don’t cheat and so are significantly less powerful (ie, LL(1) handles a smaller set of otherwise valid grammars) than automatically generated LR(1) parsers such as
lrparsing
.- Parser State
- State
The state, which is typically identified by a number, is how the LR(1) parser knows what it is up to. A state is called an item set by the parser generator. The name change reflects a change in the way is is used, not in what it is. Conceptually an LR(1) parser is a Deterministic Finite Automaton (DFA) which is a fancy name for a way of moving through a graph. The states are the nodes in this graph. The LR(1) parser progresses by moving from one node to another, but like all graphs this is only possible if the nodes are directly connected by a line. In the parsers graph these lines are actions. Every valid token that can be accepted as input while the parser is in that state has an associated action. In fact the only time the parser raises an error is when it knows the input stream can’t match the grammar, and the only time it knows that is when the token it just read doesn’t have an action associated with it. A shift action consumes the next token (pushes it onto the stack) and then moves onto the next state by following the line labelled with that token. A reduce action recognises a production and so it pops the stack to see what parent state the LR(1) parser is trying to recognise. That parent state will have a line in the graph labeled with the non-terminal the left hand side of the production just recognised, which will take it to the next state. People familiar with the purest form of regular expressions will know that they can also be recognised with DFA’s. An LR(1) parser is a souped up version of a regular expression DFA that recognises the right hand side of its productions.
- Parse Table
A parse table is the data structure that drives a parser. The parse table is an array of states, the states being identified by their array index. The parse table can be viewed as the description of the nodes and lines of a graph. The nodes of the graph are the states. The lines are the state the parser moves to when it recognises a symbol, and thus are labelled with that symbol. The lines correspond to the state actions.
- Parse Tree
The way a parser explains the meaning of its input is to output a parse tree. The parse tree assigns the meaning to the symbols and the relationships between them. The sentence Time flies like an arrow can be parsed two ways. In the one way “Time” is an adjective describing a type of insect, and “like” is a verb describing what “Time flies” do to arrows. An
lrparsing
parse tree for that meaning might look like:(sentence (noun_phrase (adjective 'Time') (noun 'flies')) (verb_phrase (verb 'like') (noun_phrase (determiner 'an') (noun 'arrow')))))
The alternative, where “Time” is a noun, looks like this:
(sentence (noun 'Time') (verb_phrase (verb 'flies') (preposition_phrase (preposition 'like') (noun_phrase (determiner 'an') (noun 'arrow')))))
- Parser
The parser in an LR(1) is a simple loop that processes a stream of input tuples according to a parse table, and produces a parse tree as a consequence. The same parser is used for all input streams and all parse tables. It is fast because it is simple. Its one internal data structure is the stack. The parse table is actually a graph, and the parser’s sole job is move to a new node in the graph as it recognises each new symbol by following the line labelled with that symbol. As it goes it outputs the productions it has recognised, and these become the parse tree.
- Parser Generator
A parser generator takes a grammar as input and generates a parse table as output, unless the grammar is ambiguous in which cause it raises an error. It does this by constructing item sets.
- Production
The LR(1) name for the lines describing a Grammar an LR(1) parser generator accepts as input.
lrparsing
does not use productions, it usesrules
. It compiles yourrules
into LR(1) productions for you. You can display those productions usingrepr_productions()
. Productions are similar torules
in that they have a left hand side which replaces the symbols on the right hand side when they are recognised. The differences between aRule
and a production are a production can only be aSequence
, and alternative derivations for the same non-terminal are expressed by allowing a non-terminal to have many productions.- Non-Terminal
A symbol that appears on the left hand side of a production. In
lrparsing
Grammars
theRules
are its non-terminals.- Reduce
- Goto
A reduce is the parsers aah! moment. It happens when it has decided it has recognised a complete production. This has occurred because the symbols on the right hand side plus the next token in the input stream uniquely determined the non-terminal on the productions left hand side. The reduce action now pops the symbols from the right hand side from the stack, and then pushes the non-terminal on the left hand side. In other words, it replaces the symbols on the stack are the productions right hand side with the non-terminal on its left hand side. It then looks up the newly arrived arrived non-terminal in the reduce table to determine the next state. Goto is an older name for reduce still found in the literature.
- Reduce Table
- Goto Table
The reduce table is a mapping of a non-terminal to a state. When a production is recognised by a reduce action it is replaced by the non-terminal on its left hand side. This is done by popping the symbols on right hand side from the stack, and then pushing the non-terminal. The parser must then move to a new state, and it does that by looking up non-terminal in the states reduce table. Every state has a reduce table, but it will be empty if all the productions in the item set the state represents have right hand sides consisting purely of tokens. If the parse table is thought of as a graph, the reduce table describes the lines in the graph labelled by the non-terminal (ie, to be followed) when the non-terminal is recognised by a reduce. There is a corresponding mapping for tokens called the shift table and between the two they describe all the lines in the graph. In the literature this table is sometimes called the reduce table. Goto Table is an older name for the reduce table which is still found in the literature.
- Rhs
- Right Hand Side
The sequence of symbols on the right hand side of a production. When these symbols are seen in the order they were given, they can be replaced on the stack (via an operation called a reduce) by the non-terminal on the left hand side.
- Shift
When a token is recognised as valid it is removed from the input stream and pushed onto the stack. The parser looks up token in the shift table to determine the new state it should be in.
- Shift Table
The shift table maps a token to another state. Each state has its own shift table that lists every token the grammar says could legitimately come next. If the next token to be read isn’t in that list a
ParseError
is raised, and fact this is the only thing that triggers aParseError
. If the parse table is thought of as a graph, the shift table describes the lines in the graph followed when a particular token appears as the next item in the input stream. There must be only one, otherwise the grammar would be ambiguous. The act of doing this is called a shift and thus the shift table lists all possible shift actions that can happen when the parser in a particular state. There is a similar table called a reduce table that describes the lines in the graph to be followed when a non-terminal is recognised. Between them, the shift tables and the reduce tables contain all lines in the graph, and thus describe all possible transitions the parser can make been states.- Stack
The stack is a data structure the LR(1) parser uses internally to keep track of where it is up to. A shift pushes the next input
Token
onto the stack. A reduce happens when the symbols on the right hand side of a production match the symbols on the top of the stack. When they do, a reduce replaces those matching symbols with the non-terminal on the left hand side of the production. That symbol will always (unless it is the start symbol) be part of another production that the parser was in the process of recognising, and now its (probably partial) right hand side will be what is now on the top of the stack. The stack continues downwards with symbols from partially recognised productions until, at its bottom, we find the partially recognised production from the start symbol. The Symbols popped from the stack become nodes in the parse tree. Thus the stack is actually a partially constructed branch of the parse tree, with the bottom becoming the root node of the parse tree.- Start Symbol
The goal of the grammar is to recognise this symbol. Although an
lrparsing
Grammar
requires aSymbol
namedGrammar.START
and its goal is indeed to recognise that symbol, it isn’t the real start symbol. The real start symbol has a single production whose right hand side is theGrammar.START
symbol followed by an internalMetaToken
called__end_of_input__
. This real start symbol is returned byepoch_symbol()
.- Symbol
In LR(1) parlance a symbol is the name for something that is recognised by the parser, ie, it is a collective name for tokens and non-terminals.
- Terminal
- Token
Typically what the parser is trying to recognise is a series of characters. The first step is to break that series of characters into tokens. In English a token is a word or a punctuation character. The parser then works on these tokens, not the individual characters.
- Tokeniser
The thing that turns text into a stream of Tokens.
Footnotes¶
- 1
Compile_grammar()
is the one lrparsing operation that is not thread safe.Parse
is thread safe, except that it will lazily callcompile_grammar()
for you if you haven’t done it yet. This means in threaded environments you must compile your grammar yourself and not reply onparse()
to do it. Compile it on module load or wrap itthreading.Lock()
if you insist on doing it lazily. Callingparse()
from multiple threads without callingcompile_grammar()
orpre_compile_grammar()
will cause lrparsing to crash sporadically.- 2
Assigning a
Symbol
instance to a rule in aGrammar
subclass does not work like normal Python assignment, where the left hand side and the right hand side are the same objects after the assignment. Instead theGrammar
subclass creates a newRule
object to hold theSymbol
object on the right hand side which results in some unusual side effects:class MyGrammar(Grammar): rule1 = Token('a') # Legal, but rule1 != Token('a') rule2 = rule1 + Token('b') # Also legal rule3 = rule2 # Illegal, direct assignment rule3 = rule2 * 1 # Same result but is legal START = rule3 # Legal because this is START
- 3
Not outputting rules whose name starts with a leading underscore to the parse tree means these two grammars produce the same parse tree:
class Ex1(Grammar): _rule1 = Token('x') START = _rule1 + Token("y") class Ex2(Grammar): START = Token('x') + Token("y")
- 4
An example of syntactic sugar for
Symbol
. This:class MyGrammar(Grammar): rule = Choice(Token('x'), Prio(Repeat(Token('a'), 1, 1), Sequence(Opt(Token('y')), Token('z'))))
Can be rewritten as this:
class MyGrammar(Grammar): rule = Token('x') | (Token('a') * 1, Opt('y') + 'z')
- 5(1,2)
An example of using a
TokenRegistry
:class MyGrammar(Grammar): class MyTokens(TokenRegistry): my_token = UserToken() identifier = Token(re='[A-Za-z_][A-Za-z_0-9]*', case=False) # # Being a subclass of Symbol, TokenSymbols have dicts. # Since assignments TokenSymbol's .dict don't work in the # body Grammar subclasses it is best to put them in the # TokenRegistry's body. # my_token["a"] = "a_value" START = MyTokens.my_token | MyTokens.identifier assert MyGrammar.MyTokens.my_token == 'MyTokens.my_token'
- 6
A corollary of a
TokenRegistry
being the only way to obtain a handle to theTokenSymbol
used by the parser is if you useUserTokens
, you have to use aTokenRegistry
otherwise there is no way to pass the correctUserToken
instance to the parser.- 7
The inbuilt tokeniser passes input tuples through to the
parser
so you can use it to handle the bulk of the work even if it can’t handle all of it. Write a generator that replaces the inputstrings
the inbuilt parser can’t handle with input tulpes and yields the remainder of the input untouched. Pass your generator toparse()
and you are done.- 8
If you want to intercept the input tuples emitted by the inbuilt tokeniser, perhaps to change or delete them, override
TokenRegistry.tokeniser()
like this:class G(Grammar): class T(TokenRegistry): def tokeniser(self, *args): for input_tuple in super(G.T, self).tokeniser(*args): # do something with input_tuple, maybe yield input_tuple
- 9
Using tokens rather than dealing with
strings
directly is primarily an optimisation. Being more powerful than a regular expression an LR(1) parser could easily do what the tokeniser does: recognise identifiers, numbers, keywords and so in a character stream. But that power comes at a price. An LR(1) parser both consumes more memory and is slower than a regular expression. Since tokens are typically 4 or 5 characters long we reduce the workload on an LR(1) parser by a factor of 4 or 5 by getting a light weight regular expression to break the input stream up into tokens, and then have the LR(1) parser operate on them.- 10(1,2)
The statement that the stack passed to
on_error()
andParseError
is aLr1State
is a simplification. It will be anLr1State
ifpre_compile_grammar()
is used, otherwise it will be an item set. However,Lr1State
contains a strict subset of the information in an item set so the description applies to both.