..
  Copyright (c) 2013,2014,2015,2016,2017,2018,2021 Russell Stuart.
  Licensed under GPLv2, or any later version.

.. module:: lrparsing
    :synopsis: An LR(1) parser hiding behind a pythonic interface.

**************
|lrparsing|
**************

.. toctree::
   :maxdepth: 2

.. topic:: Abstract

   |Lrparsing| is an LR(1) parser hiding behind a pythonic interface.

:Author: Russell Stuart <russell-lrparsing@stuart.id.au>


Introduction
============

Lrparsing provides both an :term:`LR(1) parser <LR(1)>` and a
:term:`tokeniser`.  It differs from other Python LR(1) parsers in
using Python expressions as grammars, and offers simple to use
disambiguation tools.

The result is something that is powerful yet concise.  For simple
tasks this means it can be thought of as an extension to Python's
existing :mod:`re` module, used when regular expressions become
too cumbersome.   For complex tasks lrparsing offers a high speed
parser (very roughly 25us per token from string to :term:`parse tree`
on a desktop CPU), pre-compilation of the grammar and error recovery
hooks so parsing can continue after an error is found.

In addition to extensive documentation it comes with a parser for
Sqlite3 data manipulation statements and a Lua 5.2 to Python
compiler as examples.  The documentation can be read online at
http://lrparsing.sourceforge.net/doc/html/.


First Steps
===========

Below this section is a wall of text that will overwhelm all but the
most determined.  This example is designed to help you navigate it.
Invest some time in reading it, enough to gain an intutitive feel for
how |lrparsing| works.   If your task is small it might be enough to
get you through.

The :term:`grammar` is specified by a class derived from
:class:`Grammar`.  Attributes of that class define the
:term:`grammar`::

    import lrparsing
    from lrparsing import Keyword, List, Prio, Ref, THIS, Token, Tokens

    class ExprParser(lrparsing.Grammar):
        #
        # Put Tokens we don't want to re-type in a TokenRegistry.
        #
        class T(lrparsing.TokenRegistry):
            integer = Token(re="[0-9]+")
            integer["key"] = "I'm a mapping!"
            ident = Token(re="[A-Za-z_][A-Za-z_0-9]*")
        #
        # Grammar rules.
        #
        expr = Ref("expr")                # Forward reference
        call = T.ident + '(' + List(expr, ',') + ')'
        atom = T.ident | T.integer | Token('(') + expr + ')' | call
        expr = Prio(                      # If ambiguous choose atom 1st, ...
            atom,
            Tokens("+ - ~") >> THIS,      # >> means right associative
            THIS << Tokens("* / // %") << THIS,
            THIS << Tokens("+ -") << THIS,# THIS means "expr" here
            THIS << (Tokens("== !=") | Keyword("is")) << THIS)
        expr["a"] = "I am a mapping too!"
        START = expr                      # Where the grammar must start
        COMMENTS = (                      # Allow C and Python comments
            Token(re="#(?:[^\r\n]*(?:\r\n?|\n\r?))") |
            Token(re="/[*](?:[^*]|[*][^/])*[*]/"))

    parse_tree = ExprParser.parse("1 + /* a */ b + 3 * 4 is c(1, a)")
    print(ExprParser.repr_parse_tree(parse_tree))

Will print::

  (START (expr                  # Here START is ExprParser.START
    (expr                       # And expr is ExprParser.expr too.
      (expr
        (expr (atom '1')) '+'   # '+' is repr(input_matched_by_token)
        (expr (atom 'b'))) '+'  # 'b' is actually ExprParser.T.ident
      (expr
        (expr (atom '3')) '*'   # '3' is actually ExprParser.T.integer
        (expr (atom '4')))) 'is'
    (expr (atom (call 'c' '('
      (expr (atom '1')) ','     # ',' is an anonymous Token
      (expr (atom 'a')) ')')))))

As is suggested by the output, the :term:`parse tree` returned by
:func:`parse` is a tree composed of nested Python :func:`tuples <tuple>`.
The quoted strings above are actually :func:`tuples <tuple>` of the
form::

  (Token(), 'matched_input', offset_in_stream, line_number, column_number)

So the tuple representing ``'3'`` would be 
``(ExprParser.T.integer, '3', 16, 1, 17)``.

Your next step will probably be writing a small :term:`grammar`, and
you will become discouraged when |lrparsing| complains it is
:term:`ambiguous`.  Read `Ambiguities`_ for hints on how to fix it.

If you go much beyond the example above you will need to read the
reference material below, and you will find yourself getting lost in
a thicket of jargon.  The jargon unavoidable as parsing involves a lot
of concepts and each must name its own name.  Traditionally a
:ref:`glossary` is the solution for jargon, but you may find the
`Glossary Narrative`_ a gentler introduction to the world of parsing.

.. _module:

Module contents
===============

The module contains many classes and a few functions.  The classes
are described below, lumped into the following categories:

  * Define a :term:`grammar` in a subclass of :class:`Grammar`.

  * :class:`Symbol` classes describe rules in the grammar.

  * :class:`Token <TokenSymbol>` classes break up the input into
    :term:`input tuples <input tuple>`, which is what the
    :term:`parser` wants to be fed.

  * A :class:`TokenRegistry` manages :term:`tokens <token>`, the
    :term:`tokeniser`, and the
    :ref:`inbuilt tokeniser <inbuilt-tokeniser>`.

  * :exc:`Exceptions <LrParsingError>`.

In addition, there are a few utility routines.  They all duplicate
methods of the :class:`Grammar` class.  They are provided in case a
:class:`Grammar` subclass redefines them:

.. function:: compile_grammar(grammar)

    Compile ``grammar``, which must be a subclass of :class:`Grammar`.
    If the compile fails a :exc:`GrammarError` will be raised.  This
    is typically used when debugging new :class:`Grammars <Grammar>`.
    ``Grammar`` must be compiled before it can be used, but this is
    done implicitly for you by :func:`parse` or can be done
    explicitly by :func:`pre_compile_grammar()` if compiling is too
    slow to do at run time.  :func:`repr_parse_table` displays a lot
    more information when the ``grammar`` is compiled by
    :func:`compile_grammar`.  Compiling a ``Grammar`` is not thread
    safe.

.. function:: epoch_symbol(grammar)

    Returns the :term:`start symbol`, which is the :class:`Symbol` used
    to initialise the :class:`Grammar` subclass ``grammar``.  It is a
    :term:`non-terminal` with a single :term:`production`:
    ``<MyGrammar> = START __end_of_input__``.  Useful during
    :ref:`error-recovery`.

.. function:: parse(grammar, input, tree_factory=None, on_error=None, log=None)

    Parse the ``input`` using the subclass of :class:`Grammar` passed,
    returning a :term:`parse tree` if the ``grammar`` matches the
    input or raising a :exc:`ParseError` otherwise, calling
    :func:`compile_grammar` first if the ``grammar`` hasn't been
    compiled.  If the function ``on_error`` isn't :data:`None` it will
    be called when the parser input doesn't match the ``grammar``.
    It is documented in :ref:`error-recovery`.  If the function ``log``
    is passed it will be called repeatedly with :class:`strings <str>`
    showing parser's :term:`stack`.

    The ``input`` is passed to a :term:`tokeniser` which converts it
    to the stream of :term:`input tuples <input tuple>` expected by
    the :term:`parser`.  This :term:`tokeniser` is created by the
    :class:`TokenRegistry` declared in the :class:`Grammar` subclass.
    By default the :ref:`inbuilt tokeniser <inbuilt-tokeniser>` is used,
    and it accepts either a :class:`string <str>` or a generator that
    yields :class:`strings <str>` and :term:`input tuples <input tuple>`.
    :term:`Input tuples <input tuple>` are normal Python
    :func:`tuples <tuple>` that have a :class:`TokenSymbol` instance
    as their first element.  The :term:`parser` ignores the remaining
    elements in the :func:`tuple <tuple>`.

    The default format of the returned :term:`parse tree` is a tree of
    nested :func:`tuples <tuple>`.  These :func:`tuples <tuple>` will
    be one of the :term:`input tuples <input tuple>` given to the
    :term:`parser`, or will have one of the :class:`Rule` instances
    defined as a rule in the :class:`Grammar` followed by the
    :func:`tuples <tuple>` representing the :class:`Symbols <Symbol>`
    that made up the right hand side of the :term:`production` matched.
    (Use :func:`repr_productions` to see these productions.) The format
    of the returned :term:`parse tree` can be altered by passing a
    function as ``tree_factory``.  Whatever it returns is substituted
    into the :term:`parse tree` instead the :func:`tuple <tuple>`.
    It must accept one argument: the :func:`tuple <tuple>` that would
    have been inserted.

    The parse operation is thread safe but the lazy compiling of the
    ``Grammar`` is not. [#thread_compile]_

.. function:: pre_compile_grammar(grammar, pre_compiled=None)

    :func:`compile_grammar` can take a long while for reasonable sized
    grammars (seconds on a 32 bit out-of-order desktop CPU).  This
    function allows you to skip that time in production evnironments
    by giving the parser a pre-compiled :term:`parse table`.
    
    The pre-compiled :term:`parse table` is actually a (very big)
    tuple made up entirely of Python constants.  This function returns
    the :func:`repr() <repr>` of that tuple (ie, a
    :class:`string <str>`).  Use the returned value by passing it to
    this funcion(!) as the `pre_compiled` argument.  If you pass
    returned string directly it will be around 10 faster than calling
    it without the `pre_compiled` parameter.  If you pass it the raw
    tuple it will be about 100 times faster.  The reason for the 10
    fold difference is the passed string must be parsed to
    :func:`eval() <eval>`, but if you embed the raw tuple into your
    code it will be tokenised by Python and put in the .pyc file.

    If this function is passed a pre-compiled :term:`parse table` that
    matches the current grammer it returns :data:`None` quickly,
    otherwise it compiles the grammar into a :term:`parse table` and
    returns it as a :class:`string <str>`, which is orders of magnitude
    slower.  This means apart from speed the :class:`Grammar` always
    behaves the same way (ie, as specified by the source defining it's
    class) regardless of whether this function is called; even if
    called with a pre-compiled :term:`parse table` that doesn't match
    the :class:`Grammar`.

    The returned :term:`parse table` is optimised, meaning it has
    information useful for debugging the grammar out striped out of
    it.  This operation is not thread safe.


.. function:: repr_grammar(grammar)

    Return a printable :class:`string <str>` with the
    :class:`Rules <Rule>` as seen by the :class:`Grammar`
    subclass ``grammar``.  Useful for debugging
    :class:`grammars <Grammar>`.

.. function:: repr_parse_table(grammar, state=None)

    Return printable :class:`string <str>` describing the
    :term:`parse table` of the passed :class:`Grammar` subclass
    ``grammar``.  This only returns something useful after
    :func:`compile_grammar` has been called.  The return value
    may even be useful if :func:`compile_grammar` raised an exception.
    The result returned by a
    :func:`pre compiled <pre_compile_grammar>` ``grammar`` is barely
    useful for debugging the pre compiler.  If ``state`` (an
    :class:`integer <int>`) is passed only that state instead of the
    entire table is returned.  If ``state`` is negative that
    :term:`state` and all :term:`states <state>` that refer to it in
    their :term:`action` or :term:`reduce tables <reduce table>` are
    returned.

.. function:: repr_parse_tree(parse_tree, indent="  ")

    Return a printable :class:`string <str>` representing the passed
    ``parse_tree`` (as returned by :func:`parse`) in a human readable
    format.  If ``indent`` is :data:`False` it will be on one line,
    otherwise it will be split across lines and indented using the
    :class:`string <str>` passed as ``indent`` to reveal the tree
    structure.

.. function:: repr_productions(grammar)

    Return a printable :class:`string <str>` describing all the
    :term:`productions <production>` in the passed :class:`Grammar`
    subclass ``grammar``.  This only works if :func:`compile_grammar`
    has been called.  It may work even if :func:`compile_grammar`
    has raised an exception.  Useful for debugging
    :class:`grammars <Grammar>`.

.. function:: unused_rules(grammar)

    Returns :class:`frozenset` of rules that aren't reachable from
    the :attr:`Grammar.START` rule.  If this isn't empty the
    ``grammar`` probably contains an error.

The Grammar Class
=================

A :term:`grammar's <grammar>` rules are contained in a class derived
from :mod:`lrparsing's <lrparsing>` :class:`Grammar` class. 
Members of the :class:`Grammar` subclass that are assigned instances
of the :class:`Symbol` class (described below) make up the rules of
the :term:`grammar`.  For the most part other members are simply
ignored by :mod:`lrparsing`, meaning you can add your own functions
and variables to a :class:`Grammar` subclass with impunity.

.. class:: Grammar()

    The constructor does nothing by default.
    :mod:`Lrparsing <lrparsing>` uses classes rather than instances
    to do its work, so it doesn't expect or care if a subclass is
    instantiated.
  
    .. classmethod:: compile_grammar(optimise=True, cache=None)

        Identical to the module's :func:`compile_grammar`.  If
        redefined the module's function continues to work.

    .. classmethod:: epoch_symbol()

        Identical to the module's :func:`epoch_symbol`.  If redefined the
        module's function continues to work.

    .. classmethod:: parse(input, tree_factory=None, on_error=None, log=None)

        Identical to the module's :func:`parse`.  If redefined the 
        module's function continues to work.

    .. classmethod:: pre_compile_grammar(, pre_compiled=None)

        Identical to the module's :func:`pre_compile_grammar`.  If
        redefined the module's function continues to work.

    .. classmethod:: repr_grammar()

        Identical to the module's :func:`repr_grammar`.  If redefined
        the module's function continues to work.

    .. classmethod:: repr_parse_tree(parse_tree, indent="  ")

        Identical to the module's :func:`repr_parse_tree`.  If redefined
        the module's function continues to work.

    .. classmethod:: repr_parse_table(state=None)

        Identical to the module's :func:`repr_parse_table`.  If
        redefined the module's function continues to work.

    .. classmethod:: repr_productions()

        Identical to the module's :func:`repr_productions`.  If
        redefined the module's function continues to work.

    .. classmethod:: unused_rules()

        Identical to the module's :func:`unused_rules`.  If redefined
        the module's function continues to work.

    .. attribute:: _parser_

        Reserved for internal use by :class:`Grammar`.  Do not use.

    .. attribute:: COMMENTS

        This member must be assigned a :func:`Keyword`, :class:`Token`,
        :func:`UnrecognisedToken` or :class:`UserToken`, or a
        :class:`Choice` of those symbols.
        :class:`TokenSymbols <TokenSymbol>` assigned to 
        :attr:`~Grammar.COMMENTS` are ignored by :func:`parse`.

    .. attribute:: START

        :attr:`~Grammar.START` must assigned a :class:`Rule` instance.
        The goal of the grammar is to recognise the :class:`Rule`
        defined by :attr:`~Grammar.START`.

    .. attribute:: WHITESPACE

        This is passed unaltered to the :term:`tokeniser`.  The
        :ref:`inbuilt tokeniser <inbuilt-tokeniser>` insists that if
        this is present, it is a string whose characters can optionally
        separate :class:`Tokens <Token>`.

Assigning a :class:`Symbol` to a member of a :class:`Grammar` subclass
creates a new :class:`Rule` instance containing the :class:`Symbol` on
the right hand side, and whose :attr:`Rule.name` is the name of the
member it was assigned to.  This makes it different from normal Python
assignment where the left hand side and the right hand side are the
same object after the assignment executes.  In practice there are two
places this matters.  Firstly, except for assigning to
:attr:`Grammar.START`, you may not assign one :class:`Rule` directly
to another as it makes it impossible to determine the grammar hierarchy
[#grammar_rule_assignment]_. Secondly, you cannot get a reference to
a :class:`TokenSymbol` by assigning it to a rule.  Use a
:class:`TokenRegistry` to do that [#token_registry_example]_.
:class:`Rules <Rule>` whose :attr:`names <Rule.name>` start with an
underscore (``_``) are treated specially.  They are suppressed in the
output :term:`parse tree` [#grammar_underscore_example]_.

Symbol Classes
==============

Instances of :class:`Symbol` subclasses are used to build rules in
a :class:`Grammar`.  :class:`Symbol` instances can be created in the
normal way - by calling their class name (aka the constructor) with
the appropriate arguments.  However :mod:`lrparsing` provides some
syntactic sugar by hijacking the Python expression builder.  The
Python operators that are translated to calling a :class:`Symbol`
constructor are noted below.  Another sweetener is all
:class:`Symbol` class constructors will, when they are expecting a
:class:`Symbol` as an argument, cast some standard Python types to
an appropriate :class:`Symbol` class instance.  For example, a
:class:`string <str>` will be cast to a literal :class:`Token`.
[#symbol_example]_

The :class:`Symbol` subclass constructors that can be used to define
:class:`Grammars <Grammar>` are:

    .. class:: Choice(s0, s1, ...)

        Alternate syntax: ``s0 | s1 | ...``

        The :term:`grammar` must choose between one of the
        :class:`symbols <Symbol>` s0, s1, ...

    .. class:: Left(s)

        Alternate syntax: ``s = s0 << s1 << s2 << ...``

        If the :class:`symbol <Symbol>` ``s`` is a
        ``Sequence(s0, s1, s2, ...)``, and there is a choice between the
        :term:`parse trees <parse tree>` ``(((s0 s1) s2) ...)`` and
        ``(... (s0 (s1 s2)))`` use the ``(((s0 s1) s2) ...)``.  In other
        words ``s`` is left associative.

    .. class:: List(s, d, min=0, max=None, opt=False)

        The :term:`grammar` must see a list of the :class:`symbol <Symbol>`
        ``s`` separated by the :class:`symbol <Symbol>` ``d``, ie,
        ``s d s d s ...``.  If ``opt`` is :data:`True` the list may
        optionally end with ``d``, otherwise it must end with ``s``.
        If ``min`` is absent or 0 the list may be empty, otherwise
        ``s`` must appear a minimum of ``min`` times.  If ``max`` is
        :data:`None` any number of ``s`` are allowed, otherwise
        ``max`` is the maximum number of occurrences of ``s``.

    .. class:: Nonassoc(s)

        If the :class:`symbol <Symbol>` ``s`` is a
        ``Sequence(s0, s1, s2, ...)``, and there is a choice between the
        :term:`parse trees <parse tree>` ``(((s0 s1) s2) ...)`` and
        ``(... (s0 (s1 s2)))`` raise a :exc:`ParseError`.  In other words
        ``s`` isn't associative.

    .. class:: Prio(s0, s1, ..)

        Alternate syntax: ``(s0, s1, ...)``

        The :term:`grammar` must choose between one of the symbols
        ``s0, s1, ...`` If several choices are possible, choose ``s0`` over
        ``s1`` over ``...`` A :func:`tuple <tuple>` assigned directly to
        a class member is ignored (as any non :class:`Symbol` is), but
        will be converted to a :class:`Prio` if passed to a :class:`Symbol`
        constructor.

    .. class:: Ref(name)

        A forward reference to the :class:`Rule` assigned to the class
        member ``name`` (a :class:`string <str>`).  It will be replaced by
        that member definition, which must occur later.

    .. class:: Repeat(s, min=0, max=None)

        Alternative syntaxes:
          * ``s * N``
          * ``s * (min,)``
          * ``s * (min,max)``
          * ``Opt(s)``, ``Opt * s``, ``s * (0,1)``
          * ``Some(s)``, ``Some * s``, ``s * (1,)``
          * ``Many(s)``, ``Many * s``, ``Repeat * s``, ``s * ()``

        The :term:`grammar` must see ``s`` repeated.  If ``min`` is 0 or
        absent no repeats are allowed meaning there may be no ``s`` at all,
        otherwise it must appear a minimum of ``min`` times.  If ``max`` is
        :data:`None` any number of ``s`` are allowed, otherwise ``max``
        is the maximum number of repeats.  These shortcuts are also
        provided:

            * ``Opt(x)`` or ``Opt * x`` is equivalent to ``Repeat(x, min=0, max=1)``.
            * ``Some(x)`` or ``Some * x`` is equivalent to ``Repeat(x, min=1)``.
            * ``Many(x)`` or ``Many * x`` or ``Repeat * x`` is equivalent to ``Repeat(x)``.
            * ``x * N`` is equivalent to ``Repeat(x, min=N, max=N)``.
            * ``x * tuple(...)`` is equivalent to ``Repeat(x, *tuple(...))``.

    .. class:: Right(s)

        Alternative syntax: ``s = s0 >> s1 >> s2 >> ...``

        If the :class:`symbol <Symbol>` ``s`` is a
        ``Sequence(s0, s1, s2, ...)``, and there
        is a choice between the :term:`parse trees <parse tree>`
        ``(((s0 s1) s2) ...)`` and ``(... (s0 (s1 s2)))`` use the
        ``(... (s0 (s1 s2)))``.  In other words ``s`` is right associative.

    .. class:: Sequence(s0, s1, ...)

        Alternative syntax: ``s0 + s1 + ...``

        The :term:`grammar` must see :class:`symbol <Symbol>` ``s0``,
        followed by ``s1``, followed by ...

    .. class:: THIS

        A reference to the :class:`Rule` being assigned to.  So identical
        to ``Ref('this')`` in ``this = Ref('this') + '+' + Ref('this')``.

    .. class:: Tokens(literals, [keywords])

        The :term:`grammar` must choose one of the supplied literals and
        keywords recognised by the
        :ref:`inbuilt tokeniser <inbuilt-tokeniser>`.  Both ``literals``
        and ``keywords`` are strings which are ``string.split()`` to yield
        multiple literals and keywords, each of which is recognised using
        the inbuilt :class:`Token` class.

Python does not allow ``rule =`` for an empty rule.  Use
``rule = THIS * 0`` instead.

Although the classes above are used to build :class:`Grammar` rules
by appearing on the right hand side of a Python assignment statement
their instances are not directly visible.  Instead an instance of
the :class:`Rule` class is created to hold them, and this is what is
assigned to your :class:`Grammar` attribute.  Since they are a
subclass of :class:`Symbol` they can be referenced by other grammar
rules:

  .. class:: Rule()

      .. attribute:: name

          The name of the :class:`Grammar` attribute the :class:`Rule`
          was assigned to.  A :class:`string <str>`.

      .. method:: __str__

          Returns :attr:`name`.

All :term:`symbols <symbol>` the :term:`parser` can recognise
are derived from this abstract base class, and thus share these
attributes:

  .. class:: Symbol()

      If you override :class:`Symbol` the constructor must be called
      as it does initialisation.

      .. attribute:: dict

          A :class:`dict` you can use for your own purposes for all
          rules bar :class:`Grammar.START`.

      .. method:: __contains__(key):

          Equivalent to ``key in symbol.dict``.

      .. method:: __delitem__(key):

          Equivalent to ``del symbol.dict[key]``.

      .. method:: __getitem__(key):

          Equivalent to ``symbol.dict[key]``.

      .. method:: __iter__():

          Equivalent to ``iter(symbol.dict)``.

      .. method:: __len__():

          Equivalent to ``len(symbol.dict)``.

      .. method:: __nonzero__():

          Always returns :data:`True`, ensuring a :class:`Symbol`
          instance is always truthy.

      .. method:: __setitem__(key, value)

          Equivalent to ``symbol.dict[key] = value``.

      .. method:: __repr__()

          Returns the production for this :class:`Symbol`.

      .. method:: __str__()

          Generates a unique name of the :class:`Symbol`, based
          on the :class:`Rule` it belongs to.
          :class:`Tokens <TokenSymbol>` override this to return
          their :attr:`~TokenSymbol.name` attribute.

Tokens
======

:term:`Tokens <Token>` are a special kind of :class:`Symbol`
[#tokens]_.  As a convenience the same token may have several
instances in a :class:`Grammar`.  Instances of the same
:class:`token <TokenSymbol>` are merged during the compile, resulting
in all bar one being discarded.  Two :class:`tokens <TokenSymbol>`
are considered to be the same if they have the same
:attr:`~TokenSymbol.name` attribute.  The inbuilt tokens generate a
suitable :attr:`~TokenSymbol.name` for you, but you can assign a
different name to any :class:`token <TokenSymbol>` using a
:class:`TokenRegistry`.  :attr:`Names <TokenSymbol.name>` containing
``__`` are used internally, and so are reserved.

The following methods and classes are used directly in
:class:`grammars <Grammar>`:

  .. function:: Keyword(word, case=True)

      Create an instance of the :class:`Token` class that recognises a
      keyword.  If case is ``False`` the keyword isn't case sensitive,
      otherwise it is.  The default :attr:`~TokenSymbol.name` is
      :func:`repr(word) <repr>`.  A keyword is a literal :class:`Token`
      that is recognised by another re :class:`Token`.  For example,
      an identifier might be recognised by ``Token(re='[A-Za-z_]+')``.
      The keyword ``'then'`` would normally be recognised by the
      identifier :class:`Token`, but if ``Keyword('then')`` is
      present it will match ``'then'``, while all other non-keywords
      will still be matched by the identifier :class:`Token`.

  .. class:: Token(literal=None, re=None, case=True)

      Alternative syntax: ``'literal'``

      Use the :ref:`inbuilt tokeniser <inbuilt-tokeniser>` to generate
      a :class:`Token` symbol when it sees the text specified by
      ``literal`` or ``re``.  The arguments becomes attributes with
      the name.  You can only supply ``literal`` or ``re``, not both.
      If ``literal`` is supplied the :term:`tokeniser` will match
      exactly that text unless ``case`` is :data:`False` in which case
      it ignores case when matching.  If ``re`` is supplied it must be
      a regular expression the :term:`tokeniser` will match.  The ``re``
      is compiled with :data:`re.MULTILINE` and :data:`re.DOTALL`;
      adding additional flags using :mod:`re's <re>` ``(?FLAG)`` syntax
      affects the recognition of *all* :class:`Tokens <Token>`.  The
      default :attr:`~TokenSymbol.name` for literal tokens is
      :func:`repr(literal) <repr>`.  The default
      :attr:`~TokenSymbol.name` for re tokens is the ``re`` surrounded
      by ``/``'s.  Strings assigned directly to class members are
      ignored as are all non :class:`Symbols <Symbol>`, but strings
      passed to :class:`Symbol` constructors are turned into a literal
      :class:`Token`.

  .. class:: UserToken()

      A user defined token.  The :class:`UserToken` must be assigned a
      :attr:`~TokenSymbol.name` by a :class:`TokenRegistry`.  This class
      can be used as a base for user defined token classes.

  .. function:: UnrecognisedToken()

      Create an instance of the :class:`Token` class that is given to the
      :term:`parser` when the :ref:`inbuilt tokeniser <inbuilt-tokeniser>`
      strikes a string it doesn't recognise as a token or whitespace.  If
      this token doesn't appear in the :class:`Grammar` the
      :ref:`inbuilt tokeniser <inbuilt-tokeniser>` raises a
      :exc:`TokenError` instead.

If you delve deeply into |lrparsing| you will come across these token
classes.  They aren't directly used by :class:`grammars <Grammar>`:

  .. class:: MetaToken(name)

      Used to construct special tokens used internally by the
      :term:`grammar compiler <parser generator>` and
      :term:`parser`.  You will never see these in the output 
      :term:`parse tree`, but :ref:`error recovery <error-recovery>`
      may see them.  There are two: ``__empty__`` represents a 0
      width character string, and ``__end_of_input__`` represents
      the end of the input stream.

  .. class:: TokenSymbol(name)

      The abstract base class all :term:`tokens <token>` are based on,
      thus ``isinstance(symbol, TokenSymbol)`` will return :data:`True`
      if a :class:`Symbol` is a :term:`token` of some kind.  The passed
      ``name`` is the generated name which can be overridden by the
      :class:`TokenRegistry`.  :class:`TokenSymbols <TokenSymbol>`
      and thus all subclasses have these attributes and methods:

      .. attribute:: dict

          This is inherited from :class:`Symbol`, as is the shorthand
          access provided by :class:`Symbol's <Symbol>` mapping interface.  
          :attr:`TokenSymbol.dict` is best set from within a
          :class:`TokenRegistry`.  Changes made to this attribute
          in the body of a :class:`Grammar` subclass change the
          :class:`Symbol.dict` of the :class:`Rule` the :class:`TokenSymbol`
          is assigned to, not the :class:`TokenSymbols <TokenSymbol>`
          :attr:`~TokenSymbol.dict`.

      .. attribute:: name

          A :class:`string <str>`.  The name of the token.  If two tokens
          with the same :attr:`~TokenSymbol.name` are used in a
          :class:`Grammar` they will be merged using
          :meth:`~TokenSymbol.merge` and only one will be kept.

      .. attribute:: named

          A :class:`bool <bool>`.  :data:`True` if
          :meth:`~TokenSymbol.set_name` has been called.

      .. method:: merge(other)

          Called by the :class:`TokenRegistry` when two tokens have the same
          :attr:`~TokenSymbol.name`.  ``Other`` will be discarded after merge
          returns.

      .. method:: position(input_tuple)

          Given an ``input_tuple`` supplied to the parser, return a
          :class:`string <str>` describing the position in the input stream.
          The default implementation returns :data:`None`, which means
          if the ``input_tuple`` triggers an error in the
          :func:`parser <parse>` it won't insert the position in the input
          stream of the error.
          
      .. method:: set_name(name)

          Called by the :class:`TokenRegistry` to override the
          generated :attr:`~TokenSymbol.name`.  Sets
          :attr:`~TokenSymbol.named` to :data:`True`.

      .. method:: __str__()

          Returns :attr:`str(name) <TokenSymbol.name>`.

The TokenRegistry Class
=======================

The :class:`TokenRegistry` is the home for tokens.  If you don't
define subclass within the :class:`Grammar` it will use an instance
of :class:`TokenRegistry` directly [#token_registry_example]_.
Assigning a :class:`token <TokenSymbol>` to a member of a
:class:`TokenRegistry` class gives you a handle to the
:class:`token <TokenSymbol>` instances recognised by the
:class:`Grammar`, and it is the only way to get such a handle
[#token_registry_corollary]_.  Assigning the
:class:`token <TokenSymbol>` also sets the :attr:`TokenSymbol.name`
of the :class:`token <TokenSymbol>` to the qualified attribute name,
which is a :class:`string <str>` with the format
``'NameOfTokenRegistrySubclass.member_name'``.

    .. class:: TokenRegistry()

        A single instance of the a :class:`TokenRegistry` or the subclass
        defined in the :class:`Grammar` is created when the class when the
        :class:`Grammar` is declared.  The constructor looks up all tokens
        declared in its body and registers their :attr:`TokenSymbol.name`
        using :meth:`~TokenRegistry._resolve_token_`.

        .. method:: _resolve_token_(token_symbol)

            Called when the :class:`Grammar` is compiled into
            :term:`LR(1) productions <production>` as :class:`tokens <Token>`
            are found.  It returns the :class:`TokenSymbol` that will be
            presented to the parser for :attr:`TokenSymbol.name`.  That
            instance will always be the first :class:`TokenSymbol` passed to
            it with the :attr:`TokenSymbol.name`.  If the passed token is a
            duplicate, it will pass it to :meth:`TokenSymbol.merge` before
            discarding it.

        .. method:: compile_tokens(whitespace)

            Called once, during :class:`Grammar` compilation and before the
            first call to :meth:`~TokenRegistry.tokeniser`, so the
            :term:`tokeniser` can initialise itself.  The ``whitespace``
            parameter is the :attr:`Grammar.WHITESPACE` attribute if one was
            declared, :data:`None` otherwise.  This method must raise an
            exception if ``whitespace`` isn't valid.  What "whitespace"
            means is up to the :term:`tokeniser`.  The default definition
            creates an instance of the
            :ref:`inbuilt tokeniser <inbuilt-tokeniser>`, and calls its
            ``compile_tokens`` method with the same arguments.

        .. method:: tokeniser(input, whitespace)

            This is a generator, called by :func:`parse` to create
            :term:`input tuples <input tuple>` to form the input to feed to
            the :term:`LR(1) parser <parser>` from the ``input`` parameter
            passed to :func:`parse`.  The ``whitespace`` parameter is the
            same one that was passed to :meth:`~TokenRegistry.compile_tokens`.
            The :term:`input tuples <input tuple>` accepted by the parser are 
            :func:`tuples <tuple>` whose first element is a
            :class:`TokenSymbol` instance.  The remainder of the
            :func:`tuple <tuple>` is ignored by the :term:`parser`, but
            the :func:`tuple <tuple>` is placed unmodified into the output
            :term:`parse tree`.  The default definition calls the
            :meth:`~TokenRegistry.tokeniser` method of the
            :ref:`inbuilt tokeniser <inbuilt-tokeniser>`, passing it the
            same arguments [#tokeniser-override]_.


.. _inbuilt-tokeniser:

Inbuilt Tokeniser
-----------------

The inbuilt tokeniser generates :class:`Token` instances.  Thus
if you use :func:`Keyword`, :class:`Token`, :class:`Tokens` or
:func:`UnrecognisedToken` in your grammar, you must use the inbuilt
tokeniser.  It is used by default unless you add a
:class:`TokenRegistry` to your grammar and override its
:meth:`TokenRegistry.compile_tokens` and
:meth:`TokenRegistry.tokeniser` methods.

The input accepted by the inbuilt :term:`tokeniser` is either a
:class:`string <str>` or a generator yielding a mixture of strings and
:term:`input tuples <input tuple>` [#tokeniser_chain]_.  If the input
is broken up into multiple strings you should ensure the breaks occur
at :term:`token` boundaries.  :attr:`Grammar.WHITESPACE` is a
:class:`string <str>` whose characters can appear between
:class:`tokens <Token>` but are otherwise ignored.  If not defined it
defaults to ``" \f\n\r\t\v"``.


:class:`Strings <str>` are turned into :term:`input tuples <input tuple>`
with the following format::

  (Token(), matched_data, stream_offset, line_number, column_number)

where:

  ``Token()``
      is an instance of the :class:`Token` class used in the 
      :class:`Grammar`.

  ``matched_data``
      is the matched :class:`string <str>` from the input. A
      :class:`string <str>`.

  ``stream_offset``
      is the offset from the start of the input where the matched data
      was found. An :class:`integer <int>`.

  ``line_number``
      is the line number the string was found on.
      The first line is numbered 1.
      An :class:`integer <int>`.

  ``column_number``
      is the column number within the line where the string was found.
      The first column is numbered 1.
      An :class:`integer <int>`.

If a generator is passed and it yields objects other than strings they
are passed onto the :term:`parser` unchanged in the order they were
received.


Exceptions
==========

.. exception:: LrParsingError

    Base: :exc:`StandardError`.

    An abstract base class used by all exceptions defined here.

.. exception:: GrammarError

    Base: :exc:`LrParsingError`.

    Raised when there is an error in the grammar.  This can happen both
    when the grammar is first parsed, and when :func:`compile_grammar`
    is called.

.. exception:: ParsingError

    Base: :exc:`LrParsingError`.

    An abstract base class for all errors that can be raised by this
    module when :func:`parse` is called.

.. exception:: TokenError(message, data, offset, line, column)

    Base: :exc:`ParsingError`.

    Raised by the :ref:`inbuilt tokeniser <inbuilt-tokeniser>` when it
    strikes something in the input that isn't entirely whitespace,
    isn't a recognisable :class:`Token`, and the grammar doesn't use
    :class:`UnrecognisedToken`.  The arguments following ``message``
    become attributes of the instance.

    .. attribute:: data

        The :class:`string <str>` that wasn't recognised.

    .. attribute:: offset

        :attr:`Data's <data>` position in the input stream
        as a character offset from the start.  An
        :class:`integer <int>`.

    .. attribute:: line

        :attr:`Data's <data>` position in the input stream
        as a line number starting at 1 for the first line in the
        input. An :class:`integer <int>`.

    .. attribute:: column

        :attr:`Data's <data>` position in the input stream
        as the column number in the current line, starting at 1 for the
        first column. An :class:`integer <int>`.

.. exception:: ParseError(input_tuple, lr1_state)

    Base: :exc:`ParsingError`.

    Raised by :func:`parse` when the input the :term:`parser` is given
    doesn't match the :class:`Grammar`.  The arguments become instance
    attributes with the same name.

    .. attribute:: input_tuple

        The :term:`input tuple` given to the parser that triggered the
        error.  If this :term:`input tuple` was created by the
        :ref:`inbuilt tokeniser <inbuilt-tokeniser>` it will contain
        position information, ie, line number, column number and so on.

    .. attribute:: lr1_state

        The current parser state, an :class:`Lr1State` instance
        [#error-recovery-lie]_, from the parse table (ie, as returned
        by :func:`repr_parse_table`).  If the parse table hasn't been
        pre compiled :func:`repr` reveals a lot of information about
        what the parser was expecting at the time.

.. _error-recovery:

Error Recovery
==============

The default action of :func:`parse` is to raise a :exc:`ParseError`
exception which creates a suitable message, which of course stops
the :term:`parser` at that point.  If you want the :term:`parser`
to continue so it can discover more errors in the same run you
need to do error recovery.  Error recovery is always accomplished
by manipulating the :term:`parser's <parser>` input so the parser
thinks it has seen something valid, and then letting it continue.
It is implemented by the ``on_error`` parameter to :func:`parse`:

.. function:: on_error(iterator, input_tuple, stack)

    Handle a parser error.

    :param input_tuple: The :term:`input tuple` that triggered the
           error.
    :param iterator: The iterator the parser is reading
           :term:`input tuples <input tuple>` from.
    :param stack: The parser's :term:`stack`.  A :class:`list <list>`.
    :rtype: :data:`None` or an iterable yielding
           :term:`input tuples <input tuple>` that will be inserted
           into the input stream.

The difficulty with any error recovery is determining how to turn the
mess you are given as input into something the :term:`parser` will
accept.  Invariably this means you try to figure out what the user was
trying to do.  For example, let's say you are parsing Python and
strike an problem in an expression.  If the expression was part of a
*for* statement your best strategy is probably to replace the entire
*for* statement with *for x in ():*.  But if that *for* was inside
a list comprehension inserting a *for* statement at that point would
likely generate a cascade of errors later on.  So understanding where
you are is critical.  Sadly this is one thing :term:`LR(1)` makes
hard.  That information is buried in the ``stack`` parameter.  The
``stack`` is a partially constructed branch, running from the root to
a leaf, of a :term:`parse tree`.  It is a :class:`list <list>` of
:func:`tuples <tuple>`, like this::

    [
      (lr1_state, (symbol, ...)),
      (lr1_state, (symbol, ...)),
      ...
    ]

Consider the following two :term:`grammars <grammar>` which both
recognise the same thing, a small subset of the C language::

    >>> def on_error(iterator, token, stack):
        for item in stack:
            print "([%s], (%s))" % (
                ','.join(sorted(str(rule) for rule in item[0].rules)),
                ','.join(
                    lrparsing.repr_parse_tree(sym, indent="")
                    for sym in item[1]))

    >>> class Lang1(Grammar):
        e = Prio(
            Token(re='[0-9]+') | Token(re='[a-z]+') + Opt(Token('(') + THIS + ')'),
            THIS << '*' << THIS,
            THIS << '+' << THIS)
        START = Repeat((
            Opt(e) |
            Keyword("if") + e + Keyword('then') + THIS +
                Opt(Keyword('else') + THIS + Keyword("endif") |
            Keyword("while") + e + Keyword('do') + THIS + Keyword("done") |
            Token(re='[a-z]+') + '=' + e) + ';')

    >>> Lang1.parse(
        "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
    ([<Lang1>], (__empty__))
    ([START], ('e','=',(e '1'),';'))
    ([START], ('while'))
    ([START,e], ((e 'funca' '(' (e 'e') ')'))),
    ([START], ('do'))
    ([START], ('if'))
    ([START,e], ((e 'a')))
    ([START], ('then'))
    ([START], ((START 'e' '=' (e 'a') ';' (e 'fi') ';')))
    lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match START in state 44

    >>> class Lang2(Grammar):
        class T(TokenRegistry):
            number = Token(re='[0-9]+')
            ident = Token(re='[a-z]+')
            kwd_endif = Keyword("endif")
        exp = Ref("exp")
        block = Ref("block")
        call = T.ident + '(' + exp + ')'
        mul_op = exp << '*' << exp
        add_op = exp << '+' << exp
        exp = Prio(T.number | T.ident | call, mul_op, add_op)
        if_part = Keyword("if") + exp
        then_part =  Keyword("then") + block
        else_part = Keyword("else") + block
        if_st = (if_part + then_part + Opt(else_part) + T.kwd_endif)
        while_part = Keyword("while") + exp
        do_part = Keyword("do") + block
        while_st = while_part + do_part + Keyword("done")
        exp_st = exp * 1
        ass_exp = exp * 1
        ass_st = T.ident + '=' + ass_exp
        empty_st = THIS * 0
        st = ass_st | empty_st | exp_st | if_st | while_st
        block = Repeat(st + ';')
        START = block

    >>> Lang2.parse(
        "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error)
    ([<Lang2>], (__empty__))
    ([block], ((st (ass_st 'e' '=' (ass_exp (exp '1'))),';'))
    ([while_st], ((while_part 'while' (exp (call 'funca' '(' (exp 'e') ')'))))
    ([do_part], ('do'))
    ([if_st], ((if_part 'if' (exp 'a'))))
    ([then_part], ('then'))
    ([then_part], ((block (st (ass_st 'e' '=' (ass_exp (exp 'a'))) ';' (st (exp_st (exp 'fi'))) ';'))),
    lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match then_part in state 50

Here the :func:`on_error` function prints the ``stack`` passed, but
doesn't attempt recovery.  Looking at the right hand side of the
``stack``, the ``(symbol, symbol, ...)`` bit, it shouldn't be too hard
to see how stitching it together will yield a branch of the
:term:`parse tree`.  This sequence of :class:`symbols <Symbol>` must
eventually match one of the :class:`Rules <Rule>` listed to their left.
The :term:`LR(1)` parser may list several as it will only make its
final decision once it has seen the entire :term:`production`.  For
``Lang2`` it is not difficult to see the :term:`parser` was in the
middle of the *then_part* of an *if_st*.  ``Lang1`` is an example of
how to make recognising this difficult.

``lr1_state`` has more information in it than the above example
shows. It is an :class:`Lr1State` [#error-recovery-lie]_:

    .. class:: Lr1State(id, actions, gotos, rules)

        This class is immutable, the parameters becoming attributes of
        the same name.  It has no useful methods beyond
        :meth:`object.__str__` and :meth:`object.__repr__`.
            
        .. attribute:: id

            An :class:`integer <int>` identifing the :term:`state`.
            In an :class:`Lr1State` this integer is its index into the
            :term:`parse table`.
                
        .. attribute:: actions

            The :term:`shift table` of this parse :term:`state`.  It
            is a :class:`dict`, indexed by all valid
            :class:`TokenSymbols <TokenSymbol>` that could appear at
            this point.  The error occurred because the current
            :term:`token` isn't a key to this :class:`dict`.

        .. attribute:: gotos

            The :term:`reduce table` of this parse :term:`state`.
            A :class:`dict`.  Not useful for error recovery.

        .. attribute:: rules

            This is a :class:`list <list>` of :class:`Rules <Rule>`
            from your grammar the parser was in the process of
            recognising.  In other words if the parser was processing
            the grammar ``class G(Grammar): START = Token("x")``,
            then this must be :data:`True`:
            ``rules[0] in (G.START,G.epoch_symbol())``.

By just looking at the top item of the passed stack for ``Lang2`` 
we see the :term:`parser` was in the middle of a *then_block*.  Its
task is to re-arrange the input so the :term:`parser` can continue.
The simple way is to insert one of the
:class:`TokenSymbols <TokenSymbol>` out of :class:`Lr1State.actions`.
In this case ``'endif'`` would be the right choice.  Another choice
is to erase some input, possibly replacing it with something else.
This would be appropriate if, for example the string being parsed was
``((2 func 2;``.  Replacing ``func 2`` with ``))`` would get the
:term:`parser` onto the next statement so the user only gets one
error for the current line.

If :func:`on_error` returns :data:`None` the parser will raise a
:exc:`ParseError` as if :func:`on_error` wasn't passed to it.
Otherwise it will read :term:`input tuples <input tuple>` from the
returned iterable, and when that is exhausted return to reading
the original input.  So, returning a :class:`list <list>` of
:term:`input tuples <input tuple>` will insert them, reading
:term:`input tuples <input tuple>` off the passed ``iterator``
will delete them, and reading them and returning what you read
allows you to peek ahead.  As the :term:`parser` has already
processed the :term:`input tuple` it passed to :func:`on_error`,
if you want the :term:`parser` to reprocess it, it must be in the returned
iterable.  Going back to the example, this line would insert an
``'endif'``, thus allowing the :term:`parser` to process the
remainder of the input::

    endif = (Lang2.T.kwd_endif, Lang2.T.kwd_endif.literal) + token[2:]
    return [endif, token]

There is one other option for the adventurous.  The ``stack``
passed to :func:`on_error` is the actual :term:`parser stack <stack>`,
so you can modify it.  The :term:`states <state>` on the stack
depend on the whims of the :term:`parser generator` so adding or
modifying them is impossibly fragile.  Deleting them, using say
``del stack[-2:]`` is not.  In the ``((2 func 2`` example deleting
the entire expression from the :term:`stack` and presenting the
:term:`parser` with a replacement expression on its input (eg,
just ``Lang2.T.number``) is often simpler.


Ambiguities
===========

Lurking under |lrparsing|'s hood is an :term:`LR(1)` parser.
:term:`LR(1) parser generators <parser generator>` are cantankerous
beasts.  |Lrparsing| puts lipstick on the pig, but the pig is still a
pig.  The principal way it bites is to say your grammar is
:term:`ambiguous`.  This section is here to help you work past this
error, but be aware writing large complex grammars requires a fair
amount of expertise which can't be imparted by this small how-to.

To understand what ambiguous means, understand that a :term:`parser` is
meant to take some input, validate it conforms to the :term:`grammar`,
and then produce a :term:`parse tree`.  When the parser generator says
the :term:`grammar` is :term:`ambiguous`, it means that for some valid
input several different :term:`parse trees <parse tree>` can be
produced.  An :term:`LR(1)` :term:`grammar` requires there be only one
:term:`parse tree` for any given input.

A blatantly ambiguous grammar will produce a
:term:`Reduce/Reduce conflict <Reduce/Reduce>`.  Consider::

   class A(Grammar):
      a = Token('x')
      b = Token('x')
      START = a | b

``A.compile_grammar()`` raises this exception::

    lrparsing.GrammarError: Conflict: Reduce/Reduce
    While trying to recognise:
      b = 'x' ^
      a = 'x' ^
    on seeing a __end_of_input__ I could not decide between:
      replace the sequence ['x'] with b
    and
      replace the sequence ['x'] with a
    Please remove this ambiguity from your grammar

Looking at the error message, it displays every possible
:term:`production` that could apply at this point, with an :term:`^`
showing where the parser is up to in each :term:`production`.
There are only two types of :term:`actions <action>` the parser can
take.  It can accept the next token in the input stream.  This is
called a :term:`Shift`.  It isn't possible here as the next symbol is
``__end_of_input__``, in other words it is at the end of the string.
The other action it could take is replace some symbols with the
:term:`production` they match.  This is called a :term:`Reduce`.
It can do a reduce as there are two productions that match.
Sadly two is one too many, and this :term:`conflict` is called
:term:`Reduce/Reduce` because both options are
:term:`reduces <reduce>`.  They correspond to these
:term:`parse trees <parse tree>`::

    (START (a 'x'))  OR  (START (b 'x'))

Since there are two possible :term:`parse trees <parse tree>` the
grammar is :term:`ambiguous`.  If you look at the error message just
the right way, it could almost be said to say that.  The solution in
this case is to remove one of the redundant :class:`Rules <Rule>`::

   class A(Grammar):
      a = Token('x')
      START = a

Looping grammars are also ambiguous.  Consider this
:term:`grammar`::

   class B(Grammar):
       b = Ref("e")
       e = b | Token("x")
       START = e

Again, the :class:`Grammar` will recognise a single ``'x'``.
``B.compile_grammar()`` raises this exception::

    lrparsing.GrammarError: Conflict: Reduce/Reduce
    While trying to recognise:
      START = e ^
      b = e ^
    on seeing a __end_of_input__ I could not decide between:
      replace the sequence [e] with b
    and
      replace the sequence [e] with START
    Please remove this ambiguity from your grammar

The error message is saying when the :term:`parser` has seen a single
``'e'``, it has two choices, which are listed.  These two choices
correspond to an infinite number of :term:`parse trees <parse tree>`::

   (START (e 'x')) OR (START (e (b (e 'x')))) OR (START (e (b (e (b (e 'x')))))) OR ...

Again since there is more than one possible parse tree, the grammar is
:term:`ambiguous`.

Unfortunately those two simple examples don't capture the complexity of
common :term:`ambiguities <ambiguous>`. Consider this example which is
common in real world :term:`grammars <grammar>`::

   class E(Grammar):
       e = THIS + Tokens("+ /") + THIS | Token(re="[0-9]+")
       START = e

This :class:`Grammar` recognises strings of integers separated by ``'+'``
and ``'/'``.  If the grammar writer wanted to produce a grammar that only
recognised valid integer expressions using the operators ``'+'`` and
``'/'`` he was successful.  Nonetheless, when he calls
``E.compile_grammar()`` this exception is raised::

    lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
    While trying to recognise:
      e = e '+' e ^
      e = e ^ '/' e
    on seeing a '/' I could not decide between:
      replace the sequence [e '+' e] with e
    and
      accepting the '/' in the hope it will match:
        e = e '/' ^ e
    Please remove this ambiguity from your grammar
    
Again the error message lists the :term:`productions <production>`
under consideration and how far the parser has got in each.  The
smallest possible string the parser could be looking at and end up
with those possibles is: ``e + e ^ / e``.  The parser wants to know
whether should it replace (:term:`reduce`) ``e + e`` with e, which
corresponds to  performing the addition now, or should it continue
reading, ie, delay the addition until it processes the ``'/'``.
This is called a :term:`Shift/Reduce conflict <Shift/Reduce>`
because the two :term:`actions <action>` possible are :term:`Shift`
and :term:`Reduce`.  The two :term:`parse trees <parse tree>`
under consideration are thus::

     Choose Shift:  e + (e / e)
     Choose Reduce: (e + e) / e

The normal arithmetic convention is to do the division first, so we want
to choose ``Shift``.  There are several ways to make the grammar do that.
The easiest way in |lrparsing| is to prioritise the two choices, saying 
``e '/' e`` should always be done before ``e '+' e``::

   class E(Grammar):
       e = Prio(THIS + '/' + THIS, THIS + '+' + THIS) | Token(re="[0-9]+")
       START = e

But doing that only replaces the previous error ``E.compile_grammar()``
raised with a new one::

    lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity
    While trying to recognise:
      e.Choice.Prio.Prioritised0 = e ^ '/' e
      e.Choice.Prio.Prioritised0 = e '/' e ^
    on seeing a '/' I could not decide between:
      replace the sequence [e '/' e] with e.Choice.Prio.Prioritised0
    and
      accepting the '/' in the hope it will match:
        e.Choice.Prio.Prioritised0 = e '/' ^ e
    Please remove this ambiguity from your grammar

Firstly, notice how the choices listed don't directly correspond to the
rules in your :class:`Grammar`.  That is because |lrparsing| has compiled
your :class:`Grammar` into :term:`productions <production>`, which is the
only thing an :term:`LR(1) parser <LR(1)>` understands.  You can print
this compiled form using ``print E.repr_productions()`` which yields::

    0   : <E> = START __end_of_input__
    1   : START = e
    2   : e = e.Choice.Prio
          e = /[0-9]+/
    3   : e.Choice.Prio = e.Choice.Prio.Prioritised0
          e.Choice.Prio = e.Choice.Prio.Prioritised1
    4   : e.Choice.Prio.Prioritised0 = e '/' e
    5   : e.Choice.Prio.Prioritised1 = e '+' e

It is worth taking some time to understand how these
:term:`productions <production>` correspond to the original set of
:class:`Rules <Rule>`.  In any case you can now see where all those odd
looking symbol names come from.

Using the same technique as before we deduce the smallest string it
could be looking at and end up with these possibilities:
``e '/' e ^ '/' e``.  In other words it can't decide between these
two :term:`parse trees <parse tree>`::

    e / (e / e)   OR   (e / e) / e

As an aside this is a real problem because integer division isn't
commutative: ``(9 / 3) / 2 = 1``, whereas ``9 / (3 / 2) = 9``.
So if the :term:`parse tree` was being used to evaluate the
expressions, choosing the wrong one would yield an incorrect
answer.

We can't use :class:`Prio` to solve this because the :term:`production`
is conflicting with itself.  The solution is to say division is left
associative::

   class E(Grammar):
       e = Prio(THIS << '/' << THIS, THIS << '+' << THIS) | Token(re="[0-9]+")
       START = e

And with that the parser will compile, and will yield parse trees that
conform to the conventions of arithmetic.

Looking deeper under the hood, an :term:`LR(1) parser <LR(1)>` is called
that because it delays all decisions until it has recognised an entire
:term:`production`.  If you look at the above error messages, you will
see they all involve a :term:`reduce` and that is because is a 
:term:`reduce` is what the :term:`parser` does when it makes the decision
on which :term:`production` it has seen.  There are two ways it can get
confused.

The first involves two :term:`reduces <reduce>`.  Since the :term:`parser`
is making its decision *after* it has seen the entire production, the
only way it can be confused is if two :term:`productions <production>`
have the same :term:`right hand side`.  If you look at the
:term:`Reduce/Reduce conflicts <Reduce/Reduce>` above that is indeed the
case.  But even in that case the :term:`LR(1) parser <LR(1)>` can still
distinguish between the two if they are followed by different
:term:`tokens <token>`.  Consider this :term:`ambiguous`
:class:`Grammar`::

    class F(Grammar):
        a = Token('a')
        b = Token('a')
        START = a + 'x' + 'y' | b + 'x' + 'z'

``F.compile_grammar()`` raises this exception::

    lrparsing.GrammarError: Conflict: Reduce/Reduce
    While trying to recognise state 4:
      a = 'a' ^
      b = 'a' ^
    on seeing a 'x' I could not decide between:
      replace the sequence ['a'] with a
    and
      replace the sequence ['a'] with b
    Please remove this ambiguity from your grammar

Unlike the previous example this :class:`Grammar` isn't inherently
:term:`ambiguous` as there is one unique :term:`parse tree` for every
input.  The problem is an :term:`LR(1) parser <LR(1)>` isn't powerful
enough to recognise it.  The :term:`symbol` it needs to see in order
to know whether it should be an ``a`` or a ``b`` (the ``'y'`` or
``'z'``) is two :term:`tokens <token>` away and :term:`LR(1)` only
looks one :term:`token` ahead. An LR(2) :term:`parser` would be
powerful enough but we don't have one of those.

In this case it is easy enough re-write the :class:`Grammar` so an
:term:`LR(1) parser <LR(1)>` is powerful enough.  The solution is to
delay recognising ``a`` and ``b`` until the parser can see the ``y``
or ``z``::

    class F(Grammar):
        a = Token('a') + 'x'
        b = Token('a') + 'x'
        START = a + 'y' | b + 'z'

The problem with re-writing like this is it changes the
:term:`parse tree`.  If the back end that processes the
:term:`parse tree` can't tolerate this, something is going to have
to give, and it won't be the :term:`LR(1) parser <LR(1)>`.  Post
processing the :term:`parse tree` so it is in the form the back end
wants is the normal solution.  Console yourself with the knowledge
that people using :term:`LL(1) grammars <LR(1)>` have to do it far
more.

Avoid the temptation to use :class:`Prio` to resolve
:term:`Reduce/Reduce conflicts <Reduce/Reduce>`.  It will make the
:term:`conflict` go away, but it means the :term:`parser` will never
recognise part of your :class:`Grammar`.  In the above example
yielding to temptation and using :class:`Prio` would result in::

    class F(Grammar):
        a = Token('a')
        b = Token('a')
        START = Prio(a + 'x' + 'y', b + 'x' + 'z')

And presto ``F.compile_grammar()`` works.  But now ``F.parse("axz")``
yields::

    lrparsing.ParseError: line 1 column 3: Got 'z' when expecting 'y'

Since ``"axz"`` is a valid input for our :class:`Grammar` this is a bug.
It was expecting a ``'y'`` because our :class:`Grammar` says if it can't
decide choose ``a + 'x' + 'y'`` over ``b + 'x' + 'z'``.  As we know it
could not decide, so will always choose ``a + 'x' + 'y'``.  This means it
can never recognise ``b + 'x' + 'z'``.

Since at least one :term:`reduce` must be involved, the only other
possibility is a :term:`Shift/Reduce conflict <Shift/Reduce>`.  These
always take the form::

     SYMBOL = SYMBOL <anything1> SYMBOL ^
     SYMBOL = SYMBOL ^ <anything2> SYMBOL

You can pick the :term:`Reduce` because it has the :term:`^` at the
end.  In this case the :class:`Grammar` is always :term:`ambiguous`,
which is good because you can fix it by using associativity or
priorities.  The trick in figuring out which to choose is to
understand what the :term:`parser` is trying to decide between, and
you can do that by replacing a :term:`Symbol` in the
:term:`right hand side` of the :term:`Shift` with the entire
:term:`right hand side` of the :term:`Reduce`.  The name of the
:term:`Symbol` replaced must be the same as the :term:`left hand side`
of the :term:`Reduce`, and the ``^`` in the two
:term:`productions <production>` must coincide, like this::

     SYMBOL = SYMBOL <anything1> SYMBOL ^ <anything2> SYMBOL

The :term:`parse trees <parse tree>` become either::

     Shift:  SYMBOL <anything1> (SYMBOL <anything2> SYMBOL)
     Reduce: (SYMBOL <anything1> SYMBOL) <anything2> SYMBOL

If ``<anything1>`` and ``<anything2>`` are different (like ``'*'`` and
``'+'`` above) you can use :class:`Prio` to choose the one that should
happen first.  If they have the same priority (and in particular this 
will be the case if ``<anything1>`` and ``<anything2>`` are the same
thing), then you must use :class:`Left` if the :term:`Shift` is the
correct choice or :class:`Right` if the :term:`Reduce` is the 
correct choice.  If it should never happen use :class:`Nonassoc`.


Examples
========

The source of |lrparsing| contains the follow examples, which can also
be viewed online at 
http://lrparsing.sourceforge.net/examples:

lrparsing-sqlite3.py
   A :class:`Grammar` for all of the Data Manipulation Statements in
   Sqlite3.py.  |lrparsing| was written because the author could not
   find a Python parsing library that ran at an acceptable speed and
   required less lines of code to do this job than a hand written
   LL(1) parser.

lrparsing-expr.py
   An integer expression evaluator  It illustrates one design pattern
   for building a compiler on top of |lrparsing|: using a
   :class:`Symbols <Symbol>` inbuilt mapping capabilities.  If the job
   is simple this works well.

lua52.py
   A :class:`Grammar` for Lua 5.2 and a cross compiler to Python.
   It illustrates a second design pattern for building a compiler:
   using a compiler class whose functions have the same names as
   :class:`rules <Rule>` and and :class:`tokens <TokenSymbol>` in the
   :class:`Grammar`.  These functions are then called to compile the
   node as it is output by :func:`parse`.  This is a good way to do it
   for more complex tasks.  It cleanly separates the compiler from
   the parser, while keeping the relationship between the output
   parse tree and the generated code clear.


Acknowledgements
================

|lrparsing| was developed with inspiration from several sources.
The two main ones were parsing.py (http://www.canonware.com/Parsing/)
which from a technical perspective is a powerful and clean LR
parser implementing several things |lrparsing| doesn't, including
a GLR(1) :term:`parser`.  The other source of inspiration is
pyparsing.py (http://pyparsing.wikispaces.com/) which proves that
providing a pythonic interface to parsing is a worthwhile thing
to do.

Finally, thank you to Greg Black for proof reading this document.


Lrparsing Glossary
==================

If you use a parser generator to its full extent and particularly if
you are doing error recovery, you will end up knowing enough about
parser generation to write your own.  It will be a long journey, made
longer by parser generation being an old art that has built up a
truly impressive amount of jargon.  The glossary is here to help you
get past the jargon.  

Unfortunately the glossary is large and the concepts complex, so it
ends up being another wall of text.  The narrative below is less
rigorous but hopefully doesn't suffer from the wall of text problem.
May it speed your journey to parser generation zen.

.. _my-narrative:

Glossary Narrative
------------------

Here is a simple :term:`grammar`::

    import lrparsing
    class MyGrammar(lrparsing.Grammar):
        xy_non_term = lrparsing.Token('x') + lrparsing.Token('y')
        y_non_term = lrparsing.Token('y')
        the_world_non_term = lrparsing.Repeat(xy_non_term | y_non_term, min=1)
        START = the_world_non_term

This :term:`grammar` is composed of :class:`rules <Rule>`, which are
things Python can parse.  |Lrparsing| translates the
:class:`rules <Rule>` into an alternate representation of the same
:term:`grammar` using :term:`productions <production>`.  Productions
are what an LR(1) :term:`parser generator` understands (this is the
output produced by `MyGrammar.repr_productions()`)::

    0     : <MyGrammar> = START __end_of_input__
    1     : START = the_world_non_term
    2     : the_world_non_term = the_world_non_term.Repeat
    3     : the_world_non_term.Repeat = xy_non_term
            the_world_non_term.Repeat = y_non_term
            the_world_non_term.Repeat = the_world_non_term.Repeat xy_non_term
            the_world_non_term.Repeat = the_world_non_term.Repeat y_non_term
    4     : xy_non_term = 'x' 'y'
    5     : y_non_term = 'y'

:term:`Productions <production>` have an internal structure and
each bit has its own name; sometimes several.  Apart from the `=`
they are made up of :term:`symbols <symbol>`.  The :term:`symbol` on
the :term:`left hand side` is called a :term:`non-terminal`.  A
:term:`non-terminal` is never read from the input stream directly.
It is just a place holder created by the grammar author to represent
whatever is on the :term:`right hand side` of the :term:`production`.
This contrasts to the :term:`symbols <symbol>` read from the input
stream.  They are called :term:`tokens <token>`.  |Lrparsing| embeds
the :term:`tokens <token>` in a :term:`input tuple` mostly because
additional information is needed for error reporting - things like
the where the :term:`token` was found (eg, line and column).

The :term:`parser generator` now creates a :term:`parse table`
from the :term:`productions <production>`.  To do that it goes
through an intermediate step.  It creates a third representation of
the :term:`grammar`, a directed graph.  Each node in the graph is
called an :term:`item set`.  The first :term:`item set` is::

    {
       <MyGrammar> = ^ START __end_of_input__
    }

As the `{}` are meant to imply, an :term:`item set` really is a
:class:`set`, a set of :term:`productions <production>`.  However
there is an extra bit of information associated with each
:term:`production`, represented as a `^` here.  This is called the
:term:`dot` and it shows how far the :term:`parser` has progressed in
recognising the :term:`production`.  The combination of the
:term:`production` and the :term:`dot` is called an :term:`item`,
and hence the term :term:`item set` arises because it is a set of
:term:`items <item>`.  

The :term:`production` used in this initial :term:`item set` is the
root of the :term:`parse tree`.  It includes the special end of
:class:`MetaToken` `__end_of_input__`, which signals the entire input
has been read.  The :term:`non-terminal` on its
:term:`left hand side` (`<MyGrammar>` here) has a special name.  It
is called the :term:`start symbol`.  So the :term:`item set` above is
a succinct representation of the :term:`parsers <parser>` state
before it has read anything.  However it's not finished yet.

In order to finish it off an operation called :term:`closure` must be
performed on it.  The :term:`closure` is intuitively easy to
construct once you recognise that if the :term:`parser` is about to
see a `START` symbol, then the `START` symbols
:term:`production(s) <production>` belong in the `closure` as well::

    {
        <MyGrammar> = ^ START __end_of_input__

        START = ^ the_world_non_term
    }

But the :term:`closure` is still not finished as now
`the_world_non_term` is also a :term:`non-terminal` proceeded by the
:term:`dot`.  Adding it adds yet more unclosed
:term:`non-terminals <non-terminal>`. This goes on for a bit.  Once
complete the :term:`closure` will be::

    {
        <MyGrammar> = ^ START __end_of_input__

        START = ^ the_world_non_term
        the_world_non_term = ^ the_world_non_term.Repeat
        the_world_non_term.Repeat = ^ xy_non_term
        the_world_non_term.Repeat = ^ y_non_term
        the_world_non_term.Repeat = ^ the_world_non_term.Repeat xy_non_term
        the_world_non_term.Repeat = ^ the_world_non_term.Repeat y_non_term
        xy_non_term = ^ 'x' 'y'
        y_non_term = ^ 'y'
    }

The :term:`item set` is now complete.  There are two
:term:`tokens <token>` that follow the :term:`dots <dot>` in it: 
`'x'` and `'y'`.  This means these two :term:`tokens <token>`, and
only these two :term:`tokens <token>`, could legitimately be appear
in the input stream at this point.  Anything else will cause the
:term:`parser` to raise a :exc:`ParseError`.

Lets say `'x'` was the next thing to be read from the input stream.
Reading it creates a new :term:`item` which can be used to create a
:term:`core` we haven't see before::

    {
        xy_not_term = 'x' ^ 'y'
    }

Since there are no :term:`non-terminals <non-terminal>` following the
:term:`dot` the :term:`closure` adds nothing, so this :term:`core` is
also a completed :term:`item set`.  `'x'` is not the only
:term:`token` that the epoc :term:`item sets <item set>` could first
read from the input stream.  It might also be an `'y'`.  This yields
another new :term:`core`, and doing a :term:`closure` on that core
will create a new :term:`item set` for it.

Recall the :term:`parser generator` is creating a graph, and these
:term:`item sets <item set>` are the nodes in the graph.  Since
reading a :term:`token` moves to these new
:term:`item sets <item set>`, the net effect of reading any
:term:`symbol` is to create a directed line connecting the two nodes
in the graph.  The act of moving along that line to a new node in the
graph is called an :term:`action`.  This particular variant of an
:term:`action` is called a :term:`shift action <shift>`.  All
possible :term:`shift actions <shift>` a node / :term:`item set` can
make are held in its :term:`shift table`.  For the :term:`item set`
representing the :term:`start symbol`, the :term:`shift table` will
be::

    {
        'x':     item_set({ xy_non_term = 'x' ^ 'y', }),
        'y':     item_set({ y_non_term = 'y' ^, }),
    }

When the `xy_non_term` reads the `'x'` followed `'y'` the entire
:term:`right hand side` of the :term:`production` has been seen.
Each :term:`shift` has pushed the :term:`token` it accepted onto a
data structure called the :term:`stack`.  The `'x'` and `'y'` on the
stack are popped from the :term:`stack` and the :term:`non-terminal`
`xy_non_term` is pushed onto the :term:`stack`, effectively
replacing them.  These are the first steps of an :term:`action`
called :term:`reduce`.  Effectively, the :term:`non-terminal`
`xy_non_term` has just been read from the input stream, and based on
that :term:`parser` must move to a new :term:`state` (node in the
graph, aka :term:`item set`).  It does it in the same way as a
:term:`shift action <shift>`: it looks up the :term:`non-terminal`
just "read" in a table called the :term:`reduce table`.  This table
can be displayed using `MyGrammer.repr_parse_table()`.  The
:term:`reduce table` of the :term:`start symbols <start symbol>`
:term:`item set` is::


    {
        START:
            item_set({<MyGrammar> = START ^ __end_of_input__}),
        the_world_non_term:
            item_set({START = the_world_non_term ^}),
        the_world_non_term.Repeat:
            item_set({
                the_world_non_term = the_world_non_term.Repeat ^
                the_world_non_term.Repeat = the_world_non_term.Repeat ^ xy_non_term
                the_world_non_term.Repeat = the_world_non_term.Repeat ^ y_non_term
            }),
        xy_non_term:
            item_set({the_world_non_term.Repeat = xy_non_term ^}),
        y_non_term:
            item_set({the_world_non_term.Repeat = y_non_term ^}),
    }


"Reading" a :term:`non-terminal` moves the :term:`dot` just like
reading a :term:`token` does, and so creates a new :term:`item set`.
All these new :term:`item sets <item set>` are processed, yielding
new :term:`item sets <item set>` in turn that are also processed -
but only if the :term:`core` hasn't been seen before.  When the
process winds down the graph is complete.

To do it's job the :term:`parser` needs only the base structure of
the graph, not the details used to build it.  In fact the
:term:`item sets <item set>` can be represented by a integer, which
can conveniently be used as an index into an array whose elements
contain the :term:`reduce table` and :term:`shift table` for the node
the :term:`item set` represented.  This stripped down version of the
:term:`item set` graph is the :term:`parse table`.


Glossary
--------

.. glossary::


  Action
      A :term:`action` is a collective name for the two steps an
      :term:`LR(1) parser <parser>` can perform: :term:`shift` and
      :term:`reduce`.  Alternatively, if the :term:`parse table`
      is thought of as a graph, an :term:`action` is the name of
      a line connecting the nodes in the graph.  The name for a
      node in the graph is a :term:`state`.

  Ambiguous
      A parser of any sort has two jobs: to verify the input is valid
      according to the grammar, and to create a :term:`parse tree` for it.
      The words *create a parse tree* could equally well be written
      *assign meaning*.  Eg, ``9 + 3 / 2`` is a valid arithmetic,
      but without the normal rules of arithmetic could have two different
      meanings: ``9 + (3 / 3) = 10`` or ``(9 + 3) / 3 = 4``.  In English,
      jokes are based on this same ambiguity, eg, *Time flies like an
      arrow*.  It follows a parser would not be much use if it could
      not assign a unique meaning to every input sequence.  Sadly it
      is possible (even easy) to write grammars that don't assign a
      unique meaning to every input sequence. These grammars are rejected
      by the parser constructor as :term:`ambiguous`.  Even more sadly no
      practical parser, including :term:`LR(1)`, is powerful enough to
      recognise all non-ambiguous grammars, so they incorrectly say that
      some non-ambiguous grammars are ambiguous.  The only work around
      is to re-write the grammar in a form the parser can use, and post
      process the :term:`parse tree` if the new version is too far
      from the truth.

  Closure
      Closure the name for both an operation performed by
      :term:`item sets <item set>` during their construction and the
      output of that operation, which is a set of :term:`items <item>`.
      An :term:`item set` is a set of :term:`items <item>`, ie,
      partially recognised :term:`productions <production>`.  Initially
      an :term:`item set` contains just its :term:`core`.  If the next
      expected :term:`symbol` of any :term:`item` in the :term:`item set`
      is a :term:`non-terminal`, ie the :term:`left hand side` of
      another :term:`production`, then it follows the parser is also
      recognising this new :term:`item`, which is that :term:`production`
      with its :term:`dot` at the start. This new :term:`item` becomes
      part of the :term:`closure`.  That new :term:`item` could also
      have a :term:`non-terminal` at its start, thus yielding another
      new :term:`item` recursively.  All items added in this way become
      the :term:`item sets <item set>` closure.

  Core
      The core is the name of the initial subset of :term:`items <item>`
      in an :term:`item set`.  The :term:`core` is constructed from
      the :term:`item sets <item set>` in the
      :term:`state graph <state>` that have lines leading to this new
      :term:`item set`.  It is the :term:`items <item>` in those
      ancestor :term:`item sets <item set>` after the recognition of
      the :term:`symbol` labelling the line connecting them.  The
      :term:`dot` of those items has been moved along one position to
      reflect the recognition of that :term:`symbol`.

  Conflict
  Reduce/Reduce
  Shift/Reduce
      The job of the :term:`parser generator` is to create a graph.
      The nodes of the graph become the parsers :term:`states <state>`.
      The lines that join the nodes of the graph are termed
      :term:`actions <action>`.  They are labeled with the
      :term:`symbol` the :term:`parser` just recognised.  If the
      :term:`parser generator` can't uniquely decide what the next
      :term:`state` should be when :term:`symbol` is seen (ie several
      lines would be labelled with the same :term:`symbol`), the
      :term:`grammar` is said to have a :term:`conflict`.  This means
      the :term:`grammar` is :term:`ambiguous`, because having
      several possible :term:`states <state>` reachable for the same
      :term:`symbol` means there are several possible
      :term:`parse trees <parse tree>`.  For an
      :term:`LR parser <LR(1)>` one of the conflicting
      :term:`actions <action>` will always be a :term:`reduce`
      because an :term:`LR parser <LR(1)>` only makes a decision
      once it has seen an entire production.  Since there are
      two types of :term:`actions <action>`, :term:`shift` and
      :term:`reduce`, this means a :term:`conflict` must be
      between a :term:`shift` and a :term:`reduce` which is
      termed a :term:`Shift/Reduce conflict <Shift/Reduce>`
      or it could be between two :term:`reduces <reduce>` which
      is termed a :term:`Reduce/Reduce conflict <Reduce/Reduce>`.

  Dot
  \^
      Marks a position in the :term:`right hand side` of a production.
      In days gone by when :term:`parse tables <parse table>` were
      constructed manually on paper, a human would mark a dot in a
      production to keep track of where he was up to.  In a computer
      the position of a dot is represented by the number of
      :term:`symbols <symbol>` on the :term:`right hand side` seen
      so far.

  Grammar
      Rules that explain how a series of :term:`tokens <token>`
      should be interpreted.  For example, in English
      :term:`tokens <token>` are words and punctuation.  English
      grammar shows how to build up words and punctuation into phrases
      of various kinds (eg, noun phrases, adverb phrases), then the
      phrases into sentences, and the sentences into constructs like
      quotes, paragraphs and chapters.
      The result is often represented as a :term:`parse tree`.

  Input Tuple
      The :term:`parser` recognises :term:`tokens <token>`.  However, it
      is often helpful to have other information associated with the
      :term:`token` that the :term:`parser` doesn't use, such as the
      position in the input stream it was found at for error reporting.
      |lrparsing| calls the combination of a :term:`Token` plus that
      other information an :term:`input tuple`.  |lrparsing| uses a
      Python :func:`tuple <tuple>` with the :term:`Token` as it first
      element for this purpose.

  Item
      A partially recognised :term:`production`.  It is a production,
      plus a marker (the :term:`dot`) which shows how much the parser
      has seen so far, plus the :term:`tokens <token>` that could
      legally follow the :term:`production`.  An LR(0) parser doesn't
      store any such tokens, an :term:`LR(1)` stores all valid tokens
      that could come next and an LR(n) parser stores all valid
      sequences of n tokens that could possibly follow it.  Since an
      :term:`LR(1)` parser only makes decisions after seeing the entire
      :term:`right hand side` of a :term:`production`, it only considers
      the :term:`symbols <symbol>` that could follow a production when
      it does a :term:`reduce`.

  Item Set
      This data structure is the heart of an 
      :term:`LR(1) parser generator <parser generator>`.  As the name
      implies an :term:`item set` is set of :term:`items <item>`.
      Internally this set is broken up into two distinct subsets, the
      :term:`core` and the :term:`closure`.  The generator initialises
      itself by creating the :term:`core` of the initial
      :term:`item set` from the :term:`start symbol`.  This
      :term:`core` consists of the sole :term:`production` of the
      :term:`start symbol` with the :term:`dot` before the first
      :term:`symbol`.  The next step is to compute the :term:`closure`
      of this :term:`core`.  In this case if the first :term:`symbol`
      of the :term:`right hand side` is a :term:`non-terminal`, then it
      follows the parser is also at the start of all the
      :term:`non-terminal's <non-terminal>`
      :term:`productions <production>`, so they are added to the
      :term:`closure`.  And so on, recursively, if they have
      :term:`non-terminal`\s as their first :term:`symbol`.
      So now we have a set of :term:`items <item>` (the :term:`core` and
      its its :term:`closure`), each waiting for their first
      :term:`symbol` to appear.  If the parser receives one such
      :term:`symbol` it can move forward, and what it moves to is a new
      :term:`item set` whose :term:`core` is all the current
      :term:`items <item>` waiting on that :term:`symbol`, but with their
      :term:`dot` position moved forward one step because a new
      :term:`symbol` has been seen.  This process of computing
      :term:`closures <closure>` and generating new
      :term:`item sets <item set>` continues until only duplicate
      :term:`item sets <item set>` are created.  The
      :term:`item sets <item set>` then become the
      :term:`states <state>` of the :term:`parse table`.

  Lhs
  Left Hand Side
      The :class:`Symbol` on the left hand side of a :term:`production`.
      Such symbols are called a :term:`non-terminal`. In a
      |lrparsing| :class:`Grammar` calls the :term:`symbol` on the
      :term:`left hand side` of a :term:`production` a :class:`Rule`.

  LR(1)
      LR(1) is an abbreviation for Left Right (parser with) 1 (symbol
      lookahead).  Left here simply means it treats the string of
      characters like English - it starts reading at the left.  The
      Right here means it delays making a decision on whether it has
      recognised a :term:`production` until it has read the last
      symbol (right most) of its :term:`right hand side`.  The 1
      symbol lookahead means it considers the next input symbol when
      it makes this decision.  (It only needs to do this if two
      different :term:`non-terminals <non-terminal>` have the same
      :term:`right hand side`.  An :term:`LR(1)` parser will look at
      the next :term:`token` to be read to distingiush between them.
      If the same :term:`token` can ligetimately follow both doing
      that doesn't help, so the parser generator will say it has a
      :term:`Reduce/Reduce confict <reduce/reduce>`.) LR(0) parser
      doesn't consider the next input :term:`token`.  If someone were
      to write an LR(2) parser it would consider the next 2
      :term:`tokens <token>`.  The other common type of parser is an
      LL(1) parser, also called a recursive descent parser.  The
      only difference is R becomes an L (for Left), meaning the parser
      must make the decisions on what it is doing using just the 
      :term:`left hand side` of the production it is currently trying to
      recognise and the next input symbol.  In theory because it has
      significantly less information to work with (it hasn't seen any of
      the :term:`right hand side` yet), an LL(1) parser is less powerful
      than a :term:`LR(1)` parser.  In practice recursive descent parsers
      are often written by hand and the programmer cheats by looking ahead
      or re-writing the grammar.  This cheating goes a long way to making
      up the difference in power.  However, automatically generated LL(1)
      parsers don't cheat and so are significantly less powerful (ie,
      LL(1) handles a smaller set of otherwise valid grammars) than
      automatically generated :term:`LR(1)` parsers such as |lrparsing|.

  Parser State
  State
      The state, which is typically identified by a number, is how the
      :term:`LR(1)` parser knows what it is up to.  A :term:`state` is
      called an :term:`item set` by the :term:`parser generator`.  The
      name change reflects a change in the way is is used, not in what
      it is.  Conceptually an :term:`LR(1)` parser is a Deterministic
      Finite Automaton (DFA) which is a fancy name for a way of moving
      through a graph.  The :term:`states <state>` are the nodes in
      this graph.  The :term:`LR(1)` parser progresses by moving from
      one node to another, but like all graphs this is only possible
      if the nodes are directly connected by a line.  In the parsers
      graph these lines are :term:`actions <action>`.  Every valid
      :term:`token` that can be accepted as input while the parser is
      in that state has an associated :term:`action`.  In fact the
      only time the :term:`parser` raises an error is when it knows
      the input stream can't match the :term:`grammar`, and the only
      time it knows that is when the :term:`token` it just read
      doesn't have an :term:`action` associated with it.  A
      :term:`shift action <shift>` consumes the next :term:`token`
      (pushes it onto the :term:`stack`) and then moves onto the next
      state by following the line labelled with that :term:`token`.  A
      :term:`reduce` action recognises a :term:`production` and so it
      pops the :term:`stack` to see what parent :term:`state` the
      :term:`LR(1) parser <parser>` is trying to recognise.  That
      parent state will have a line in the graph labeled with the
      :term:`non-terminal` the :term:`left hand side` of the
      :term:`production` just recognised, which will take it to the
      next :term:`state`.  People familiar with the purest form of
      regular expressions will know that they can also be recognised
      with DFA's.  An :term:`LR(1)` parser is a souped up version of
      a regular expression DFA that recognises the
      :term:`right hand side` of its productions.

  Parse Table
      A parse table is the data structure that drives a :term:`parser`.
      The parse table is an array of :term:`states <state>`, the
      :term:`states <state>` being identified by their array index.
      The :term:`parse table` can be viewed as the description of the
      nodes and lines of a graph.  The nodes of the graph are the
      :term:`states <state>`.  The lines are the :term:`state` the
      :term:`parser` moves to when it recognises a :term:`symbol`,
      and thus are labelled with that :term:`symbol`.  The lines 
      correspond to the state :term:`actions <action>`.

  Parse Tree
      The way a parser explains the meaning of its input is to output a
      parse tree.  The parse tree assigns the meaning to the
      :term:`symbols <symbol>` and the relationships between them.  The
      sentence *Time flies like an arrow* can be parsed two ways.  In
      the one way "Time" is an adjective describing a type of insect,
      and "like" is a verb describing what "Time flies" do to arrows.
      An |lrparsing| parse tree for that meaning might look like::

        (sentence
          (noun_phrase (adjective 'Time') (noun 'flies'))
          (verb_phrase
            (verb 'like')
            (noun_phrase (determiner 'an') (noun 'arrow')))))

      The alternative, where "Time" is a noun, looks like this::

        (sentence
          (noun 'Time')
          (verb_phrase
            (verb 'flies')
            (preposition_phrase 
              (preposition 'like')
              (noun_phrase (determiner 'an') (noun 'arrow')))))

  Parser
      The parser in an :term:`LR(1)` is a simple loop that processes a
      stream of :term:`input tuples <input tuple>` according to a
      :term:`parse table`, and produces a :term:`parse tree` as a
      consequence.  The same :term:`parser` is used for all input
      streams and all :term:`parse tables <parse table>`.  It is fast
      because it is simple.  Its one internal data structure is the
      :term:`stack`.  The :term:`parse table` is actually a graph,
      and the :term:`parser's <parser>` sole job is move to a new node
      in the graph as it recognises each new :term:`symbol` by following
      the line labelled with that :term:`symbol`.  As it goes it outputs
      the :term:`productions <production>` it has recognised, and these
      become the :term:`parse tree`.

  Parser Generator
      A parser generator takes a :term:`grammar` as input and generates
      a :term:`parse table` as output, unless the :term:`grammar` is
      :term:`ambiguous` in which cause it raises an error.  It does this
      by constructing :term:`item sets <item set>`.

  Production
      The :term:`LR(1)` name for the lines describing a :term:`Grammar` an
      :term:`LR(1) parser generator <parser generator>` accepts as input.
      |lrparsing| does not use :term:`productions <production>`, it uses
      :class:`rules <Rule>`.  It compiles your :class:`rules <Rule>` into
      :term:`LR(1)` :term:`productions <production>` for you.  You can
      display those productions using :func:`repr_productions`.
      Productions are similar to :class:`rules <Rule>` in that they have
      a :term:`left hand side` which replaces the symbols on the
      :term:`right hand side` when they are recognised.  The differences
      between a :class:`Rule` and a :term:`production` are a
      :term:`production` can only be a :class:`Sequence`, and alternative
      derivations for the same :term:`non-terminal` are expressed by
      allowing a :term:`non-terminal` to have many
      :term:`productions <production>`.

  Non-Terminal
      A symbol that appears on the :term:`left hand side` of a
      production.  In |lrparsing| :class:`Grammars <Grammar>` the
      :class:`Rules <Rule>` are its
      :term:`non-terminals <non-terminal>`.

  Reduce
  Goto
      A reduce is the :term:`parsers <parser>` aah! moment.  It
      happens when it has decided it has recognised a complete
      :term:`production`.  This has occurred because the
      :term:`symbols <symbol>` on the :term:`right hand side`
      plus the next :term:`token` in the input stream uniquely
      determined the :term:`non-terminal` on the productions
      :term:`left hand side`.  The :term:`reduce` :term:`action`
      now pops the :term:`symbols <symbol>` from the
      :term:`right hand side` from the stack, and then pushes
      the :term:`non-terminal` on the :term:`left hand side`.  In
      other words, it replaces the :term:`symbols <symbol>` on the
      stack are the :term:`productions <production>` right hand side
      with the :term:`non-terminal` on its :term:`left hand side`.
      It then looks up the newly arrived arrived :term:`non-terminal`
      in the :term:`reduce table` to determine the next state.
      :term:`Goto` is an older name for :term:`reduce` still found
      in the literature.

  Reduce Table
  Goto Table
      The :term:`reduce table` is a mapping of a :term:`non-terminal`
      to a :term:`state`.  When a :term:`production` is recognised
      by a :term:`reduce` :term:`action` it is replaced by the
      :term:`non-terminal` on its :term:`left hand side`.  This is
      done by popping the :term:`symbols <symbol>` on
      :term:`right hand side` from the :term:`stack`, and then
      pushing the :term:`non-terminal`.  The parser must then move to
      a new :term:`state`, and it does that by looking up
      :term:`non-terminal` in the :term:`states <state>`
      :term:`reduce table`.  Every :term:`state` has a
      :term:`reduce table`, but it will be empty if all the
      :term:`productions <production>` in the :term:`item set` the
      :term:`state` represents have
      :term:`right hand sides <right hand side>` consisting purely
      of :term:`tokens <token>`.  If the :term:`parse table` is
      thought of as a graph, the :term:`reduce table` describes
      the lines in the graph labelled by the :term:`non-terminal`
      (ie, to be followed) when the :term:`non-terminal` is
      recognised by a :term:`reduce`.  There is a corresponding
      mapping for :term:`tokens <token>` called the
      :term:`shift table` and between the two they describe all
      the lines in the graph.  In the literature this table is
      sometimes called the :term:`reduce table`.  :term:`Goto Table`
      is an older name for the :term:`reduce table` which is still
      found in the literature.

  Rhs
  Right Hand Side
      The sequence of symbols on the right hand side of a
      :term:`production`.  When these symbols are seen in the order they
      were given, they can be replaced on the :term:`stack` (via an
      operation called a :term:`reduce`) by the 
      :term:`non-terminal` on the :term:`left hand side`.

  Shift
      When a :term:`token` is recognised as valid it is removed from
      the input stream and pushed onto the :term:`stack`.  The
      :term:`parser` looks up :term:`token` in the
      :term:`shift table` to determine the new :term:`state` it
      should be in.

  Shift Table
      The :term:`shift table` maps a :term:`token` to another
      :term:`state`.  Each :term:`state` has its own
      :term:`shift table` that lists every :term:`token` the
      :term:`grammar` says could legitimately come next.  If the next
      :term:`token` to be read isn't in that list a :exc:`ParseError`
      is raised, and fact this is the only thing that triggers a
      :exc:`ParseError`.  If the :term:`parse table` is thought of
      as a graph, the :term:`shift table` describes the lines in
      the graph followed when a particular :term:`token` appears as
      the next item in the input stream.  There must be only one,
      otherwise the :term:`grammar` would be :term:`ambiguous`.  The
      act of doing this is called a :term:`shift` and thus the
      :term:`shift table` lists all possible :term:`shift`
      :term:`actions <action>` that can happen when the
      :term:`parser` in a particular :term:`state`.  There is a
      similar table called a :term:`reduce table` that describes the
      lines in the graph to be followed when a :term:`non-terminal`
      is recognised.  Between them, the
      :term:`shift tables <shift table>` and the
      :term:`reduce tables <reduce table>` contain all lines in the
      graph, and thus describe all possible transitions the
      :term:`parser` can make been :term:`states <state>`.

  Stack
      The stack is a data structure the :term:`LR(1) parser <parser>`
      uses internally to keep track of where it is up to.  A
      :term:`shift` pushes the next input :class:`Token` onto the
      stack.  A :term:`reduce` happens when the
      :term:`symbols <symbol>` on the :term:`right hand side` of a
      :term:`production` match the symbols on the top of the stack.
      When they do, a :term:`reduce` replaces those matching symbols
      with the :term:`non-terminal` on the :term:`left hand side` of
      the :term:`production`.  That :term:`symbol` will always (unless
      it is the :term:`start symbol`) be part of another production
      that the parser was in the process of recognising, and now its
      (probably partial) :term:`right hand side` will be what is now
      on the top of the stack.  The stack continues downwards with
      :term:`symbols <symbol>` from partially recognised
      :term:`productions <production>` until, at its bottom, we find
      the partially recognised production from the :term:`start symbol`.
      The :term:`Symbols <symbol>` popped from the :term:`stack`
      become nodes in the :term:`parse tree`.  Thus the :term:`stack`
      is actually a partially constructed branch of the
      :term:`parse tree`, with the bottom becoming the root node of
      the :term:`parse tree`.

  Start Symbol
      The goal of the grammar is to recognise this symbol.  Although
      an |lrparsing| :class:`Grammar` requires a :class:`Symbol` named
      :attr:`Grammar.START` and its goal is indeed to recognise that
      symbol, it isn't the real :term:`start symbol`.  The real
      :term:`start symbol` has a single :term:`production` whose
      :term:`right hand side` is the :attr:`Grammar.START` symbol
      followed by an internal :class:`MetaToken` called
      ``__end_of_input__``.  This real start symbol is returned by
      :func:`epoch_symbol`.

  Symbol
      In :term:`LR(1)` parlance a symbol is the name for something that
      is recognised by the :term:`parser`, ie, it is a collective name
      for :term:`tokens <token>` and
      :term:`non-terminals <non-terminal>`.

  Terminal
      What |lrparsing| calls a :term:`token`.

  Token
      Typically what the :term:`parser` is trying to recognise is a
      series of characters.  The first step is to break that series of
      characters into :term:`tokens <token>`.  In English a
      :term:`token` is a word or a punctuation character.  The
      :term:`parser` then works on these :term:`tokens <token>`,
      not the individual characters.

  Tokeniser
      The thing that turns text into a stream of :term:`Tokens <Token>`.


Footnotes
=========

.. [#thread_compile]

    :func:`Compile_grammar() <compile_grammar>` is the one lrparsing
    operation that is not thread safe.  :func:`Parse <parse>` is
    thread safe, except that it will lazily call
    :func:`compile_grammar()` for you if you haven't done it yet.
    This means in threaded environments you must compile your grammar
    yourself and not reply on :func:`parse` to do it. Compile it on
    module load or wrap it :func:`threading.Lock` if you insist on
    doing it lazily.  Calling :func:`parse` from multiple threads
    without calling :func:`compile_grammar` or
    :func:`pre_compile_grammar` will **cause lrparsing to crash**
    sporadically.

.. [#grammar_rule_assignment]

    Assigning a :class:`Symbol` instance to a rule in a :class:`Grammar`
    subclass does not work like normal Python assignment, where the left
    hand side and the right hand side are the same objects after the
    assignment.  Instead the :class:`Grammar` subclass creates a new
    :class:`Rule` object to hold the :class:`Symbol` object on the
    right hand side which results in some unusual side effects::

        class MyGrammar(Grammar):
            rule1 = Token('a')          # Legal, but rule1 != Token('a')
            rule2 = rule1 + Token('b')  # Also legal
            rule3 = rule2               # Illegal, direct assignment
            rule3 = rule2 * 1           # Same result but is legal
            START = rule3               # Legal because this is START

.. [#grammar_underscore_example]

    Not outputting rules whose name starts with a leading underscore to
    the :term:`parse tree` means these two grammars produce the same
    :term:`parse tree`::

        class Ex1(Grammar):
            _rule1 = Token('x')
            START = _rule1 + Token("y")

        class Ex2(Grammar):
            START = Token('x') + Token("y")

.. [#symbol_example]

    An example of syntactic sugar for :class:`Symbol`.
    This::

        class MyGrammar(Grammar):
            rule = Choice(Token('x'), Prio(Repeat(Token('a'), 1, 1), Sequence(Opt(Token('y')), Token('z'))))

    Can be rewritten as this::

        class MyGrammar(Grammar):
            rule = Token('x') | (Token('a') * 1, Opt('y') + 'z')

.. [#token_registry_example]

    An example of using a :class:`TokenRegistry`::

        class MyGrammar(Grammar):
            class MyTokens(TokenRegistry):
                my_token = UserToken()
                identifier = Token(re='[A-Za-z_][A-Za-z_0-9]*', case=False)
                #
                # Being a subclass of Symbol, TokenSymbols have dicts.
                # Since assignments TokenSymbol's .dict don't work in the
                # body Grammar subclasses it is best to put them in the
                # TokenRegistry's body.
                #
                my_token["a"] = "a_value"
            START = MyTokens.my_token | MyTokens.identifier
        assert MyGrammar.MyTokens.my_token == 'MyTokens.my_token'

.. [#token_registry_corollary]

    A corollary of a :class:`TokenRegistry` being the only way to obtain
    a handle to the :class:`TokenSymbol` used by the parser is if you
    use :class:`UserTokens <UserToken>`, you have to use a
    :class:`TokenRegistry` otherwise there is no way to pass the correct
    :class:`UserToken` instance to the parser.

.. [#tokeniser_chain]

    The inbuilt tokeniser passes :term:`input tuples <input tuple>`
    through to the :func:`parser <parse>` so you can use it to handle
    the bulk of the work even if it can't handle all of it.  Write
    a generator that replaces the input :class:`strings <str>` the
    inbuilt parser can't handle with :term:`input tulpes <input tuple>`
    and yields the remainder of the input untouched.  Pass your
    generator to :func:`parse` and you are done.

.. [#tokeniser-override]
    
    If you want to intercept the :term:`input tuples <input tuple>`
    emitted by the :ref:`inbuilt tokeniser <inbuilt-tokeniser>`,
    perhaps to change or delete them, override
    :meth:`TokenRegistry.tokeniser` like this::

        class G(Grammar):
            class T(TokenRegistry):
                def tokeniser(self, *args):
                    for input_tuple in super(G.T, self).tokeniser(*args):
                        # do something with input_tuple, maybe
                        yield input_tuple


.. [#tokens]

    Using :term:`tokens <token>` rather than dealing with
    :class:`strings <str>` directly is primarily an optimisation.  Being
    more powerful than a regular expression an :term:`LR(1)` parser could
    easily do what the :term:`tokeniser` does: recognise identifiers,
    numbers, keywords and so in a character stream.  But that power comes
    at a price.  An :term:`LR(1)` parser both consumes more memory and is
    slower than a regular expression.  Since :term:`tokens <token>` are
    typically 4 or 5 characters long we reduce the workload on an
    :term:`LR(1)` parser by a factor of 4 or 5 by getting a light weight
    regular expression to break the input stream up into
    :term:`tokens <token>`, and then have the
    :term:`LR(1) parser <parser>` operate on them.

.. [#error-recovery-lie]

    The statement that the :term:`stack` passed to :func:`on_error` and
    :exc:`ParseError` is a :class:`Lr1State` is a simplification.  It
    will be an :class:`Lr1State` if :func:`pre_compile_grammar` is used,
    otherwise it will be an :term:`item set`.  However, :class:`Lr1State`
    contains a strict subset of the information in an :term:`item set`
    so the description applies to both.

.. |lrparsing| replace:: :mod:`lrparsing`