ML-Yacc is a parser generator for Standard ML modeled after the
Yacc parser generator. It generates parsers for LALR languages, like Yacc,
and has a similar syntax. The generated parsers use a different algorithm
for recovering from syntax errors than parsers generated by Yacc.
The algorithm is a partial implementation of an algorithm described in [1].
A parser tries to recover from a syntax error
by making a single token insertion, deletion, or
substitution near the point in the input stream at which the error
was detected. The parsers delay the evaluation of semantic actions until
parses are completed successfully. This makes it possible for
parsers to recover from syntax errors that occur before the point
of error detection, but it does prevent the parsers from
affecting lexers in any significant way. The parsers
can insert tokens with values and substitute tokens with values
for other tokens. All symbols carry left and right position values
which are available to semantic actions and are used in
syntactic error messages.
ML-Yacc uses context-free grammars to specify the syntax of languages to
be parsed. See [2] for definitions and information on context-free
grammars and LR parsing. We briefly review some terminology here. A
context-free grammar is defined by a set of terminals T, a set of
nonterminals NT, a set of productions P, and a start
nonterminal S.
Terminals are interchangeably referred to as tokens. The terminal
and nonterminal sets are assumed to be disjoint. The set of symbols is the
union of the nonterminal and terminal sets. We use lower case
Greek letters to denote a string of symbols. We use upper case
Roman letters near the beginning of the alphabet to denote nonterminals.
Each production gives a
derivation of a string of symbols from a nonterminal, which we will
write as A ® a. We define a relation between strings of
symbols a and b, written a |- b and read
as a derives b, if and only if a = d A g,
b = d f g and
there exists some production A ® f. We write the
transitive closure of this relation as
|-*. We say that a string of terminals a is a valid sentence
of the language, i.e. it is derivable, if the start symbol
S |-* a. The sequence of derivations is often
visualized as a parse tree.
ML-Yacc uses an attribute grammar scheme with synthesized attributes.
Each symbol in the grammar may have a value (i.e. attribute) associated
with it. Each production has a semantic action associated with it.
A production with a semantic action is called a rule.
Parsers perform bottom-up, left-to-right evaluations of parse trees using semantic
actions to compute values as they do so. Given a production
P = A ® a, the corresponding semantic action is
used to compute a value for A from the values of the symbols in a.
If A has no value, the semantic action is still evaluated but the value is ignored.
Each parse returns the value associated with the start symbol S of the
grammar. A parse returns a nullary value if the start symbol does not carry a value.
The synthesized attribute scheme can be adapted easily to inherited
attributes. An inherited attribute is a value which propagates from
a nonterminal to the symbols produced by the nonterminal according to
some rule. Since functions are values in ML,
the semantic actions for the derived symbols
can return functions which takes the
inherited value as an argument.
1.2 Modules
ML-Yacc uses the ML modules facility to specify the interface between
a parser that it generates and a lexical analyzer that must be supplied
by you. It also uses the ML modules facility to factor out
a set of modules that are common to every generated parser.
These common modules include a parsing structure, which contains
an error-correcting LR parser1, an LR table structure, and a structure
which defines the representation of terminals. ML-Yacc produces
a functor for a particular parser parameterized by the LR table
structure and the representation of terminals. This functor
contains values specific to the parser, such as the
LR table for the parser2, the
semantic actions for the parser, and a structure containing
the terminals for the parser. ML-Yacc produces a signature
for the structure produced by applying this functor
and another signature for the structure containing the terminals for
the parser. You must
supply a functor for the lexing module parameterized this
structure.
Figure 1 is a dependency diagram of the modules that summarizes this
information. A module at the head of an arrow is dependent
on the module at the tail.
parsing structure
¾®
values for a particular parser
values for a particular parser
¾®
lexical analyzer
parsing structure,
¾®
particular parser
values for a particular parser,
lexical analyzer
Figure 1: Module Dependencies
1.3 Error Recovery
The error recovery algorithm is able to accurately recover from many
single token syntax errors. It tries to make a single token
correction at the token in the input stream at which the syntax error
was detected and any of the 15 tokens3 before that token. The algorithm checks corrections
before the point of error detection because a syntax error is often
not detected until several tokens beyond the token which caused the
error.4
The algorithm works by trying corrections at each
of the 16 tokens up to and including the token at which the
error was detected. At each token in the input stream, it
will try deleting the token, substituting other tokens for the
token, or inserting some other token before the token.
The algorithm uses a parse check to evaluate corrections. A parse
check is a check of how far a correction allows a parser to
parse without encountering a syntax error.
You pass an upper bound on how many tokens beyond the error
point a parser may read while doing a parse check as an argument to the
parser. This allows
you to control the amount of lookahead that a parser reads
for different kinds of systems. For an interactive system, you
should set the lookahead to zero. Otherwise, a parser may hang
waiting for input in the case of a syntax error. If the lookahead
is zero, no syntax errors will be corrected. For a batch system,
you should set the lookahead to 15.
The algorithm selects the set of corrections which allows the parse
to proceed the farthest
and parse through at least the error token. It then removes those
corrections involving keywords which do not meet a longer minimum
parse check. If there is more than one correction possible after this,
it uses a simple heuristic priority scheme to order the corrections,
and then arbitrarily chooses one of the corrections with the highest priority.
You have some control over the priority scheme by being able to
name a set of preferred insertions and a set of preferred substitutions.
The priorities for corrections, ordered from highest to lowest
priority, are
preferred insertions, preferred substitutions, insertions, deletions,
and substitutions.
The error recovery algorithm is guaranteed to terminate since it always
selects fixes which parse through the
error token.
The error-correcting LR parser implements the algorithm by keeping
a queue of its state stacks before shifting tokens and using
a lazy stream for the lexer.
This makes it possible to restart the
parse from before an error point and try various corrections. The
error-correcting LR parser does not defer semantic actions. Instead,
ML-Yacc creates semantic actions which are free of side-effects
and always terminate.
ML-Yacc uses higher-order functions to defer the
evaluation of all user semantic actions until the parse is successfully
completed without constructing an explicit parse tree.
You may declare whether your semantic actions are free of side-effects
and always terminate, in which case ML-Yacc does not need to defer
the evaluation of your semantic actions.
1.4 Precedence
ML-Yacc uses the same precedence scheme as Yacc for resolving
shift/reduce conflicts. Each terminal may be assigned a precedence and
associativity. Each rule is then assigned the precedence of its rightmost
terminal. If a shift/reduce conflict occurs, the conflict is resolved
silently if the terminal and the rule in the conflict have
precedences.
If the terminal has the higher precedence, the shift is chosen. If
the rule has the higher precedence, the reduction is chosen. If both
the terminal and the rule have the same precedence, then the associativity
of the terminal is used to resolve the conflict. If the terminal is
left associative, the reduction is chosen. If the terminal is
right associative, the shift is chosen. Terminals may be declared to
be nonassociative, also, in which case an error message is produced
if the associativity is need to resolve the parsing conflict.
If a terminal or a rule in a shift/reduce conflict does not have
a precedence, then an error message is produced and the shift
is chosen.
In reduce/reduce conflicts, an error message is always produced and
the first rule listed in the specification is chosen for reduction.
1.5 Notation
Text surrounded by brackets denotes meta-notation. If you see
something like {parser name}, you should substitute the actual
name of your parser for the meta-notation. Text in a bold-face
typewriter font (like this) denotes text in a specification
or ML code.