<html> <head> <title>ANTLR-centric Language Glossary</title> </head> <body bgcolor="#FFFFFF" text="#000000"> <font face=Arial> <h1><a id="ANTLR-centric_Language_Glossary" name="ANTLR-centric_Language_Glossary">ANTLR-centric Language Glossary</a></h1> <i>Terence Parr</i> <br><br> This glossary defines some of the more important terms used in the ANTLR documentation. I have tried to be very informal and provide references to other pages that are useful. For another great source of information about formal computer languages, see <a href="http://www.wikipedia.org/wiki/Context-free_grammar"><b>Wikipedia</b></a>. <font size=2> <!--index--> <ul> <li><a href="#Ambiguous">Ambiguous</a></li> <li><a href="#ANTLR">ANTLR</a></li> <li><a href="#AST">AST</a></li> <li><a href="#Bit_set">Bit set</a></li> <li><a href="#Child-sibling_Tree">Child-sibling Tree</a></li> <li><a href="#Context-free_grammar">Context-free grammar</a></li> <li><a href="#Context-sensitive">Context-sensitive</a></li> <li><a href="#DFA">DFA</a></li> <li><a href="#FIRST">FIRST</a></li> <li><a href="#FOLLOW">FOLLOW</a></li> <li><a href="#Grammar">Grammar</a></li> <li><a href="#Hoisting">Hoisting</a></li> <li><a href="#Inheritance,_grammar">Inheritance, grammar</a></li> <li><a href="#LA(n)">LA(n)</a></li> <li><a href="#Left-prefix,_left_factor">Left-prefix, left factor</a></li> <li><a href="#Literal">Literal</a></li> <li><a href="#Linear_approximate_lookahead">Linear approximate lookahead</a></li> <li><a href="#LL(k)">LL(k)</a></li> <li><a href="#LT(n)">LT(n)</a></li> <li><a href="#Language">Language</a></li> <li><a href="#Lexer">Lexer</a></li> <li><a href="#Lookahead">Lookahead</a></li> <li><a href="#nextToken">nextToken</a></li> <li><a href="#NFA">NFA</a></li> <li><a href="#Nondeterministic">Nondeterministic</a></li> <li><a href="#Parser">Parser</a></li> <li><a href="#Predicate,_semantic">Predicate, semantic</a></li> <li><a href="#Predicate,_syntactic">Predicate, syntactic</a></li> <li><a href="#Production">Production</a></li> <li><a href="#Protected">Protected</a></li> <li><a href="#Recursive-descent">Recursive-descent</a></li> <li><a href="#Regular">Regular</a></li> <li><a href="#Rule">Rule</a></li> <li><a href="#Scanner">Scanner</a></li> <li><a href="#Semantics">Semantics</a></li> <li><a href="#Subrule">Subrule</a></li> <li><a href="#Syntax">Syntax</a></li> <li><a href="#Token">Token</a></li> <li><a href="#Token_stream">Token stream</a></li> <li><a href="#Tree">Tree</a></li> <li><a href="#Tree_parser">Tree parser</a></li> <li><a href="#Vocabulary">Vocabulary</a></li> <li><a href="#Wow">Wow</a></li> </ul> <!--/index--> </font> <h3><a id="Ambiguous" name="Ambiguous">Ambiguous</a></h3> A language is ambiguous if the same sentence or phrase can be interpreted in more than a single way. For example, the following sentence by Groucho Marx is easily interpreted in two ways: "I once shot an elephant in my pajamas. How he got in my pajamas I'll never know!" In the computer world, a typical language ambiguity is the if-then-else ambiguity where the else-clause may be attached to either the most recent if-then or an older one. Reference manuals for computer languages resolve this ambiguity by stating that else-clauses always match up with the most recent if-then. <p> A grammar is ambiguous if the same input sequence can be derived in multiple ways. Ambiguous languages always yield ambiguous grammars unless you can find a way to encode semantics (actions or predicates etc...) that resolve the ambiguity. Most language tools like ANTLR resolve the if-then-else ambiguity by simply choosing to match greedily (i.e., as soon as possible). This matches the else with the most recent if-then. See nondeterministic. <h3><a id="ANTLR" name="ANTLR">ANTLR</a></h3> <b>AN</b>other <b>T</b>ool for <b>L</b>anguage <b>R</b>ecognition, a predicated-LL(k) parser generator that handles lexers, parsers, and tree parsers. ANTLR has been available since 1990 and led to a resurgence of recursive-descent parsing after decades dominated by LR and other DFA-based strategies. <h3><a id="AST" name="AST">AST</a></h3> <b>A</b>bstract <b>S</b>yntax <b>T</b>ree. ASTs are used as internal representations of an input stream, normally constructed during a parsing phase. Because AST are two-dimensional trees they can encode the structure (as determined by the parser) of the input as well as the input symbols. <p> A homogeneous AST is in one in which the physical objects are all of the same type; e.g., <tt>CommonAST</tt> in ANTLR. A heterogeneous tree may have multiple types such as <tt>PlusNode</tt>, <tt>MultNode</tt> etc... <p> An AST is not a parse tree, which encodes the sequence of rules used to match input symbols. See <a href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the difference between a parse tree and an abstract syntax tree (AST)? Why doesn't ANTLR generate trees with nodes for grammar rules like JJTree does?</b></a>. <p>An AST for input <tt>3+4</tt> might be represented as <pre> + / \ 3 4 </pre> or more typically (ala ANTLR) in child-sibling form: <pre> + | 3--4 </pre> Operators are usually subtree roots and operands are usually leaves. <h3><a id="Bit_set" name="Bit_set">Bit set</a></h3> Bit sets are an extremely efficient representation for dense integer sets. You can easily encode sets of strings also by mapping unique strings to unique integers. ANTLR uses bitsets for lookahead prediction in parsers and lexers. Simple bit set implementations do not work so well for sparse sets, particularly when the maximum integer stored in the set is large. <p> ANTLR's bit set represents membership with a bit for each possible integer value. For a maximum value of <i>n</i>, a bit set needs <i>n/64</i> long words or <i>n/8</i> bytes. For ASCII bit sets with a maximum value of 127, you only need 16 bytes or 2 long words. UNICODE has a max value of \uFFFF or 65535, requiring 8k bytes, and these sets are typically sparse. Fortunately most lexers only need a few of these space inefficient (but speedy) bitsets and so it's not really a problem. <h3><a id="Child-sibling_Tree" name="Child-sibling_Tree">Child-sibling Tree</a></h3> A particularly efficient data structure for representing trees. See AST. <h3><a id="Context-free_grammar" name="Context-free_grammar">Context-free grammar</a></h3> A grammar where recognition of a particular construct does not depend on whether it is in a particular syntactic context. A context-free grammar has a set of rules like <pre> stat : IF expr THEN stat | ... ; </pre> where there is no restriction on when the <tt>IF</tt> alternative may be applied--if you are in rule <tt>stat</tt>, you may apply the alternative. <h3><a id="Context-sensitive" name="Context-sensitive">Context-sensitive</a></h3> A grammar where recognition of a particular construct may depend on a syntactic context. You never see these grammars in practice because they are impractical (Note, an Earley parser is O(n^3) worst-case for context-<i>free</i> grammars). A context-free rule looks like: <pre> Α → γ </pre> but a context-<i>sensitive</i> rule may have context on the left-side: <pre> αΑβ → αγβ </pre> meaning that rule Α may only be applied (converted to γ) in between α and β. <p> In an ANTLR sense, you can recognize context-sensitive constructs with a semantic predicate. The action evaluates to true or false indicating the validity of applying the alternative. <p> See <a href="http://www.wikipedia.org/wiki/Context-sensitive"><b>Context-sensitive gramar</b></a>. <h3><a id="DFA" name="DFA">DFA</a></h3> <b>D</b>eterministic <b>F</b>inite <b>A</b>utomata. A state machine used typically to formally describe lexical analyzers. <tt>lex</tt> builds a DFA to recognize tokens whereas ANTLR builds a recursive descent lexer similar to what you would build by hand. See <a href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite state machine</b></a> and ANTLR's lexer documentation. <h3><a id="FIRST" name="FIRST">FIRST</a></h3> The set of symbols that may be matched on the left-edge of a rule. For example, the FIRST(decl) is set {ID, INT} for the following: <pre> decl : ID ID SEMICOLON | INT ID SEMICOLON ; </pre> The situation gets more complicated when you have optional constructs. The FIRST(a) below is {A,B,C} <pre> a : (A)? B | C ; </pre> because the A is optional and the B may be seen on the left-edge. <p> Naturally k>1 lookahead symbols makes this even more complicated. FIRST_k must track sets of k-sequences not just individual symbols. <h3><a id="FOLLOW" name="FOLLOW">FOLLOW</a></h3> The set of input symbols that may follow any reference to the specified rule. For example, FOLLOW(decl) is {RPAREN, SEMICOLON): <pre> methodHead : ID LPAREN decl RPAREN ; var : decl SEMICOLON ; decl : TYPENAME ID ; </pre> because RPAREN and SEMICOLON both follow references to rule decl. FIRST and FOLLOW computations are used to analyze grammars and generate parsing decisions. <p> This grammar analysis all gets very complicated when k>1. <h3><a id="Grammar" name="Grammar">Grammar</a></h3> A finite means of formally describing the structure of a possibly infinite language. Parser generators build parsers that recognize sentences in the language described by a grammar. Most parser generators allow you to add actions to be executed during the parse. <h3><a id="Hoisting" name="Hoisting">Hoisting</a></h3> Semantic predicates describe the semantic context in which a rule or alternative applies. The predicate is hoisted into a prediction expression. Hoisting typically refers to pulling a predicate out of its enclosing rule and into the prediction expression of another rule. For example, <pre> decl : typename ID SEMICOLON | ID ID SEMICOLON ; typename : {isType(LT(1))}? ID ; </pre> The predicate is not needed in typename as there is no decision, however, rule decl needs it to distinguish between its two alternatives. The first alternative would look like: <pre> if ( LA(1)==ID && isType(LT(1)) ) { typename(); match(ID); match(SEMICOLON); } </pre> PCCTS 1.33 did, but ANTLR currently does not hoist predicates into other rules. <h3><a id="Inheritance,_grammar" name="Inheritance,_grammar">Inheritance, grammar</a></h3> The ability of ANTLR to define a new grammar as it differs from an existing grammar. See the ANTLR documentation. <h3><a id="LA(n)" name="LA(n)">LA(n)</a></h3> The nth lookahead character, token type, or AST node type depending on the grammar type (lexer, parser, or tree parser respectively). <h3><a id="Left-prefix,_left_factor" name="Left-prefix,_left_factor">Left-prefix, left factor</a></h3> A common sequence of symbols on the left-edge of a set of alternatives such as: <pre> a : A B X | A B Y ; </pre> The left-prefix is A B, which you can remove by left-factoring: <pre> a : A B (X|Y) ; </pre> Left-factoring is done to reduce lookahead requirements. <h3><a id="Literal" name="Literal">Literal</a></h3> Generally a literal refers to a fixed string such as <tt>begin</tt> that you wish to match. When you reference a literal in an ANTLR grammar via <tt>"begin"</tt>, ANTLR assigns it a token type like any other token. If you have defined a lexer, ANTLR provides information about the literal (type and text) to the lexer so it may detect occurrences of the literal. <h3><a id="Linear_approximate_lookahead" name="Linear_approximate_lookahead">Linear approximate lookahead</a></h3> An approximation to full lookahead (that can be applied to both LL and LR parsers) for k>1 that reduces the complexity of storing and testing lookahead from O(n^k) to O(nk); exponential to linear reduction. When linear approximate lookahead is insufficient (results in a nondeterministic parser), you can use the approximate lookahead to attenuate the cost of building the full decision. <p>Here is a simple example illustrating the difference between full and approximate lookahead: <pre> a : (A B | C D) | A D ; </pre> This rule is LL(2) but not linear approximate LL(2). The real FIRST_2(a) is {AB,CD} for alternative 1 and {AD} for alternative 2. No intersection, so no problem. Linear approximate lookahead collapses all symbols at depth i yielding k sets instead of a possibly n^k k-sequences. The approximation (compressed) sets are {AB,AD,CD,CB} and {AD}. Note the introduction of the spurious k-sequences AD and CB. Unfortunately, this compression introduces a conflict upon AD between the alternatives. PCCTS did full LL(k) and ANTLR does linear approximate only as I found that linear approximate lookahead works for the vast majority of parsing decisions and is extremely fast. I find one or two problem spots in a large grammar usually with ANTLR, which forces me to reorganize my grammar in a slightly unnatural manner. Unfortunately, your brain does full LL(k) and ANTLR does a slightly weaker linear approximate lookahead--a source of many (invalid) bug reports ;) <p> This compression was the subject of <a href="http://www.antlr.org/papers/parr.phd.thesis.pdf"><b>my doctoral dissertation</b></a> (PDF 477k) at Purdue. <h3><a id="LL(k)" name="LL(k)">LL(k)</a></h3> Formally, LL(k) represents a class of parsers and grammars that parse symbols from left-to-right (beginning to end of input stream) using a leftmost derivation and using k symbols of lookahead. A leftmost derivation is one in which derivations (parses) proceed by attempting to replace rule references from left-to-right within a production. Given the following rule <pre> stat : IF expr THEN stat | ... ; </pre> an LL parser would match the IF then attempt to parse expr rather than a rightmost derivation, which would attempt to parse stat first. <p> LL(k) is synonymous with a "top-down" parser because the parser begins at the start symbol and works its way down the derivation/parse tree (tree here means the stack of method activations for recursive descent or symbol stack for a table-driven parser). A recursive-descent parser is particular implementation of an LL parser that uses functions or method calls to implement the parser rather than a table. <p> ANTLR generates predicate-LL(k) parsers that support syntactic and sematic predicates allowing you to specify many context-free and context-sensitive grammars (with a bit of work). <h3><a id="LT(n)" name="LT(n)">LT(n)</a></h3> In a parser, this is the nth lookahead Token object. <h3><a id="Language" name="Language">Language</a></h3> A possibly infinite set of valid sentences. The vocabulary symbols may be characters, tokens, and tree nodes in an ANTLR context. <h3><a id="Lexer" name="Lexer">Lexer</a></h3> A recognizer that breaks up a stream of characters into vocabulary symbols for a parser. The parser pulls vocabulary symbols from the lexer via a queue. <h3><a id="Lookahead" name="Lookahead">Lookahead</a></h3> When parsing a stream of input symbols, a parser has matched (and no longer needs to consider) a portion of the stream to the left of its read pointer. The next k symbols to the right of the read pointer are considered the fixed lookahead. This information is used to direct the parser to the next state. In an LL(k) parser this means to predict which path to take from the current state using the next k symbols of lookahead. <p> ANTLR supports syntactic predicates, a manually-specified form of backtracking that effectively gives you infinite lookahead. For example, consider the following rule that distinguishes between sets (comma-separated lists of words) and parallel assignments (one list assigned to another): <pre> stat: ( list "=" )=> list "=" list | list ; </pre> If a list followed by an assignment operator is found on the input stream, the first production is predicted. If not, the second alternative production is attempted. <h3><a id="nextToken" name="nextToken">nextToken</a></h3> A lexer method automatically generated by ANTLR that figures out which of the lexer rules to apply. For example, if you have two rules ID and INT in your lexer, ANTLR will generate a lexer with methods for ID and INT as well as a nextToken method that figures out which rule method to attempt given k input characters. <h3><a id="NFA" name="NFA">NFA</a></h3> <b>N</b>ondeterministic <b>F</b>inite <b>A</b>utomata. See <a href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite state machine</b></a>. <h3><a id="Nondeterministic" name="Nondeterministic">Nondeterministic</a></h3> A parser is nondeterministic if there is at least one decision point where the parser cannot resolve which path to take. Nondeterminisms arise because of parsing strategy weaknesses. <ul> <li>If your strategy works only for unambiguous grammars, then ambiguous grammars will yield nondeterministic parsers; this is true of the basic LL, LR strategies. Even unambiguous grammars can yield nondeterministic parsers though. Here is a nondeterministic LL(1) grammar: <pre> decl : ID ID SEMICOLON | ID SEMICOLON ; </pre> Rule <tt>decl</tt> is, however, LL(2) because the second lookahead symbol (either ID or SEMICOLON) uniquely determines which alternative to predict. You could also left-factor the rule to reduce the lookahead requirements.<br><br> <li> If you are willing to pay a performance hit or simply need to handle ambiguous grammars, you can use an Earley parser or a Tomita parser (LR-based) that match all possible interpretations of the input, thus, avoiding the idea of nondeterminism altogether. This does present problems when trying to execute actions, however, because multiple parses are, in effect, occurring in parallel. </ul> <p> Note that a parser may have multiple decision points that are nondeterministic. <h3><a id="Parser" name="Parser">Parser</a></h3> A recognizer that applies a grammatical structure to a stream of vocabulary symbols called tokens. <h3><a id="Predicate,_semantic" name="Predicate,_semantic">Predicate, semantic</a></h3> A semantic predicate is a boolean expression used to alter the parse based upon semantic information. This information is usually a function of the constructs/input that have already been matched, but can even be a flag that turns on and off subsets of the language (as you might do for a grammar handling both K&R and ANSI C). One of the most common semantic predicates uses a symbol table to help distinguish between syntactically, but semantically different productions. In FORTRAN, array references and function calls look the same, but may be distinguished by checking what the type of the identifier is. <pre> expr : {isVar(LT(1))}? ID LPAREN args RPAREN // array ref | {isFunction(LT(1))}? ID LPAREN args RPAREN // func call ; </pre> <h3><a id="Predicate,_syntactic" name="Predicate,_syntactic">Predicate, syntactic</a></h3> A selective form of backtracking used to recognize language constructs that cannot be distinguished without seeing all or most of the construct. For example, in C++ some declarations look exactly like expressions. You have to check to see if it is a declaration. If it parses like a declaration, assume it is a declaration--reparse it with "feeling" (execute your actions). If not, it must be an expression or an error: <pre> stat : (declaration) => declaration | expression ; </pre> <h3><a id="Production" name="Production">Production</a></h3> An alternative in a grammar rule. <h3><a id="Protected" name="Protected">Protected</a></h3> A protected lexer rule does not represent a complete token--it is a helper rule referenced by another lexer rule. This overloading of the access-visibility Java term occurs because if the rule is not visible, it cannot be "seen" by the parser (yes, this nomeclature sucks). <h3><a id="Recursive-descent" name="Recursive-descent">Recursive-descent</a></h3> See LL(k). <h3><a id="Regular" name="Regular">Regular</a></h3> A regular language is one that can be described by a regular grammar or regular expression or accepted by a DFA-based lexer such as those generated by <tt>lex</tt>. Regular languages are normally used to describe tokens. <p> In practice you can pick out a regular grammar by noticing that references to other rules are not allowed accept at the end of a production. The following grammar is regular because reference to <tt>B</tt> occurs at the right-edge of rule <tt>A</tt>. <pre> A : ('a')+ B ; B : 'b' ; </pre> Another way to look at it is, "what can I recognize without a stack (such as a method return address stack)?". <p> Regular grammars cannot describe context-free languages, hence, LL or LR based grammars are used to describe programming languages. ANTLR is not restricted to regular languages for tokens because it generates recursive-descent lexers. This makes it handy to recognize HTML tags and so on all in the lexer. <h3><a id="Rule" name="Rule">Rule</a></h3> A rule describes a partial sentence in a language such as a statement or expression in a programming language. Rules may have one or more alternative productions. <h3><a id="Scanner" name="Scanner">Scanner</a></h3> See Lexer. <h3><a id="Semantics" name="Semantics">Semantics</a></h3> See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What do "syntax" and "semantics" mean and how are they different?</b></a>. <h3><a id="Subrule" name="Subrule">Subrule</a></h3> Essentially a rule that has been expanded inline. Subrules are enclosed in parenthesis and may have suffixes like star, plus, and question mark that indicate zero-or-more, one-or-more, or optional. The following rule has 3 subrules: <pre> a : (A|B)+ (C)* (D)? ; </pre> <h3><a id="Syntax" name="Syntax">Syntax</a></h3> See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What do "syntax" and "semantics" mean and how are they different?</b></a>. <h3><a id="Token" name="Token">Token</a></h3> A vocabulary symbol for a language. This term typically refers to the vocabulary symbols of a parser. A token may represent a constant symbol such as a keyword like <tt>begin</tt> or a "class" of input symbols like <tt>ID</tt> or <tt>INTEGER_LITERAL</tt>. <h3><a id="Token_stream" name="Token_stream">Token stream</a></h3> See <a href="http://www.antlr.org/doc/streams.html"><b>Token Streams</b></a> in the ANTLR documentation. <h3><a id="Tree" name="Tree">Tree</a></h3> See AST and <a href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the difference between a parse tree and an abstract syntax tree (AST)? Why doesn't ANTLR generate trees with nodes for grammar rules like JJTree does?</b></a>. <h3><a id="Tree_parser" name="Tree_parser">Tree parser</a></h3> A recognizer that applies a grammatical structure to a two-dimensional input tree. Grammatical rules are like an "executable comment" that describe the tree structure. These parsers are useful during translation to (i) annotate trees with, for example, symbol table information, (2) perform tree rewrites, and (3) generate output. <h3><a id="Vocabulary" name="Vocabulary">Vocabulary</a></h3> The set of symbols used to construct sentences in a language. These symbols are usually called tokens or token types. For lexers, the vocabulary is a set of characters. <h3><a id="Wow" name="Wow">Wow</a></h3> See ANTLR. </font> </body> </html>