Rather than have the programmer specify a bunch of command-line arguments to the parser generator, an options section within the grammar itself serves this purpose. This solution is preferrable because it associates the required options with the grammar rather than ANTLR invocation. The section is preceded by the options keyword and contains a series of option/value assignments surrounded by curly braces such as:
options { k = 2; tokenVocbaulary = IDL; defaultErrorHandler = false; }
The options section for an entire (.g) file, if specified, immediately follows the (optional) file header:
header { package X; } options {language="FOO";}
The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
The options section for a rule, if specified, must immediately follow the rule name:
myrule[args] returns [retval] options { defaultErrorHandler=false; } : // body of rule... ;
The option names are not keywords in ANTLR, but rather are entries in a symbol table examined by ANTLR. The scope of option names is limited to the options section; identifiers within your grammar may overlap with these symbols.
The only ANTLR options not specified in the options section are things that do not vary with the grammar, but rather than invocation of ANTLR itself. The best example is debugging information. Typically, the programmer will want a makefile to change an ANTLR flag indicating a debug or release build.
Key for the type column: F=file, G=grammar, R=rule, L=lexer, S=subrule, C=C++ only.
Symbol | Type | Description |
---|---|---|
language | F | Set the generated language |
k | G | Set the lookahead depth |
importVocab | G | Initial grammar vocabulary |
exportVocab | G | Vocabulary exported from grammar |
testLiterals | LG,LR | Generate literal-testing code |
defaultErrorHandler | G,R | Control default exception-handling |
greedy | S | False implies you want subrule loop, (..)* and (..)+, to exit when it sees lookahead consistent with what follows the loop. |
codeGenMakeSwitchThreshold | G | Control code generation |
codeGenBitsetTestThreshold | G | Control code generation |
buildAST | G | Set automatic AST construction in Parser (transform mode in Tree-Parser) |
analyzerDebug | G | Spit out lots of debugging information while performing grammar analysis. |
codeGenDebug | G | Spit out lots of debugging information while doing code generation. |
ASTLabelType | G | Specify the type of all user-defined labels, overrides default of AST. |
charVocabulary | LG | Set the lexer character vocabulary |
interactive | G | Both the lexer and the parser have an interactive option, which defaults to "false". See the parser speed section above. |
caseSensitive | LG | Case is ignored when comparing against character and string literals in the lexer. The case of the input stream is maintained when stored in the token objects. |
ignore | LR | Specify a lexer rule to use as whitespace between lexical rule atomic elements (chars, strings, and rule references). The grammar analysis and, hence, the lookhaead sets are aware of the whitespace references. This is a lexer rule option. |
paraphrase | LR | An easy way to specify a string to use in place of the token name during error processing. |
caseSensitiveLiterals | LG | Case is ignored when comparing tokens against the listerals table. |
classHeaderPrefix | G | Replace the usual class prefix ("public" in Java) for the enclosing class definition. |
classHeaderSuffix | G | Append a string to the enclosing class definition. In Java, this amounts to a comma-separated list of interfaces that your lexer, parser, or tree walker must implement. |
mangleLiteralPrefix | F | Sets the prefix for the token type definitions of literals rather than using the default of "TOKEN_". |
warnWhenFollowAmbig | S | Warnings will be printed when the lookahead set of what follows a subrule containing an empty alternative conflicts with a subrule alternative or when the implicit exit branch of a closure loop conflicts with an alternative. The default is true. |
generateAmbigWarnings | S |
When true, no ambiguity/nondeterminism warning is generated for the decision associated with the subrule. Use this very carefully--you may change the subrule and miss an ambiguity because of the option. Make very sure that the ambiguity you mask is handled properly by ANTLR. ANTLR-generated parsers resolve ambiguous decisions by consuming input as soon as possible (or by choosing the alternative listed first).
See the Java and HTML grammars for proper use of this option. A comment should be supplied for each use indicating why it is ok to shut off the warning. |
filter | LG | When true, the lexer ignores any input not exactly matching one of the nonprotected lexer rules. When set to a rule name, the filter option using the rule to parse input characters between valid tokens or those tokens of interest. |
namespace | FGC | When set, all the C++ code generated is wrapped in the namespace mentioned here. |
namespaceStd | FGC | When set, the ANTLR_USE_NAMESPACE(std) macros in the generated C++ code are replaced by this value. This is a cosmetic option that only makes the code more readable. It does not replace this macro in the support C++ files. Note: use this option directly after setting the language to C++. |
namespaceAntlr | FGC | When set, the ANTLR_USE_NAMESPACE(antlr) macros in the generated C++ code are replaced by this value. This is a cosmetic option that only makes the code more readable. It does not replace this macro in the support C++ files. Note: use this option directly after setting the language to C++. |
genHashLines | FGC | Boolean toggle, when set to 'true' #line <linenumber> "filename" lines are inserted in the generated code so compiler errors/warnings refer the .g files. |
noConstructors | FGLC | Boolean toggle, when set to 'true' the default constructors for the generated lexer/parser/treewalker are omitted. The user then has the option to specify them himself (with extra initializers etc.) |
ANTLR supports multiple, installable code generators. Any code-generator conforming to the ANTLR specification may be invoked via the language option. The default language is "Java", but "Cpp" and "CSharp" are also supported. The language option is specified at the file-level, for example:
header { package zparse; } options { language="Java"; } ... classes follow ...
You may set the lookahead depth for any grammar (parser, lexer, or tree-walker), by using the k= option:
class MyLexer extends Lexer; options { k=3; } ...
Setting the lookahead depth changes the maximum number of tokens that will be examined to select alternative productions, and test for exit conditions of the EBNF constructs (...)?, (...)+, and (...)*. The lookahead analysis is linear approximate (as opposed to full LL(k) ). This is a bit involved to explain in detail, but consider this example with k=2:
r : ( A B | B A ) | A A ;
Full LL(k) analysis would resolve the ambiguity and produce a lookahead test for the first alternate like:
if ( (LA(1)==A && LA(2)==B) || (LA(1)==B && LA(2)==A) )
However, linear approximate analysis would logically OR the lookahead sets at each depth, resulting in a test like:
if ( (LA(1)==A || LA(1)==B) && (LA(2)==A || LA(2)==B) )
Which is ambiguous with the second alternate for {A,A}. Because of this, setting the lookahead depth very high tends to yield diminishing returns in most cases, because the lookahead sets at large depths will include almost everything.
[See the documentation on vocabularies for more information]
To specify an initial vocabulary (tokens, literals, and token types), use the importVocab grammar option.
class MyParser extends Parser;
options {
importVocab=V;
}
ANTLR will look for VTokenTypes.txt in the current directory and preload the token manager for MyParser with the enclosed information.
This option is useful, for example, if you create an external lexer and want to connect it to an ANTLR parser. Conversely, you may create an external parser and wish to use the token set with an ANTLR lexer. Finally, you may find it more convenient to place your grammars in separate files, especially if you have multiple tree-walkers that do not add any literals to the token set.
The vocabulary file has an identifier on the first line that names the token vocabulary that is followed by lines of the form ID=value or "literal"=value. For example:
ANTLR // vocabulary name "header"=3 ACTION=4 COLON=5 SEMI=6 ...
A file of this form is automatically generated by ANTLR for each grammar.
Note: you must take care to run ANTLR on the vocabulay-generating grammar files before you run ANTLR on the vocabulary-consuming grammar files.
[See the documentation on vocabularies for more information]
The vocabulary of a grammar is the union of the set of tokens provided by an importVocab option and the set of tokens and literals defined in the grammar. ANTLR exports a vocabulary for each grammar whose default name is the same as the grammar. So, the following grammar yields a vocabulary called P:
class P extends Parser; a : A;
ANTLR generates files PTokenTypes.txt and PTokenTypes.java.
You can specify the name of the exported vocabulary with the exportVocab option. The following grammar generates a vocabulary called V not P.
class P extends Parser; options { exportVocab=V; } a : A;
All grammars in the same file witht the same vocabulary name contribute to the same vocabulary (and resulting files). If the the grammars were in separate files, on the other hand, they would all overwrite the same file. For example, the following parser and lexer grammars both may contribute literals and tokens to the MyTokens vocabulary.
class MyParser extends Parser; options { exportVocab=MyTokens; } ... class MyLexer extends Lexer; options { exportVocab=MyTokens; } ...
By default, ANTLR will generate code in all lexers to test each token against the literals table (the table generated for literal strings), and change the token type if it matches the table. However, you may suppress this code generation in the lexer by using a grammar option:
class L extends Lexer; options { testLiterals=false; } ...
If you turn this option off for a lexer, you may re-enable it for specific rules. This is useful, for example, if all literals are keywords, which are special cases of ID:
ID options { testLiterals=true; } : LETTER (LETTER | DIGIT)* ;
If you want to test only a portion of a token's text for a match in the literals table, explicitly test the substring within an action using method:
public int testLiteralsTable(String text, int ttype) {...}
For example, you might want to test the literals table for just the tag word in an HTML word.
By default, ANTLR will generate default exception handling code for a parser or tree-parser rule. The generated code will catch any parser exceptions, synchronize to the follow set of the rule, and return. This is simple and often useful error-handling scheme, but it is not very sophisticated. Eventually, you will want to install your own exepttion handlers. ANTLR will automatically turn off generation of default exception handling for rule where an exception handler is specified. You may also explicitly control generation of default exception handling on a per-grammar or per-rule basis. For example, this will turn off default error-handing for the entire grammar, but turn it back on for rule "r":
class P extends Parser; options {defaultErrorHandler=false;} r options {defaultErrorHandler=true;} : A B C;
For more information on exception handling in the lexer, go here.
ANTLR will optimize lookahead tests by generating a switch statement instead of a series of if/else tests for rules containing a sufficiently large number of alternates whose lookahead is strictly LL(1). The option codeGenMakeSwitchThreshold controls this test. You may want to change this to control optimization of the parser. You may also want to disable it entirely for debugging purposes, by setting it to a large number:
class P extends Parser; options { codeGenMakeSwitchThreshold=999; } ...
ANTLR will optimize lookahead tests by generating a bitset test instead of an if statement, for very complex lookahead sets. The option codeGenBitsetTestThreshold controls this test. You may want to change this to control optimization of the parser:
class P extends Parser; // make bitset if test involves five or more terms options { codeGenBitsetTestThreshold=5; } ...
You may also want to disable it entirely for debugging purposes, by setting it to a large number:
class P extends Parser; options { codeGenBitsetTestThreshold=999; } ...
In a Parser, you can tell ANTLR to generate code to construct ASTs corresponding to the structure of the recognized syntax. The option, if set to true, will cause ANTLR to generate AST-building code. With this option set, you can then use all of the AST-building syntax and support methods.
In a Tree-Parser, this option turns on "transform mode", which means an output AST will be generated that is a transformation of the input AST. In a tree-walker, the default action of buildAST is to generate a copy of the portion of the input AST that is walked. Tree-transformation is almost identical to building an AST in a Parser, except that the input is an AST, not a stream of tokens.
When you must define your own AST node type, your actions within the grammar will require lots of downcasting from AST (the default type of any user-defined label) to your tree node type; e.g.,
decl : d:ID {MyAST t=(MyAST)#d;} ;
This makes your code a pain to type in and hard to read. To avoid this, use the grammar option ASTLabelType to have ANTLR automatically do casts and define labels of the appropriate type.
class ExprParser extends Parser; options { buildAST=true; ASTLabelType = "MyAST"; } expr : a:term ;
The type of #a within an action is MyAST not AST.
ANTLR processes Unicode. Because of this this, ANTLR cannot make any assumptions about the character set in use, else it would wind up generating huge lexers. Instead ANTLR assumes that the character literals, string literals, and character ranges used in the lexer constitute the entire character set of interest. For example, in this lexer:
class L extends Lexer; A : 'a'; B : 'b'; DIGIT : '0' .. '9';
The implied character set is { 'a', 'b', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' }. This can produce unexpected results if you assume that the normal ASCII character set is always used. For example, in:
class L extends Lexer; A : 'a'; B : 'b'; DIGIT : '0' .. '9'; STRING: '"' (~'"")* '"';
The lexer rule STRING will only match strings containing 'a', 'b' and the digits, which is usually not what you want. To control the character set used by the lexer, use the charVocbaulary option. This example will use a general eight-bit character set.
class L extends Lexer; options { charVocabulary = '\3'..'\377'; } ...
This example uses the ASCII character set in conjunction with some values from the extended Unicode character set:
class L extends Lexer; options { charVocabulary = '\3'..'\377' | '\u1000'..'\u1fff'; } ...
[Warning: you should know what you are doing before you use this option. I deliberately made it a pain to shut warnings off (rather than a single character operator) so you would not just start turning off all the warnings. I thought for long time before implementing this exact mechanism. I recommend a comment in front of any use of this option that explains why it is ok to hush the warning.]
This subrule option is true by default and controls the generation of nondeterminism (ambiguity) warnings when comparing the FOLLOW lookahead sets for any subrule with an empty alternative and any closure subrule such as (..)+ and (...)*. For example, the following simple rule has a nondeterministic subrule, which arises from a language ambiguity that you could attach an ELSE clause to the most recent IF or to an outer IF because the construct can nest.
stat : "if" expr "then" stat ("else" stat)? | ID ASSIGN expr SEMI ;
Because the language is ambiguous, the context-free grammar must be ambiguous and the resulting parser nondeterministic (in theory). However, being the practical language folks that we are, we all know you can trivially solve this problem by having ANTLR resolve conflicts by consuming input as soon as possible; I have yet to see a case where this was the wrong thing to do, by the way. This option, when set to false, merely informs ANTLR that it has made the correct assumption and can shut off an ambiguity related to this subrule and an empty alternative or exit path. Here is a version of the rule that does not yield a warning message:
stat : "if" expr "then" stat ( // standard if-then-else ambig options { warnWhenFollowAmbig=false; } : "else" stat )? | ID ASSIGN expr SEMI ;
One important note: This option does not affect non-empty alternatives. For example, you will still get a warning for the following subrule between alts 1 and 3 (upon lookahead A):
( options { warnWhenFollowAmbig=false; } : A | B | A )
Further, this option is insensitive to lookahead. Only completely empty alternatives count as candidate alternatives for hushing warnings. So, at k=2, just because ANTLR can see past alternatives with single tokens, you still can get warnings.
-o outputDir | specify output directory where all output generated. |
-glib supergrammarFile | Specify a file with a supergrammar for the generated file. |
-debug | launch the ParseView debugger upon parser invocation. Unless you have downloaded and unzipped the debugger over the top of the standard ANTLR distribution, the code emanating from ANTLR with this option will not compile (likewise for Swing). |
-html | generate a HTML file from your grammar without actions and so on. This is only a prototype, but seems to do something useful. It only works for parsers, not lexers or tree parsers. |
-docbook | generate a docbook SGML file from your grammar without actions and so on. This is only a prototype, but seems to do something useful. It only works for parsers, not lexers or tree parsers. |
-diagnostic | generate a text file from your grammar with a lot of debugging info. |
-trace | have all rules call traceIn/traceOut. |
-traceParser | have parser rules call traceIn/traceOut. |
-traceLexer | have lexer rules call traceIn/traceOut. |
-traceTreeParser | have tree walker rules call traceIn/traceOut. |
-h|-help|--help | help message. |
Version: $Id: //depot/code/org.antlr/release/antlr-2.7.6/doc/options.html#1 $