NAME¶
flexc++input - Organization of flexc++’s input
s
DESCRIPTION¶
Flexc++(1) was designed after
flex(1) and
flex++(1). Like
these latter two programs
flexc++ generates code performing
pattern-matching on text, possibly executing actions when certain
regular
expressions are recognized.
Refer to
flexc++(1) for a general overview. This manual page describes
how
flexc++’s input
s should be organized. It contains
the following sections:
- o
- 1. SPECIFICATION FILE(S): the format and contents of flexc++
input files, specifying the Scanner’s characteristics
- o
- 2. FILE SWITCHING: how to switch to another input specification
file
- o
- 3. DIRECTIVES: directives that can be used in input specification
files
- o
- 4. MINI SCANNERS: how to declare mini-scanners
- o
- 5. DEFINITIONS: how to define symbolic names for regular
expressions
- o
- 6. %% SEPARATOR: the separator between the input specification
sections
- o
- 7. REGULAR EXPRESSIONS: regular expressions supported by
flexc++
- o
- 8. SPECIFICATION EXAMPLE: an example of a specification file
1. SPECIFICATION FILE(S)¶
Flexc++ expects an input file containing directives and the regular
expressions that should be recognized by objects of the scanner class
generated by
flexc++. In this man page the elements and organization of
flexc++’s input file is described.
Flexc++’s input file consists of two sections, separated from each
other by a line merely containing two consecutive percent characters:
%%
The section before this separator contains directives; the section following
this separator contains regular expressions and possibly actions to perform
when these regular expressions are matched by the object of the scanner class
generated by
flexc++. If a second line is encountered immediately
beginning with two consecutive percent characters then this ends
flexc++’s input file processing. See also section 6 (%%
SEPARATOR) below.
White space is usually ignored, as is comment, which may be of the traditional
C form (i.e.,
/*, followed by (possibly multi-line) comment
text, followed by
*/, and it may be
C++ end-of-line comment: two
consecutive slashes (
//) start the comment, which continues up to the
next newline character.
2. FILE SWITCHING¶
Flexc++’s input file may be split into multiple files. This allows
for the definition of logically separate elements of the specifications in
different files. Include directives must be specified on a line of their own.
To switch to another specification file the following stanza is used:
//include file-location
The
//include directive starts in the line’s first column. File
locations can be absolute or relative to the location of the file containing
the
//include directive. White space characters following
//include and before the end of the line are ignored. The file
specification may be surrounded by double quotes, but these double quotes are
not required and are ignored (removed) if present. All remaining characters
are expected to define the name of the file where
flexc++’s
rules specifications continue. Once end of file of a sub-file has been
reached, processing continues at the line beyond the
//include
directive of the previously scanned file. The end-of-file of the file that was
initially specified when
flexc++ was called indicates the end of
flexc++’s rules specification.
3. DIRECTIVES¶
The first section of
flexc++’s input file consists of directives.
In addition it may associate regular expressions with symbolic names, allowing
you to use these identifiers in the rules section. Each directive is defined
on a line of its own. When available, directives are overridden by
flexc++ command line options.
Some directives require arguments, which are usually provided following
separating (but optional)
= characters. Arguments of directives are
text, surrounded by double quotes (strings), or embedded in raw string
literals (rawstrings). Double quotes or backslashes inside strings must
themselves be preceded by backslashes; these backslashes are not required when
rawstrings are used.
The
%s and
%x directives are immediately followed by name lists,
consisting of identifiers separated by blanks. Here is an example of the
definition of a directive:
%class-name = "MyScanner"
Directives accepting a `filename’ do not accept path names, i.e., they
cannot contain directory separators (
/); options accepting a
’pathname’ may contain directory separators. A
’pathname’ using blank characters should be surrounded by double
quotes.
Some directives may generate errors. This happens when a directive conflicts
with the contents of an existing file which
flexc++ cannot modify
(e.g., a scanner class header file exists, but doesn’t define a name
space, but a
%namespace directive was provided). To solve the error the
offending directive could be omitted, the existing file could be removed, or
the existing file could be hand-edited according to the directive’s
specification. Note that
flexc++ currently does not handle the opposite
error condition: if a previously used directive is omitted, then
flexc++ does not detect the inconsistency. In those cases you may
encounter compilation errors.
- o
- %baseclass-header = "filename"
Defines the name of the file to contain the scanner class’s base
class interface. Corresponding command-line option:
--baseclass-header.
- It is an error if this directive is used and an already existing
scanner-class header file does not include `filename’.
- o
- %case-insensitive
Generates a scanner which case insensitively matches regular
expressions. All regular expressions specified in flexc++’s
input file are interpreted case insensitively and the resulting scanner
object will case insensitively interpret its input.
- Corresponding command-line option: --cases-insensitive.
- When this directive is specified the resulting scanner does not
distinguish between the following rules:
First // initial F is transformed to f
first
FIRST // all capitals are transformed to lower case chars
With a case-insensitive scanner only the first rule can be matched, and
flexc++ will issue warnings for the second and third rule about
rules that cannot be matched.
- Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First rule is matched for all of
the following input words: first First FIRST firST.
- Although the matching process proceeds case insensitively, the matched
text (as returned by the scanner’s matched() member) always
contains the original, unmodified text. So, with the above input
matched() returns, respectively first, First, FIRST and
firST, while matching the rule First.
- o
- %class-header = "filename"
Defines the name of the file to contain the scanner class’s
interface. Corresponding command-line option: --class-header.
- o
- %class-name = "className"
Declares the name of the scanner class generated by flexc++. This
directive corresponds to the %name directive used by
flex++(1). Contrary to flex++’s %name
declaration, class-name may appear anywhere in the first section of
the grammar specification file. It may be defined only once. If no
class-name is specified the default class name ( Scanner) is
used. Corresponding command-line option: --class-name.
- It is an error if this directive is used and an already existing
scanner-class header file does not define class
`className’.
- o
- %debug
Provide lex and its support functions with debugging code, showing
the actual parsing process on the standard output stream. When included,
the debugging output is active by default, but its activity may be
controlled using the setDebug(bool on-off) member. Note that
no #ifdef DEBUG macros are used in the generated code.
- o
- %filenames = "basename"
Defines the basename of the Scanner.h, Scanner.ih, and
Scannerbase.h files. E.g., when using the directive
%filenames = "scanner"
the names of the generated files are, respectively, scanner.h,
scanner.ih, and scannerbase.h. Corresponding command-line
option: --filenames. The name of the source file (by default
lex.cc) is controlled by the %lex-source directive.
- o
- %implementation-header = "filename"
Defines the name of the file to contain the implementation header.
Corresponding command-line option: --implementation-header.
- It is an error if this directive is used and an already
’filename’ file does not include the scanner class
header file.
- o
- %input-implementation = "sourcefile"
Defines the pathname of the file containing the implementation of a
user-defined Input class.
- o
- %input-interface = "interface"
Defines the pathname of the file containing the interface of a user-defined
Input class. See section 17. THE CLASS INPUT in the
flexc++api(3) manual page for additional information about
user-defined Input classes.
- o
- %interactive
Generate an interactive scanner. An interactive scanner reads lines from the
input stream, and then returns the tokens encountered on that line. The
interactive scanner implemented by flexc++ only predefines the
Scanner(std::istream &in, std::ostream &out) constructor,
by default assuming that input is read from std::cin. See also
section 1. INTERACTIVE SCANNER section in the flexc++api(3)
manual page.
- o
- %lex-function-name = "funname"
Defines the name of the scanner class’s member to perform the lexical
scanning. If this directive is omitted the default name ( lex) is
used. Corresponding command-line option: --lex-function-name.
- o
- %lex-source = "filename"
Defines the name of the file to contain the scanner member lex.
Corresponding command-line option: --lex-source.
- o
- %no-lines
Do not put #line preprocessor directives in the file containing the
scanner’s lex function. If omitted #line directives
are added to this file, unless overridden by the command line options
--lines and --no-lines.
- o
- %namespace = "identifer"
Define the scanner class in the namespace identifier. By default no
namespace is used. If this directives is used the implementation header is
provided with a commented out using namespace declaration
for the requested namespace. In addition, the scanner and scanner base
class header files also use the specified namespace to define their
include guard directives.
- It is an error if this directive is used and an already scanner-class
header file does not define namespace identifier.
- o
- %print-tokens
this directive results in the tokens as well as the matched text to be
displayed on the standard output stream, just before returning the token
to lex’s caller. Displaying is suppressed again when the
lex.cc file is generated without using this directive. The function
showing the tokens ( ScannerBase::print__) is called from
Scanner::print(), which is defined in-line in Scanner.h.
Calling ScannerBase::print__, therefore, can also easily be
controlled by an option controlled by the program using the scanner
object. this directive does not show the tokens returned and text
matched by flexc++ itself when reading its input s. If that
is what you want, use the --own-tokens option.
- o
- %s namelist
The %s directive is followed by a list of one or more identifiers,
separated by blanks. Each identifier is the name of an inclusive start
condition.
- o
- %skeleton-directory = "pathname"
Use pathname rather than the default (e.g.,
/usr/share/flexc++) path when looking for flexc++’s
skeleton files. Corresponding command-line option:
--skeleton-directory.
- o
- %target-directory = "pathname"
Pathname defines the directory where generated files should be
written. By default this is the directory where flexc++ is called.
This directive is overruled by the --target-directory command-line
option.
- o
- %x namelist
The %x directive is followed by a list of one or more identifiers,
separated by blanks. Each identifier is the name of an exclusive start
condition.
4. MINI SCANNERS¶
Mini scanners come in two flavors: inclusive mini scanners and exclusive mini
scanners. The rules that apply to an inclusive mini scanner are the mini
scanner’s own rules as well as the rules which apply to no mini
scanners in particular (i.e., the rules that apply to the default (or
INITIAL) mini scanner). Exclusive mini scanners only use the rules that
were defined for them.
To define an inclusive mini scanner use
%s, followed by one or more
identifiers specifying the name(s) of the mini-scanner(s). To define an
exclusive mini scanner use
%x, followed by or more identifiers
specifying the name(s) of the mini-scanner(s). The following example defines
the names of two mini scanners:
string and
comment:
%x string comment
Following this, rules defined in the context of the
string mini scanner
(see below) will only be used when that mini scanner is active.
A
flexc++ input file may contain multiple
%s and
%x
specifications.
5. DEFINITIONS¶
Definitions are of the form
identifier regular-expression
Each definition must be entered on a line of its own. Definitions associate
identifiers with regular expressions, allowing the use of
${identifier}
as synonym for its regular expression in the rules section of
flexc++’s input file. One defined, the identifiers representing
regular expressions can also be used in subsequent definitions.
Example:
FIRST [A-Za-z_]
NAME {FIRST}[-A-Za-z0-9_]*
6. %% SEPARATOR¶
Following directives and definitions a line merely containing two consecutive
% characters is expected. Following this line the rules are defined.
Rules consist of regular expressions which should be recognized, possibly
followed by actions to be executed once a rule’s regular expression has
been matched.
If the rule section contains a line starting with two consecutive
%
characters, then any remaining input is ignored. Note that this second
%% separator does not have to be specified. It is purely optional. To
specify a regular expression starting with
%% surround the
%%
with double quotes (
"%%") or prefix the
%% with a
blank space: the
%%-characters are only considered a separator if they
are encountered at the very beginning of a line.
7. REGULAR EXPRESSIONS¶
The regular expressions defined in
flexc++’s rules files are
matched against the information passed to the scanner’s
lex
function.
Regular expressions begin as the first non-blank character on a line. Comment is
interpreted as comment as long as it isn’t part of the regular
expresssion. To define a regular expression starting with two slashes (at
least) the first slash can be escaped or double quoted. (E.g.,
"//".* defines
C++ comment to end-of-line).
Regular expressions end at the first blank character (to add a blank character,
e.g., a space character, to a regular expression, prefix it by a backslash or
put it in a double-quoted string).
Actions may be associated with regular expressions. At a match the action that
is associated with the regular expression is executed, after which scanning
continues when the lexical scanning function (e.g.,
lex) is called
again. Actions are not required, and regular expressions can be defined
without any actions at all. If such action-less regular expressions are
matched then the match is performed silently, after which processing
continues.
Flexc++ tries to match as many characters of the input file as possible
(i.e., it uses `greedy matching’). Non-greedy matching is accomplished
by a combination of a scanner and parser and/or by using the
`lookahead’ operator (
/).
The following regular expression `building blocks’ are available. More
complex regular expressions are created by combining them:
- x
- the character `x’;
- .
- any character (byte) except newline;
- [xyz]
- a character class; in this case, the pattern matches either an `x’,
a `y’, or a `z’. See also the paragraph about character
classes below;
- [abj-oZ]
- a character class containing a range; matches an `a’, a `b’,
any letter from `j’ through `o’, or a `Z’. See also
the paragraph about character classes below;
- [^A-Z]
- a negated character class, i.e., any character except for those in the
class. In this example, any non-capital character. See also the paragraph
about character classes below;
- "[xyz]\"foo"
- text between double quotes matches the literal string:
[xyz]"foo;
- R"([xyz]\"foo)"
- the literal string `[xyz]\"foo’ (using a raw string
literal);
- \X
- if X is `a’, `b’, `f’, `n’, `r’,
`t’, or `v’, then the ANSI-C interpretation of `\x’
is matched. Otherwise, a literal `X’ is matched (this is used to
escape operators such as `*’);
- \0
- a NUL character (ASCII code 0);
- \123
- the character with octal value 123;
- \x2a
- the character with hexadecimal value 2a;
- (r)
- the regular expression `r’; parentheses are used to override
precedence (see below);
- {name}
- the expansion of the `name’ definition;
- r*
- zero or more regular expressions `r’. This also matches the empty
string;
- r+
- one or more regular expressions `r’;
- r?
- zero or one regular expression `r’. This also matches the empty
string;
- rs
- the regular expression `r’ followed by the regular expression
`s’; called concatenation;
- r{m, n}
- regular expression `r’ at least m, but at most n times (1 <=
m <= n);
- r{m,}
- regular expression `r’ m or more times (1 <= m).
- r{m}
- regular expression `r’ exactly m times (1 <= m);
- r|s
- either regular expression `r’ or regular expression
`s’;
- r/s
- regular expression `r’ if it is followed by regular expression
`s’. The text matched by `s’ is included when determining
whether this rule results in the longest match, but `s’ is then
returned to the input before the rule’s action (if defined) is
executed.
- If flexc++ detects patterns potentially not matching any text it
generates warnings like this:
[Warning] input, line 7: null-matching regular expression
By placing the comment
//%nowarn
on the line just before a regular expression that potentially does not match
any text, the warning for that regular expression is suppressed;
- ^r
- a regular expression `r’ at the beginning of a line or file;
- r$
- a regular expression `r’, occurring at the end of a line. This
pattern is identical to `r/\n’;
- <s>r
- a regular exprression `r’ in start condition `s’;
- <s1,s2,s3>r
- a regular exprression `r’ in start conditions s1, s2, or s3;
- <*>r
- a regular exprression `r’ in all start conditions;
- <<EOF>>
- an end-of-file;
- <s1,s2><<EOF>>
- an end-of-file when in start conditions s1 or s2 .
Character classes
Inside a character class all regular expression operators lose their special
meanings, except for the escape character (
\), the character range
operator
-, the end of character class operator
], and, at the
beginning of the class,
^. All ordinary escape sequences are supported,
all other escaped characters are interpreted as literal characters (e.g.,
\c is a literal
c).
To add a closing bracket to a character class use
[] or
\]. To add
a closing bracket to a negated character class use
[^] (or use
[^ followed by
\] somewhere within the character class). Minus
characters are used to define character ranges (e.g.,
[a-d], defining
[abcd]) except in the following cases, where
flexc++ recognizes
a literal minus character:
[-, or
[^- (a minus at the very
beginning of a character class);
-] (a minus at the very end of a
character class); or
\- (an escaped minus character)) Once a character
class has started, all subsequent character (ranges) are added to the set,
until the final closing bracket (
]) has been reached.
Operator precedence
The regular expressions listed above are grouped according to precedence, from
highest precedence at the top to lowest at the bottom. From lowest to highest
precedence, the operators are:
- o
- |: the or-operator at the end of a line (instead of an action)
indicates that this expression’s action is identical to the action
of the next rule.
- o
- /: the look-ahead operator;
- o
- |: the or-operator withn a regular expression;
- o
- CHAR: individual elements of the regular expression: characters,
strings, quoted characters, escaped characters, character sets etc. are
all considered CHAR elements. Multiple CHAR elements can be
combined by enclosing them in parentheses (e.g., (abc)+ indicates
sequences of abc characters, like abcabcabc);
- o
- *, ?, +, {: multipliers:
?: zero or one occurrence of the previous element;
+: one or more repetitions of the previous element;
*: zero or more repetitions of the previous element;
{...}: interval specification: a specified number of repetitions of
the previous element (see above for specific forms of the interval
specification)
- o
- {+}, {-}: set operators ({+} computing the union of two
sets, {-} computing the difference of the left-hand side set minus
the elements in the right-hand side set);
The lex standard defines concatenation as having a higher precedence than the
interval expression. This is different from many other regular expression
engines, and
flexc++ follows these latter engines, giving all
`multiplication operators’ equal priority.
Name expansion has the same precedence as grouping (using parentheses to
influence the precedence of the other operators in the regular expression).
Since the name expansion is treated as a group in
flexc++, it is not
allowed to use the lookahead operator in a name definition (a named pattern,
defined in the definition section).
Predefined sets of characters
Character classes can also contain character class expressions. These are
expressions enclosed inside
[: and
:] delimiters (which
themselves must appear between the
[ and
] of the character
class. Other elements may occur inside the character class as well). The
character class expressions are:
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
Character class expressions designate a set of characters equivalent to the
corresponding standard
C isXXX function. For example,
[:alnum:]
designates those characters for which
isalnum returns true - i.e., any
alphabetic or numeric character. For example, the following character classes
are all equivalent:
[[:alnum:]]
[[:alpha:][:digit:]]
[[:alpha:][0-9]]
[a-zA-Z0-9]
A negated character class such as the example
[^A-Z] above will match a
newline unless
\n (or an equivalent escape sequence) is one of the
characters explicitly present in the negated character class (e.g.,
[^A-Z\n]). This differs from the way many other regular expression
tools treat negated character classes, but unfortunately the inconsistency is
historically entrenched. Matching newlines means that a pattern like
[^"]* can match the entire input unless there’s another
quote in the input.
Flexc++ allows negation of character class expressions by prepending
^ to the POSIX character class name.
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
Combining character sets
The
{-} operator computes the difference of two character classes. For
example,
[a-c]{-}[b-z] represents all the characters in the class
[a-c] that are not in the class
[b-z] (which in this case, is
just the single character
a). The
{-} operator is left
associative, so
[abc]{-}[b]{-}[c] is the same as
[a].
The
{+} operator computes the union of two character classes. For
example,
[a-z]{+}[0-9] is the same as
[a-z0-9]. This operator is
useful when preceded by the result of a difference operation, as in,
[[:alpha:]]{-}[[:lower:]]{+}[q], which is equivalent to
[A-Zq]
in the
C locale.
Trailing context
A rule can have at most one instance of trailing context (the
/ operator
or the
$ operator). The start condition,
^, and
<<EOF>> patterns can only occur at the beginning of a
pattern, and cannot be surrounded by parentheses. The characters
^ and
$ only have their special properties at, respectively, the beginning
and end of regular expressions. In all other cases they are treated as a
normal characters.
8. SPECIFICATION EXAMPLE¶
%option debug
%x comment
NAME [[:alpha:]][_[:alnum:]]*
%%
"//".* // ignore
"/*" begin(comment);
<comment>.|\n // ignore
<comment>"*/" begin(INITIAL);
^a return 1;
a return 2;
a$ return 3;
{NAME} return 4;
.|\n // ignore
)
FILES¶
Flexc++’s default skeleton files are in
/usr/share/flexc++.
By default,
flexc++ generates the following files:
- o
- Scanner.h: the header file containing the scanner class’s
interface.
- o
- Scannerbase.h: the header file containing the interface of the
scanner class’s base class.
- o
- Scanner.ih: the internal header file that is meant to be included
by the scanner class’s source files (e.g., it is included by
lex.cc, see the next item’s file), and that should contain
all declarations required for compiling the scanner class’s
sources.
- o
- lex.cc: the source file implementing the scanner class member
function lex (and support functions), performing the lexical scan.
SEE ALSO¶
flexc++(1),
flexc++api(3)
BUGS¶
- o
- The priority of interval expressions ({...}) equals the priority of
other multiplicative operators (like *).
- o
- All INITIAL rules apply to inclusive mini scanners, also those
INITIAL rules that were explicitly associated with the
INITIAL mini scanner.
COPYRIGHT¶
This is free software, distributed under the terms of the GNU General Public
License (GPL).
AUTHOR¶
Frank B. Brokken (
f.b.brokken@rug.nl),
Jean-Paul van Oosten (
j.p.van.oosten@rug.nl),
Richard Berendsen (
richardberendsen@xs4all.nl) (until 2010).