NAME¶
fst-compiler - Two compilers for SFST programs
SYNOPSIS¶
fst-compiler grammar-file [ 
output-file ]
 
fst-compiler-utf8 grammar-file [ 
output-file ]
OPTIONS¶
  - -c
 
  - Store the transducer in compact format which is used by
      fst-infl2 and.
 
  - -l
 
  - Store the transducer in lowmem format.
 
  - -s
 
  - Switch upper and lower level of the transducer. You have to
      use this switch in order to use fst-infl (fst-infl2, fst-infl3) for
      generation rather than analysis.
 
DESCRIPTION¶
fst-compiler is a compiler for finite-state transducer programs. It
  generates a minimized finite state transducer which can be used with
  
fst-mor, fst-infl, fst-print, fst-compare,
  fst-parse, and 
fst-lattice. The compact transducer
  representation which is generated with the -c flag, is supported by
  
fst-infl2, fst-train, and 
fst-match. The memory-efficient
  transducer representation which is generated with the -l flag, is only
  supported by 
fst-infl3.
The first program argument is the name of a file which contains the transducer
  program. The programming language is described below. The second argument is
  the name of the file to which the resulting transducer will be written in
  binary form. If a second argument is missing, the output will be written to
  
stdout.
fst-compiler-utf8 differs from 
fst-compiler only in the character
  encoding. 
fst-compiler-utf8 supports UTF8 encoding of the source files
  whereas 
fst-compiler is to be used for 8-Bit character codes like
  latin1 which are an extension of the ASCII code. Information about the
  encoding is stored in the transducer files and used by the other SFST
  programs.
A transducer program consists of an (optional) sequence of 
alphabet and
  
variable definitions followed by a single 
transducer expression
  which defines the result transducer.
Alphabet
An alphabet definition consists of the keyword ALPHABET followed by = and some
  transducer expression e.g.
  - 
  
   ALPHABET = [a-z]:[A-Z] 
  
This command redefines the alphabet as the set of symbol pairs occurring on the
  transitions of the transducer. Occurrences of two-level operators, negation
  operators and unquoted periods always have to be preceded by an alphabet
  definition.
Variables
There are two different types of variables. 
Symbol set variables are
  enclosed by hash signs (#) and take symbol sequences (see below) as values:
  - 
  
   #UC# = A-Z 
  - 
    
     #LC# = a-z 
Transducer variables are enclosed by dollar signs and take transducer
  expressions as values:
  - 
  
   $MAP$ = [a-z]:[A-Z]+ 
  - 
    
     $MAP$ = [#LC#]:[#UC#]+ 
Variables whose name starts with the symbol `=' are special 
agreement
  variables. If an agreement variable occurs more than once in a transducer
  expression, it will always have the same value. Consider the following
  transducer program:
  - 
  
   $=1$ = [abc] 
  - 
    
     $=1$ X $=1$ 
The result transducer recognizes the strings aXa, bXb, and cXc. Only acyclic
  transducers (i.e. transducers with a finite set of string mappings) can be
  assigned to agreement variables.
Symbols
A symbol is either
- a single character like A s 5,
- a quoted character like \* or \_,
  - - a multi-character symbol like <X> or <ab.c5>
    (which is always
 
  - enclosed in angle brackets) or
 
  - - a backslash followed by a number which is the numeric
    code of the
 
  - designated character
 
- the null symbol <>.
Symbol sequence
A symbol sequence is a sequence of characters, multi-character symbols and
  character ranges, e.g. a-z \. <x>.
symbol range
A symbol range is either
- a single symbol
- a symbol sequence enclosed in square brackets like [A-Za-z] or
- a symbol sequence starting with ^ and enclosed in square brackets like
  [^A-Za-z] (designating the complement of [a-zA-Z]) or
- the period (which represents any symbol from the alphabet)
Transducer expressions
A transducer expression (TE) is recursively defined as follows:
  - - A pair of two symbol ranges separated by a colon is a
    TE.
 
  - 
    
 
    [a-z]:[a-Z] 
  - - A single symbol range like [a-z] is a TE.
 
  - It is a short form for [a-z]:[a-z].
 
  - - Two symbol sequences enclosed in braces and separated by
    a colon are
 
  - a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f |
      a:d c:e <>:f.
 
  - - X Y is a TE if X and Y are TEs.
 
  - (Blanks are ignored unless they are quoted.)
 
  - - (X) is a TE if X is a TE.
 
  
  - - X op is a TE is X is a TE and op is either * (Kleene's
    star operator), +
 
  - (Kleene's plus operator), or ? (optionality operator)
 
  - - op X is a TE is X is a TE and op is either ! (negation
    operator), ^
 
  - (target language extraction operator), _ (source language
      extraction operator), or ^_ (source and target switch operator).
 
  - - X op Y is a TE is X and Y are TEs and op is either &
    (conjunction
 
  - operator), | (disjunction operator), || (composition
      operator), or - (subtraction operator)
 
  - - L x op y R is a TE if L and R are TEs, x and y are symbol
    ranges and
 
  - op is either => (two-level restriction), <=
      (two-level coercion), or <=> (two-level restriction and
    coercion).
 
  - - X op L__R is a TE if X, L and R are TEs and op is either
    ^-> (upward
 
  - replacement), _-> (downward replacement), /->
      (leftward replacement) or \-> (rightward replacement). Furthermore, L
      and R must define automata (i.e. which map their strings onto themselves).
      These operators correspond to Karttunen's replace operators. If the arrow
      is followed by a question mark (?), the replacement becomes optional.
 
  - - X << l is a TE if X is a TE, and l is either of the
    form
 
  - a or the form a:b where a and b are single characters or
      symbols. The result is a transducer where l was freely inserted into X.
      The transducer ab << c for instance is equivalent to c*ac*bc*.
 
  - - X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN
    and R1
 
  - through RN are TEs, and op is either => (general
      restriction), <= (general coercion), ^=> (general surface
      restriction), ^<= (general surface coercion), ^<=> (general
      surface restriction and coercion), _=> (general deep restriction),
      _<= (general deep coercion), _<=> (general deep restriction and
      coercion). (These operators were implemented following a suggestion by
      Anssi Yli-Jyra.)
 
  - - "fname" is a TE. The compiler reads the file
    named fname and turns
 
  - it into a transducer of the form line1|line2|line3|...
      where linex is the x-th line of the file. All characters other than : and
      \ are interpreted literally (i.e. not as operators). This TE is typically
      used e.g. to read morpheme list from a file.
 
  - - "<fname>" is a TE. The compiler reads a
    pre-compiled transducer from
 
  - the file named fname. This
 
Further Features
Comments start with the symbol % and extend up to the end of the line. Blanks
  are ignored unless they are quoted. Expressions terminate at the end of a line
  unless the end of line is preceded by a backslash. The command
  - #include "fname"
 
  
can be used to insert source code from a file named fname. The command
  - RE >> "fname"
 
  
stores the regular expression RE in the file fname.
 
EXAMPLE¶
Here is an example of a simple transducer program. Assuming that the file
  "adj-stems" contains the two lines
easy late big
this transducer will correctly analyse the adjective forms easy, easier, easiest
  and late, later, and latest.
 
ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>
 
$R$ = y<=>i (<ADJ>:<> e)
 
$R2$ = e<=><> (<ADJ>:<> e)
 
$R$ = $R$ & $R2$
 
$Stems$ = "adj-stems"
 
$S$ = $Stems$ <ADJ>
  (<pos>:<>|<cmp>:{er}|<sup>:{est})
 
$S$ || $R$
 
EXIT STATUS¶
fst-compiler returns 0 unless some error occurs.
BUGS¶
The compiler gets the operator precedence wrong in case of two-level rules and
  interprets the expression "ab c<=>d ef" as "a(b
  c<=>d (ef))". Therefore, you should always surround the left
  context of two-level rules with parenthesis: (ab) c<=>d (ef)
SEE ALSO¶
fst-mor, fst-infl, fst-infl2, fst-infl3, fst-print, fst-compact, fst-parse,
  fst-compare, fst-compact, fst-lowmem, fst-lattice, fst-train
AUTHOR¶
Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,
  Email: schmid@ims.uni-stuttgart.de, This software is available under the GNU
  Public License.