NAME¶
swath - General-purpose Thai word segmentation utility
SYNOPSIS¶
swath [options] <
infile >
outfile
DESCRIPTION¶
Thai script has no word delimitor. Applications need to recognize word
boundaries before they can do useful things with Thai text, such as line
wrapping.
Swath provides word analysis filter to insert word delimitors into a given text
stream. It reads text from standard input, analyzes it for word boundaries by
consulting a Thai word list, and output to standard output the same text with
the predefined word delimitors inserted.
Currently, it can read plain text, HTML, RTF, LaTeX and Lambda (Unicode version
of LaTeX with Omega typesetter kernel) documents and insert common word
delimitors for each format (pipe `|' for plain text). But user can always
override this with a preferred delimitor.
OPTIONS¶
- -b [delimitor]
- Define a string to be used as word delimitor code in the output text.
- -d [dict-path]
- Specify alternative dictionary location. dict-path must be either a
directory containing the swath dictionary file `swathdic.tri', or a path
to the dictionary file itself. The dictionary file must be a trie file
prepared using trietool-0.2(1) utility from libdatrie package.
If this option is given, swath will override normal dictionary search
and will exit on failure to find the given dictionary. Otherwise, if
SWATHDICT environment is set, it will try to open dictionary from
the location specified by its value. Otherwise, it will try the current
working directory, and finally the usual installed location.
- -f [format]
- Specify format of the input. Possible formats are: html, rtf, latex,
lambda.
- -m [scheme]
- Choose word matching scheme when analyzing word boundaries. Possible
schemes are `long' (for longest or greedy matching) and `max' (for maximal
matching, with least words preferred). Maximal matching is the default
value.
- -u input-enc,output-enc
- Specify encodings of input and output. input-enc and
output-enc can be one of 'u' (for UTF-8 encoding) and 't' (for
TIS-620 encoding). Swath will convert the character encoding as necessary.
If omitted, TIS-620 encodings on both input and output are assumed.
- -v, --verbose
- Turn on verbose mode.
- -help, --help
- Show help.
ENVIRONMENT VARIABLES¶
- SWATHDICT
- If specified, swath will search for dictionary in this location
before the usual places (current working directory and usual installed
directory, respectively). This value is overridden by -d
option.
EXAMPLES¶
For LaTeX (to be used with babel-thai package):
$
swath -f latex < thaifile.tex > thaifile.ttex
$
latex thaifile.ttex
For HTML (to provide web pages to web browsers that cannot wrap Thai lines
properly, but support the <wbr> tag):
$
swath -f html < myweb.html > myweb-wbr.html
To preprocess a Thai UTF-8 encoded LaTeX file for babel-thai with tis620
inputenc:
$
swath -f latex -u u,t < thaifile.tex > thaifile.ttex
$
latex thaifile.ttex
This is equivalent to filtering with
iconv(1):
$
iconv -f UTF-8 -t TIS-620 thaifile.tex |
swath -f latex >
thaifile.ttex
$
latex thaifile.ttex
To use longest matching scheme with LaTeX document:
$
swath -f latex -m long < thaifile.tex > thaifile.ttex
$
latex thaifile.ttex
To use an alternative dictionary from libthai:
$
swath -f latex -d /usr/share/libthai/thbrk.tri < thaifile.tex >
thaifile.ttex
AUTHOR¶
This manual page was written by Theppitak Karoonboonyanan
<theppitak@gmail.com>.