.TH ucto 1 "2023 apr 21"

.SH NAME
ucto \- Unicode Tokenizer
.SH SYNOPSIS
ucto [[options]] [input\(hyfile] [[output\(hyfile]]

.SH DESCRIPTION
.B ucto
ucto tokenizes text files: it separates words from punctuation, splits
sentences (and optionally paragraphs), and finds paired quotes.
Ucto is preconfigured with tokenisation rules for several languages.

.SH OPTIONS

.BR \-c " configfile"
.RS
read settings from a file
.RE

.BR \-d " value"
.RS
set debug mode to 'value'
.RE

.BR \-e " value"
.RS
set input encoding. (default UTF8)
.RE

.BR \-N " value"
.RS
set UTF8 output normalization. (default NFC)
.RE

.BR \-\-filter =[YES|NO]
.RS
disable filtering of special characters, (default YES)
These special characters can be specified in the [FILTER] block of the
configuration file.
.RE

.BR \-f
.RS
OBSOLETE. use --filter=NO
.RE

.BR \-L " language"
.RS
Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code.
For example, 'fra' will select the file tokconfig\(hyfra from the installation directory
.RE

.BR \-\-detectlanguages =<lang1,lang2,..langn>
.RS
try to detect all the specified languages. The default language will be 'lang1'.
(only useful for FoLiA output).

All language codes must be iso-639-3.

You can use the special language code `und`. This ensures there is NO default
language, but any language that is NOT in the list will remain unanalyzed.

Warning: To be able to handle utterances of mixed language, Ucto uses a simple
sentence splitter based on the markers '.' '?' and '!'.
This may occasionally lead to surprising results.
.RE

.BR \-l
.RS
Convert to all lowercase
.RE

.BR \-u
.RS
Convert to all uppercase
.RE

.BR \-n
.RS
Emit one sentence per line on output
.RE

.BR \-m
.RS
Assume one sentence per line on input
.RE

.BR \-\-normalize =class1,class2,..,classn
.RS
map all occurrences of tokens with class1,...class to their generic names. e.g \-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E\-mail addresses and so on.
.RE

.BR \-T\  value
or
.BR \-\-textredundancy =value
.RS
set text redundancy level for text nodes in FoLiA output:
 'full'    - add text to all levels: <p> <s> <w> etc.
 'minimal' - don't introduce text on higher levels, but retain what is already
 there.
 'none'    - only introduce text on <w>, AND remove all text from higher levels
.RE

.BR \-\-allow-word-correction
.RS
Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections
.RE

.BR \-\-ignore-tag-hints
.RS
Skip all
.B tag=token
hints from the FoLiA input. These hints can be used to signal text markup like
subscript and superscript
.RE

.BR \-\-add\-tokens ="file"
.RS
Add additional tokens to the [TOKENS] block of the default language.
The file should contain one TOKEN per line.
.RE

.BR \-\-passthru
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE

.BR \-\-filterpunct
.RS
remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John's)
.RE

.B \-P
.RS
Disable Paragraph Detection
.RE

.B \-Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected results)
.RE

.B \-s
<string>
.RS
Set End\(hyof\(hysentence marker. (Default <utt>)
.RE

.B \-V
or
.B \-\- version
.RS
Show version information
.RE

.B \-v
.RS
set Verbose mode
.RE

.B \-F
.RS
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: \-nPQvs)
For files with an '.xml' extension, \-F is the default.
.RE

.BR \-\-inputclass ="cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'.
The default is "current".
.RE

.BR \-\-outputclass ="cls"
.RS
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'.
The default is "current".
It is recommended to have different classes for input and output.
.RE

.BR \-\-textclass ="cls" (obsolete)
.RS
use 'cls' for input and output of text from FoLiA. Equivalent to both \-\-inputclass='cls' and \-\-outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate \-\-inputclass= and \-\-outputclass options.
.RE

.BR \-\-copyclass
.RS
when ucto is used on FoLiA with fully tokenized text in inputclass='inputclass',
no text in textclass 'outputclass' is produced. (A warning will be given).
To circumvent this. Add the
.B \-\-copyclass
option. Which assures that text will be emitted in that class
.RE

.B \-X
.RS
Output FoLiA XML. (this disables usage of most other options: \-nPQvs)
.RE

.B \-\-id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML
.RE

.B \-x
<DocId>
.B (obsolete)
.RS
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: \-nPQvs).

.B obsolete
Use
.B \-X
and
.B \-\-id
instead
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl