ucto(1) | General Commands Manual | ucto(1) |
NAME¶
ucto - Unicode TokenizerSYNOPSYS¶
ucto [[options]] [input-file] [[output-file]]DESCRIPTION¶
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.OPTIONS¶
-c configfileread settings from a file
set debug mode to 'value'
set input encoding. (default UTF8)
disable filtering of special characters
Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory
Convert to all lowercase
Convert to all uppercase
Assume one sentence per line on input
Emit one sentence per line on output
Don't tokenize, but perform input decoding and
simple token role detection
Disable Paragraph Detection
Enable Quote Detection. (this is experimental
and may lead to unexpected results)
Disable Sentence Detection
Set End-of-sentence marker. (Default
<utt>)
Show version information
set Verbose mode
Output FoLiA XML, use the specified Document
ID. (this disables usage of most other options: -nulPQvsS)
Read a FoLiA XML document, tokenize it, and
output the modified doc. (this disables usage of most other options:
-nulPQvsS)
BUGS¶
likelyAUTHORS¶
Maarten van Gompel proycon@anaproy.nl2011 november 28 |