Scroll to navigation

ucto(1) General Commands Manual ucto(1)

NAME

ucto - Unicode Tokenizer

SYNOPSYS

ucto [[options]] [input-file] [[output-file]]
 

DESCRIPTION

ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
 

OPTIONS

-c configfile
read settings from a file
 
-d value
set debug mode to 'value'
 
-e value
set input encoding. (default UTF8)
 
-f
disable filtering of special characters
 
-L language

Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory
 
-l
Convert to all lowercase
 
-u
Convert to all uppercase
 
-n
Assume one sentence per line on input
 
-m
Emit one sentence per line on output
 
--passthru
Don't tokenize, but perform input decoding and simple token role detection
 
-P
Disable Paragraph Detection
 
-Q
Enable Quote Detection. (this is experimental and may lead to unexpected results)
 
-S
Disable Sentence Detection
 
-s <string>
Set End-of-sentence marker. (Default <utt>)
 
-V
Show version information
 
-v
set Verbose mode
 
-x <DocId>
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)
 
-F
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
 
 

BUGS

likely
 

AUTHORS

Maarten van Gompel proycon@anaproy.nl
 
Ko van der Sloot Timbl@uvt.nl
 
2011 november 28