.TH ucto 1 "2023 apr 21" .SH NAME ucto \- Unicode Tokenizer .SH SYNOPSIS ucto [[options]] [input\(hyfile] [[output\(hyfile]] .SH DESCRIPTION .B ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages. .SH OPTIONS .BR \-c " configfile" .RS read settings from a file .RE .BR \-d " value" .RS set debug mode to 'value' .RE .BR \-e " value" .RS set input encoding. (default UTF8) .RE .BR \-N " value" .RS set UTF8 output normalization. (default NFC) .RE .BR \-\-filter =[YES|NO] .RS disable filtering of special characters, (default YES) These special characters can be specified in the [FILTER] block of the configuration file. .RE .BR \-f .RS OBSOLETE. use --filter=NO .RE .BR \-L " language" .RS Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig\(hyfra from the installation directory .RE .BR \-\-detectlanguages = .RS try to detect all the specified languages. The default language will be 'lang1'. (only useful for FoLiA output). All language codes must be iso-639-3. You can use the special language code `und`. This ensures there is NO default language, but any language that is NOT in the list will remain unanalyzed. Warning: To be able to handle utterances of mixed language, Ucto uses a simple sentence splitter based on the markers '.' '?' and '!'. This may occasionally lead to surprising results. .RE .BR \-l .RS Convert to all lowercase .RE .BR \-u .RS Convert to all uppercase .RE .BR \-n .RS Emit one sentence per line on output .RE .BR \-m .RS Assume one sentence per line on input .RE .BR \-\-normalize =class1,class2,..,classn .RS map all occurrences of tokens with class1,...class to their generic names. e.g \-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E\-mail addresses and so on. .RE .BR \-T\ value or .BR \-\-textredundancy =value .RS set text redundancy level for text nodes in FoLiA output: 'full' - add text to all levels:

etc. 'minimal' - don't introduce text on higher levels, but retain what is already there. 'none' - only introduce text on , AND remove all text from higher levels .RE .BR \-\-allow-word-correction .RS Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections .RE .BR \-\-ignore-tag-hints .RS Skip all .B tag=token hints from the FoLiA input. These hints can be used to signal text markup like subscript and superscript .RE .BR \-\-add\-tokens ="file" .RS Add additional tokens to the [TOKENS] block of the default language. The file should contain one TOKEN per line. .RE .BR \-\-passthru .RS Don't tokenize, but perform input decoding and simple token role detection .RE .BR \-\-filterpunct .RS remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John's) .RE .B \-P .RS Disable Paragraph Detection .RE .B \-Q .RS Enable Quote Detection. (this is experimental and may lead to unexpected results) .RE .B \-s .RS Set End\(hyof\(hysentence marker. (Default ) .RE .B \-V or .B \-\- version .RS Show version information .RE .B \-v .RS set Verbose mode .RE .B \-F .RS Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: \-nPQvs) For files with an '.xml' extension, \-F is the default. .RE .BR \-\-inputclass ="cls" .RS When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is "current". .RE .BR \-\-outputclass ="cls" .RS When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The default is "current". It is recommended to have different classes for input and output. .RE .BR \-\-textclass ="cls" (obsolete) .RS use 'cls' for input and output of text from FoLiA. Equivalent to both \-\-inputclass='cls' and \-\-outputclass='cls') This option is obsolete and NOT recommended. Please use the separate \-\-inputclass= and \-\-outputclass options. .RE .BR \-\-copyclass .RS when ucto is used on FoLiA with fully tokenized text in inputclass='inputclass', no text in textclass 'outputclass' is produced. (A warning will be given). To circumvent this. Add the .B \-\-copyclass option. Which assures that text will be emitted in that class .RE .B \-X .RS Output FoLiA XML. (this disables usage of most other options: \-nPQvs) .RE .B \-\-id .RS Use the specified Document ID for the FoLiA XML .RE .B \-x .B (obsolete) .RS Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: \-nPQvs). .B obsolete Use .B \-X and .B \-\-id instead .RE .SH BUGS likely .SH AUTHORS Maarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl