'\" t .\" Title: unicharambigs .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets vsnapshot .\" Date: 01/11/2023 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "UNICHARAMBIGS" "5" "01/11/2023" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" unicharambigs \- Tesseract unicharset ambiguities .SH "DESCRIPTION" .sp The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&. .sp The file contains a number of lines, laid out as follow: .sp .if n \{\ .RS 4 .\} .nf [num] [char(s)] [num] [char(s)] [num] .fi .if n \{\ .RE .\} .sp .TS tab(:); lt lt lt lt lt lt lt lt lt lt. T{ .sp Field one T}:T{ .sp the number of characters contained in field two T} T{ .sp Field two T}:T{ .sp the character sequence to be replaced T} T{ .sp Field three T}:T{ .sp the number of characters contained in field four T} T{ .sp Field four T}:T{ .sp the character sequence used to replace field two T} T{ .sp Field five T}:T{ .sp contains either 1 or 0\&. 1 denotes a mandatory replacement, 0 denotes an optional replacement\&. T} .TE .sp 1 .sp Characters appearing in fields two and four should appear in unicharset\&. The numbers in fields one and three refer to the number of unichars (not bytes)\&. .SH "EXAMPLE" .sp .if n \{\ .RS 4 .\} .nf v1 2 \*(Aq \*(Aq 1 " 1 1 m 2 r n 0 3 i i i 1 m 0 .fi .if n \{\ .RE .\} .sp The first line is a version identifier\&. In this example, all instances of the \fI2\fR character sequence \fI\*(Aq\fR\*(Aq will \fBalways\fR be replaced by the \fI1\fR character sequence \fI"\fR; a \fI1\fR character sequence \fIm\fR \fBmay\fR be replaced by the \fI2\fR character sequence \fIrn\fR, and the \fI3\fR character sequence \fBmay\fR be replaced by the \fI1\fR character sequence \fIm\fR\&. .sp Version 3\&.03 and on supports a new, simpler format for the unicharambigs file: .sp .if n \{\ .RS 4 .\} .nf v2 \*(Aq\*(Aq " 1 m rn 0 iii m 0 .fi .if n \{\ .RE .\} .sp In this format, the "error" and "correction" are simple UTF\-8 strings separated by a space, and, after another space, the same type specifier as v1 (0 for optional and 1 for mandatory substitution)\&. Note the downside of this simpler format is that Tesseract has to encode the UTF\-8 strings into the components of the unicharset\&. In complex scripts, this encoding may be ambiguous\&. In this case, the encoding is chosen such as to use the least UTF\-8 characters for each component, ie the shortest unicharset components will make up the encoding\&. .SH "HISTORY" .sp The unicharambigs file first appeared in Tesseract 3\&.00; prior to that, a similar format, called DangAmbigs (\fIdangerous ambiguities\fR) was used: the format was almost identical, except only mandatory replacements could be specified, and field 5 was absent\&. .SH "BUGS" .sp This is a documentation "bug": it\(cqs not currently clear what should be done in the case of ligatures (such as \fIfi\fR) which may also appear as regular letters in the unicharset\&. .SH "SEE ALSO" .sp tesseract(1), unicharset(5) \m[blue]\fBhttps://tesseract\-ocr\&.github\&.io/tessdoc/Training\-Tesseract\-3\&.03%E2%80%933\&.05\&.html#the\-unicharambigs\-file\fR\m[] .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.