table of contents
OCRODJVU(1) | ocrodjvu manual | OCRODJVU(1) |
NAME¶
ocrodjvu - OCR for DjVu files
SYNOPSIS¶
ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}
DESCRIPTION¶
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.
The following OCR engines are supported:
OPTIONS¶
OCR engine options¶
-e, --engine=engine-id
The default is “tesseract”. (The default was “ocropus” prior to ocrodjvu 0.8.)
--list-engines
Options controlling output¶
-o, --save-bundled=output-djvu-file
-i, --save-indirect=index-djvu-file
--save-script=script-file
--in-place
--dry-run
It is mandatory to use exactly one of the above options.
--ocr-only
The default is to save all pages, even when the -p/--pages option is in effect.
--clear-text
--save-raw-ocr=output-directory
--raw-ocr-filename-template=template
The template language uses the Python string formatting syntax[6]. The following fields are available:
page, page+N, page-N
id
id-ext
The default template is “{id-ext}”.
Text segmentation options¶
-t lines, --details lines
This is the default for OCRopus 0.2. The option is ineffective with stand-alone Tesseract 2.0.
-t words, --details=words
This is the default for most OCR engines.
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
-t chars, --details=chars
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
--word-segmentation=simple
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore it is not recommended.
Other options¶
-l, --language=language-id
Tesseract ≥ 3.02 allows specifying multiple languages separated by “+” characters.
The default is always “eng” (English).
--list-languages
--render=mask
This is the default.
--render=foreground
--render=all
This option is necessary to OCR DjVu files with invalid foreground/background separation.
-p, --pages=page-range
The default is to process all pages.
-j, --jobs=n
The default is 1.
--version
-h, --help
Advanced options¶
-D, --debug
-X key=value
--on-error=abort
This is the default.
--on-error=resume
This option is strongly discouraged.
--html5
EXIT STATUS¶
One of the following exit values can be returned by ocrodjvu:
0
1
2
ENVIRONMENT¶
The following environment variables affects ocrodjvu:
TMPDIR
BUGS¶
Known bugs¶
Tesseract 3.00 is affected by a bug [9] making it produce invalid hOCR output in certain circumstances. ocrodjvu does not try recover from this fault (which couldn't be done reliably anyway) unless you pass the -X fix-html=1 option.
Extracting bounding boxes of particular characters (which happens when either --details=chars or --word-segmentation=uax29 is enabled) is slow with Tesseract < 3.04.
Reporting new bugs¶
Please report bugs at: https://github.com/jwilk/ocrodjvu/issues
SEE ALSO¶
djvu(1), djvu2hocr(1), hocr2djvused(1),
ocroscript(1), tesseract(1), cuneiform(1), ocrad(1), gocr(1)
NOTES¶
- 1.
- OCRopus
- 2.
- Cuneiform for Linux
- 3.
- Ocrad
- 4.
- GOCR
- 5.
- Tesseract
- 6.
- Python string formatting syntax
- 7.
- Unicode Text Segmentation
- 8.
- HTML5 parser
2019-02-11 | ocrodjvu 0.11 |