table of contents
- trixie 0.14-16
- testing 0.14-16
- unstable 0.14-16
- experimental 0.14-17
CATDVI(1) | General Commands Manual | CATDVI(1) |
NAME¶
catdvi - a DVI to plain text converter
SYNOPSIS¶
catdvi[-d debuglevel, --debug=debuglevel] [-e outenc, --output-encoding=outenc] [-p pagespec, --first-page=pagespec] [-l pagespec, --last-page=pagespec] [-N,--list-page-numbers][-s,--sequential][-U,--show-unknown-glyphs][-h,--help][--version][--copyright][dvi-file]
DESCRIPTION¶
This manual page documentscatdviversion 0.14
catdvireads theDVI(typesetter DeVice Independent) filedvi-fileand dumps a plain text approximation of the document it describes to stdout.If the argumentdvi-fileis omitted or a dash (`-'),catdviwill read from stdin.Severaloutput encodings(different character sets of the plain text output) are supported,most notablyUTF-8.
The current version ofcatdviis a work in progress;it may not be robust enough for production use,but already works fine with linear english text.Many mathematical symbols (e.g. the uppercase greek letters)and moderately complex formulae also come out right.
The program needs to read theTFM(Tex Font Metric) files corresponding to the fonts used in theDVIfile.These are searched (and, if necessary and possible, created on thefly) through theKpathsealibrary.
In order to correctly translate aDVIfile to text, theinput encodingof the fonts used in it (i.e. a meaning-preserving mapping fromfont code points to Unicode) must be known. There are a lot of differentfont encodings in use. At the time of writing,catdviunderstands the following input encodings:
- `TEX TEXT'
- Knuth's original font encoding, also known as OT1.
- `TEX TEXT WITHOUT F-LIGATURES'
- A variant of the above.
- `EXTENDED TEX FONT ENCODING - LATIN'
- The Cork encoding, also known as T1.
- `TEX MATH ITALIC'
- The encoding of Knuth's math italic fonts, also known as OML.
- `TEX MATH SYMBOLS'
- The encoding of Knuth's math symbol fonts, also known as OMS.
- `TEX MATH EXTENSION' (most of it)
- The encoding of Knuth's math extension fonts (big operators, brackets, etc.),also known as OMX.
- `TEX TYPEWRITER TEXT'
- The encoding of Knuth's typewriter type fonts.
- `LATEX SYMBOLS'
- The encoding of the lasy fonts.
- Henrik Theilings European currency symbol (`eurosym') font.
- `TEX TEXT COMPANION SYMBOLS 1---TS1' (almost everything)
- The encoding of the text companion fonts.
- Martin Vogels symbol (`MarVoSym') font.
- Both the 1998 and the 2000 version are supported as far as possible --about half of the symbols are not representable in Unicode.
- `BLACKBOARD'
- The encoding of the blackboard bold math (`bbm') fonts.
- All AMS fonts except the Cyrillic ones.
- This includes the AMS math symbols group A and group B, Euler fraktur,Euler cursive, Euler script and Euler compatible extension fonts.
It is impossible to do perfect translation from unmarked-upDVIto plain text,since the former does only describe the layout of a page,and a translator such as this should really know where words andparagraphs end, and more importantly, which glyphs should be alignedvertically and which shouldn't.The current alignment algorithm tries to preserve the relativehorizontal positions of word beginnings; this works well in most cases.Word breaks are detected using simple heuristics;paragraphs are not detected at all (and no paragraph fill is attempted).
The price of alignment is that the output will likely be more than 80columns wide, even thoughcatdvitries very hard not to use more columns than strictly necessary.Output is usually less than 120 columns, almost always less than 132columns wide. It may be a good idea to switch your terminal to one ofthese modes if possible.
OPTIONS¶
The program follows the usual GNU command line syntax, with longoptions starting with two dashes.
- -d debuglevel, --debug=debuglevel
- Set the debug output level todebuglevel(default is 10).Large values will result in lots of debug output, 0 in none at all.The maximal debug output level currently used is 150.
- -e outenc, --output-encoding=outenc
- Specify the encoding of the output character set.outenccan be one
of the numbers or names from the table below.Names are case
insensitive.The following output encodings should be available:
0: UTF-8
1: US-ASCII
2: ISO-8859-1
3: ISO-8859-15
The commandcatdvi --help(see below) will give a more up-to-date list of all compiled-in outputencodings. The default encoding is 1. - -p pagespec, --first-page=pagespec
- Do not output pages before pagepagespec.Pages can be specified in three different ways; the first twoare exactly the same as fordvips(1).
A (possibly negative) numbernumspecifies a TeX page number, which is stored as the so-calledcount0value in theDVIfile for every page.Plain TeX uses negative page numbers for roman-numbered frontmatter(title page, preface,TOC,etc.) so thecount0values compare as
A number prefixed by an equals sign(`=num')specifies a physical page, i.e. thenum-thpage appearing in theDVIfile. Numbering starts with 1.Note that with the long form of the option you actually needtwoequals signs, one as part of the long option and one as part of thepage specification. Example:
The third form of a page specification, two numbers separated by a colon(`num1:num2'),is useful for documents with separately-numbered parts, e.g. chapters.It refers to the page withcount0value equal tonum2thatcatdvibelieves to be in partnum1.Since those part numbers are not stored in theDVIfile, the program has to guess them:an internalchaptercounter is increased by one every timethecount0value of the current page is not greater (in above ordering) than thatof the previous page.The counter is initialized to 1 if the first pagehas negativecount0value and to 0 otherwise. (A document with separately numbered partswill probably have separately numbered frontmatter as well, and then thisrule keeps the internal counter equal toreal world part numbers.)
- -l pagespec, --last-page=pagespec
- Do not output pages after pagepagespec.Pages are specified exactly as for the--first-pageoption above.
- -N, --list-page-numbers
- Instead of the contents of pages, output theirphysical page count,count0value andchaptercount (see the--first-pageoption above for a definition of these).
- -s, --sequential
- Do not attempt to reproduce the page layout;output glyphs in the order they appear in theDVIfile. This may be useful with e.g. multi-column page layouts.
- -U, --show-unknown-glyphs
- Show the Unicode number of unknown glyphs instead of `?'.
- -h, --help
- Show usage information and a list of available output encodings,then exit.
- --version
- Show version information and exit.
- --copyright
- Show copyright information and exit.
ENVIRONMENT¶
The usual environment variables TFMFONTS, TEXFONTS, etc. forKpathseafont search and creation apply.Refer to theKpathseadocumentation for details.
SEE ALSO¶
xdvi(1),dvips(1),tex(1),mktextfm(1),theKpathseatexinfo documentation,utf-8(7).
BUGS¶
These things do not work (yet):
- No rules are converted.
- Extensible recipes (very large brackets, braces, etc. built out of severalsmaller pieces) are not properly handled.
- Complicated math formulae are sometimes misaligned (mostly due to lackof appropriate word break heuristics).
- Some fonts and font encodings are not recognised yet.
- Most mathematical symbols have no representation in the availableoutput character sets except Unicode, and hence show up as `?' unlessUTF-8output encoding is selected. A textual transcription would be desirable.
Watch out for these:
- •
- If there is a space where it does not belong or if there is no spacewhere there should be one,report this as a bug (send theDVIfile to thecatdvimaintainer, stating where in the file the bug is seen).
AUTHORS¶
catdviwas written byAntti-Juhani Kaijanaho <gaia@iki.fi>,based on a skeletal version by J.H.M. Dassen (Ray).Bjoern Brill <brill@fs.math.uni-frankfurt.de>did further improvements and currently maintains the program.
The manual page was compiled by Bjoern Brill, usingmaterial written by the first two program authors.
8 November 2002 |