NAME¶
Encode::Arabic::ArabTeX - Interpreter of the ArabTeX notation of Arabic
REVISION¶
$Revision: 717 $ $Date: 2008-10-03 00:28:12 +0200 (Fri, 03 Oct 2008) $
SYNOPSIS¶
use Encode::Arabic::ArabTeX; # imports just like 'use Encode' would, plus extended options
while ($line = <>) { # maps the ArabTeX notation for Arabic into the Arabic script
print encode 'utf8', decode 'arabtex', $line; # 'ArabTeX' alias 'Lagally' alias 'TeX'
}
# ArabTeX lower ASCII transliteration <--> Arabic script in Perl's internal format
$string = decode 'ArabTeX', $octets;
$octets = encode 'ArabTeX', $string;
Encode::Arabic::ArabTeX->encoder('dump' => '!./encoder.code'); # dump the encoder engine to file
Encode::Arabic::ArabTeX->decoder('load'); # load the decoder engine from module's extra sources
DESCRIPTION¶
ArabTeX is an excellent extension to TeX/LaTeX designed for typesetting the
right-to-left scripts of the Orient. It comes up with very intuitive and
comprehensible lower ASCII transliterations, the expressive power of which is
even better than that of the scripts.
Encode::Arabic::ArabTeX implements the rules needed for proper interpretation of
the ArabTeX notation of Arabic. The conversion ifself is done by
Encode::Mapper, and the user interface is built on the Encode::Encoding
module.
ENCODING BUSINESS¶
Since the ArabTeX notation is not a simple mapping to the graphemes of the
Arabic script, encoding the script into the notation is ambiguous. Two
different strings in the notation may correspond to identical strings in the
script. Heuristics must be engaged to decide which of the representations is
more appropriate.
Together with this bottle-neck, encoding may not be perfectly invertible by the
decode operation, due to over-generation or approximations in the encoding
algorithm.
There are situations where conversion from the Arabic script to the ArabTeX
notation is still convenient and useful. Imagine you need to edit the data,
enhance it with vowels or other diacritical marks, produce phonetic
transcripts and trim the typography of the script ... Do it in the ArabTeX
notation, having an unrivalled control over your acts!
Nonetheless, encoding is not the very purpose for this module's existence ;)
DECODING BUSINESS¶
The module decodes the ArabTeX notation as defined in the User Manual Version
4.00 of March 11, 2004,
<
ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/doc/arabdoc.pdf>. The
implementation uses three levels of Encode::Mapper engines to solve the
problem:
- Hamza writing
- Hamza carriers are determined from the context in
accordance with the Arabic orthographical conventions. The first level of
mapping expands every "<'>" into the verbatim encoding of
the relevant carrier. This level of processing can become optional, if
people ever need to encode the hamza carriers explicitly.
Interpretation of geminated hamza "<''>" is
correct here, as opposed to ArabTeX itself. In order to deduce the
proper spelling rules, we resorted to
<http://www.arabic-morphology.com/> and experimented with words like
"<ra''asa>", "<ru''isa>",
"<tara''usuN>", etc.
On this level, word-internal occurrences of "<T>" get
translated into "<t>", which is an extension to the
notation that simplifies some requirements in modeling of the Arabic
morphology.
- Grapheme generation
- The core level includes most of the rules needed, and
converts the ArabTeX notation to Arabic graphemes in Unicode. The engine
recognizes all the consonants of Modern Standard Arabic, plus the
following letters:
[ "|", "" ], # invisible consonant
[ "B", "\x{0640}" ], # consonantal ta.twil
[ "T", "\x{0629}" ], # ta' marbu.ta
[ "H", "\x{0629}" ], # ta' marbu.ta silent
[ "p", "\x{067E}" ], # pa'
[ "v", "\x{06A4}" ], # va'
[ "g", "\x{06AF}" ], # gaf
[ "c", "\x{0681}" ], # .ha with hamza
[ "^c", "\x{0686}" ], # gim with three
[ ",c", "\x{0685}" ], # _ha with three
[ "^z", "\x{0698}" ], # zay with three
[ "^n", "\x{06AD}" ], # kaf with three
[ "^l", "\x{06B5}" ], # lam with bow above
[ ".r", "\x{0695}" ], # ra' with bow below
There are many nice features in the notation, like assimilation, gemination,
hyphenation, all implemented here. Defective and historical writings of
vowels are supported, too! Try yourself if your fonts can handle these ;)
Word-initial sequences like "<lV-all>",
"<lV-al->", "<lV-al-CC>" and
"<lV-aC-C>", where "V" stands for a short,
possibly quoted or missing, vowel, and "C" represents a fixed
consonant, are processed according to the requirements of the Arabic
orthography. Thus, "<li-al-laylaTi>" reduces to
"<li-llaylaTi>", "<li-al-rra^guli>"
becomes "<lir-ra^guli>", and
"<la-al-ma^gdu>" equals "<lal-ma^gdu>",
while "<li-alla_dI>" turns into
"<lilla_dI>".
- Wasla and ligatures
- Wasla is introduced if there is a preceding long or
short vowel, and the blank space is one newline, one tabulator, or up to
four single spaces. Optionally, diacritical marks in between laam
and 'alif go after the latter letter, since most of the current
systems rendering the Arabic script do not produce the desired ligatures
if the two kinds of graphemes are not adjacent immediately.
There are modes and options in ArabTeX that have not been dealt with yet in
Encode::Arabic::ArabTeX. Still, mutual consistency of the systems is very
high. This new release does support
vowel quoting and works in the
ArabTeX's "\vocalize" mode by default. The other
conversion
modes are implemented, too, as described below within the
"enmode" and "demode" methods.
EXPORTS, ENGINES & MODES¶
The module exports as if "use Encode" also appeared in the package.
The "import" options, except for the first-place subsequence of
":xml", ":simple" or ":describe", are just
delegated to Encode and imports performed properly.
If the first element in the list to "use" is ":xml", all XML
markup, or rather any
data enclosed in the well-paired and non-nested
angle brackets "<" and ">", will be
preserved. Properties of the Encode::Arabic::ArabTeX engines can be generally
controlled through the Encode::Mapper API.
In case the next, possibly the first, element in this list is
":simple",
rules in the engines
get simplified so that
quotes be mapped to empty strings and infrequent or experimental notations of
vowels not be interpreted in the extra manner of ArabTeX. Using
":simple" is recommended for simple every-day tasks where these
nuances would have no impact and where full initialization would be bothering.
The ":describe" option calls the Encode::Mapper's "describe"
method on the module's engines right after their compilation.
Initialization of the engines takes place the first time they are used, unless
they have already been defined. There are two explicit methods for it:
- encoder
- Initialize or redefine the encoder engine. If no parameters
are given, rules in the module are compiled into a list of Encode::Mapper
objects. Currently, the "--dump" and "--load" options
have some experimental meaning.
- decoder
- See the description of "encoder".
There are five
conversion modes currently recognized in this module, and
their aliases are mapped according to the module's %modemap hash. Selection of
the appropriate mode is done best through the "enmode" and
"demode" functions of Encode::Arabic, or with a direct call of the
namesake methods in Encode::Arabic::ArabTeX:
our %Encode::Arabic::ArabTeX::modemap = ( # the module provides these definitions
'default' => 3, 'undef' => 0,
'fullvocalize' => 4, 'full' => 4,
'vocalize' => 3, 'nosukuun' => 3,
'novocalize' => 2, 'novowels' => 2, 'none' => 2,
'noshadda' => 1, 'noneplus' => 1,
);
# the function calls might be preferred as more comfortable
Encode::Arabic::demode 'arabtex', 'full'; # like 'encode' and 'decode' of Encode
Encode::Arabic::ArabTeX->demode('fullvocalize'); # like the Encode::Encoding interfaces
# how modes can be set easily
use Encode::Arabic ':modes'; enmode 'arabtex', 'undef'; demode 'arabtex', 'noneplus';
- enmode
- Currently in development. The mode is fixed to 'undef'
internally.
- demode
- Enforces the proper version of the final, third level of
the Encode::Mapper engines.
SEE ALSO¶
Encode::Arabic, Encode::Mapper, Encode::Encoding, Encode
ArabTeX system
<
ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/arabtex.htm>
Klaus Lagally
<
http://www.informatik.uni-stuttgart.de/ifi/bs/people/lagall_e.htm>
ArabTeX extensions <
http://sourceforge.net/projects/encode-arabic/>
ArabXeTeX <
http://tug.ctan.org/info/?id=arabxetex>
Encode Arabic: Exercise in Functional Parsing
<
http://ufal.mff.cuni.cz/padt/online/2006/06/encode-arabic.html>
AUTHOR¶
Otakar Smrz, <
http://ufal.mff.cuni.cz/~smrz/>
eval { 'E<lt>' . ( join '.', qw 'otakar smrz' ) . "\x40" . ( join '.', qw 'mff cuni cz' ) . 'E<gt>' }
Perl is also designed to make the easy jobs not that easy ;)
COPYRIGHT AND LICENSE¶
Copyright 2003-2008 by Otakar Smrz
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.