NAME¶
Unicode::Map8 - Mapping table between 8-bit chars and Unicode
SYNOPSIS¶
require Unicode::Map8;
my $no_map = Unicode::Map8->new("ISO646-NO") || die;
my $l1_map = Unicode::Map8->new("latin1") || die;
my $ustr = $no_map->to16("V}re norske tegn b|r {res\n");
my $lstr = $l1_map->to8($ustr);
print $lstr;
print $no_map->tou("V}re norske tegn b|r {res\n")->utf8
DESCRIPTION¶
The
Unicode::Map8 class implement efficient mapping tables between 8-bit
character sets and 16 bit character sets like Unicode. The tables are
efficient both in terms of space allocated and translation speed. The 16-bit
strings is assumed to use network byte order.
The following methods are available:
- $m = Unicode::Map8->new( [$charset] )
- The object constructor creates new instances of the
Unicode::Map8 class. I takes an optional argument that specify then name
of a 8-bit character set to initialize mappings from. The argument can
also be a the name of a mapping file. If the charset/file can not be
located, then the constructor returns undef.
If you omit the argument, then an empty mapping table is constructed. You
must then add mapping pairs to it using the addpair() method
described below.
- $m->addpair( $u8, $u16 );
- Adds a new mapping pair to the mapping object. It takes two
arguments. The first is the code value in the 8-bit character set and the
second is the corresponding code value in the 16-bit character set. The
same codes can be used multiple times (but using the same pair has no
effect). The first definition for a code is the one that is used.
Consider the following example:
$m->addpair(0x20, 0x0020);
$m->addpair(0x20, 0x00A0);
$m->addpair(0xA0, 0x00A0);
It means that the character 0x20 and 0xA0 in the 8-bit charset maps to
themselves in the 16-bit set, but in the 16-bit character set 0x0A0 maps
to 0x20.
- $m->default_to8( $u8 )
- Set the code of the default character to use when mapping
from 16-bit to 8-bit strings. If there is no mapping pair defined for a
character then this default is substituted by to8() and
recode8().
- $m->default_to16( $u16 )
- Set the code of the default character to use when mapping
from 8-bit to 16-bit strings. If there is no mapping pair defined for a
character then this default is used by to16(), tou() and
recode8().
- $m->nostrict;
- All undefined mappings are replaced with the identity
mapping. Undefined character are normally just removed (or replaced with
the default if defined) when converting between character sets.
- $m->to8( $ustr );
- Converts a 16-bit character string to the corresponding
string in the 8-bit character set.
- $m->to16( $str );
- Converts a 8-bit character string to the corresponding
string in the 16-bit character set.
- $m->tou( $str );
- Same an to16() but return a Unicode::String object
instead of a plain UCS2 string.
- $m->recode8($m2, $str);
- Map the string $str from one 8-bit character set ($m) to
another one ($m2). Since we assume we know the mappings towards the common
16-bit encoding we can use this to convert between any of the 8-bit
character sets.
- $m->to_char16( $u8 )
- Maps a single 8-bit character code to an 16-bit code. If
the 8-bit character is unmapped then the constant NOCHAR is returned. The
default is not used and the callback method is not invoked.
- $m->to_char8( $u16 )
- Maps a single 16-bit character code to an 8-bit code. If
the 16-bit character is unmapped then the constant NOCHAR is returned. The
default is not used and the callback method is not invoked.
The following callback methods are available. You can override these methods by
creating a subclass of Unicode::Map8.
- $m->unmapped_to8
- When mapping to 8-bit character string and there is no
mapping defined (and no default either), then this method is called as the
last resort. It is called with a single integer argument which is the code
of the unmapped 16-bit character. It is expected to return a string that
will be incorporated in the 8-bit string. The default version of this
method always returns an empty string.
Example:
package MyMapper;
@ISA=qw(Unicode::Map8);
sub unmapped_to8
{
my($self, $code) = @_;
require Unicode::CharName;
"<" . Unicode::CharName::uname($code) . ">";
}
- $m->unmapped_to16
- Likewise when mapping to 16-bit character string and no
mapping is defined then this method is called. It should return a 16-bit
string with the bytes in network byte order. The default version of this
method always returns an empty string.
FILES¶
The
Unicode::Map8 constructor can parse two different file formats; a
binary format and a textual format.
The binary format is simple. It consist of a sequence of 16-bit integer pairs in
network byte order. The first pair should contain the magic value 0xFFFE,
0x0001. Of each pair, the first value is the code of an 8-bit character and
the second is the code of the 16-bit character. If follows from this that the
first value should be less than 256.
The textual format consist of lines that is either a comment (first non-blank
character is '#'), a completely blank line or a line with two hexadecimal
numbers. The hexadecimal numbers must be preceded by "0x" as in C
and Perl. This is the same format used by the Unicode mapping files available
from <
URL:ftp://ftp.unicode.org/Public>.
The mapping table files are installed in the
Unicode/Map8/maps directory
somewhere in the Perl @INC path. The variable $Unicode::Map8::MAPS_DIR is the
complete path name to this directory. Binary mapping files are stored within
this directory with the suffix
.bin. Textual mapping files are stored
with the suffix
.txt.
The scripts
map8_bin2txt and
map8_txt2bin can translate between
these mapping file formats.
A special file called
aliases within $MAPS_DIR specify all the alias
names that can be used to denote the various character sets. The first name of
each line is the real file name and the rest is alias names separated by
space.
The `"umap --list"' command be used to list the character sets
supported.
BUGS¶
Does not handle Unicode surrogate pairs as a single character.
SEE ALSO¶
umap(1), Unicode::String
COPYRIGHT¶
Copyright 1998 Gisle Aas.
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.