NAME¶
I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8 (et al.)
conversion scheme names
SYNOPSIS¶
use I18N::Charset;
$sCharset = iana_charset_name('WinCyrillic');
# $sCharset is now 'windows-1251'
$sCharset = umap_charset_name('Adobe DingBats');
# $sCharset is now 'ADOBE-DINGBATS' which can be passed to Unicode::Map->new()
$sCharset = map8_charset_name('windows-1251');
# $sCharset is now 'cp1251' which can be passed to Unicode::Map8->new()
$sCharset = umu8_charset_name('x-sjis');
# $sCharset is now 'sjis' which can be passed to Unicode::MapUTF8->new()
$sCharset = libi_charset_name('x-sjis');
# $sCharset is now 'MS_KANJI' which can be passed to `iconv -f $sCharset ...`
$sCharset = enco_charset_name('Shift-JIS');
# $sCharset is now 'shiftjis' which can be passed to Encode::from_to()
I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
I18N::Charset::add_umap_alias('my-hebrew' => 'ISO-8859-8');
I18N::Charset::add_libi_alias('my-sjis' => 'x-sjis');
I18N::Charset::add_enco_alias('my-japanese' => 'shiftjis');
DESCRIPTION¶
The "I18N::Charset" module provides access to the IANA Character Set
Registry names for identifying character encoding schemes. It also provides a
mapping to the character set names used by the Unicode::Map and Unicode::Map8
modules.
So, for example, if you get an HTML document with a META CHARSET="..."
tag, you can fairly quickly determine what Unicode::MapXXX module can be used
to convert it to Unicode.
If you don't have the module Unicode::Map installed, the umap_ functions will
always return undef. If you don't have the module Unicode::Map8 installed, the
map8_ functions will always return undef. If you don't have the module
Unicode::MapUTF8 installed, the umu8_ functions will always return undef. If
you don't have the iconv library installed, the libi_ functions will always
return undef. If you don't have the Encode module installed, the enco_
functions will always return undef.
CONVERSION ROUTINES¶
There are four main conversion routines: "iana_charset_name()",
"map8_charset_name()", "umap_charset_name()", and
"umu8_charset_name()".
- iana_charset_name()
- This function takes a string containing the name of a
character set and returns a string which contains the official IANA name
of the character set identified. If no valid character set name can be
identified, then "undef" will be returned. The case and
punctuation within the string are not important.
$sCharset = iana_charset_name('WinCyrillic');
- mime_charset_name()
- This function takes a string containing the name of a
character set and returns a string which contains the preferred MIME name
of the character set identified. If no valid character set name can be
identified, then "undef" will be returned. The case and
punctuation within the string are not important.
$sCharset = mime_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');
- enco_charset_name()
- This function takes a string containing the name of a
character set and returns a string which contains a name of the character
set suitable to be passed to the Encode module. If no valid character set
name can be identified, or if Encode is not installed, then
"undef" will be returned. The case and punctuation within the
string are not important.
$sCharset = enco_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');
- libi_charset_name()
- This function takes a string containing the name of a
character set and returns a string which contains a name of the character
set suitable to be passed to iconv. If no valid character set name can be
identified, then "undef" will be returned. The case and
punctuation within the string are not important.
$sCharset = libi_charset_name('Extended_UNIX_Code_Packed_Format_for_Korean');
- mib_to_charset_name
- This function takes a string containing the MIBenum of a
character set and returns a string which contains a name for the character
set. If the given MIBenum does not correspond to any character set, then
"undef" will be returned.
$sCharset = mib_to_charset_name('3');
- mib_charset_name
- This is a synonum for mib_to_charset_name
- charset_name_to_mib
- This function takes a string containing the name of a
character set in almost any format and returns a MIBenum for the character
set. For IANA-registered character sets, this is the IANA-registered MIB.
For non-IANA character sets, this is an unambiguous unique string whose
only use is to pass to other functions in this module. If no valid
character set name can be identified, then "undef" will be
returned.
$iMIB = charset_name_to_mib('US-ASCII');
- map8_charset_name()
- This function takes a string containing the name of a
character set (in almost any format) and returns a string which contains a
name for the character set that can be passed to
Unicode::Map8::new(). Note: the returned string will be capitalized
just like the name of the .bin file in the Unicode::Map8::MAPS_DIR
directory. If no valid character set name can be identified, then
"undef" will be returned. The case and punctuation within the
argument string are not important.
$sCharset = map8_charset_name('windows-1251');
- umap_charset_name()
- This function takes a string containing the name of a
character set (in almost any format) and returns a string which contains a
name for the character set that can be passed to
Unicode::Map::new(). If no valid character set name can be
identified, then "undef" will be returned. The case and
punctuation within the argument string are not important.
$sCharset = umap_charset_name('hebrew');
- umu8_charset_name()
- This function takes a string containing the name of a
character set (in almost any format) and returns a string which contains a
name for the character set that can be passed to
Unicode::MapUTF8::new(). If no valid character set name can be
identified, then "undef" will be returned. The case and
punctuation within the argument string are not important.
$sCharset = umu8_charset_name('windows-1251');
QUERY ROUTINES¶
There is one function which can be used to obtain a list of all IANA-registered
character set names.
- "all_iana_charset_names()"
- Returns a list of all registered IANA character set names.
The names are not in any particular order.
CHARACTER SET NAME ALIASING¶
This module supports several semi-private routines for specifying character set
name aliases.
- add_iana_alias()
- This function takes two strings: a new alias, and a target
IANA Character Set Name (or another alias). It defines the new alias to
refer to that character set name (or to the character set name to which
the second alias refers).
Returns the target character set name of the successfully installed alias.
Returns 'undef' if the target character set name is not registered.
Returns 'undef' if the target character set name of the second alias is
not registered.
I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');
With this code, "my-alias1" becomes an alias for the existing IANA
character set name 'Shift_JIS'.
I18N::Charset::add_iana_alias('my-alias2' => 'sjis');
With this code, "my-alias2" becomes an alias for the IANA
character set name referred to by the existing alias 'sjis' (which happens
to be 'Shift_JIS').
- add_map8_alias()
- This function takes two strings: a new alias, and a target
Unicode::Map8 Character Set Name (or an exising alias to a Map8 name). It
defines the new alias to refer to that mapping name (or to the mapping
name to which the second alias refers).
If the first argument is a registered IANA character set name, then all
aliases of that IANA character set name will end up pointing to the target
Map8 mapping name.
Returns the target mapping name of the successfully installed alias. Returns
'undef' if the target mapping name is not registered. Returns 'undef' if
the target mapping name of the second alias is not registered.
I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');
With the above statement, "normal" becomes an alias for the
existing Unicode::Map8 mapping name 'ANSI_X3.4-1968'.
I18N::Charset::add_map8_alias('normal' => 'US-ASCII');
With the above statement, "normal" becomes an alias for the
existing Unicode::Map mapping name 'ANSI_X3.4-1968' (which is what
"US-ASCII" is an alias for).
I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');
With the above statement, "IBM297" becomes an alias for the
existing Unicode::Map mapping name 'EBCDIC-CA-FR'. As a side effect, all
the aliases for 'IBM297' (i.e. 'cp297' and 'ebcdic-cp-fr') also become
aliases for 'EBCDIC-CA-FR'.
- add_umap_alias()
- This function works identically to add_map8_alias()
above, but operates on Unicode::Map encoding tables.
- add_libi_alias()
- This function takes two strings: a new alias, and a target
iconv Character Set Name (or existing iconv alias). It defines the new
alias to refer to that character set name (or to the character set name to
which the existing alias refers).
Returns the target conversion scheme name of the successfully installed
alias. Returns 'undef' if there is no such target conversion scheme or
alias.
Examples:
I18N::Charset::add_libi_alias('my-chinese1' => 'CN-GB');
With this code, "my-chinese1" becomes an alias for the existing
iconv conversion scheme 'CN-GB'.
I18N::Charset::add_libi_alias('my-chinese2' => 'EUC-CN');
With this code, "my-chinese2" becomes an alias for the iconv
conversion scheme referred to by the existing alias 'EUC-CN' (which
happens to be 'CN-GB').
- add_enco_alias()
- This function takes two strings: a new alias, and a target
Encode encoding Name (or existing Encode alias). It defines the new alias
referring to that encoding name (or to the encoding to which the existing
alias refers).
Returns the target encoding name of the successfully installed alias.
Returns 'undef' if there is no such encoding or alias.
Examples:
I18N::Charset::add_enco_alias('my-japanese1' => 'jis0201-raw');
With this code, "my-japanese1" becomes an alias for the existing
encoding 'jis0201-raw'.
I18N::Charset::add_enco_alias('my-japanese2' => 'my-japanese1');
With this code, "my-japanese2" becomes an alias for the encoding
referred to by the existing alias 'my-japanese1' (which happens to be
'jis0201-raw' after the previous call).
KNOWN BUGS AND LIMITATIONS¶
- •
- There could probably be many more aliases added (for
convenience) to all the IANA names. If you have some specific
recommendations, please email the author!
- •
- The only character set names which have a corresponding
mapping in the Unicode::Map8 module are the character sets that
Unicode::Map8 can convert.
Similarly, the only character set names which have a corresponding mapping
in the Unicode::Map module are the character sets that Unicode::Map can
convert.
- •
- In the current implementation, all tables are read in and
initialized when the module is loaded, and then held in memory until the
program exits. A "lazy" implementation (or a less-portable tied
hash) might lead to a shorter startup time. Suggestions, patches, comments
are always welcome!
SEE ALSO¶
- Unicode::Map
- Convert strings from various multi-byte character encodings
to and from Unicode.
- Unicode::Map8
- Convert strings from various 8-bit character encodings to
and from Unicode.
- Jcode
- Convert strings among various Japanese character encodings
and Unicode.
- Unicode::MapUTF8
- A wrapper around all three of these character set
conversion distributions.
AUTHOR¶
Martin 'Kingpin' Thurn, "mthurn at cpan.org",
<
http://tinyurl.com/nn67z>.
LICENSE¶
This module is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.