NAME¶
Unicode::Japanese - Convert encoding of japanese text
SYNOPSIS¶
 use Unicode::Japanese;
 use Unicode::Japanese qw(unijp);
 
 # convert utf8 -> sjis
 
 print Unicode::Japanese->new($str)->sjis;
 print unijp($str)->sjis; # same as above.
 
 # convert sjis -> utf8
 
 print Unicode::Japanese->new($str,'sjis')->get;
 
 # convert sjis (imode_EMOJI) -> utf8
 
 print Unicode::Japanese->new($str,'sjis-imode')->get;
 
 # convert zenkaku (utf8) -> hankaku (utf8)
 
 print Unicode::Japanese->new($str)->z2h->get;
DESCRIPTION¶
The Unicode::Japanese module converts encoding of japanese text from one
  encoding to another.
FEATURES¶
  - •
 
  - An instance of Unicode::Japanese internally holds a string
      in UTF-8.
 
  - •
 
  - This module is implemented in two ways: XS and pure perl.
      If efficiency is important for you, you should build and install the XS
      module. If you don't want to, or if you can't build the XS module, you may
      use the pure perl module instead. In that case, only you have to do is to
      copy Japanese.pm into somewhere in @INC.
 
  - •
 
  - This module can convert characters from zenkaku
      (full-width) form to hankaku (half-width) form, and vice versa. Conversion
      between hiragana (one of two sets of japanese phonetical alphabet) and
      katakana (another set of japanese phonetical alphabet) is also
    supported.
 
  - •
 
  - This module has mapping tables for emoji (graphic
      characters) defined by various japanese mobile phones; DoCoMo i-mode,
      ASTEL dot-i and J-PHONE J-Sky. Those letters are mapped on Unicode Private
      Use Area so unicode strings it outputs are still valid even if they
      contain emoji, and you can safely pass them to other softwares that can
      handle Unicode.
 
  - •
 
  - This module can map some emoji from one set to another.
      Different mobile phones define different sets of emoji, so mapping each
      other is not always possible. But since some emoji exist in two or more
      sets with similar appearance, this module considers those emoji to be the
      same.
 
  - •
 
  - This module uses the mapping table for MS-CP932 instead of
      the standard Shift_JIS. The Shift_JIS encoding used by MS-Windows
      (MS-SJIS/MS-CP932) slightly differs from the standard.
 
  - •
 
  - When the module converts strings from Unicode to Shift_JIS,
      EUC-JP or ISO-2022-JP, unicode letters which can't be represented in those
      encodings will be encoded in "&#dddd;" form (decimal
      character reference). Note, however, that letters in Unicode Private Use
      Area will be replaced with '?' mark ('QUESTION MARK'; U+003F) instead of
      being encoded. In addition, encoding to character sets for mobile phones
      makes every unrepresentable letters being '?' mark.
 
  - •
 
  - On perl-5.8.0 or later, this module handles the UTF-8 flag:
      the method utf8() returns UTF-8 byte string, and the method
      getu() returns UTF-8 character string.
    
 
    Currently the method get() returns UTF-8 byte string but this
      behavior may be changed in the future.
     
    Methods like sjis(), jis(), utf8(), and such like
      return byte string. new(), set(), getcode()
      methods just ignore the UTF-8 flag of strings they take. 
REQUIREMENT¶
  - •
 
  - perl 5.10.x, 5.8.x, etc. (5.004 and later)
 
  - •
 
  - (optional) C Compiler. This module supports both XS and
      Pure Perl. If you have no C Compilers, Unicode::Japanese will be installed
      as Pure Perl module.
 
  - •
 
  - (optional) Test.pm and Test::More for testing.
 
No other modules are required at run time.
METHODS¶
  - $s = Unicode::Japanese->new($str [, $icode [,
    $encode]])
 
  - Create a new instance of Unicode::Japanese.
    
 
    Any given parameters will be internally passed to the method
      "set"(). 
  - $s = unijp($str [, $icode [, $encode]])
 
  - Same as Unicode::Jananese->new(...).
 
  - $s->set($str [, $icode [, $encode]])
 
  
  - $str: string
 
  
  - $icode: optional character encoding (default: 'utf8')
 
  
  - $encode: optional binary encoding (default: no binary
    encodings are assumed)
 
  
 
 
Store a string into the instance.
 
Possible character encodings are:
 
 auto
 utf8 ucs2 ucs4
 utf16-be utf16-le utf16
 utf32-be utf32-le utf32
 sjis cp932 euc euc-jp jis
 sjis-imode sjis-imode1 sjis-imode2
 utf8-imode utf8-imode1 utf8-imode2
 sjis-doti sjis-doti1
 sjis-jsky sjis-jsky1 sjis-jsky2
 jis-jsky  jis-jsky1  jis-jsky2
 utf8-jsky utf8-jsky1 utf8-jsky2
 sjis-au sjis-au1 sjis-au2
 jis-au  jis-au1  jis-au2
 sjis-icon-au sjis-icon-au1 sjis-icon-au2
 euc-icon-au  euc-icon-au1  euc-icon-au2
 jis-icon-au  jis-icon-au1  jis-icon-au2
 utf8-icon-au utf8-icon-au1 utf8-icon-au2
 ascii binary
 
(see also "SUPPORTED ENCODINGS".)
 
If you want the Unicode::Japanese detect the character encoding of string, you
  must explicitly specify 'auto' as the second argument. In that case, the given
  string will be passed to the method 
getcode() to guess the encoding.
 
For binary encodings, only 'base64' is currently supported. If you specify
  'base64' as the third argument, the given string will be decoded using Base64
  decoder.
 
Specify 'binary' as the second argument if you want your string to be stored
  without modification.
 
When you specify 'sjis-imode' or 'sjis-doti' as the character encoding, any
  occurences of '&#dddd;' (decimal character reference) in the string will
  be interpreted and decoded as code point of emoji, just like emoji implanted
  into the string in binary form.
 
Since encoded forms of strings in various encodings are not clearly distinctive
  to each other, it is not always certainly possible to detect what encoding is
  used for a given string.
 
When a given string is possibly interpreted as both Shift_JIS and UTF-8 string,
  this module considers such a string to be encoded in Shift_JIS. And if the
  encoding is not distinguishable between 'sjis-au' and 'sjis-doti', this module
  considers it 'sjis-au'.
 
  - $str = $s->get
 
  
 
Get the internal string in UTF-8.
 
This method currently returns a byte string (whose UTF-8 flag is turned off),
  but this behavior may be changed in the future.
 
If you absolutely want a byte string, you should use the method 
utf8()
  instead. And if you want a character string (whose UTF-8 flag is turned on),
  you have to use the method 
getu().
 
  - $str = $s->getu
 
  
 
Get the internal string in UTF-8.
 
On perl-5.8.0 or later, this method returns a character string with its UTF-8
  flag turned on.
 
  - $code = $s->getcode($str)
 
  
  - $str: string
 
  
  - $code: name of character encoding
 
  
 
 
Detect the character encoding of given string.
 
Note that this method, exceptionaly, doesn't deal with the internal string of an
  instance.
 
To guess the encoding, the following algorithm is used:
 
(For pure perl implementation)
  - 1.
 
  - If the string has an UTF-32 BOM, its encoding is
    'utf32'.
 
  - 2.
 
  - If it has an UTF-16 BOM, its encoding is 'utf16'.
 
  - 3.
 
  - If it is valid for UTF-32BE, its encoding is
    'utf32-be'.
 
  - 4.
 
  - If it is valid for UTF-32LE, its encoding is
    'utf32-le'.
 
  - 5.
 
  - If it contains no ESC characters or bytes whose eighth bit
      is on, its encoding is 'ascii'. Every ASCII control characters (0x00-0x1F
      and 0x7F) except ESC (0x1B) are considered to be in the range of
    'ascii'.
 
  - 6.
 
  - If it contains escape sequences of ISO-2022-JP, its
      encoding is 'jis'.
 
  - 7.
 
  - If it contains any emoji defined for J-PHONE, its encoding
      is 'sjis-jsky'.
 
  - 8.
 
  - If it is valid for EUC-JP, its encoding is 'euc'.
 
  - 9.
 
  - If it is valid for Shift_JIS, its encoding is 'sjis'.
 
  - 10.
 
  - If it contains any emoji defined for au, and everything
      else is valid for Shift_JIS, its encoding is 'sjis-au'.
 
  - 11.
 
  - If it contains any emoji defined for i-mode, and everything
      else is valid for Shift_JIS, its encoding is 'sjis-imode'.
 
  - 12.
 
  - If it contains any emoji defined for dot-i, and everything
      else is valid for Shift_JIS, its encoding is 'sjis-doti'.
 
  - 13.
 
  - If it is valid for UTF-8, its encoding is 'utf8'.
 
  - 14.
 
  - If no conditions above are fulfilled, its encoding is
      'unknown'.
 
 
 
(For XS implementation)
  - 1.
 
  - If the string has an UTF-32 BOM, its encoding is
    'utf32'.
 
  - 2.
 
  - If it has an UTF-16 BOM, its encoding is 'utf16'.
 
  - 3.
 
  - Find all possible encodings that might have been applied to
      the string from the following:
    
 
    ascii / euc / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky /
      sjis-imode / sjis-au / sjis-doti 
  - 4.
 
  - If any encodings have been found possible, this module
      picks out one encoding having the highest priority among them. The
      priority order is as follows:
    
 
    utf32-be / utf32-le / ascii / jis / euc / sjis / sjis-jsky / sjis-imode /
      sjis-au / sjis-doti / utf8 
  - 5.
 
  - If no conditions above are fulfilled, its encoding is
      'unknown'.
 
 
 
Pay attention to the following pitfalls in the above algorithm:
  - •
 
  - UTF-8 strings might be accidentally considered to be
      encoded in Shift_JIS.
 
  - •
 
  - UCS-2 strings (sequence of raw UCS-2 letters in big-endian;
      each letters has always 2 bytes) can't be detected because they look like
      nothing but sequences of random bytes whose length is an even number.
 
  - •
 
  - UTF-16 strings must have BOM to be detected.
 
  - •
 
  - Emoji are only be recognized if they are implanted into the
      string in binary form. If they are described in '&#dddd;' form, they
      aren't considered to be emoji.
 
 
 
Since the XS and pure perl implementations use different algorithms to guess
  encoding, they may guess differently for the same string. Especially, the pure
  perl implementation finds Shift_JIS strings containing ESC character (0x1B) to
  be actually encoded in Shift_JIS but XS implementation doesn't. This is
  because such strings can hardly be distinguished from 'sjis-jsky'. In
  addition, EUC-JP strings containing ESC character are also rejected for the
  same reason.
 
  - $code = $s->getcodelist($str)
 
  
  - $str: string
 
  
  - $code: name of character encodings
 
  
 
 
Detect the character encoding of given string.
 
Unlike the method 
getcode(), 
getcodelist() returns a list of
  possible encodings.
 
  - $str = $s->conv($ocode, $encode)
 
  
  - $ocode: character encoding (possible encodings are:)
 
  - 
    
 utf8 ucs2 ucs4 utf16
 sjis cp932 euc euc-jp jis
 sjis-imode sjis-imode1 sjis-imode2
 utf8-imode utf8-imode1 utf8-imode2
 sjis-doti sjis-doti1
 sjis-jsky sjis-jsky1 sjis-jsky2
 jis-jsky  jis-jsky1  jis-jsky2
 utf8-jsky utf8-jsky1 utf8-jsky2
 sjis-au sjis-au1 sjis-au2
 jis-au  jis-au1  jis-au2
 sjis-icon-au sjis-icon-au1 sjis-icon-au2
 euc-icon-au  euc-icon-au1  euc-icon-au2
 jis-icon-au  jis-icon-au1  jis-icon-au2
 utf8-icon-au utf8-icon-au1 utf8-icon-au2
 binary
    
     
    (see also "SUPPORTED ENCODINGS".)
     
    Some encodings for mobile phones have a trailing digit like 'sjis-au2'.
      Those digits represent the version number of encodings. Such encodings
      have a variant with no trailing digits, like 'sjis-au', which is the same
      as the latest version among its variants. 
  - $encode: optional binary encoding
 
  
  - $str: string
 
  
 
 
Get the internal string of instance with encoding it using a given character
  encoding method.
 
If you want the resulting string to be encoded in Base64, specify 'base64' as
  the second argument.
 
On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned off even if
  you specify 'utf8' to the first argument.
 
  - $s->tag2bin
 
  - Interpret decimal character references (&#dddd;) in the
      instance, and replaces them with single characters they represent.
 
  - $s->z2h
 
  - Replace zenkaku (full-width) letters in the instance with
      hankaku (half-width) letters.
 
  - $s->h2z
 
  - Replace hankaku (half-width) letters in the instance with
      zenkaku (full-width) letters.
 
  - $s->hira2kata
 
  - Replace any hiragana in the instance with katakana.
 
  - $s->kata2hira
 
  - Replace any katakana in the instance with hiragana.
 
  - $str = $s->jis
 
  - $str: byte string in ISO-2022-JP
    
 
    Get the internal string of instance with encoding it in ISO-2022-JP. 
  - $str = $s->euc
 
  - $str: byte string in EUC-JP
    
 
    Get the internal string of instance with encoding it in EUC-JP. 
  - $str = $s->utf8
 
  - $str: byte string in UTF-8
    
 
    Get the internal UTF-8 string of instance.
     
    On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned
    off. 
  - $str = $s->ucs2
 
  - $str: byte string in UCS-2
    
 
    Get the internal string of instance as a sequence of raw UCS-2 letters in
      big-endian. Note that this is different from UTF-16BE as raw UCS-2
      sequence has no concept of surrogate pair. 
  - $str = $s->ucs4
 
  - $str: byte string in UCS-4
    
 
    Get the internal string of instance as a sequence of raw UCS-4 letters in
      big-endian. This is practically the same as UTF-32BE. 
  - $str = $s->utf16
 
  - $str: byte string in UTF-16
    
 
    Get the insternal string of instance with encoding it in UTF-16 in
      big-endian with no BOM prepended. 
  - $str = $s->sjis
 
  - $str: byte string in Shift_JIS
    
 
    Get the internal string of instance with encoding it in Shift_JIS (MS-SJIS /
      MS-CP932). 
  - $str = $s->sjis_imode
 
  - $str: byte string in 'sjis-imode'
    
 
    Get the internal string of instance with encoding it in 'sjis-imode'. 
  - $str = $s->sjis_imode1
 
  - $str: byte string in 'sjis-imode1'
    
 
    Get the internal string of instance with encoding it in 'sjis-imode1'. 
  - $str = $s->sjis_imode2
 
  - $str: byte string in 'sjis-imode2'
    
 
    Get the internal string of instance with encoding it in 'sjis-imode2'. 
  - $str = $s->sjis_doti
 
  - $str: byte string in 'sjis-doti'
    
 
    Get the internal string of instance with encoding it in 'sjis-doti'. 
  - $str = $s->sjis_jsky
 
  - $str: byte string in 'sjis-jsky'
    
 
    Get the internal string of instance with encoding it in 'sjis-jsky'. 
  - $str = $s->sjis_jsky1
 
  - $str: byte string in 'sjis-jsky1'
    
 
    Get the internal string of instance with encoding it in 'sjis-jsky1'. 
  - $str = $s->sjis_jsky
 
  - $str: byte string in 'sjis-jsky'
    
 
    Get the internal string of instance with encoding it in 'sjis-jsky'. 
  - $str = $s->sjis_icon_au
 
  - $str: byte string in 'sjis-icon-au'
    
 
    Get the internal string of instance with encoding it in 'sjis-icon-au'. 
  - $str_arrayref = $s->strcut($len)
 
  
  - $len: maximum length of each chunks (in number of
    full-width characters)
 
  
  - $str_arrayref: reference to array of strings
 
  
 
 
Split the internal string of instance into chunks of a given length.
 
On perl-5.8.0 or later, UTF-8 flags of each chunks are turned on.
 
  - $len = $s->strlen
 
  - $len: character width of the internal string
    
 
    Calculate the character width of the internal string. Half-width characters
      have width of one unit, and full-width characters have width of two
    units. 
  - $s->join_csv(@values);
 
  - @values: array of strings
    
 
    Build a line of CSV from the arguments, and store it into the instance. The
      resulting line has a trailing line break ("\n"). 
  - @values = $s->split_csv;
 
  - @values: array of strings
    
 
    Parse a line of CSV in the instance and return each columns. The line will
      be chomp()ed before getting parsed.
     
    If the internal string was decoded from 'binary' encoding (see methods
      new() and set()), the UTF-8 flags of the resulting array of
      strings are turned off. Otherwise the flags are turned on. 
SUPPORTED ENCODINGS¶
 +---------------+----+-----+-------+
 |encoding       | in | out | guess |
 +---------------+----+-----+-------+
 |auto           : OK : --  | ----- |
 +---------------+----+-----+-------+
 |utf8           : OK : OK  | OK    |
 |ucs2           : OK : OK  | ----- |
 |ucs4           : OK : OK  | ----- |
 |utf16-be       : OK : --  | ----- |
 |utf16-le       : OK : --  | ----- |
 |utf16          : OK : OK  | OK(#) |
 |utf32-be       : OK : --  | OK    |
 |utf32-le       : OK : --  | OK    |
 |utf32          : OK : --  | OK(#) |
 +---------------+----+-----+-------+
 |sjis           : OK : OK  | OK    |
 |cp932          : OK : OK  | ----- |
 |euc            : OK : OK  | OK    |
 |euc-jp         : OK : OK  | ----- |
 |jis            : OK : OK  | OK    |
 +---------------+----+-----+-------+
 |sjis-imode     : OK : OK  | OK    |
 |sjis-imode1    : OK : OK  | ----- |
 |sjis-imode2    : OK : OK  | ----- |
 |utf8-imode     : OK : OK  | ----- |
 |utf8-imode1    : OK : OK  | ----- |
 |utf8-imode2    : OK : OK  | ----- |
 +---------------+----+-----+-------+
 |sjis-doti      : OK : OK  | OK    |
 |sjis-doti1     : OK : OK  | ----- |
 +---------------+----+-----+-------+
 |sjis-jsky      : OK : OK  | OK    |
 |sjis-jsky1     : OK : OK  | ----- |
 |sjis-jsky2     : OK : OK  | ----- |
 |jis-jsky       : OK : OK  | ----- |
 |jis-jsky1      : OK : OK  | ----- |
 |jis-jsky2      : OK : OK  | ----- |
 |utf8-jsky      : OK : OK  | ----- |
 |utf8-jsky1     : OK : OK  | ----- |
 |utf8-jsky2     : OK : OK  | ----- |
 +---------------+----+-----+-------+
 |sjis-au        : OK : OK  | OK    |
 |sjis-au1       : OK : OK  | ----- |
 |sjis-au2       : OK : OK  | ----- |
 |jis-au         : OK : OK  | ----- |
 |jis-au1        : OK : OK  | ----- |
 |jis-au2        : OK : OK  | ----- |
 |sjis-icon-au   : OK : OK  | ----- |
 |sjis-icon-au1  : OK : OK  | ----- |
 |sjis-icon-au2  : OK : OK  | ----- |
 |euc-icon-au    : OK : OK  | ----- |
 |euc-icon-au1   : OK : OK  | ----- |
 |euc-icon-au2   : OK : OK  | ----- |
 |jis-icon-au    : OK : OK  | ----- |
 |jis-icon-au1   : OK : OK  | ----- |
 |jis-icon-au2   : OK : OK  | ----- |
 |utf8-icon-au   : OK : OK  | ----- |
 |utf8-icon-au1  : OK : OK  | ----- |
 |utf8-icon-au2  : OK : OK  | ----- |
 +---------------+----+-----+-------+
 |ascii          : OK : --  | OK    |
 |binary         : OK : OK  | ----- |
 +---------------+----+-----+-------+
 (#): guessed when it has bom.
GUESSING ORDER¶
 1.  utf32 (#)
 2.  utf16 (#)
 3.  utf32-be
 4.  utf32-le
 5.  ascii
 6.  jis
 7.  sjis-jsky (pp)
 8.  euc
 9.  sjis
 10. sjis-jsky (xs)
 11. sjis-au
 12. sjis-imode
 13. sjis-doti
 14. utf8
 15. unknown
DESCRIPTION OF UNICODE MAPPING¶
Transcoding between Unicode encodings and other ones is performed as below:
  - Shift_JIS
 
  - This module uses the mapping table of MS-CP932.
    
 
    <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT>
     
    When the module tries to convert Unicode string to Shift_JIS, it represents
      most letters which isn't available in Shift_JIS as decimal character
      reference ('&#dddd;'). There is one exception to this: every graphic
      characters for mobile phones are replaced with '?' mark.
     
    For variants of Shift_JIS defined for mobile phones, every unrepresentable
      characters are replaced with '?' mark unlike the plain Shift_JIS. 
  - EUC-JP/ISO-2022-JP
 
  - This module doesn't directly convert Unicode string from/to
      EUC-JP or ISO-2022-JP: it once converts from/to Shift_JIS and then do the
      rest translation. So characters which aren't available in the Shift_JIS
      can not be properly translated.
 
  - DoCoMo i-mode
 
  - This module maps emoji in the range of F800 - F9FF to
      U+0FF800 - U+0FF9FF.
 
  - ASTEL dot-i
 
  - This module maps emoji in the range of F000 - F4FF to
      U+0FF000 - U+0FF4FF.
 
  - J-PHONE J-SKY
 
  - The encoding method defined by J-SKY is as follows: first
      an escape sequence "\e\$" comes to indicate the beginning of
      emoji, then the first byte of an emoji comes next, then the second bytes
      of at least one emoji comes next, then "\x0f" comes last to
      indicate the end of emoji. If a string contains a series of emoji whose
      first bytes are identical, such sequence can be compressed by cascading
      second bytes of them to the single first byte.
    
 
    This module considers a pair of those first and second bytes to be one
      letter, and map them from 4500 - 47FF to U+0FFB00 - U+0FFDFF.
     
    When the module encodes J-SKY emoji, it performs the compression
      automatically. 
  - AU
 
  - This module maps AU emoji to U+0FF500 - U+0FF6FF.
 
PurePerl mode¶
   use Unicode::Japanese qw(PurePerl);
If you want to explicitly take the pure perl implementation, pass 'PurePerl' to
  the argument of the "use" statement.
BUGS¶
Please report bugs and requests to "bug-unicode-japanese at
  rt.cpan.org" or
  
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese
  <
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese>. If
  you report them to the web interface, any progress to your report will be
  automatically sent back to you.
  - •
 
  - This module doesn't directly convert Unicode string from/to
      EUC-JP or ISO-2022-JP: it once converts from/to Shift_JIS and then do the
      rest translation. So characters which aren't available in the Shift_JIS
      can not be properly translated.
 
  - •
 
  - The XS implementation of getcode() fails to detect
      the encoding when the given string contains \e while its encoding is
      EUC-JP or Shift_JIS.
 
  - •
 
  - Japanese.pm is composed of textual perl script and binary
      character conversion table. If you transfer it on FTP using ASCII mode,
      the file will collapse.
 
SUPPORT¶
You can find documentation for this module with the perldoc command.
    perldoc Unicode::Japanese
You can find more information at:
  - •
 
  - AnnoCPAN: Annotated CPAN documentation
    
 
    http://annocpan.org/dist/Unicode-Japanese
      <http://annocpan.org/dist/Unicode-Japanese> 
  - •
 
  - CPAN Ratings
    
 
    http://cpanratings.perl.org/d/Unicode-Japanese
      <http://cpanratings.perl.org/d/Unicode-Japanese> 
  - •
 
  - RT: CPAN's request tracker
    
 
    http://rt.cpan.org/NoAuth/Bugs.html?Dist=Unicode-Japanese
      <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Unicode-Japanese> 
  - •
 
  - Search CPAN
    
 
    http://search.cpan.org/dist/Unicode-Japanese
      <http://search.cpan.org/dist/Unicode-Japanese> 
CREDITS¶
Thanks very much to:
NAKAYAMA Nao
SUGIURA Tatsuki & Debian JP Project
COPYRIGHT & LICENSE¶
Copyright 2001-2008 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio, all rights
  reserved.
This program is free software; you can redistribute it and/or modify it under
  the same terms as Perl itself.