NAME¶
kcc - Kanji code coverter with encoding auto detection
SYNOPSIS¶
kcc [
-IOchnvxz ] [
-b bufsize ] [
file ] ...
DESCRIPTION¶
kcc is a filter that reads
file sequencially, converts kanji
encodings and output to stdout. If no file is specified, or specified
-
as filename, it read from stdin. You can specify kanji encodings for
input/output. However,
kcc detect input encodig automatically, if you
don't specify input encoding.
Available kanji encodings are
JIS (7 bit and/or 8 bit), Shift
JISEUCDEC. For input encoding, you can mix when these are pair
of one of EUC DEC or Shift
JIS and 7 bit
JIS.
SI/
SOESC(I are recognized as halfwidth of JIS.
OPTIONS¶
- -O
- -IO
- I for input kanji encoding¡¤O for
output kanji encoding. When no input encoding specified, it will be
detected automatically, and if both of input/output aren't specified,
output encoding is 7 bit JIS.
- You can specify one of the followings for the input
encoding option, I.
- e
- EUC(available with 7 bit JIS
)
- d
- DEC(available with 7 bit JIS
)
- s
- Shift JIS(available with 7 bit
JIS )
- j7 or k
- 7 bit JIS
- 8
- 8 bit JIS
- You can specify one of the followings for output encoding
option, O.
- e
- EUC
- d
- DEC
- s
- Shift JIS
- jXY or 7XY
- 7 bit JIS(usingSI/SO for JIS
kana designation)
- kXY
- 7 bit JIS(usingESC(I for JIS kana
designation)
- 8XY
- 8 bit JIS
- By XY in O option, You can specify which
escape sequence used in JIS encoding. BJ is default. Supplimental
kanji designation is fixed to ESC$(D
- X
- Kanji is designated by:
- B
- ESC$B(JIS X0208-1983)
- @
- ESC$@(JIS X0208-1978)
- +
- ESC&@ESC$B(JIS X0212-1990)
- Y
- Alpha Numerical is designated by:
- B
- ESC(B(ASCII)
- J
- ESC(J(JIS Roman; JIS X0201)
- H
- ESC(H(Swedish; strongly deprecated)
- -v
- outputs result of input encoding detection to stderr.
- -x
- Extension mode. By auto detection of input encodings,
recognize user-defined characters and extended character region ( out of
range of EUC, undefined halfwidth kana, control character,
C1 area and/or extended character region Shift
C1 JIS ). Distinguish between DEC and
EUC is done in this mode.
- -z
- Shrink mode. Don't recognize halfwidth kana (except 7 bit
JIS ) with input encoding detection. With this option,
accuracy of auto detection of input encodings becomes much better for file
without halfwidth kana.
- -h
- Normally, When converted halfwidth kana to
DEC , it becomes fullwidth Katakana. With this option, it
becomes Hiragana.
- -n
- user-defined characters, extended characters and
supplimental kanji characters areconverted to fullwidth white box, and
undefined region of halfwidth kana are converted to halfwidth centered
dot.
- -b bufsize
- specify buffer size. 8kbytes is default.
- -c
- don't convert but check input encoding and print result to
stdout. Different with normal auto-detection, whole contents of file is
checked. However, when inconsistency of encodings is found, abort reading
and print "data". Options except -x¡¤-z
are ignored.
EXAMPLES¶
- % kcc -e file
- Input encoding are detect automatically, and output is in
EUC encoding.
- % kcc -sj file1 file2
- Two files in Shift JIS concatinated with
converting to JIS.
- % command | kcc -k+J
- output of command are converted to JIS(JIS
JIS X0208 JIS JIS Roman¡¤ ESC(I Halfwidth Kana
JIS )
- % kcc -c file
- Encoding of contents of file is detected(no
conversion)
BUG¶
Auto detection of input encoding is well done for normal case, however, it has
the following problems.
7 bit
JIS is recognized by escape sequence in certain. EUC and
DEC are the same (refered as
EUC series).
Halfwidth kana of 8 bit
JIS is the same as halfwidth kana of
Shift
JIS (refered as Shift
JIS series).
However,
EUC series and
JIS , which are both 8
bit encoding, are sharing the same regions widely. So, the problem in auto
detection is detection of these 2 encodings.
Detection of
EUC series/Shift
JIS series is done
in line by line, When it is found that it's not Shift
JIS
series, or it's not EUC series, encoding is determined. When inconsistensy
found, it will be treated as "data" and contents of output is not
guaranteed.
While determined between
EUC series/Shift
JIS
series after 8bit code found, conversions are pending and put input data in
buffer, however, buffer is fulled, it assumes it's
EUC series
and forces to start conversion. Rationale. Usually, we can assume that
documents with kanji include
JIS non-kanji or
JIS first standard, it can be detected in certain if it is
Shift
JIS , which does not share region with
EUC. So if it can't be determined, it's very likely to be
EUC.
8 bit
JIS and it has always even number of halfwidth kana
sequences, then it will be wrongly detected as EUC kanji. Be ceraful.
If input encoding doesn't have halfwidth kana, use
-z and accuracy of
detection become much better. This is because shared region are restricted to
area of
JIS second standards.
Extended region of Shift
JIS user-defined area of EUC, control
characters
C1 of EUC, undefined region of halfwidth kana of EUC
are out of range of auto detection, so it will fails to detect encodings if
input has these characters. Use
-x option to specify extended mode, or
specify input code.
SEE ALSO¶
cat(1)
NOTES¶
Usually, user-defined characters, extended characters, supplimental kanji
characters are mapped respectively. However characters that is out of range of
extended characters become FCFC in hexadecimal when converted to Shift
JIS. Although control character region
C1 of EUC
and
DEC remains when converted to
JIS , these
will be deleted when converted to Shift
JIS Undefined area of
halfwidth kana become halfwidth centered dot when convered to Shift
JIS Halfwidth kana become fullwidth kana when converted to
DEC.
When output is
JIS encoding, control characters such as newline,
TAB, DEL and white space (halfwidth) will be output in ASCII mode.
When encoding of input is detected wrongly, or input undefined character for
expected character sets, output is indefined.
This manual are translated by Fumitoshi UKAI <ukai@debian.or.jp> for
Debian system, but you can use it for any purpose.