Algorithm to read multi-byte character?
Lassi Kortela 17 Oct 2020 10:01 UTC
A question to anyone who might know, but Shiro and John in particular:
How complex is an algorithm to read one multi-byte character from a byte
stream, supporting all character encodings that are still used with
terminals? Is it too limiting to support UTF-8 only?
If we ignore "characters sets" (i.e. lookup tables mapping unsigned
integers to elements of human languages), and only consider "character
encodings" (i.e. algorithms mapping byte sequences to unsigned
integers), how many different encodings are still in wide use?
* UTF-8
* Chinese: EUC-CN, GB2312, GBK, Big5, ...
* Japanese: EUC-JP, Shift-JIS, ...
* Korean: EUC-KR, ...
Are there other languages besides Chinese, Japanese, and Korean that
require multi-byte character sets?
Can we use ISO 2022 and/or ISO 646 to take advantage of common parts
between different encodings? (I have no idea what I'm talking about, but
the names of these standards keep popping up on related Wikipedia pages.)