The character encoding system, known as Big5, was developed in Taiwan in the early 1980s to encode characters from a variety of languages used on computers, including Traditional Chinese, Japanese Kanji, and Korean Hanja. The name « Big5 » originates from its initial usage with five bits (32 values) per character, although later revisions expanded this range.
Background The development of Big5 was driven by the need to encode non-Latin characters for use in computing applications on systems that did not support other encoding schemes such as Unicode or ISO/IEC 8859-1. In Taiwan at that time, there were several competing Big5 casino character sets and a standardization effort resulted in Big5 being adopted.
Key Features The primary feature of Big5 is its ability to encode most Traditional Chinese characters, including the Kangxi Zidian radicals and their combinations. It also supports encoding for some Japanese Kanji and Korean Hanja, although these are not as extensively supported as the Traditional Chinese character set.
Character Set Composition A typical implementation of Big5 includes a repertoire that encompasses over 20,000 distinct characters from:
- Traditional Chinese (9,200)
- Simplified Chinese (10,250 – with a significant overlap between sets)
- Japanese Kanji (4,500) with some rare exceptions for non-standard forms
- Korean Hanja (2,400)
Character Encoding Structure Big5 character encoding is structured around two main elements:
- The Basic Set: This includes about 6,000 characters consisting of both Traditional Chinese and a small portion of Simplified Chinese.
- Extension Sets: These contain the remaining over 12,000 non-traditional Chinese characters.
Comparison with Other Encoding Schemes Compared to other encoding schemes such as Unicode (UTF-8 or UTF-16), Big5 is less flexible because it focuses mainly on Traditional Chinese. While Unicode can handle multiple languages and a wider range of character types, including pictograms, diacritical marks, and symbols not found in the typical Big5 set.
Examples To illustrate how Big5 works let’s consider an example with the common Taiwanese term for « hello », which is composed from radicals stored within the Basic Set. This shows the flexibility and efficiency that allowed it to gain such widespread acceptance among Chinese computing applications before Unicode became widely used:
The word « nǐ hǎo » can be encoded in several ways, including Big5 (0x82A7 0x671F) for Traditional Chinese characters.
Comparison with EUC EUC is another encoding system often confused with or compared to Big5 due to overlapping functions. However, the main difference lies in its handling and distribution across various systems:
- Big5 generally focuses on older IBM compatible platforms.
- While both are primarily used by Mandarin speakers living within regions using Traditional Chinese as a first language.
Comparison with Shift JIS Shift JIS is another character encoding system which was not developed but based upon EUC, and commonly employed for Japanese input in the mid 80’s. However, while it supports Kanji, Big5 also encodes other non-Japanese languages.
Types or Variations The development of various variants led to three major modifications:
- MacBig5: Utilized primarily on Apple devices due to their different encoding structures.
- Big5-2003: Slightly expanded the character repertoire with additional Simplified Chinese and more Kanji for greater compatibility in mainland China.
Regional and Legal Context The use of various versions and variations can vary depending on which territory a user resides or works. However, there are no strict laws enforcing specific encoding systems, giving users flexibility to choose their preferred system according to requirements.
Common Misconceptions or Myths There is often confusion over the similarity between Big5, Unicode (UTF-8) as well as other character sets but such comparisons can result from a misunderstanding of how each set functions.
