Successfully!

Error!

Coding

  • Last update:  2021-12-30
  • I. Description

    Encoding is to convert the information content of the source object into another standard format according to one standard format or form. The n-bit binary numbers can be combined into 2 to the nth power of different information, and a specific code group is specified for each information. This process is also called encoding.

    Such as the interaction between the server and the client

    1.png

    To give a simple example, in China, the standard language is Mandarin. When an English-speaking American comes to China, his words need to be translated into Chinese before they can be accepted by the Chinese. The process of translation is the process of coding.

    II. Coding Principle

    Because there are different languages and characters in the world, different characters need to be encoded, processed and transmitted through computers. There are many types of encoding nowadays, and the main purpose is to convert between information.

    III. Classification

    Currently commonly used character encodings are: ASCII (American Standard Information Interchange Code), EBCDIC (Extended BCD Interchange Code), GB2312, Unicode, UTF-8, ISO-8859-1 and GBK. The following is an explanation of codes that are generally used and easily confused.

    1. ISO-8859-1

     It belongs to a single-byte encoding, and the maximum range of characters that can be represented is 0-255, which is used in English series. For example, the code of the letter'a' is 0x61=97.

    Obviously, the range of characters represented by ISO-8859-1 encoding is very narrow and cannot represent Chinese characters. However, because it is a single-byte encoding, which is consistent with the most basic representation unit of a computer, ISO-8859-1 encoding is still used for representation in many cases. And in many protocols, this encoding is used by default. For example, although the ISO-8859-1 encoding does not exist for the two characters "Chinese", taking the gb2312 encoding as an example, it should be "d6d0 cec4". When using the ISO-8859-1 encoding, it will be split into 4 A byte (a byte is equal to an 8-bit binary number) to represent: "d6 d0 ce c4" (in fact, when storing, it is also processed in bytes). If it is UTF encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation method also needs to be based on another encoding.


    2. GB2312/GBK

    This is the national standard code of Chinese characters, which is specially used to represent Chinese characters. It is a double-byte encoding, and English letters are consistent with ISO-8859-1 (compatible with ISO-8859-1 encoding). Among them, GBK encoding can be used to express both traditional and simplified characters, while GB2312 can only express simplified characters. GBK is compatible with GB2312 encoding.


    3. Unicode

    This is the most uniform encoding that can be used to represent characters in all languages, and it is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with ISO-8859-1 encoding, nor is it compatible with any encoding. However, compared to the ISO-8859-1 encoding, the Uniocode encoding just adds a 0 byte at the front, for example, the letter'a' is "00 61".

    It should be noted that fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so Unicode encoding is used internally for processing in many software, such as Java.


    4. UTF

    Taking into account that Unicode encoding is not compatible with ISO-8859-1 encoding, and it is easy to take up more space: because for English letters, Unicode also requires two bytes to represent. So Unicode is not convenient for transmission and storage. Therefore, UTF encoding is produced. UTF encoding is compatible with ISO-8859-1 encoding, and can also be used to represent characters in all languages. However, utf encoding is an indefinite length encoding, and the length of each character varies from 1-6 bytes. Wait. In addition, UTF encoding comes with a simple check function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

    Note that although UTF is used to use less space, it is only relative to Unicode encoding. If you already know that it is a Chinese character, using GB2312/GBK is undoubtedly the most economical. But on the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, even for Chinese character web pages, UTF encoding will be less expensive than Unicode encoding, because web pages contain a lot of English characters.

    Attachment List


    Theme: Report Features
    Already the First
    Already the Last
    • Helpful
    • Not helpful
    • Only read

    Doc Feedback