Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ASCII characters (code points 0-127) are represented using a single byte with the same binary value as the ASCII code.
- Code points 128-2047 are represented using two bytes. The first byte starts with the binary value 110, followed by the 5-bit binary value of the first 5 bits of the code point. The second byte starts with the binary value 10, followed by the remaining 6 bits of the code point.
- Code points 2048-65535 are represented using three bytes. The first byte starts with the binary value 1110, followed by the 4-bit binary value of the first 4 bits of the code point. The second and third bytes start with the binary value 10, followed by the remaining 6 bits of the code point split between the two bytes.
- Code points 65536-1114111 are represented using four bytes. The first byte starts with the binary value 11110, followed by the 3-bit binary value of the first 3 bits of the code point. The second, third, and fourth bytes start with the binary value 10, followed by the remaining 6 bits of the code point split between the three bytes.
- UTF-8 bytes that do not conform to any of the above rules are considered invalid and should not be used in UTF-8 encoded strings.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement