Douglas Crockford

Blog

Books

Videos

2019 Appearances

JavaScript

JSLint

JSON

Github

How JavaScript Works

About

UTF-8

UTF-8 is one of the smartest things I've seen. It is a byte stream encoding for simple values (like Unicode characters) that can be bigger than bytes. UTF-8 has some wonderful properties:

A first byte can contain between 2 and 7 bits of data. A first byte also determines the number of continuation bytes. Each continuation byte carries 6 bits of data.

UTF-8 has one unfortunate disadvantage, that many 16-bit characters are encoded in 3 bytes. This disadvantage is more than offset by its advantages, and by having a single, simple encoding that can work in all languages and contexts. The benefits range from greater reliability to better security. That is why JSON recommends UTF-8. UTF-8 is the good stuff. Thank you, Ken Thompson.

In my own work, I use a formulation that works well with 32-bit characters.

Binary Hex Thru Range Thru Continuation
Bytes
Total
Data Bits
0xxxxxxx 00 7F 00 7F 07
10xxxxxx 80 BF continuation6
110xxxxx C0 DF 80 7FF 111
1110xxxx E0 EF 800 FFFF 216
11110xxx F0 F7 1 0000 1F FFFF 321
111110xx F8 FB 20 0000 3FF FFFF 426
111111xx FC FF 400 0000 FFFF FFFF 532
湖南赛车今天开奖结果