UTF-8

From ICANNWiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

UTF-8 refers to Unicode Transformation Format 8-bit, which is a variable-width encoding that can represent every character in the Unicode character set that was designed for backward compatibility with ASCII.

Overview

UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets. The number of octets depends on the integer value assigned to the character. UTF-8 is the default encoding for XML and has been the dominant character encoding on the web since 2010.[1]

W3C has offered several reasons for the popularity of UTF-8:

  1. An HTML page can only be in one encoding, and UTF-8 can support many languages and accommodate many pages and forms.
  2. Barriers to using Unicode are very low; by January 2012, Google reported that over 60% of the Web in their sample used UTF-8.
  3. ASCII is a subset of UTF-8; all ASCII characters in UTF-8 use the same bytes as an ASCII encoding, helping with Interoperability.
  4. The HTML5 specification says "Authoring tools should default to using UTF-8 for newly-created documents."[2]

References