Encoding Schemes

Java's Encoding Schemes

This sidebar describes the character-encoding schemes that are supported by the Java platform. Use your browser's back button to continue in the document that brought you here.

US-ASCII

US-ASCII is a 7-bit encoding scheme that covers the English-language alphabet. It is not large enough to cover the characters used in other languages, however, so it is not very useful for internationalization.

UTF-8

UTF-8 is an 8-bit encoding scheme. Characters from the English-language alphabet are all encoded using an 8-bit bytes. Characters for other languages are encoding using 2, 3 or 3ven 4 bytes. UTF-8 therefore produces compact documents for the English language, but very large documents for other languages. If the majority of a document's text is in English, then UTF-8 is a good choice because it allows for internationalization while still minimizing the space required for encoding.

UTF-16

UTF-8 is a 16-bit encoding scheme. It is large enough to encode all the characters from all the alphabets in the world, with the exception of ideogram-based languages like Chinese. All characters in UTF-16 are encoded using 2 bytes. An English-language document that uses UTF-16 will be twice as large as the same document encoded using UTF-8. Documents written in other languages, however, will be far smaller using UTF-16.