Unicode is a computing industry international unique character encoding standard used to represent most of the world’s languages.
Prior to Unicode there were different character encoding systems to represent their native language character representation. Transmission of one language to other weren’t possible because of variant encoding systems. That’s was solved by introducing Unicode character encoding system.
The Unicode standard uses hexadecimal to express a character.
For example, the value 0x0041 represents the Latin character A. The Unicode standard was initially designed using 16 bits to encode characters because the primary machines were 16-bit PCs.
The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits.
Encoding System in Java
- According to Oxford Dictionary Lexical means relating to the words or vocabulary of a language.
- Java programs are written using the Unicode character set which represents text in sequences of 16-bit code units i.e. UTF-16 .
- Lexical translations are provided to translate that into sequence of input elements.
- The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements.
- In Java all input elements in a program are formed only from ASCII characters except for comments, identifiers and the contents of character and string literals.
- There are three input elements:
1. white space: White space is defined as the ASCII space character, horizontal tab character, form feed character, and line terminator characters.
The tokens are the terminal symbols of the syntactic grammar. Those input elements that are not white space or comments are tokens.