Brief History of Character Sets & Encodings

Epic of Gilgamesh

Introduction

Homo Sapiens have come a long way from inventing the picture symbols known as hieroglyphics, and we have done a tremendous job in continuously refining them into modern-day smileys. The first true alphabet, as we know it today, was developed by the ancient Phoenicians around 1000 BC. The oldest surviving literature is the Epic of Gilgamesh, a Mesopotamian poem written on clay tablets that date back to around 2000 BC, which tells the story of the Sumerian hero Gilgamesh and his journey to find the secret of immortality. Written communication helps us pass information from one generation to another without distortion.

Let’s fast forward to the invention of computers and see how we addressed the solution for computers to understand different characters or alphabets.

BCDIC

BCDIC (Binary-Coded Decimal Interchange Code) is a character encoding system used in some early computers. It was developed in the 1960s by IBM as a way to represent decimal numbers using a binary code.

BCDIC was designed for IBM’s Systems. It used 8-bit code words to represent each character, with each code word consisting of two 4-bit nibbles. The first nibble (group of 4-bits) was used to describe the tens digit of the decimal number, while the second nibble represented the ones digit.

For example, the decimal number 57 would be represented in BCDIC as 01010111, with the first nibble (0101) depicting the tens digit (5) and the second nibble (0111) describing the one’s digit (7).

A few newer versions of the BCIDC were primarily geared toward IBM machines. BCDIC was a closed system targeting IBM machines. When two different computers had to communicate, or some data movement was done between IBM and non-IBM computers, this code failed miserably.

Code Pages

Code Page 437 & 850

As BCDIC and the updated version EBCDIC (Extended BCDIC) were not performing at all for the data movements between different computers, MS-DOS 3.3 (released in 1987) introduced the concept of code pages to IBM PC users. Later, this idea was followed by Windows. A code page defines a mapping of character codes to characters. Think of code pages as reference charts for the computer when the processor wants to process the strings.

Initially, it started with a couple of code pages, but over time, code pages have evolved to support a broader range of characters and languages. For example, the ASCII code page, developed in the 1960s, only supported English characters. In contrast, later code pages such as ISO-8859 and Unicode were designed to support a broader range of languages and writing systems.

In the days of code pages, when data was moved from one computer to another, it was expected to tell the computer explicitly what should be used to process the string. As code pages became famous, many OEMs started to develop their version of code pages, sometimes extending their bold ideas to even override the mappings of existing code pages. And in no time, code pages were all-messed-up.

ASCII

Around the 1970s, ASCII started to come into existence, trying to solve the problems faced by earlier approaches for encoding characters.

ASCII has just 128 code points, of which only 95 are printable characters, severely limiting its scope. The first word of the acronym indicates the big problem with ASCII. ASCII is genuinely an American standard and isn’t good enough for other countries where English is not spoken. Where is the British pound symbol (£), for instance?

English uses the Latin (or Roman) alphabet. Among written languages that use the Latin alphabet, English is unusual in that very few words require letters with accent marks (or “diacritics”). Even for those English words where diacritics are traditionally proper, such as coöperate or résumé, the spellings without diacritics are perfectly acceptable.

After a failed attempt to create ASCII as a standard, another desperate attempt was made to extend the ASCII standard, which could not cater to the hungry requirement of a substantial character set on earth.

DBCS

DBCS (Double-Byte Character Set) is a character encoding system used to represent characters in some East Asian writing systems, such as Chinese and Japanese. DBCS uses two bytes (16 bits) to represent each character, allowing for a much larger character set than single-byte character sets like ASCII.

The development of DBCS was motivated by the need to accurately represent the large number of characters used in East Asian writing systems. Unlike alphabetic writing systems, which can typically be represented using a single byte per character, East Asian writing systems use thousands of unique characters. As a result, traditional single-byte character sets were not sufficient for representing these writing systems.

The first DBCS character sets were developed in the 1980s and were used in early computer systems for storing and displaying East Asian text. Over time, DBCS has been refined and expanded to support a broader range of languages and writing techniques.

But DBCS has limited space and could not serve the globalization of computer systems.

Unicode

Unicode is a character encoding standard used to represent text in most of the world’s writing systems. It was developed in the late 1980s and early 1990s in response to the need for a consistent and universal way of representing text on computers.

Before the development of Unicode, different character encoding standards were used for other languages and writing systems. This made it challenging to exchange text between different computer systems, as each system used a different encoding standard. Unicode was designed to solve this problem by providing a single-character encoding that could represent all of the world’s writing systems.

The first version of Unicode was released in 1991 and included support for a limited number of languages and writing systems. Over time, Unicode has been expanded and refined, and it now provides support for over 137,000 characters from a wide range of languages and writing systems.

Today, Unicode is the world’s most widely used character encoding standard.

UTF Encodings

UTF (Universal Character Set Transform Format) is a character encoding standard used to represent text in computers. It is a member of the Unicode family of character encoding standards, and it was developed to provide a standardized way of representing Unicode text in computer systems.

UTF has several variants, which use different numbers of bits to represent each character. The most commonly used variants are UTF-8, UTF-16, and UTF-32. UTF-8 uses 8-bit code units to represent each character and is designed to be backward-compatible with ASCII. UTF-16 uses 16-bit code units and is used as the default character encoding in many modern operating systems. UTF-32 uses 32-bit code units and is designed to store and process Unicode text efficiently.

What’s next

As of now, we can map all the characters on planet Earth. Unicode has the disadvantage of requiring more space when UTF-16 & UTF-32 encoding formats are used for data storage. In computation, If we are confident enough that UTF-8 encoding can serve our purpose, we should always try to convert our texts to UTF-8 format – which sometimes can be a riskier move.

With the growing technologies and human communications, we have smileys representing emotions. We don’t have anything to represent dialects and accents directly in a few languages. Once we have them, we might need a UTF-64 encoding. Time will ask us, and the next generation of Home Sapiens will work it out.

Comments

Leave a Reply Cancel reply