Character Encoding

This article is not assessed by the IB but may be helpful to deepen your understanding. Plus, I think it's cool.

Big idea

All data in a computer system is ultimately stored as binary patterns. When we work with text — letters, digits, punctuation, symbols — the computer must have a systematic way to map each character to a unique binary number. A character encoding defines that mapping. Without a shared encoding, two systems cannot reliably interpret the same text.

Character encoding therefore solves a fundamental problem in computer science: how to represent human language in binary form.

 

1. Why character encoding is necessary

Characters themselves have no inherent binary meaning. The letter A must be mapped to some binary pattern, but different systems might choose different patterns unless a standard exists.

A character encoding ensures that:

  1. Every character has a unique code point (a number assigned to that character).
  2. Every code point has a standard binary representation (1 byte, several bytes, etc.).
  3. Different computers interpret the same file the same way, even across operating systems, programming languages, and network protocols.

This belongs to the broader topic of data representation in the computer science syllabus: how integers, characters, strings, images, and other data types are encoded as binary.

 

2. ASCII: the historical foundation

ASCII (American Standard Code for Information Interchange) was one of the earliest widely adopted encoding schemes.

Key properties

  • 7-bit encoding (128 possible values: 0–127).
  • Represents English letters, digits, punctuation, and control codes.
  • A capital 'A' is represented as:
    • Decimal: 65
    • Hex: 0x41
    • Binary: 0100 0001₂

Advantages

  • Compact, simple, stable for decades.
  • Many modern encodings build compatibility around ASCII.

Limitation

ASCII cannot represent characters outside basic English — no accented letters, no non-Latin scripts, no emojis, no scientific symbols.

 

3. Unicode: a universal character set

Unicode was designed to encode every written symbol used by humans, plus many technical, historical, and symbolic systems. Unicode assigns each character a unique code point, written as:

U+0041  → 'A'
U+03A9  → 'Ω'
U+1F600 → 😀

The key idea

Unicode is a table of code points, not a storage format.
To store Unicode code points in memory or on disk, we need an encoding. Several encodings exist, including UTF-8, UTF-16, and UTF-32.

 

4. UTF-8: the dominant encoding

UTF-8 is the most widely used encoding on the modern internet and in most software systems.

Design goals

  • Backward-compatible with ASCII.
  • Efficient for English-language text.
  • Capable of encoding all Unicode code points.
  • Self-synchronizing (good for error recovery and network transmission).

Encoding rules (variable-length)

UTF-8 uses 1 to 4 bytes per character:

BytesBinary prefixPurpose
10xxxxxxxASCII characters (0–127)
2110xxxxx …Most European letters
31110xxxx …Many non-Latin scripts
411110xxx …Rare symbols, emoji, historic scripts

Because ASCII bytes keep their original values, any ASCII text is also valid UTF-8.

Why UTF-8 dominates

  • Efficient for English (1 byte per character).
  • Universal and portable.
  • Backward compatible with legacy systems.
  • Robust to transmission errors.
  • Required or strongly recommended by HTML5, Linux, Python 3, and most modern APIs.

 

5. Characters vs strings: the software perspective

A character is a single code point.
A string is a sequence of zero or more characters.

Memory and storage implications

  • Character length ≠ byte length.
    • Example: "A" → 1 byte in UTF-8
    • Example: "Ω" → 2 bytes in UTF-8
    • Example: "😀" → 4 bytes in UTF-8

Practical consequences for programmers

  • Counting bytes is not the same as counting characters.
  • Slicing strings requires awareness of encoding boundaries.
  • String libraries in modern languages (Python, Java) operate in terms of Unicode code points, not raw bytes.

This aligns with the IB requirement to understand how data such as characters and strings are encoded in binary.

 

6. Common failure modes

Students frequently encounter encoding problems when:

  1. Opening files in the wrong encoding.
  2. Mixing ASCII, Latin-1, and UTF-8 in a system.
  3. Handling network data without specifying encoding.
  4. Passing byte arrays where code points are expected.

These situations cause the classic "replacement character" or invalid byte-sequence errors.

 

7. Character Encoding Comparison Table

EncodingTypeBytes per CharacterSupported Character RangeBackward CompatibilityAdvantagesDisadvantages
ASCIIFixed-length (7-bit)1 byte (7 bits used, 1 unused)128 characters (0–127)N/A (original standard)Very simple; compact for English; foundation for later encodingsCannot represent accented characters, non-Latin scripts, emoji, or modern symbols
UTF-8Variable-length1–4 bytesFull Unicode range (1,114,112 code points)Yes — ASCII bytes are unchangedDominant web encoding; efficient for English; robust to transmission errors; self-synchronizingCharacters outside ASCII may require 2–4 bytes; random access by index is slower
UTF-16Variable-length2–4 bytesFull Unicode rangeNot ASCII-compatible, but shares code points with UCS-2Efficient for Asian languages; fixed 2-byte format for BMP characters; widely used in Windows and JavaUses surrogate pairs; endianness issues (UTF-16LE/UTF-16BE); more complex than UTF-8
UTF-32Fixed-length4 bytes for every characterFull Unicode rangeNot ASCII-compatibleVery simple mapping: 1 code point = 1 32-bit word; fast random accessWastes space (4× ASCII, 2× BMP); rarely used for storage or transmission

 

8. Summary

Human languages require rich expressive symbols, but computers operate only with binary data. A character encoding defines the mapping between characters and their binary representations. ASCII provides a compact 7-bit foundation, while Unicode generalizes this idea to cover every symbol used worldwide. UTF-8 then provides a practical, efficient, backward-compatible way to store and transmit Unicode text.

Understanding character encoding is foundational for programming, networking, databases, and web development — wherever text data must be stored, compared, sorted, or transmitted reliably.