Big idea

All data in a computer system is ultimately stored as binary patterns. When we work with text — letters, digits, punctuation, symbols — the computer must have a systematic way to map each character to a unique binary number. A character encoding defines that mapping. Without a shared encoding, two systems cannot reliably interpret the same text.

Character encoding therefore solves a fundamental problem in computer science: how to represent human language in binary form.

1. Why character encoding is necessary

Characters themselves have no inherent binary meaning. The letter A must be mapped to some binary pattern, but different systems might choose different patterns unless a standard exists.

A character encoding ensures that:

Every character has a unique code point (a number assigned to that character).
Every code point has a standard binary representation (1 byte, several bytes, etc.).
Different computers interpret the same file the same way, even across operating systems, programming languages, and network protocols.

This belongs to the broader topic of data representation in the computer science syllabus: how integers, characters, strings, images, and other data types are encoded as binary.

2. ASCII: the historical foundation

ASCII (American Standard Code for Information Interchange) was one of the earliest widely adopted encoding schemes.

Key properties

7-bit encoding (128 possible values: 0–127).
Represents English letters, digits, punctuation, and control codes.
A capital 'A' is represented as:
- Decimal: 65
- Hex: 0x41
- Binary: 0100 0001₂

Advantages

Compact, simple, stable for decades.
Many modern encodings build compatibility around ASCII.

Limitation

ASCII cannot represent characters outside basic English — no accented letters, no non-Latin scripts, no emojis, no scientific symbols.

3. Unicode: a universal character set

Unicode was designed to encode every written symbol used by humans, plus many technical, historical, and symbolic systems. Unicode assigns each character a unique code point, written as:

U+0041  → 'A'
U+03A9  → 'Ω'
U+1F600 → 😀

The key idea

Unicode is a table of code points, not a storage format.
To store Unicode code points in memory or on disk, we need an encoding. Several encodings exist, including UTF-8, UTF-16, and UTF-32.

4. UTF-8: the dominant encoding

UTF-8 is the most widely used encoding on the modern internet and in most software systems.

Design goals

Backward-compatible with ASCII.
Efficient for English-language text.
Capable of encoding all Unicode code points.
Self-synchronizing (good for error recovery and network transmission).

Encoding rules (variable-length)

UTF-8 uses 1 to 4 bytes per character:

Bytes	Binary prefix	Purpose
1	0xxxxxxx	ASCII characters (0–127)
2	110xxxxx …	Most European letters
3	1110xxxx …	Many non-Latin scripts
4	11110xxx …	Rare symbols, emoji, historic scripts

Because ASCII bytes keep their original values, any ASCII text is also valid UTF-8.

Why UTF-8 dominates

Efficient for English (1 byte per character).
Universal and portable.
Backward compatible with legacy systems.
Robust to transmission errors.
Required or strongly recommended by HTML5, Linux, Python 3, and most modern APIs.

5. Characters vs strings: the software perspective

A character is a single code point.
A string is a sequence of zero or more characters.

Memory and storage implications

Character length ≠ byte length.
- Example: "A" → 1 byte in UTF-8
- Example: "Ω" → 2 bytes in UTF-8
- Example: "😀" → 4 bytes in UTF-8

Practical consequences for programmers

Counting bytes is not the same as counting characters.
Slicing strings requires awareness of encoding boundaries.
String libraries in modern languages (Python, Java) operate in terms of Unicode code points, not raw bytes.

This aligns with the IB requirement to understand how data such as characters and strings are encoded in binary.

6. Common failure modes

Students frequently encounter encoding problems when:

Opening files in the wrong encoding.
Mixing ASCII, Latin-1, and UTF-8 in a system.
Handling network data without specifying encoding.
Passing byte arrays where code points are expected.

These situations cause the classic � "replacement character" or invalid byte-sequence errors.

7. Character Encoding Comparison Table

Encoding	Type	Bytes per Character	Supported Character Range	Backward Compatibility	Advantages	Disadvantages
ASCII	Fixed-length (7-bit)	1 byte (7 bits used, 1 unused)	128 characters (0–127)	N/A (original standard)	Very simple; compact for English; foundation for later encodings	Cannot represent accented characters, non-Latin scripts, emoji, or modern symbols
UTF-8	Variable-length	1–4 bytes	Full Unicode range (1,114,112 code points)	Yes — ASCII bytes are unchanged	Dominant web encoding; efficient for English; robust to transmission errors; self-synchronizing	Characters outside ASCII may require 2–4 bytes; random access by index is slower
UTF-16	Variable-length	2–4 bytes	Full Unicode range	Not ASCII-compatible, but shares code points with UCS-2	Efficient for Asian languages; fixed 2-byte format for BMP characters; widely used in Windows and Java	Uses surrogate pairs; endianness issues (UTF-16LE/UTF-16BE); more complex than UTF-8
UTF-32	Fixed-length	4 bytes for every character	Full Unicode range	Not ASCII-compatible	Very simple mapping: 1 code point = 1 32-bit word; fast random access	Wastes space (4× ASCII, 2× BMP); rarely used for storage or transmission

8. Summary

Human languages require rich expressive symbols, but computers operate only with binary data. A character encoding defines the mapping between characters and their binary representations. ASCII provides a compact 7-bit foundation, while Unicode generalizes this idea to cover every symbol used worldwide. UTF-8 then provides a practical, efficient, backward-compatible way to store and transmit Unicode text.

Understanding character encoding is foundational for programming, networking, databases, and web development — wherever text data must be stored, compared, sorted, or transmitted reliably.