Big idea
All data in a computer system is ultimately stored as binary patterns. When we work with text — letters, digits, punctuation, symbols — the computer must have a systematic way to map each character to a unique binary number. A character encoding defines that mapping. Without a shared encoding, two systems cannot reliably interpret the same text.
Character encoding therefore solves a fundamental problem in computer science: how to represent human language in binary form.
1. Why character encoding is necessary
Characters themselves have no inherent binary meaning. The letter A must be mapped to some binary pattern, but different systems might choose different patterns unless a standard exists.
A character encoding ensures that:
- Every character has a unique code point (a number assigned to that character).
- Every code point has a standard binary representation (1 byte, several bytes, etc.).
- Different computers interpret the same file the same way, even across operating systems, programming languages, and network protocols.
This belongs to the broader topic of data representation in the computer science syllabus: how integers, characters, strings, images, and other data types are encoded as binary.
2. ASCII: the historical foundation
ASCII (American Standard Code for Information Interchange) was one of the earliest widely adopted encoding schemes.
Key properties
- 7-bit encoding (128 possible values: 0–127).
- Represents English letters, digits, punctuation, and control codes.
- A capital 'A' is represented as:
- Decimal: 65
- Hex: 0x41
- Binary: 0100 0001₂
Advantages
- Compact, simple, stable for decades.
- Many modern encodings build compatibility around ASCII.
Limitation
ASCII cannot represent characters outside basic English — no accented letters, no non-Latin scripts, no emojis, no scientific symbols.
3. Unicode: a universal character set
Unicode was designed to encode every written symbol used by humans, plus many technical, historical, and symbolic systems. Unicode assigns each character a unique code point, written as:
U+0041 → 'A'
U+03A9 → 'Ω'
U+1F600 → 😀
The key idea
Unicode is a table of code points, not a storage format.
To store Unicode code points in memory or on disk, we need an encoding. Several encodings exist, including UTF-8, UTF-16, and UTF-32.
4. UTF-8: the dominant encoding
UTF-8 is the most widely used encoding on the modern internet and in most software systems.
Design goals
- Backward-compatible with ASCII.
- Efficient for English-language text.
- Capable of encoding all Unicode code points.
- Self-synchronizing (good for error recovery and network transmission).
Encoding rules (variable-length)
UTF-8 uses 1 to 4 bytes per character:
| Bytes | Binary prefix | Purpose |
|---|---|---|
| 1 | 0xxxxxxx | ASCII characters (0–127) |
| 2 | 110xxxxx … | Most European letters |
| 3 | 1110xxxx … | Many non-Latin scripts |
| 4 | 11110xxx … | Rare symbols, emoji, historic scripts |
Because ASCII bytes keep their original values, any ASCII text is also valid UTF-8.
Why UTF-8 dominates
- Efficient for English (1 byte per character).
- Universal and portable.
- Backward compatible with legacy systems.
- Robust to transmission errors.
- Required or strongly recommended by HTML5, Linux, Python 3, and most modern APIs.
5. Characters vs strings: the software perspective
A character is a single code point.
A string is a sequence of zero or more characters.
Memory and storage implications
- Character length ≠ byte length.
- Example:
"A"→ 1 byte in UTF-8 - Example:
"Ω"→ 2 bytes in UTF-8 - Example:
"😀"→ 4 bytes in UTF-8
- Example:
Practical consequences for programmers
- Counting bytes is not the same as counting characters.
- Slicing strings requires awareness of encoding boundaries.
- String libraries in modern languages (Python, Java) operate in terms of Unicode code points, not raw bytes.
This aligns with the IB requirement to understand how data such as characters and strings are encoded in binary.
6. Common failure modes
Students frequently encounter encoding problems when:
- Opening files in the wrong encoding.
- Mixing ASCII, Latin-1, and UTF-8 in a system.
- Handling network data without specifying encoding.
- Passing byte arrays where code points are expected.
These situations cause the classic � "replacement character" or invalid byte-sequence errors.
7. Character Encoding Comparison Table
| Encoding | Type | Bytes per Character | Supported Character Range | Backward Compatibility | Advantages | Disadvantages |
|---|---|---|---|---|---|---|
| ASCII | Fixed-length (7-bit) | 1 byte (7 bits used, 1 unused) | 128 characters (0–127) | N/A (original standard) | Very simple; compact for English; foundation for later encodings | Cannot represent accented characters, non-Latin scripts, emoji, or modern symbols |
| UTF-8 | Variable-length | 1–4 bytes | Full Unicode range (1,114,112 code points) | Yes — ASCII bytes are unchanged | Dominant web encoding; efficient for English; robust to transmission errors; self-synchronizing | Characters outside ASCII may require 2–4 bytes; random access by index is slower |
| UTF-16 | Variable-length | 2–4 bytes | Full Unicode range | Not ASCII-compatible, but shares code points with UCS-2 | Efficient for Asian languages; fixed 2-byte format for BMP characters; widely used in Windows and Java | Uses surrogate pairs; endianness issues (UTF-16LE/UTF-16BE); more complex than UTF-8 |
| UTF-32 | Fixed-length | 4 bytes for every character | Full Unicode range | Not ASCII-compatible | Very simple mapping: 1 code point = 1 32-bit word; fast random access | Wastes space (4× ASCII, 2× BMP); rarely used for storage or transmission |
8. Summary
Human languages require rich expressive symbols, but computers operate only with binary data. A character encoding defines the mapping between characters and their binary representations. ASCII provides a compact 7-bit foundation, while Unicode generalizes this idea to cover every symbol used worldwide. UTF-8 then provides a practical, efficient, backward-compatible way to store and transmit Unicode text.
Understanding character encoding is foundational for programming, networking, databases, and web development — wherever text data must be stored, compared, sorted, or transmitted reliably.