Problem Set 16: Markov Chains

How to Approach Problem Sets

When you start a new problem set, your first instinct might be to open your computer and begin typing code right away. While this can feel productive, it often leads to frustration when things don't work as expected. Instead, take a few minutes to slow down and plan.

Here are some helpful strategies:

Understand the problem clearly
- Read the instructions carefully — twice if needed.
- Ask yourself: What exactly is being asked?
Break the problem into smaller steps
- Think about the smallest possible actions the computer will need to perform.
- For example: If the task is to find the first recurring letter in a word, what steps must happen first?
Try solving it on paper first
- Write out your steps in plain language (pseudocode).
- Test your steps with a simple example before touching the keyboard.
Translate your steps into code
- Start small — write only a few lines at a time and test often.
- Don't worry about perfection at first; get a working version, then improve it.
Check your solution
- Run it with different examples, including edge cases.
- Ask: Does this solve the problem in all situations?

Read the question carefully (twice).
Break the task into the smallest steps.
Sketch or write pseudocode before coding.
Start small — test as you go.
Check your solution with different cases.

Background: What is a Markov Chain?

A Markov chain is a mathematical model that describes a system that moves between different "states." Each next state depends only on the current state, not on the full history of previous states. This is called the Markov property.

For text generation:

The states are words.
The transitions are probabilities of one word following another.
By randomly walking through these transitions, we can generate new text that “sounds like” the original input.

Example:
Training text:

the cat sat on the mat
the cat ate a rat

From this we build transitions:

"the" → {"cat": 2}
"cat" → {"sat": 1, "ate": 1}
"sat" → {"on": 1}
"on" → {"the": 1}

If we start with "the", we might generate:
the cat ate a rat
or
the cat sat on the mat

Part A: Building the Transition Table

Task A1: Write a function that:

Takes a string of text.
Splits it into words.
Builds a dictionary of transitions where each key is a word and each value is a list of possible next words.

Test Example:
Input: "the cat sat on the mat"
Expected Output (dictionary):

{
 "the": ["cat", "mat"],
 "cat": ["sat"],
 "sat": ["on"],
 "on": ["the"]
}

Part B: Generating Random Text

Task B1: Write a function that:

Starts with a random word.
Picks the next word by choosing randomly from the list of possible transitions.
Repeats this process for N words.

Test Example:
Input: N = 6
Possible Output:
the cat sat on the mat

Part C: Improving the Generator

Task C1: Modify your generator to allow choosing a starting word (e.g., always start with "the").

Task C2: Extend the model to use bigrams (pairs of words as states) instead of single words.
Example: "the cat" → {"sat": 1, "ate": 1}

Challenge Extension

Add punctuation handling.
Use a larger sample text (e.g., a paragraph from a book).
Generate at least 50 words of new text.
Compare the “realism” of 1-word states vs. 2-word states.

Reflection Questions

Why is a Markov chain useful for text generation?
How does the size of the training text affect the realism of the generated text?
How does using bigrams (or trigrams) change the quality of the text compared to single words?

Mark Scheme

A1 (Transition table): 4 marks
- Correctly splits text into words (1)
- Creates dictionary with keys as words (1)
- Stores next words as lists (1)
- Works on multiple lines of input (1)
B1 (Generator): 4 marks
- Starts with random/selected word (1)
- Correctly finds next word (1)
- Loops for N steps (1)
- Produces sensible output (1)
C1/C2 (Improvements): up to 4 marks for creativity, correctness, and testing.
Reflection: 3 marks for thoughtful answers with examples.