An overview of testing

Big Idea

Testing is how you find out whether your program actually does what you intended — not just whether it runs without crashing. A program can run perfectly and still be wrong. Testing is the discipline of being honest about that gap: between what you built and what you meant to build, between what works in your head and what works for someone else. Different types of testing reveal different kinds of problems, which is why good testing uses more than one approach.

This article is the starting point for the Testing section of the knowledge base. It gives you a map of the different types of testing you will use in this course and explains what each one is for. Later articles in this section go deeper on specific techniques — writing test cases, conducting user testing, collecting playtest evidence, and writing honest evaluations.

LevelWhat it coversWhen to read it
Basic ideaWhat testing is and why running the program is not enough; the five types of testing used in this courseRight now.
At a deeper levelHow the types of testing fit together; what each type finds that the others miss; how testing connects to MYP Criterion DOnce you have read the basic descriptions and have a project in progress.
At the deepest levelHow professional software teams approach testing; the limits of testing; what testing cannot proveWhen you want to understand where the ideas come from and where they lead.

1   Why Running the Program Is Not Enough

Basic Idea

When a program runs without crashing, it is easy to feel like it works. It does not mean the program is correct. It means the program did not crash on that particular input, on that particular day, when you ran it.

There are three distinct things a program can be:

  1. Running. The program starts and does not crash.
  2. Correct. The program produces the right output for every valid input.
  3. Usable. A real person can use the program to accomplish what they need without confusion or frustration.

These are three separate properties. A program can have any combination of them. The goal is all three. Running the program once with your own comfortable inputs only checks the first. Testing is what checks the second and third.

At a Deeper Level

Consider a grade calculator that takes a list of scores and returns the average. You run it with [80, 90, 70] and get 80.0. It works. But:

  • What happens if the list is empty? (Correctness under edge cases.)
  • What happens if a score is entered as "85" instead of 85? (Correctness with wrong input types.)
  • What happens if a score is -10 or 150? (Correctness at boundaries.)
  • Would a classmate looking at the output understand what the number means? (Usability.)
  • If you add a new feature next week, does the average calculation still work? (Correctness after change.)

Each of these questions reveals a different kind of gap between "it ran" and "it works." Each requires a different type of testing to answer.

Testing is an act of honesty

The purpose of testing is not to show that your program works. It is to find out whether it works — including in cases where it might not. A test that only uses inputs you know will succeed is not a test; it is a demonstration. A genuine test tries to break the program. If you cannot break it, you have more confidence it is correct. If you can break it, you have found something important.

At the Deepest Level

The computer scientist Edsger Dijkstra made a precise and important observation: testing can show the presence of bugs, but never their absence. No matter how many tests pass, you cannot prove from testing alone that the program is correct for every possible input — because the number of possible inputs is usually infinite and you can only test finitely many of them.

This does not make testing pointless. It means that testing gives you increasing confidence, not certainty. The more diverse and deliberate your tests — covering typical cases, edge cases, and cases designed to stress the program's logic — the higher your justified confidence. Formal proof of correctness exists as a discipline (called formal verification), but for the programs in this course, well-designed testing is both sufficient and professionally standard.

2   The Five Types of Testing in This Course

Basic Idea

This course uses five types of testing. Each one answers a different question about your program.

Functional Testing

Does the program do what it is supposed to do? You run the program with a specific input and check whether the output matches what you expected. This is the most basic type of testing and the starting point for everything else.

Boundary Testing

Does the program handle the edges correctly? You test with values at the limits of what the program is designed to accept — the minimum, the maximum, zero, an empty list, the exact cutoff value in a condition. Bugs hide at boundaries more often than anywhere else.

User Testing

Can a real person use the program without your help? You give the program to someone who did not build it and watch them use it. You are looking for confusion, unexpected behaviour, and moments where they cannot figure out what to do. Their experience is evidence, not opinion.

Playtest Testing

Used specifically for games and interactive programs. A playtester plays the program without coaching from the developer. You observe what they do, what confuses them, what they try that you did not expect, and whether the experience is engaging. Playtesting is a specialised form of user testing with additional focus on feel, flow, and fun.

Regression Testing

Did a change you made break something that used to work? Whenever you add a feature, fix a bug, or refactor your code, you re-run your previous tests to check that nothing broke. A regression is a working feature that stops working after a change. Regression testing catches regressions before they reach a user.

At a Deeper Level

Each type of testing has a distinct focus, produces distinct evidence, and finds bugs the other types tend to miss:

TypeQuestion it answersWhat it findsWhat it misses
FunctionalDoes it produce the right output?Logic errors, incorrect calculations, wrong output formatEdge cases, usability problems, regressions
BoundaryDoes it handle the limits correctly?Off-by-one errors, missing validation, crashes on empty inputUsability problems, regressions after changes
UserCan someone else use it?Confusing prompts, unclear output, unexpected user behaviourCorrectness bugs the tester did not trigger, boundary failures
PlaytestIs the experience engaging and understandable?Pacing problems, unclear rules, features that are fun or not funCorrectness bugs, boundary cases, regressions
RegressionDid my changes break anything?Features that stopped working after a changeNew bugs in new code, usability problems

No single type of testing is sufficient on its own. A program that passes functional testing can still fail boundary testing. A program that passes all technical testing can still fail user testing because real users interact with it in ways the developer did not anticipate. A program that passed all tests last week can fail regression testing today because of a change made yesterday.

At the Deepest Level

In professional software development, these categories map loosely onto a larger taxonomy. Functional and boundary testing are forms of black-box testing — you test the program's behaviour without looking at the internal code, only at inputs and outputs. This contrasts with white-box testing, where you design tests based on knowledge of the code's internal structure — making sure every branch of every conditional is exercised, for example.

User testing in professional contexts is part of usability testing and UX research — a discipline with its own methodologies for recruiting representative users, designing observation protocols, and analysing qualitative data. Playtesting is similarly established in the game industry, often conducted in dedicated labs with video recording and eye-tracking equipment. The techniques in this course are simplified versions of real professional practices.

3   Functional Testing in Practice

Basic Idea

A functional test has three parts: an input, an expected output, and the actual output. You run the program with the input, compare the actual output to the expected output, and record whether they match.

Writing the expected output before running the test is important. If you run first and then decide what the output means, you are not testing — you are rationalising. The expected output must come from your understanding of what the program should do, not from what the program happened to produce.

# Function under test
def classify_score(score):
    if score >= 70:
        return "Merit"
    elif score >= 50:
        return "Pass"
    else:
        return "Fail"

# Functional tests
print(classify_score(85))   # Expected: Merit
print(classify_score(70))   # Expected: Merit
print(classify_score(65))   # Expected: Pass
print(classify_score(50))   # Expected: Pass
print(classify_score(49))   # Expected: Fail
print(classify_score(0))    # Expected: Fail

Each line is one test. The comment states the expected result. After running, you compare. If any actual result differs from the expected result, you have found a bug.

At a Deeper Level

Notice that the tests above include values at the exact boundary of each condition: 70 (the cutoff between Fail/Pass and Merit), 50 (the cutoff between Fail and Pass), and 49 (just below the Pass cutoff). These are the values most likely to reveal off-by-one errors — and they appear in both the functional tests and the boundary tests.

In practice, functional tests and boundary tests are written together, because the boundary values are just specific functional tests chosen for their location relative to the conditions in the code. The distinction matters more conceptually — it reminds you to think about boundaries explicitly — than as a practical separation.

Write tests before you are sure your code is correct

Tests are most useful when you run them while you still expect to find failures. If you wait until you are confident the program is correct, you will unconsciously avoid inputs that might reveal problems. Write your test cases as soon as you understand what the function is supposed to do — before you have written or at least before you have debugged the function. Running them against a fresh function reveals bugs while your critical mindset is still active.

At the Deepest Level

The test structure above — input, expected output, actual output — is the structure of every unit test in frameworks like pytest. In pytest, the tests are written as functions, and the comparison is made with an assert statement:

def test_classify_score():
    assert classify_score(85) == "Merit"
    assert classify_score(70) == "Merit"
    assert classify_score(65) == "Pass"
    assert classify_score(50) == "Pass"
    assert classify_score(49) == "Fail"

When pytest runs, it calls every function whose name starts with test_, executes the assertions, and reports which passed and which failed. This automation means a test suite of hundreds of cases can be run in seconds. The manual approach in this course produces exactly the same thinking — the framework just runs it faster and keeps a record automatically.

4   Boundary Testing in Practice

Basic Idea

Boundary testing focuses on the values where behaviour is most likely to change — or break. These are not random inputs. They are chosen deliberately because of where they sit relative to conditions, list lengths, and valid ranges.

For any condition or range in your program, test three values:

  • One below the boundary. Just outside the expected range.
  • Exactly at the boundary. The precise cutoff value.
  • One above the boundary. Just inside the expected range.

For any collection (list, string, dictionary), always test the empty case. Empty input is the boundary that is most often forgotten and most often causes crashes.

At a Deeper Level

Here are the boundary cases for common situations in student programs:

SituationBoundary values to testWhy they matter
Condition score >= 6059, 60, 61Tests whether the boundary is inclusive (>=) or exclusive (>); catches off-by-one in the condition
Loop using range(n)n = 0, n = 1, typical nn = 0 means the loop body never runs; n = 1 means it runs exactly once — both are common crash or logic-error sites
List index accessIndex 0, index len-1, index lenFirst and last items are frequently off by one; accessing index len raises IndexError
List passed to a functionEmpty list [], one-item list, typical listEmpty list causes division by zero in average calculations; crashes functions that assume at least one item exists
User input that should be a numberZero, negative number, very large number, non-numeric stringDivision by zero; wrong sign handling; overflow; ValueError from conversion
String that should contain contentEmpty string "", single character, string with only spacesEmpty string causes crashes in functions that assume content exists; whitespace-only strings behave unexpectedly in comparisons
At the Deepest Level

Boundary testing is grounded in the observation that the density of bugs is much higher near boundaries than in the middle of a valid range. A grade classifier that works correctly for scores of 55, 65, and 75 almost certainly works correctly for 58, 63, and 72 as well — because the same code path handles all of them. But the code path that handles 50 might be different from the code path that handles 49, because there is a condition boundary between them. The boundary is where the code makes a decision, and decisions are where mistakes are made.

This is formalised in the testing concept of equivalence partitioning: divide the input space into regions (partitions) where the program behaves identically, then test one value from each partition plus the boundary between adjacent partitions. If a function behaves correctly for one value in a partition, it is likely correct for all values in that partition — so testing the representative and the boundaries gives maximum coverage with minimum test cases.

5   User Testing in Practice

Basic Idea

User testing means giving your program to someone who did not build it and observing what happens. The key word is observing — not explaining, not helping, not correcting. You watch.

What you are looking for:

  • Moments where they stop and look confused
  • Things they try that you did not expect
  • Inputs they enter that you did not design for
  • Features they look for that do not exist
  • Things they understand immediately without any help

All of these are evidence. You record what you observe — not what you think of their choices, not what they should have done. What they actually did, and what happened.

At a Deeper Level

The hardest part of user testing is staying quiet. When a user does something unexpected — tries to enter a name where you expected a number, clicks something that does not respond, stares at an output message that seems obvious to you — the instinct is to explain. Do not explain. The confusion is data. If your program requires explanation to use, that is a finding.

After the session, ask the user a small number of focused questions. "Was there a moment where you were not sure what to do?" is more useful than "Did you like it?" Good questions produce specific observations; vague questions produce vague answers that are hard to act on.

Record what you observed in writing, in enough detail that you could act on it. "User was confused" is not a useful record. "User typed their name when the prompt said 'Enter score' — stopped and looked at the screen for about ten seconds, then typed a number" is a useful record. It identifies a specific prompt that may need to be made clearer.

One tester is not enough; two or three is a minimum

A single user test tells you what one person did on one occasion. Some observations will be specific to that person's habits or prior experience. When two or three testers independently have the same confusion at the same point in your program, that confusion is reliable evidence of a real usability problem worth fixing. Observations that only one tester experienced may or may not represent a real problem — note them but treat them as lower priority.

At the Deepest Level

The principle behind staying quiet during user testing is that you are trying to observe behaviour in its natural state. As soon as you explain, help, or correct, you have changed the thing you are studying. The user is no longer interacting with your program; they are interacting with your program plus your explanation. This is called the observer effect in research methodology — the act of observation changes the phenomenon being observed.

Professional usability researchers use a technique called think-aloud protocol: they ask users to narrate what they are thinking as they use the product, without the researcher responding. This gives a continuous stream of qualitative data about the user's mental model, their expectations, and where those expectations are violated. For Grade 9 projects, a simplified version — asking the user to say out loud what they are trying to do, without the developer responding — produces similar insight without formal training.

6   Playtest Testing in Practice

Basic Idea

Playtesting applies to games and interactive programs. The core method is the same as user testing — someone who did not build the program uses it while you observe — but the questions you are asking are different.

In user testing, you are asking: can they use it? In playtesting, you are asking: is the experience what you intended?

  • Did they understand the rules without reading documentation?
  • Did they find the difficulty appropriate — not too easy, not too frustrating?
  • Did they want to play again, or did they stop as soon as they could?
  • Did they discover interactions or strategies you did not plan for?
  • Were there moments of genuine engagement or delight?
  • Were there moments of confusion, boredom, or frustration?

Playtest evidence is qualitative — it captures the texture of the experience, not just whether the program is correct.

At a Deeper Level

The most important thing to record during a playtest is not what the tester said — it is what they did. Did they read the instructions? Did they restart immediately after losing? Did they try the same strategy twice in a row? Did they laugh? Did they sigh? These behavioural observations are more reliable evidence than verbal responses, because what people say about an experience is often different from how they actually respond to it.

After the playtest, a short debrief is valuable: "What was the most confusing moment?" "Was there a rule you had to re-read?" "Did you feel like you understood why you won or lost?" These questions invite specific recollection rather than general evaluation. "It was fun" is not useful evidence. "I didn't understand why I lost at the end — I thought I had more time" is evidence you can act on.

Playtest evidence feeds directly into your evaluation. When you write about what worked and what did not work in your program, playtest observations are some of the most credible evidence you can cite — because they come from someone other than you, under real conditions, without coaching.

Record observations during the session, not after

Take brief notes while the playtest is happening. Do not rely on memory. You will remember the moments that confirmed what you already thought. You will forget the moments that surprised you — and those are the most valuable ones. A few words per observation is enough: "confused by score display," "played three times in a row," "tried clicking before reading prompt."

At the Deepest Level

Playtesting as a formal discipline emerged from the board game and video game industries, where the gap between a designer's intention and a player's experience can be enormous. A mechanic that feels elegant to the designer — who has deep familiarity with the system — can feel arbitrary or confusing to a player encountering it for the first time. Playtesting is the systematic process of discovering and closing that gap before release.

The distinction between intended experience and actual experience is the central design question that playtesting answers. A related concept is emergent gameplay — behaviours and strategies that arise from the rules of a game that the designer did not explicitly plan for. Sometimes emergence is delightful (players invent creative strategies), sometimes it is a problem (a strategy makes the game trivially easy). Playtesting is how you find out which kind of emergence your game has.

7   Regression Testing in Practice

Basic Idea

Every time you make a change to your program — adding a feature, fixing a bug, reorganising code — you risk breaking something that used to work. A regression is a working feature that stops working after a change. Regression testing is the practice of re-running your existing tests after every change to check that nothing broke.

This means keeping your test cases. If you delete or forget them, you cannot run regression tests. The test cases you write in functional and boundary testing are also your regression tests. Write them so you can run them again.

At a Deeper Level

Regressions are particularly common in these situations:

  • Fixing one bug introduces another. You change a condition to fix a boundary case, and the change accidentally affects a different part of the logic.
  • Adding a feature changes shared data. A new feature modifies a list or dictionary that other parts of the program depend on. The new feature works, but the old features now see unexpected data.
  • Refactoring breaks an assumption. You reorganise code for clarity — moving lines, renaming variables, extracting a function — and inadvertently change the order of operations or the scope of a variable.

The discipline of running regression tests after every change is what makes it safe to keep working on a program over time. Without it, a growing program becomes increasingly fragile — you are afraid to change anything because you might break something you cannot find.

Commit before you change, test after you change

A practical habit: before making any significant change to your program, commit the current working version to Git. Then make the change and run your tests. If regression tests fail, you can compare against the previous commit to understand exactly what changed. If you need to, you can revert to the last known-good version with a single Git command. This is one of the most important practical benefits of using version control throughout development.

At the Deepest Level

In professional software engineering, regression testing is almost always automated. A test suite runs automatically every time code is committed — a practice called continuous integration (CI). If any test fails, the team is notified immediately and the failing code is not merged into the main codebase until the regression is fixed. This makes the cost of regression nearly zero: failures are caught within minutes of being introduced, by the person who introduced them, while the change is still fresh in their mind.

The alternative — manual regression testing performed periodically — is far more expensive. By the time a regression is discovered manually, it may be days or weeks old, introduced by any of dozens of changes, and require significant investigation to locate and fix. Automation does not change what regression testing finds; it changes how quickly it finds it.

8   Testing and MYP Criterion D

Basic Idea

In this course, testing is not separate from assessment — it is part of how you demonstrate that your program meets your design criteria. MYP Criterion D (Evaluating) asks you to test your program, reflect on its success, and identify what you would improve. The evidence you produce during testing is what makes that reflection credible.

A Criterion D response that says "my program works well" without specific test evidence is not a strong response. A response that says "I tested with five inputs including boundary values, observed two users, and found that the score display confused both of them before I revised the prompt" is credible because it is specific and grounded in evidence you actually collected.

At a Deeper Level

Criterion D has two distinct components, and different types of testing serve each one:

Criterion D componentWhat it requiresTypes of testing that produce this evidence
Testing the productEvidence that the program was tested against the design criteria you set in Criterion B, using specific inputs and expected outputsFunctional testing; boundary testing; regression testing
Evaluating the productHonest reflection on what works, what does not, and what you would do differently — supported by evidence from real users, not just your own judgmentUser testing; playtest testing; functional testing where it revealed failures

The word "honest" in evaluation matters. A weak evaluation lists only what worked. A strong evaluation acknowledges what did not work, explains why, and proposes a specific improvement. The evidence from user testing and playtesting is what makes an honest evaluation possible — if you only tested by yourself, you have no basis for knowing whether a real user could use the program.

Testing evidence is only useful if you record it as you go

You cannot write a credible Criterion D evaluation from memory. Keep a testing log as you work — a simple document or a section of your project notes where you record each test you ran, what input you used, what you expected, and what actually happened. User testing observations go in the same place. When you sit down to write the evaluation, the evidence is already there.

At the Deepest Level

The IB MYP Design cycle asks students to evaluate against criteria they set themselves in Criterion B — not against a generic standard. This means the quality of your Criterion D evaluation depends in part on the quality of your Criterion B criteria. Criteria that are vague ("the program will be good") cannot be tested. Criteria that are specific and measurable ("the program will return the correct classification for any integer score between 0 and 100, including boundary values") can be directly tested and the results directly reported.

This is one of the reasons that learning to write specific, testable criteria is a skill developed deliberately in this course. A criterion you cannot test is a criterion you cannot honestly evaluate. The connection between Criterion B and Criterion D is not administrative — it reflects a genuine principle in engineering and design: you must define what success looks like before you can determine whether you achieved it.

Check Your Understanding
  1. A program runs without crashing. Does this mean it is correct? Does it mean it is usable? Explain the difference between the three properties a program can have: running, correct, and usable.
  2. Name the five types of testing used in this course. For each one, write one sentence describing the question it answers.
  3. A student writes a function that converts a Celsius temperature to Fahrenheit. List at least five specific test cases they should run, including at least two boundary values. For each test case, state the input and the expected output.
  4. Explain why the expected output for a test must be written before the test is run, not after. What goes wrong when you decide what the expected output is after seeing the actual output?
  5. A student conducts a user test and their tester gets confused at the same point twice. The student wants to explain what the tester should do. Why should they not? What should they do instead?
  6. Describe the difference between user testing and playtest testing. What does playtesting ask that user testing does not?
  7. A student fixes a bug in their grade calculator and then submits the program. They do not re-run their previous tests. The next day, a classmate points out that the average calculation is now wrong. What type of testing would have caught this? What habit would have prevented it?
  8. In your own words, explain the connection between testing and MYP Criterion D. Why is "my program works well" not sufficient as an evaluation, and what would make it stronger?