A4.3.6 Describe how an agent learns to make decisions by interacting with its environment in reinforcement learning. (HL only)

A4.3.6 Describe how an agent learns to make decisions by interacting with its environment in reinforcement learning.

• The principle of cumulative reward and the foundational concepts of agent–environment interaction, encompassing actions, states, rewards and policies

• The exploration versus exploitation trade-off as a core concept in reinforcement learning

Reinforcement Learning: How Agents Learn Through Interaction

The Big Idea

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make sequences of decisions by interacting with an environment in order to maximize a cumulative reward. Unlike supervised learning, where the model learns from labelled examples, RL agents learn through trial and error, using feedback from the environment to guide future actions.

At its core, RL formalizes the problem as a Markov Decision Process (MDP), where decisions must be made over time under uncertainty. It is used in problems involving sequential decision-making, such as robotics, game playing, recommendation systems, and autonomous control systems.

Core Concepts in Reinforcement Learning

1. Agent–Environment Interaction

In reinforcement learning, learning is defined by the agent–environment loop. At each discrete time step $t$ :

The agent observes the current state $s_t \in \mathcal{S}$
Based on a policy $\pi(a|s)$ , the agent chooses an action $a_t \in \mathcal{A}$
The environment returns a new state $s_{t+1}$ and a reward $r_t \in \mathbb{R}$

This cycle continues, and the agent's objective is to learn a policy that maximizes expected cumulative reward over time.

Formal Components:

State space $\mathcal{S}$ : The set of all possible configurations of the environment.
Action space $\mathcal{A}$ : The set of actions the agent can take.
Reward function $R(s, a)$ : The feedback signal received after taking action $a$ in state $s$ .
Policy $\pi(a|s)$ : A function that maps states to actions; it can be deterministic or probabilistic.
Value function $V(s)$ : Estimates the expected return (cumulative reward) from state $s$ .
Q-function $Q(s, a)$ : Estimates the expected return from taking action $a$ in state $s$ and following policy $\pi$ thereafter.

2. Cumulative Reward and Return

The goal of an RL agent is to maximize total reward over time, not just immediate reward.

The return $G_t$ is the total reward from time $t$ onward:
$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$
$\gamma \in [0,1)$ is the discount factor, controlling the agent’s preference for short-term vs long-term rewards.

A small $\gamma$ encourages short-term gain. A large $\gamma$ encourages long-term planning.

3. Exploration vs Exploitation Trade-off

One of the central challenges in RL is the exploration–exploitation dilemma:

Exploitation: Choosing the current best-known action to maximize immediate reward.
Exploration: Trying less certain actions to discover potentially better long-term strategies.

Balancing both is critical:

Too much exploitation may prevent discovering better strategies.
Too much exploration leads to inefficient learning.

Common Strategies:

ε-greedy: With probability ε, choose a random action; otherwise choose the best-known action.
Softmax action selection: Probabilistic selection weighted by estimated value.
Upper Confidence Bound (UCB): Choose actions based on both estimated value and uncertainty.

Technical Example: Gridworld Navigation

Imagine a robot navigating a 5×5 grid to reach a goal:

States: Each cell in the grid.
Actions: Up, down, left, right.
Rewards: +10 for reaching the goal, -1 for each step, -10 for falling into a pit.
The agent starts with no knowledge, and learns by interacting—trying different paths, experiencing outcomes, and updating its policy.

Over time, it learns:

Which states are promising (via value functions)
Which actions yield long-term benefit (via Q-values)
To balance exploring new paths and exploiting known safe routes

This is the essence of model-free reinforcement learning (like Q-learning or SARSA), where the agent does not build a model of the environment, but instead learns from direct experience.

Real-World Applications

Games: AlphaGo, Dota 2 bots, and chess engines use RL to outperform humans by learning optimal strategies through self-play.
Robotics: Agents learn to walk, grasp, or fly by maximizing physical reward signals like balance or accuracy.
Finance: Trading agents learn to buy/sell stocks based on delayed returns.
Healthcare: Adaptive treatment strategies that maximize patient recovery outcomes.

Summary

Reinforcement learning is a powerful learning paradigm where agents learn optimal behavior through direct interaction with an environment, guided by the goal of maximizing cumulative reward. The agent must navigate uncertainty and delayed feedback, using strategies that trade off exploration of unknown actions against exploitation of known rewards. Through this process, RL agents develop policies that adapt and improve over time—even in complex, dynamic environments.