A1.3.5 Explain the role of the operating system in managing multitasking and resource allocation. (HL only) .

A1.3.5 Explain the role of the operating system in managing multitasking and resource allocation. (HL only)

• The challenges of multitasking and resource allocation, including task scheduling, resource contention and deadlock

The big idea

An operating system (OS) turns a single collection of physical components into the illusion of many concurrently running programs, each safely and efficiently sharing finite resources. It does this by multitasking—rapidly switching the processor among contexts—and by resource allocation—deciding which task may consume CPU cycles, memory pages, disk/I-O bandwidth, network sockets, or other critical assets at any instant.

There have been some student questions about what happens when a process is sleeping (or blocked) and a network request comes in for that process. Read this article for clarification on interrupts, schedulers, and processes.

1 Multitasking in depth

Concept	Precise meaning	How the OS realises it
Task	An execution context: a process with its own virtual address space and descriptors, or a kernel thread/lightweight process that shares an address space.	PCB/TCB structures store registers, page-table base, open-file table, credentials, cgroup membership. PCB / TCB PCB (Process Control Block): A kernel data structure that stores all information the operating system needs to manage a process. TCB (Thread Control Block): A similar kernel data structure, but for a thread; it stores thread-specific execution state, often linked to a PCB. Registers Small, fast storage locations inside the CPU that hold the current execution state of a task, including the program counter, stack pointer, and general-purpose registers. Saving and restoring registers allows the OS to pause and resume tasks during context switching. Page-table base A reference (pointer) to the base of the page table used by the task. It tells the memory management unit (MMU) how the task’s virtual addresses map to physical memory. Open-file table A data structure listing all files the task currently has open, including file descriptors, access modes, and current read/write positions. It allows the OS to track and manage file I/O independently for each task. Credentials Security-related information associated with the task, such as user ID (UID), group ID (GID), and access privileges. These determine what system resources the task is allowed to access. cgroup membership Information indicating which control group (cgroup) the task belongs to. cgroups are used by the OS to limit and account for resource usage (CPU time, memory, I/O) across groups of tasks. At a conceptual level: the PCB/TCB is the OS’s memory of a task, capturing everything needed to stop it, manage it safely, and resume it later as if nothing happened.
Context switch	The act of saving the CPU state of the current task and restoring that of the next runnable task.	Triggered by timer interrupt, I/O completion, system call, or explicit yield; involves TLB flush or ASID update, pipeline drain, cache interference. Timer interrupt A hardware signal generated periodically by a system timer. It allows the operating system to regain control of the CPU to enforce scheduling decisions, such as pre-emption in multitasking systems. I/O completion An event generated when an input/output operation (for example, disk read or network send) finishes. It notifies the operating system that a previously blocked task can resume execution. System call A controlled entry point through which a user-level task requests a service from the operating system kernel. It causes a transition from user mode to kernel mode, often triggering scheduling or state changes. Explicit yield A voluntary action by a running task to give up the CPU before its time slice expires. It allows the scheduler to select another ready task without forcing pre-emption. TLB (Translation Lookaside Buffer) A small, high-speed cache inside the CPU that stores recent virtual-to-physical memory address translations. It reduces the cost of address translation by avoiding repeated page-table lookups during memory access. At a conceptual level: the TLB exists to make virtual memory fast, but it also explains why context switches are expensive—because cached translations may no longer be valid when the active task changes. TLB flush The invalidation of entries in the Translation Lookaside Buffer (TLB), which caches virtual-to-physical address translations. It is required when switching between tasks with different address spaces to prevent incorrect memory access. ASID update A change to the Address Space Identifier used by the MMU to distinguish between different virtual address spaces. It allows the OS to avoid flushing the TLB by tagging translations with a task-specific identifier. Pipeline drain The process of clearing partially executed instructions from the CPU pipeline. It ensures that instructions from the old task do not complete after a context switch. Cache interference Performance degradation caused when a context switch replaces cached data from one task with data from another. It increases memory access latency until the cache is repopulated with relevant data. At a conceptual level: a context switch is triggered by control events and incurs real hardware costs, because the CPU and memory system must be reoriented to safely and correctly execute a different task.
Pre-emption	Forcible suspension of a running task by the kernel before the task voluntarily blocks or exits.	Driven by the scheduler tick or higher-priority wake-ups; guarded by pre-emption disable sections inside the kernel. Scheduler tick A periodic interrupt generated by the system timer. It allows the scheduler to update accounting information and decide whether the currently running task should be pre-empted. Higher-priority wake-up An event in which a task with a higher scheduling priority transitions from a blocked state to a ready state. It can cause the scheduler to immediately pre-empt the currently running task. Pre-emption disable sections Critical regions of kernel code where pre-emption is temporarily turned off. They prevent the scheduler from interrupting execution while shared kernel data structures are being safely modified. At a conceptual level: pre-emption is policy-driven but carefully controlled, allowing responsive scheduling without corrupting kernel state.

1.1 Task scheduling

The policy and mechanism that choose which task’s context is loaded next.

Policy layer — defines goals (fair share, priority, energy) and computes eligibility (e.g., CFS’s virtual-runtime minimum, EDF’s earliest deadline).
Mechanism layer — an O(1) or O(log n) ready-queue data structure (red-black tree, bitmap + array, skip-list) and per-core run queue selection.

Policy layer

The part of the scheduler that defines what the system is trying to optimise.
It encodes scheduling goals such as fairness, priority enforcement, deadline compliance, or energy efficiency, and computes which tasks are eligible to run and in what order.

Fair share

A scheduling goal where CPU time is distributed proportionally among tasks or groups.
It prevents starvation by ensuring no task can monopolise the processor over time.

Priority

A numerical or ordered value assigned to a task that influences its preference for CPU access.
Higher-priority tasks are selected to run before lower-priority ones when contention exists.

Energy (energy-aware scheduling)

A scheduling goal that considers power consumption alongside performance.
The scheduler may consolidate tasks onto fewer cores or prefer energy-efficient cores to reduce overall energy use.

Virtual runtime (CFS)

A logical time value representing how much processor time a task has received, weighted by priority.
The Completely Fair Scheduler selects the task with the smallest virtual runtime to approximate ideal multitasking fairness.

Earliest deadline (EDF)

A real-time scheduling policy where tasks are ordered by their absolute deadlines.
The task with the closest deadline is always selected to run next.

Mechanism layer

The part of the scheduler that defines how scheduling decisions are implemented.
It provides concrete data structures and algorithms to efficiently select the next runnable task.

Ready queue

A kernel-managed data structure holding tasks that are ready to execute but not currently running.
Its organisation directly affects scheduling overhead and scalability.

O(1) / O(log n)

Descriptions of the time complexity required to select the next task to run.
O(1) schedulers make constant-time decisions, while O(log n) schedulers trade slightly higher cost for more precise ordering.

Red-black tree

A self-balancing binary search tree used to keep tasks ordered by scheduling criteria.
It allows efficient insertion, deletion, and minimum-value selection.

Bitmap + array

A scheduling structure where priority levels map to arrays, tracked by a bitmap.
It enables constant-time lookup of the highest-priority runnable task.

Skip list

A layered probabilistic data structure that maintains ordered elements with logarithmic access time.
It offers simpler implementation than balanced trees with similar performance.

Per-core run queue

A separate ready queue maintained for each CPU core in a multicore system.
It improves scalability and cache locality by reducing contention between cores.

At a conceptual level: the policy layer decides who should run, and the mechanism layer makes that decision fast enough to be practical.

Challenges

High utilisation vs. overhead – short time slices improve responsiveness but increase cache-destructive context switches.
Fairness vs. predictability – weight-proportional share conflicts with hard real-time latency bounds.
Priority inversion – a low-priority task holding a lock needed by a high-priority task stalls the latter until the kernel boosts (“priority inheritance”).

2 Resource allocation

2.1 Definition

The kernel must partition, arbitrate, and account for finite resources so that no task corrupts another’s state and overall performance meets design goals.

Resource	Allocation techniques
CPU	Scheduler quanta, CPU affinity masks, cgroups CPU, share, energy-aware placement. Scheduler quanta (time quantum / time slice) The maximum amount of CPU time a runnable task is allowed to execute before it may be pre-empted. It enforces fairness and responsiveness by preventing any single task from running indefinitely. CPU affinity masks Bitmasks that specify which CPU cores a task is permitted to run on. They constrain scheduler placement to improve cache locality, meet hardware constraints, or satisfy real-time requirements. cgroups CPU A control group subsystem that manages how CPU time is distributed among groups of tasks. It allows administrators or the OS to account for and limit CPU usage at the group level rather than per task. CPU share A relative weighting assigned to a task or cgroup that determines its proportion of CPU time under contention. Higher shares result in more CPU time when multiple runnable tasks compete. Energy-aware placement A scheduling strategy that considers power efficiency when selecting which CPU core should run a task. It balances performance against energy consumption by accounting for core characteristics, workload intensity, and thermal state. At a conceptual level: these parameters tune how aggressively and where tasks run, shaping fairness, performance predictability, and energy efficiency in modern schedulers.
Memory	Demand paging, copy-on-write, NUMA page migration, working-set size limits, slab/slub caches. Demand paging A memory management technique where pages are loaded into physical memory only when they are first accessed. It reduces memory usage and startup time by avoiding loading unused pages. Copy-on-write (CoW) A memory optimisation technique where multiple tasks initially share the same physical memory pages marked as read-only. A private copy of a page is created only when a task attempts to modify it. NUMA page migration The relocation of memory pages between NUMA nodes to be closer to the CPU core accessing them most frequently. It reduces memory access latency and improves performance on non-uniform memory access systems. Working-set size limits Constraints on the amount of physical memory a task or group of tasks is allowed to actively use. They prevent excessive paging and help the OS maintain overall memory system stability. Slab / SLUB caches Kernel memory allocators designed for efficient allocation of frequently used, fixed-size objects. They reduce fragmentation and allocation overhead by reusing pre-initialised memory objects. NUMA (Non-Uniform Memory Access) A memory architecture in which each CPU socket has its own local memory, with slower access to memory attached to other sockets. It improves scalability on multiprocessor systems but requires the OS to manage memory placement carefully to minimise access latency. Slab allocator A kernel memory allocation strategy that manages memory in caches of pre-sized objects. It improves performance and reduces fragmentation by reusing already-initialised kernel objects. SLUB allocator A simplified and more scalable successor to the slab allocator used in modern Linux kernels. It reduces metadata overhead and improves performance on multicore systems while retaining object-caching benefits. At a conceptual level: these mechanisms control when, where, and how memory is allocated, balancing efficiency, isolation, and performance under memory pressure.
Disk / NVMe	Per-I/O priority queues (CFQ, BFQ), IOThrottle, tagged command queuing. Per-I/O priority queues Separate queues used by the operating system to organise I/O requests according to priority or workload class. They allow the I/O scheduler to balance fairness, latency, and throughput across competing tasks. CFQ (Completely Fair Queuing) An I/O scheduling algorithm that allocates disk time fairly among processes. It prevents any single process from monopolising I/O bandwidth by time-slicing access to the device. BFQ (Budget Fair Queuing) An I/O scheduler that assigns each process a budget of I/O operations. It improves responsiveness for interactive and multimedia workloads while maintaining fairness. IOThrottle A kernel mechanism that limits the rate of I/O operations for a task or control group. It prevents excessive disk usage from degrading system-wide performance. Tagged Command Queuing (TCQ) A storage device feature that allows multiple I/O commands to be issued and reordered internally by the device. It improves throughput by enabling the device to optimise execution order, especially on high-latency storage media. At a conceptual level: modern I/O scheduling separates requests by intent and priority, then carefully meters access to slow devices to keep the system responsive.
Network	Token Bucket Filters, XDP, QoS classes, eBPF shaping, receive-side scaling (RSS) steering. Token Bucket Filter (TBF) A traffic-shaping algorithm that controls the rate at which packets are transmitted. It allows short bursts of traffic while enforcing a long-term average bandwidth limit. XDP (eXpress Data Path) A high-performance packet processing framework that runs programs at the earliest point in the Linux networking stack. It enables extremely fast packet filtering, redirection, and shaping before packets reach the kernel network stack. QoS (Quality of Service) classes Categories used to classify network traffic according to priority or service requirements. They allow the network stack to guarantee bandwidth, limit latency, or reduce packet loss for critical traffic. eBPF shaping The use of extended Berkeley Packet Filter programs to control and modify packet flow dynamically. It allows programmable, fine-grained traffic shaping and policy enforcement inside the kernel. Receive-Side Scaling (RSS) steering A hardware-assisted technique that distributes incoming network packets across multiple CPU cores. It improves throughput and reduces contention by steering packets for the same flow to the same core. At a conceptual level: modern network scheduling combines rate control, programmable filtering, and parallelism to manage high-throughput traffic efficiently.
Kernel objects	Reference counts for files, sockets; semaphore counts; quota tables. Reference counts (files, sockets) Integer counters that track how many active references exist to a kernel object such as a file or socket. They ensure resources are released only when no tasks are still using them. Semaphore counts Integer values associated with semaphores that represent the number of available resources or permits. They are decremented when a task acquires the semaphore and incremented when the resource is released. Quota tables Kernel-managed data structures that record resource usage limits and current consumption for users, groups, or control groups. They enforce system-wide and per-entity limits on resources such as disk space, CPU time, or memory. At a conceptual level: counting mechanisms let the OS safely share finite resources without leaks, races, or starvation.

2.2 Resource contention

Two or more runnable tasks require the same non-shareable resource concurrently.

Symptoms – rising wait times in run queue, lock contention spin-time, cache-line bouncing (false sharing), elevated latency.
Kernel mitigations – adaptive spin locks, MCS locks for fairness, NUMA-aware memory placement, per-CPU data structures, I/O merging, lock elision with transactional memory.

3 Deadlock

3.1 Formal definition

A deadlock exists when a set of tasks is permanently blocked because each holds at least one resource and waits to acquire a resource held by another task in the same set. Coffman’s four necessary conditions:

Mutual exclusion
Hold-and-wait
No pre-emption of held resources
Circular wait (closed wait-for graph)

3.2 OS strategies

Strategy	Principle	Example
Prevention	Remove at least one Coffman condition	Force tasks to request all locks at once (hold-and-wait gone); impose global lock ordering (circular wait gone).
Avoidance	Admit a request only if system would remain in a safe state	Banker’s algorithm for fixed-size resource vectors.
Detection & recovery	Periodically scan wait-for graph; if cycle found, kill or roll back a victim task	Linux’s `CONFIG_DETECT_HUNG_TASK`, databases’ deadlock detectors.
Static analysis	Prove absence of cycles at design time in synchronous reactive systems	Model checking of mutex protocols in avionics RTOS.

4 Putting it all together: the OS control loop

Hardware timer fires → trap to kernel.
Scheduler charges the running task’s wall-time and consults the policy (Wall-time is the actual elapsed real-world time during which a task is running on the CPU. It includes all time spent executing, regardless of instruction count or work performed.)
If pre-emption needed, context switch to next eligible task.
That task’s memory pages are ensured resident via page-fault handling.
I/O requests are queued to device schedulers that observe per-task quotas.
Synchronisation primitives gate critical regions; kernel monitors for deadlock hints (long lock hold times, wait-for cycles).
Accounting subsystem logs CPU, memory, and I/O usage for fairness rebalancing and billing.

Throughout, the kernel maintains invariants: no address-space overlap, no unauthorised device register access, no starvation beyond configured policy limits.

Key takeaways

Multitasking turns scarce CPU cycles into the appearance of parallel threads through pre-emption and rapid context switches.
Resource allocation guarantees isolation and efficient usage of all hardware resources; its correctness depends on the scheduler, memory manager, and I/O subsystems cooperating.
Resource contention and deadlock are inherent risks; kernels deploy prevention, avoidance, detection, and recovery mechanisms, plus modern techniques like lock-free data structures and eBPF-driven scheduling, to sustain throughput without sacrificing correctness.