Problem Set 3: Url shortener – Convert long URLs into short codes.

Read the question carefully (twice).
Break the task into the smallest steps.
Sketch or write pseudocode before coding.
Start small — test as you go.
Check your solution with different cases.

Overview

Build a tiny URL-shortening service. Given a long URL, your program returns a short code (e.g., aZ3fQ9). When someone provides that short code, your program returns the original long URL.

Key constraint: No duplicate short codes may exist. If a code is already in use, your program must not reuse it for a different URL.

Learning objectives

You should be able to:

Design simple data structures to map keys to values.
Implement a collision-safe short-code generator.
Validate and normalize user input (URLs).
Write tests that verify uniqueness, idempotency, and lookup correctness.

Success criteria (what “done” looks like)

shorten(url) returns a 6–10 character short code and never creates a duplicate code.
Re-shortening the same URL returns the same code (idempotent behavior).
resolve(code) returns the original URL that was shortened.
Input validation: rejects empty strings and clearly invalid URLs.
A test suite that passes: uniqueness under load, idempotency, lookup, and error cases.

Constraints & rules

Codes are URL-safe: only [A–Z][a–z][0–9].
Codes must be unique. Handle collisions robustly.
Keep state in memory (dictionary) for the core task.
Extension: add persistence with a simple file or SQLite.
Deterministic behavior for the same input (idempotency): if URL X was shortened before, return the same code.

Starter design

Data structures

url_to_code: dict[str, str] – map long URL → short code
code_to_url: dict[str, str] – map short code → long URL

Core functions (implement these)

normalize_url(url: str) -> str
generate_code(n: int = 6) -> str
shorten(url: str) -> str
resolve(code: str) -> str

Skeleton (Python)

Paste this into shortener.py. Do not change function names or signatures.

import re
import string
import random
from urllib.parse import urlparse

# In-memory stores
url_to_code: dict[str, str] = {}
code_to_url: dict[str, str] = {}

ALPHABET = string.ascii_letters + string.digits
CODE_LEN = 6

def normalize_url(url: str) -> str:
    """
    Return a cleaned, canonical-looking URL or raise ValueError.
    Rules:
      - Must have scheme (http/https). If missing, assume https.
      - Must have netloc (domain).
      - Trim whitespace.
    """
    if not isinstance(url, str):
        raise ValueError("URL must be a string")
    url = url.strip()
    if not url:
        raise ValueError("Empty URL")

    # Prepend https:// if missing scheme
    parsed = urlparse(url)
    if not parsed.scheme:
        url = "https://" + url
        parsed = urlparse(url)

    if parsed.scheme not in {"http", "https"}:
        raise ValueError("Unsupported scheme")

    if not parsed.netloc:
        raise ValueError("Missing domain")

    # Optional: lowercase scheme + host; keep path/query as-is
    canonical = f"{parsed.scheme.lower()}://{parsed.netloc.lower()}{parsed.path}"
    if parsed.params:
        canonical += f";{parsed.params}"
    if parsed.query:
        canonical += f"?{parsed.query}"
    if parsed.fragment:
        canonical += f"#{parsed.fragment}"
    return canonical

def generate_code(n: int = CODE_LEN) -> str:
    """
    Generate a random code using ALPHABET. Collision-safe logic
    is enforced in shorten() (this function only makes a candidate).
    """
    return "".join(random.choices(ALPHABET, k=n))

def shorten(url: str) -> str:
    """
    Return a short code for the given URL.
    Requirements:
      - Idempotent: the same normalized URL gets the same code.
      - No duplicate codes: if a generated code exists, regenerate.
    """
    normalized = normalize_url(url)

    # Idempotency: if URL already shortened, return existing code
    if normalized in url_to_code:
        return url_to_code[normalized]

    # Generate until an unused code is found (collision handling)
    # This loop guarantees uniqueness even under rare collisions.
    attempts = 0
    while True:
        attempts += 1
        if attempts > 10_000:
            raise RuntimeError("Too many collisions; increase code length.")
        candidate = generate_code(CODE_LEN)
        if candidate not in code_to_url:
            # Reserve the code
            code_to_url[candidate] = normalized
            url_to_code[normalized] = candidate
            return candidate

def resolve(code: str) -> str:
    """
    Given a short code, return the original URL.
    Raise KeyError if the code does not exist.
    """
    if not isinstance(code, str) or not code:
        raise ValueError("Code must be a non-empty string")
    if not re.fullmatch(r"[A-Za-z0-9]+", code):
        raise ValueError("Invalid code characters")
    return code_to_url<pre><code>

Manual usage example (REPL)

>>> from shortener import shorten, resolve
>>> c = shorten("https://www.example.com/some/long/path?x=1")
>>> c
'aZ3fQ9'  # example
>>> resolve(c)
'https://www.example.com/some/long/path?x=1'
>>> shorten("www.example.com/some/long/path?x=1")  # idempotent after normalization
'aZ3fQ9'

Tests (save as `test_shortener.py`)

Run with python -m pytest -q or python -m unittest discover.

import re
from shortener import shorten, resolve, url_to_code, code_to_url, normalize_url, CODE_LEN

def test_normalization_adds_scheme_and_lowers_host():
    u = normalize_url("Example.com/Path")
    assert u.startswith("https://")
    assert "example.com" in u

def test_shorten_returns_fixed_length_alnum_code():
    code = shorten("https://example.com/alpha")
    assert len(code) == CODE_LEN
    assert re.fullmatch(r"[A-Za-z0-9]{%d}" % CODE_LEN, code)

def test_idempotent_same_url_same_code():
    a = shorten("https://example.com/idempotent")
    b = shorten("example.com/idempotent")
    assert a == b

def test_resolve_round_trip():
    code = shorten("https://example.com/roundtrip")
    assert resolve(code) == "https://example.com/roundtrip"

def test_reject_resolve_unknown_code():
    import pytest
    with pytest.raises(KeyError):
        resolve("NoSuchCode123")

def test_reject_invalid_inputs():
    import pytest
    with pytest.raises(ValueError):
        normalize_url("")
    with pytest.raises(ValueError):
        resolve("bad-code!")  # invalid characters

def test_uniqueness_under_load():
    # Create many URLs; ensure all codes are unique
    codes = set()
    for i in range(2000):
        url = f"https://example.org/resource/{i}"
        code = shorten(url)
        assert code not in codes
        codes.add(code)

Stretch goals (optional)

Deterministic codes (Base62 of an auto-increment id):
Maintain a counter next_id, encode it in Base62 for the code. Guarantees uniqueness without random collisions.
Custom code support:
Allow users to request a custom code (validate and ensure it’s unused).
Persistence:
Save url_to_code and code_to_url to a JSON file on exit and load on start, or use SQLite with a unique index on code.
CLI or Web API:
- CLI: python cli.py shorten <url> and python cli.py resolve <code>.
- Web: tiny Flask app with POST /shorten and GET /<code>; enforce a unique DB constraint on code.

Hints & guidance

Uniqueness enforcement lives at the point of insertion. Even with random codes, a while loop plus a set/dict membership check prevents duplicates.
Idempotency reduces storage and improves UX; normalize before looking up or inserting.
Length vs collisions: Longer codes reduce collision probability. If you ever hit many collisions, increase CODE_LEN.
Validation: Use urllib.parse.urlparse to check scheme and netloc; be strict and clear in error messages.