Problem Set 3: Url shortener – Convert long URLs into short codes.

When you start a new problem set, your first instinct might be to open your computer and begin typing code right away. While this can feel productive, it often leads to frustration when things don't work as expected. Instead, take a few minutes to slow down and plan.

Here are some helpful strategies:

  1. Understand the problem clearly
    • Read the instructions carefully — twice if needed.
    • Ask yourself: What exactly is being asked?
  2. Break the problem into smaller steps
    • Think about the smallest possible actions the computer will need to perform.
    • For example: If the task is to find the first recurring letter in a word, what steps must happen first?
  3. Try solving it on paper first
    • Write out your steps in plain language (pseudocode).
    • Test your steps with a simple example before touching the keyboard.
  4. Translate your steps into code
    • Start small — write only a few lines at a time and test often.
    • Don't worry about perfection at first; get a working version, then improve it.
  5. Check your solution
    • Run it with different examples, including edge cases.
    • Ask: Does this solve the problem in all situations?

  • Read the question carefully (twice).
  • Break the task into the smallest steps.
  • Sketch or write pseudocode before coding.
  • Start small — test as you go.
  • Check your solution with different cases.

Overview

Build a tiny URL-shortening service. Given a long URL, your program returns a short code (e.g., aZ3fQ9). When someone provides that short code, your program returns the original long URL.

Key constraint: No duplicate short codes may exist. If a code is already in use, your program must not reuse it for a different URL.


Learning objectives

You should be able to:

  • Design simple data structures to map keys to values.
  • Implement a collision-safe short-code generator.
  • Validate and normalize user input (URLs).
  • Write tests that verify uniqueness, idempotency, and lookup correctness.

Success criteria (what “done” looks like)

  • shorten(url) returns a 6–10 character short code and never creates a duplicate code.
  • Re-shortening the same URL returns the same code (idempotent behavior).
  • resolve(code) returns the original URL that was shortened.
  • Input validation: rejects empty strings and clearly invalid URLs.
  • A test suite that passes: uniqueness under load, idempotency, lookup, and error cases.

Constraints & rules

  • Codes are URL-safe: only [A–Z][a–z][0–9].
  • Codes must be unique. Handle collisions robustly.
  • Keep state in memory (dictionary) for the core task.
    Extension: add persistence with a simple file or SQLite.
  • Deterministic behavior for the same input (idempotency): if URL X was shortened before, return the same code.

Starter design

Data structures

  • url_to_code: dict[str, str] – map long URL → short code
  • code_to_url: dict[str, str] – map short code → long URL

Core functions (implement these)

  • normalize_url(url: str) -> str
  • generate_code(n: int = 6) -> str
  • shorten(url: str) -> str
  • resolve(code: str) -> str

 

Skeleton (Python)

Paste this into shortener.py. Do not change function names or signatures.

import re
import string
import random
from urllib.parse import urlparse

# In-memory stores
url_to_code: dict[str, str] = {}
code_to_url: dict[str, str] = {}

ALPHABET = string.ascii_letters + string.digits
CODE_LEN = 6

def normalize_url(url: str) -> str:
    """
    Return a cleaned, canonical-looking URL or raise ValueError.
    Rules:
      - Must have scheme (http/https). If missing, assume https.
      - Must have netloc (domain).
      - Trim whitespace.
    """
    if not isinstance(url, str):
        raise ValueError("URL must be a string")
    url = url.strip()
    if not url:
        raise ValueError("Empty URL")

    # Prepend https:// if missing scheme
    parsed = urlparse(url)
    if not parsed.scheme:
        url = "https://" + url
        parsed = urlparse(url)

    if parsed.scheme not in {"http", "https"}:
        raise ValueError("Unsupported scheme")

    if not parsed.netloc:
        raise ValueError("Missing domain")

    # Optional: lowercase scheme + host; keep path/query as-is
    canonical = f"{parsed.scheme.lower()}://{parsed.netloc.lower()}{parsed.path}"
    if parsed.params:
        canonical += f";{parsed.params}"
    if parsed.query:
        canonical += f"?{parsed.query}"
    if parsed.fragment:
        canonical += f"#{parsed.fragment}"
    return canonical

def generate_code(n: int = CODE_LEN) -> str:
    """
    Generate a random code using ALPHABET. Collision-safe logic
    is enforced in shorten() (this function only makes a candidate).
    """
    return "".join(random.choices(ALPHABET, k=n))

def shorten(url: str) -> str:
    """
    Return a short code for the given URL.
    Requirements:
      - Idempotent: the same normalized URL gets the same code.
      - No duplicate codes: if a generated code exists, regenerate.
    """
    normalized = normalize_url(url)

    # Idempotency: if URL already shortened, return existing code
    if normalized in url_to_code:
        return url_to_code[normalized]

    # Generate until an unused code is found (collision handling)
    # This loop guarantees uniqueness even under rare collisions.
    attempts = 0
    while True:
        attempts += 1
        if attempts > 10_000:
            raise RuntimeError("Too many collisions; increase code length.")
        candidate = generate_code(CODE_LEN)
        if candidate not in code_to_url:
            # Reserve the code
            code_to_url[candidate] = normalized
            url_to_code[normalized] = candidate
            return candidate

def resolve(code: str) -> str:
    """
    Given a short code, return the original URL.
    Raise KeyError if the code does not exist.
    """
    if not isinstance(code, str) or not code:
        raise ValueError("Code must be a non-empty string")
    if not re.fullmatch(r"[A-Za-z0-9]+", code):
        raise ValueError("Invalid code characters")
    return code_to_url<pre><code>

 

Manual usage example (REPL)

>>> from shortener import shorten, resolve
>>> c = shorten("https://www.example.com/some/long/path?x=1")
>>> c
'aZ3fQ9'  # example
>>> resolve(c)
'https://www.example.com/some/long/path?x=1'
>>> shorten("www.example.com/some/long/path?x=1")  # idempotent after normalization
'aZ3fQ9'

 

Tests (save as test_shortener.py)

Run with python -m pytest -q or python -m unittest discover.

import re
from shortener import shorten, resolve, url_to_code, code_to_url, normalize_url, CODE_LEN

def test_normalization_adds_scheme_and_lowers_host():
    u = normalize_url("Example.com/Path")
    assert u.startswith("https://")
    assert "example.com" in u

def test_shorten_returns_fixed_length_alnum_code():
    code = shorten("https://example.com/alpha")
    assert len(code) == CODE_LEN
    assert re.fullmatch(r"[A-Za-z0-9]{%d}" % CODE_LEN, code)

def test_idempotent_same_url_same_code():
    a = shorten("https://example.com/idempotent")
    b = shorten("example.com/idempotent")
    assert a == b

def test_resolve_round_trip():
    code = shorten("https://example.com/roundtrip")
    assert resolve(code) == "https://example.com/roundtrip"

def test_reject_resolve_unknown_code():
    import pytest
    with pytest.raises(KeyError):
        resolve("NoSuchCode123")

def test_reject_invalid_inputs():
    import pytest
    with pytest.raises(ValueError):
        normalize_url("")
    with pytest.raises(ValueError):
        resolve("bad-code!")  # invalid characters

def test_uniqueness_under_load():
    # Create many URLs; ensure all codes are unique
    codes = set()
    for i in range(2000):
        url = f"https://example.org/resource/{i}"
        code = shorten(url)
        assert code not in codes
        codes.add(code)

 

Stretch goals (optional)

  1. Deterministic codes (Base62 of an auto-increment id):
    Maintain a counter next_id, encode it in Base62 for the code. Guarantees uniqueness without random collisions.
  2. Custom code support:
    Allow users to request a custom code (validate and ensure it’s unused).
  3. Persistence:
    Save url_to_code and code_to_url to a JSON file on exit and load on start, or use SQLite with a unique index on code.
  4. CLI or Web API:
    • CLI: python cli.py shorten <url> and python cli.py resolve <code>.
    • Web: tiny Flask app with POST /shorten and GET /<code>; enforce a unique DB constraint on code.

 

Hints & guidance

  • Uniqueness enforcement lives at the point of insertion. Even with random codes, a while loop plus a set/dict membership check prevents duplicates.
  • Idempotency reduces storage and improves UX; normalize before looking up or inserting.
  • Length vs collisions: Longer codes reduce collision probability. If you ever hit many collisions, increase CODE_LEN.
  • Validation: Use urllib.parse.urlparse to check scheme and netloc; be strict and clear in error messages.