DAY 92 / 210

Foundations of Production LLM Inference

This opening day of phase-3 establishes why inference differs from training and why it dominates real-world costs. It creates the mental model needed before any optimization or serving work begins.

⏱ 45 min target📝 3 quiz Qs

Resources

readingarXiv
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
abstract and section 1
25 min

Deliverable

Journal entry listing three inference bottlenecks observed in current app/maku routes plus one candidate fix

Quiz · 3 questions

1. Why is LLM inference typically memory-bound rather than compute-bound?

Model weights exceed cache sizesGPUs lack enough coresToken generation is sequentialBatch sizes are always 1

2. Name one concrete difference between training and inference memory access patterns.

3. How might the rate-limiter in lib/rate-limiter.ts interact with an inference queue under bursty traffic?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)