LLM Inference Fundamentals and Latency
This first day of phase-3 establishes the core mental model for production inference before optimization work begins. Understanding token generation loops and KV-cache mechanics explains why later days target throughput and cost. The day surfaces common misconceptions about batching versus latency that appear in real deployment reviews.
Resources
- 20 min
- 25 minreadingvLLM BlogvLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Introduction and PagedAttention explanation
Deliverable
Journal entry with 150-word summary of KV-cache role plus one latency bottleneck identified from app/maku/page.tsx
Quiz · 3 questions
1. Which component most directly reduces recomputation during autoregressive generation?
2. Explain in one sentence why increasing batch size can increase tail latency even when throughput rises.
3. Describe a scenario from the current Maku codebase where inference latency would be mis-measured if only average tokens-per-second is tracked.