DAY 109 / 210

LLM Inference Fundamentals and Latency

This first day of phase-3 establishes the core mental model for production inference before optimization work begins. Understanding token generation loops and KV-cache mechanics explains why later days target throughput and cost. The day surfaces common misconceptions about batching versus latency that appear in real deployment reviews.

⏱ 50 min target📝 3 quiz Qs

Resources

readingHugging Face
Text Generation Inference Documentation
Overview and quickstart sections
20 min
readingvLLM Blog
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Introduction and PagedAttention explanation
25 min

Deliverable

Journal entry with 150-word summary of KV-cache role plus one latency bottleneck identified from app/maku/page.tsx

Quiz · 3 questions

1. Which component most directly reduces recomputation during autoregressive generation?

Gradient checkpointingKV cacheFlashAttention onlyQuantization

2. Explain in one sentence why increasing batch size can increase tail latency even when throughput rises.

3. Describe a scenario from the current Maku codebase where inference latency would be mis-measured if only average tokens-per-second is tracked.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)