DAY 115 / 210

LLM Inference Fundamentals and Tradeoffs

This opening day of phase-3 establishes core mental models for production inference before optimization layers are added. It matters because every later technique in the arc (batching, quantization, serving engines) is measured against these baseline latency-throughput-memory constraints.

⏱ 45 min target📝 3 quiz Qs

Resources

readingLilian Weng
Inference Optimization for Large Language Models
entire article
25 min
readingarXiv
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
abstract + sections 1-2
15 min

Deliverable

One-page journal entry listing the three primary inference metrics and one concrete tradeoff each introduces for the Maku brief endpoint

Quiz · 3 questions

1. Which metric is most directly increased by larger batch sizes during LLM inference?

Time to first tokenThroughputPeak memory per requestModel parameter count

2. Name one reason paged attention reduces memory fragmentation compared with naive KV caching.

3. For the current /api/maku/brief route, which single inference metric would you optimize first and why?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)