DAY 110 / 210

Core Concepts of LLM Inference Serving

This opening day of phase-3-inference establishes the vocabulary and metrics of production model serving before any optimization work begins. It matters in the arc because every later technique (batching, quantization, continuous batching) will be evaluated against these baseline definitions and trade-offs. The learner's existing Next.js routes provide the deployment surface that inference patterns will eventually target.

⏱ 45 min target📝 3 quiz Qs

Resources

readingvLLM project
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Introduction and PagedAttention section
25 min
readingHugging Face
Hugging Face Inference Endpoints documentation
Quickstart and pricing model
15 min

Deliverable

Commit a new branch inference-foundations containing a 300-word journal entry that records the four primary inference metrics and one open question for the next day

Quiz · 3 questions

1. Which metric is most directly improved by continuous batching?

ThroughputModel parameter countTraining lossEmbedding dimension

2. Define KV cache and state its primary memory impact during autoregressive generation.

3. Why might an engineer choose a higher-latency inference engine over a lower-latency one for a startup API?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)