Core Concepts of LLM Inference Serving
This opening day of phase-3-inference establishes the vocabulary and metrics of production model serving before any optimization work begins. It matters in the arc because every later technique (batching, quantization, continuous batching) will be evaluated against these baseline definitions and trade-offs. The learner's existing Next.js routes provide the deployment surface that inference patterns will eventually target.
Resources
- 25 minreadingvLLM projectvLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Introduction and PagedAttention section
- 15 min
Deliverable
Commit a new branch inference-foundations containing a 300-word journal entry that records the four primary inference metrics and one open question for the next day
Quiz · 3 questions
1. Which metric is most directly improved by continuous batching?
2. Define KV cache and state its primary memory impact during autoregressive generation.
3. Why might an engineer choose a higher-latency inference engine over a lower-latency one for a startup API?