DAY 80 / 210

LLM Inference Pipeline Fundamentals

This opening day of phase-3-inference establishes the core mechanics of model serving, token generation, and latency measurement that every later optimization will build upon. It directly supports Maku's work on StartupTribunal by grounding the API and rate-limiting patterns already present in the codebase.

⏱ 45 min target📝 2 quiz Qs

Resources

readingHugging Face
Text Generation
entire page
25 min
readingvLLM
vLLM Documentation
Getting Started and Serving LLMs sections
20 min

Deliverable

journal entry capturing first local inference benchmark and observed latency numbers

Quiz · 2 questions

1. Which factor most directly increases time-to-first-token in autoregressive decoding?

batch sizecontext lengthmodel parameter countall of the above

2. Explain why KV caching reduces per-token latency after the first token.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)