DAY 122 / 210

Foundations of LLM Inference Pipelines

Phase 3 shifts focus from training to serving models at scale; this day establishes core inference concepts so later optimization work has a measurable baseline. Understanding pipelines early prevents downstream bottlenecks when integrating inference into StartupTribunal workflows. The day matters because inference latency and cost directly determine product viability for real users.

⏱ 45 min target📝 3 quiz Qs

Resources

readingHugging Face
Transformers Pipelines
entire page
20 min
readingvLLM Blog
vLLM: Easy, Fast, and Cheap LLM Serving
intro and architecture overview
15 min

Deliverable

journal entry in app/maku/BriefForm.tsx documenting first inference latency measurement on a local model

Quiz · 3 questions

1. Which component most directly controls batching behavior during inference?

TokenizerModel forward pass schedulerData collatorOptimizer

2. Why might increasing batch size reduce latency up to a point but then increase it?

3. Describe one concrete change you would make to the current brief submission flow if inference latency exceeded 2 s.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)