DAY 90 / 210

Core LLM Inference Patterns and Tradeoffs

This day launches phase-3-inference by grounding the learner in production serving fundamentals before any optimization work. It matters because inference is the moment StartupTribunal moves from model training to user-facing value, and the existing app structure must now be measured against real serving constraints.

⏱ 45 min target📝 3 quiz Qs

Resources

readingHugging Face
Text Generation Inference
Getting Started and Supported Models sections
25 min
readingvLLM Blog
vLLM: Easy, Fast, and Cheap LLM Serving
entire post
15 min

Deliverable

300-word journal entry mapping inference latency/memory tradeoffs to the current Maku app routes

Quiz · 3 questions

1. Which factor most directly limits concurrent request throughput in a naive transformer pipeline?

Batch sizeGPU memory fragmentation from per-request KV cacheTokenizer vocabulary sizeLearning rate schedule

2. Name one concrete downside of always using greedy decoding in a production chat endpoint.

3. How might the rate-limiter in the current codebase interact with an inference server that uses continuous batching?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)