DAY 79 / 210

Core Concepts of LLM Inference Serving

This day opens phase-3 by grounding learners in production inference realities rather than training. It directly supports Maku's StartupTribunal work by clarifying how model outputs reach users at scale. The focus on measurable trade-offs prevents common over-optimism about raw model quality alone.

⏱ 45 min target📝 2 quiz Qs

Resources

readingarXiv
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
abstract and section 1
25 min

Deliverable

300-word journal entry on inference metrics relevant to StartupTribunal

Quiz · 2 questions

1. Which factor most directly limits throughput when batch size increases?

GPU memory fragmentationTraining dataset sizeNumber of attention headsEmbedding dimension

2. Explain in two sentences why latency and throughput are not always improved by the same technique.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)