DAY 113 / 210

LLM Inference Fundamentals and Tradeoffs

This first day of phase-3 establishes the core mental model for production inference that every later optimization will build upon. Because Maku is building StartupTribunal, understanding latency, throughput, and cost at inference time directly determines whether the product can serve real users reliably.

⏱ 45 min target📝 2 quiz Qs

Resources

readingarXiv
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Abstract + Sections 1-3
20 min
readingHugging Face
Hugging Face Inference Endpoints Overview
Quickstart and Pricing sections
15 min

Deliverable

journal entry comparing inference latency and cost for a 7B model on two providers with concrete numbers for StartupTribunal workload

Quiz · 2 questions

1. Which factor most directly limits concurrent users in a naive transformer inference server?

GPU memory fragmentation from KV cacheNumber of training epochsEmbedding dimension sizeTokenizer vocabulary size

2. List two concrete metrics you would track to decide whether to switch from API calls to self-hosted inference for StartupTribunal.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)