DAY 116 / 210

LLM Inference Fundamentals and Serving Basics

This opening day of phase-3 establishes the mental model for production inference that all later optimization work will build upon. It directly supports Maku's StartupTribunal goals by clarifying how model calls will be structured inside the existing Next.js API routes. Early clarity here prevents costly rework when real traffic patterns appear.

⏱ 40 min target📝 2 quiz Qs

Resources

readingHugging Face
Text Generation Inference
Quickstart and architecture sections
25 min

Deliverable

journal entry with 3 concrete latency numbers measured against a local inference endpoint

Quiz · 2 questions

1. Which factor most directly increases p50 latency when batch size grows from 1 to 8?

GPU memory bandwidth saturationTokenizer vocabulary sizeHTTP keep-alive timeoutReact hydration cost

2. Name one common misconception when developers first measure inference latency in a web route handler.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)