DAY 93 / 210

Introduction to Efficient LLM Inference

Phase 3 shifts focus from training to serving models at scale. This day establishes core inference concepts so later optimizations can be measured against real bottlenecks in the existing Maku app stack.

⏱ 40 min target📝 2 quiz Qs

Resources

readingarXiv
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Abstract + Sections 1-3
25 min

Deliverable

Journal entry listing three inference bottlenecks observed in app/maku/BriefForm.tsx and app/api/maku/brief/route.ts

Quiz · 2 questions

1. Which technique in vLLM primarily reduces memory fragmentation during LLM serving?

PagedAttentionQuantizationSpeculative decodingTensor parallelism

2. Name one key difference between continuous batching and static batching for inference throughput.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)