DAY 3 / 210

Designing your first eval set

Data shape precedes code; selecting representative rows forces explicit criteria before any automation. This day locks the foundation for all later triangulation and graph work by making the target distribution visible.

⏱ 45 min target📝 2 quiz Qs🔗 2 code anchors

Resources

readingpromptfoo.dev
Getting Started
full page
20 min

Codebase anchors

The Tribunal code that demonstrates today's concept. Click the line to open in GitHub or VS Code.

lib/maku/curriculum-arc.ts:L55audit-recent-catalog-country

GitHub ↗VS Code ↗

this is the existing pattern for Week 1 first eval that you will now populate with the concrete 10-row dataset

⚠️ This anchor needs updating — the file or line is no longer reachable.

lib/maku/curriculum-arc.ts:L65audit-recent-catalog-country

GitHub ↗VS Code ↗

this is the closest existing usage of eval-related milestones we will measure the new 10-row set against

⚠️ This anchor needs updating — the file or line is no longer reachable.

Deliverable

10-row eval set draft saved as eval-set.json in branch eval-set-v1

Quiz · 2 questions

1. Why select rows before writing any scorer code?

To make the target distribution explicitBecause the model needs training dataTo satisfy linter rulesTo reduce token count

2. List two concrete signals that would make a StartupTribunal row representative of high-trust pain.

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)