DAY 5 / 210

Add LLM-Rubric Assertions for Soft Properties

This day shifts evaluation from brittle string matching to scalable model-graded checks, directly enabling detection of marketing fluff and other soft attributes that literal assertions miss. It builds the foundation for trustworthy automated review in later phases of the eval arc.

⏱ 45 min target📝 3 quiz Qs

Resources

readingarXiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Abstract, Section 1, Section 3
25 min
readingGitHub
OpenAI Evals
README and model-graded eval examples
20 min

Deliverable

Commit adding at least three new LLM-rubric assertions to the evaluation harness with passing test output

Quiz · 3 questions

1. Why do literal contains/equals checks fail for marketing-fluff detection?

They require exact token matchesThey cannot capture semantic intent or toneThey run too slowly on large datasetsThey only work with numeric outputs

2. Write a one-sentence model-graded rubric prompt that distinguishes substantive claims from marketing fluff.

3. What failure mode might arise if the judge LLM shares the same biases as the generator model?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)