← Back to syllabus
Eval Discipline · Week 1 · Day 2/7
DAY 2 / 210

LLM-as-Judge Biases and Scaling

Reading Zheng et al. equips you to detect position, verbosity and self-enhancement biases before you build production judges. This directly informs how the existing VibeJudge scoring in disqualification-engine will be replaced or augmented with systematic LLM evals.

45 min target📝 3 quiz Qs🔗 2 code anchors

Resources

Codebase anchors

The Tribunal code that demonstrates today's concept. Click the line to open in GitHub or VS Code.

lib/disqualification-engine.ts:L4VibeJudge

This is the existing VibeJudge implementation whose scoring logic will be extended or measured against LLM-as-Judge patterns learned today.

1/**
2 * Disqualification Engine
3 *
4 * Evaluates VibeJudge scores post-analysis to determine whether a submission
5 * should be disqualified from prize eligibility.
6 *
7 * Rules:
8 * - Innovation score <= configurable threshold AND ai_detection indicates framework patterns → disqualified
9 * - Additional rules can be added without changing the interface.
10 *
11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.3
12 */
13
14import type { SubmissionScore } from '@/lib/vibejudge-client';
15import type { GitHistoryAnalysis } from '@/types/hackathon';
16
17export interface DisqualificationResult {
18 disqualified: boolean;
19 reason?: string;
20 low_confidence?: boolean;
21 low_confidence_agents?: string[];
22}
23
24export interface DisqualificationConfig {
lib/disqualification-engine.ts:L30VibeJudge

This is the closest existing usage of judge-style scoring we will measure against when introducing bias-aware LLM judges.

10 *
11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.3
12 */
13
14import type { SubmissionScore } from '@/lib/vibejudge-client';
15import type { GitHistoryAnalysis } from '@/types/hackathon';
16
17export interface DisqualificationResult {
18 disqualified: boolean;
19 reason?: string;
20 low_confidence?: boolean;
21 low_confidence_agents?: string[];
22}
23
24export interface DisqualificationConfig {
25 lowInnovationThreshold?: number; // default 1.0
26 lowConfidenceThreshold?: number; // default 0.20
27}
28
29/**
30 * Evaluate whether a submission should be disqualified based on VibeJudge scores
31 * and optional git history analysis.
32 *
33 * Current rules:
34 * 1. Innovation score <= configurable threshold (default 1.0) AND ai_detection
35 * indicates framework patterns → disqualified
36 *
37 * Returns { disqualified: false } for all other cases.
38 */
39export function evaluateDisqualification(
40 scores: SubmissionScore,
41 gitHistoryAnalysis?: GitHistoryAnalysis,
42 config?: DisqualificationConfig
43): DisqualificationResult {
44 // Rule 1: Low innovation + framework patterns in ai_detection
45 const innovationScore = scores.innovation_scorer.score;
46 const lowInnovationThreshold = config?.lowInnovationThreshold ?? 1.0;
47 const aiDetectionVerdict = scores.ai_detection.verdict.toLowerCase();
48 const aiDetectionEvidence = scores.ai_detection.evidence.map((e) => e.toLowerCase());
49
50 const frameworkIndicators = [

Deliverable

Commit a 300-word bias-analysis note attached to disqualification-engine.ts:4

Quiz · 3 questions

1. Which bias causes an LLM judge to favor the first answer presented?

2. Name one concrete mitigation Zheng et al. suggest for position bias.

3. How might the current VibeJudge rules in disqualification-engine.ts amplify verbosity bias?

Journal