DAY 2 / 210

LLM-as-Judge Biases and Scaling

Reading Zheng et al. equips you to detect position, verbosity and self-enhancement biases before you build production judges. This directly informs how the existing VibeJudge scoring in disqualification-engine will be replaced or augmented with systematic LLM evals.

⏱ 45 min target📝 3 quiz Qs🔗 2 code anchors

Resources

readingarXiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Sections 1-4 and Appendix A
35 min

Codebase anchors

The Tribunal code that demonstrates today's concept. Click the line to open in GitHub or VS Code.

lib/disqualification-engine.ts:L4VibeJudge

GitHub ↗VS Code ↗

This is the existing VibeJudge implementation whose scoring logic will be extended or measured against LLM-as-Judge patterns learned today.

1/**
2 * Disqualification Engine
3 *
4 * Evaluates VibeJudge scores post-analysis to determine whether a submission
5 * should be disqualified from prize eligibility.
6 *
7 * Rules:
8 * - Innovation score <= configurable threshold AND ai_detection indicates framework patterns → disqualified
9 * - Additional rules can be added without changing the interface.
10 *
11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.3
12 */
13 
14import type { SubmissionScore } from '@/lib/vibejudge-client';
15import type { GitHistoryAnalysis } from '@/types/hackathon';
16 
17export interface DisqualificationResult {
18  disqualified: boolean;
19  reason?: string;
20  low_confidence?: boolean;
21  low_confidence_agents?: string[];
22}
23 
24export interface DisqualificationConfig {

lib/disqualification-engine.ts:L30VibeJudge

GitHub ↗VS Code ↗

This is the closest existing usage of judge-style scoring we will measure against when introducing bias-aware LLM judges.

10 *
11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.3
12 */
13 
14import type { SubmissionScore } from '@/lib/vibejudge-client';
15import type { GitHistoryAnalysis } from '@/types/hackathon';
16 
17export interface DisqualificationResult {
18  disqualified: boolean;
19  reason?: string;
20  low_confidence?: boolean;
21  low_confidence_agents?: string[];
22}
23 
24export interface DisqualificationConfig {
25  lowInnovationThreshold?: number;  // default 1.0
26  lowConfidenceThreshold?: number;  // default 0.20
27}
28 
29/**
30 * Evaluate whether a submission should be disqualified based on VibeJudge scores
31 * and optional git history analysis.
32 *
33 * Current rules:
34 * 1. Innovation score <= configurable threshold (default 1.0) AND ai_detection
35 *    indicates framework patterns → disqualified
36 *
37 * Returns { disqualified: false } for all other cases.
38 */
39export function evaluateDisqualification(
40  scores: SubmissionScore,
41  gitHistoryAnalysis?: GitHistoryAnalysis,
42  config?: DisqualificationConfig
43): DisqualificationResult {
44  // Rule 1: Low innovation + framework patterns in ai_detection
45  const innovationScore = scores.innovation_scorer.score;
46  const lowInnovationThreshold = config?.lowInnovationThreshold ?? 1.0;
47  const aiDetectionVerdict = scores.ai_detection.verdict.toLowerCase();
48  const aiDetectionEvidence = scores.ai_detection.evidence.map((e) => e.toLowerCase());
49 
50  const frameworkIndicators = [

Deliverable

Commit a 300-word bias-analysis note attached to disqualification-engine.ts:4

Quiz · 3 questions

1. Which bias causes an LLM judge to favor the first answer presented?

verbosity biasposition biasself-enhancement biaslength bias

2. Name one concrete mitigation Zheng et al. suggest for position bias.

3. How might the current VibeJudge rules in disqualification-engine.ts amplify verbosity bias?

Journal

Time spent (minutes)

Blockers

Commit / PR links (one per line)