LLM-as-Judge Biases and Scaling
Reading Zheng et al. equips you to detect position, verbosity and self-enhancement biases before you build production judges. This directly informs how the existing VibeJudge scoring in disqualification-engine will be replaced or augmented with systematic LLM evals.
Resources
- 35 min
Codebase anchors
The Tribunal code that demonstrates today's concept. Click the line to open in GitHub or VS Code.
This is the existing VibeJudge implementation whose scoring logic will be extended or measured against LLM-as-Judge patterns learned today.
1/**2 * Disqualification Engine3 *4 * Evaluates VibeJudge scores post-analysis to determine whether a submission5 * should be disqualified from prize eligibility.6 *7 * Rules:8 * - Innovation score <= configurable threshold AND ai_detection indicates framework patterns → disqualified9 * - Additional rules can be added without changing the interface.10 *11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.312 */13 14import type { SubmissionScore } from '@/lib/vibejudge-client';15import type { GitHistoryAnalysis } from '@/types/hackathon';16 17export interface DisqualificationResult {18 disqualified: boolean;19 reason?: string;20 low_confidence?: boolean;21 low_confidence_agents?: string[];22}23 24export interface DisqualificationConfig {This is the closest existing usage of judge-style scoring we will measure against when introducing bias-aware LLM judges.
10 *11 * Requirements: 6.1, 6.2, 6.4, 9.1, 9.2, 9.312 */13 14import type { SubmissionScore } from '@/lib/vibejudge-client';15import type { GitHistoryAnalysis } from '@/types/hackathon';16 17export interface DisqualificationResult {18 disqualified: boolean;19 reason?: string;20 low_confidence?: boolean;21 low_confidence_agents?: string[];22}23 24export interface DisqualificationConfig {25 lowInnovationThreshold?: number; // default 1.026 lowConfidenceThreshold?: number; // default 0.2027}28 29/**30 * Evaluate whether a submission should be disqualified based on VibeJudge scores31 * and optional git history analysis.32 *33 * Current rules:34 * 1. Innovation score <= configurable threshold (default 1.0) AND ai_detection35 * indicates framework patterns → disqualified36 *37 * Returns { disqualified: false } for all other cases.38 */39export function evaluateDisqualification(40 scores: SubmissionScore,41 gitHistoryAnalysis?: GitHistoryAnalysis,42 config?: DisqualificationConfig43): DisqualificationResult {44 // Rule 1: Low innovation + framework patterns in ai_detection45 const innovationScore = scores.innovation_scorer.score;46 const lowInnovationThreshold = config?.lowInnovationThreshold ?? 1.0;47 const aiDetectionVerdict = scores.ai_detection.verdict.toLowerCase();48 const aiDetectionEvidence = scores.ai_detection.evidence.map((e) => e.toLowerCase());49 50 const frameworkIndicators = [Deliverable
Commit a 300-word bias-analysis note attached to disqualification-engine.ts:4
Quiz · 3 questions
1. Which bias causes an LLM judge to favor the first answer presented?
2. Name one concrete mitigation Zheng et al. suggest for position bias.
3. How might the current VibeJudge rules in disqualification-engine.ts amplify verbosity bias?