Methodology — StartupTribunal

Overview

StartupTribunal's catalog is built by a multi-LLM pipeline that ingests public news and feeds, ranks signals by country relevance + pain depth, extracts the underlying pain into a structured record, and synthesizes a startup idea + savage verdict + scoring. Every catalog row preserves its source URL, extracted raw quotes, and a content-hash of the inputs so any claim can be traced back to its evidence.

We publish this page so users can audit our claims and so engineers can see exactly how the system is wired. The system has known limits — they're listed under Honest caveats without hedging.

The pipeline

For each scheduled slot (10 Cloud Scheduler jobs covering 24-hour rotation across African, Tier 1 global, and Tier 2 global pools), the pipeline runs:

Source ingestion — Brave/Serper web search and curated RSS feeds return 5-40 candidate articles. Each candidate has a URL, title, snippet, and (when available) full body content.
Ranking — Signals are scored on pain intensity (vocabulary of concrete suffering vs marketing prose) and country relevance. The top 5 enter the gate.
Gate — A 4-rule country-relevance check rejects signals that mention the target country only in passing. Details under The country-relevance gate.
Web-search fallback — If no top-5 signal passes the gate, the pipeline runs a second targeted search before giving up. If still nothing, the slot is skipped — we do not publish a low-confidence row.
Pain extraction — Grok extracts a structured PainSignal (problem statement, raw verbatim quotes, audience, urgency) from the winning article. A multi-shot retry guards against malformed output.
Synthesis — Title, savage verdict, idea, scoring, and AI consumer price are all generated from the PainSignal. Each step is prompt-versioned.
Persistence — The row is written with full provenance: source_url, pain_source_url, raw_quotes[], factor_hash, and pain_signal_strength. Duplicate source_url hits short-circuit before paying for extraction.

Stack: Next.js 16 on Vercel (Fluid Compute, Node 24), GCP Cloud Run workers for catalog generation and tribunal judging, Cloud SQL Postgres 15, GCS for blob storage, LangGraph 1.x StateGraph for the gate orchestration, Grok / Gemini / Claude pluggable router, AWS Bedrock for VibeJudge agents.

Pipeline source: lib/spawnforge/orchestrator-cloud-complete.ts · lib/graphs/pain-discovery-graph.ts · cloud-run/auto-catalog-generator/src/server.ts

The country-relevance gate

The hardest problem in country-tagged catalog generation is the "passing mention" failure: an article about U.S. AI policy mentions Kenya once and gets tagged KE. Four layered rules guard against this:

Rule 1 — Target in title (strong accept)

If any of the target country's natural-language keywords (name, demonym, capital, major cities — from REGION_META) appears in the article title, accept immediately. This is the highest-confidence signal: a journalist who put the country in the headline meant it as the subject.

Rule 2 — Competing country in title (strong reject)

If a different country's keywords appear in the title without the target's, reject. The journalist named a different subject. Prevents Saudi article landing in Kenya slot.

Rule 2.5 — Body dominance ratio (3:1)

When the title is location-neutral, we count keyword occurrences in the body. If any competing country's keywords appear at least 3 times as often as the target's, reject. This rule was added in May 2026 after an AO slot accepted an article whose body mentioned "Angola" once but "South Africa" five times.

Rule 3 — Body presence (last-resort accept)

With a location-neutral title and no body-dominance failure, accept if the target's keywords appear at least once in the body. The previous tightening makes this safe.

Implementation: lib/graphs/pain-discovery-nodes.ts:399 · tests: __tests__/lib/graphs/mentions-target-country.test.ts · country keywords: REGION_META in lib/auto-catalog/types.ts

The gate has been iterated three times based on production incidents (LR / ZA / TG / NG / AO each surfaced a class of failure that informed the rule above). When the gate rejects every top-5 candidate, the slot is skipped, not soft-published.

Pain extraction + provenance

Once a signal passes the gate, Grok extracts a structured PainSignal with these fields:

problem_statement — multi-sentence framing
raw_quotes[] — verbatim quotes pulled from the source
audience — the affected group (e.g. "Kenyan SMEs")
urgency — qualitative scale
pain_level — numeric (1–10)

The extraction call is wrapped by extractPainWithRetry, which retries up to N times on schema-validation failure. A retry-fallback country-revalidation check guards against the "Frankenstein row" failure mode where a fallback URL won the extraction but the source provenance still pointed at the primary URL that originally failed.

Implementation: lib/spawnforge/extract-pain-with-retry.ts · lib/spawnforge/retry-country-revalidation.ts

Every persisted catalog row carries:

source_url — the article that triggered the row
pain_source_url — the article the pain was actually extracted from (may differ from source_url after fallback)
pain_extracted — the full PainSignal JSON, including raw_quotes
pain_signal_strength — confidence score
factor_hash — SHA hash of the input factors so we can detect prompt drift

This is the audit trail. Any claim in a catalog row can be traced back to its source URL and its extracted raw quotes.

Anti-hallucination measures

LLMs hallucinate. We don't pretend otherwise. Four measures reduce hallucination at the cost of more code and more LLM calls:

Title sanitizer — Grok occasionally returns **"Title"**\n(reasoning) instead of just the title. The sanitizer strips markdown wrapping, surrounding quotes, and trailing rationale parentheticals. lib/spawnforge/title-prompt.ts
ISO-code ban— Grok shortens "French Fishing And Farming" to "FR Fishing And Farming". The title prompt now forbids the ISO code and lists acceptable natural-language forms per country.
Country body-dominance check — see Rule 2.5 above. A 3:1 ratio rejects articles that mention the target country only in passing while really being about a competing country.
Retry-fallback revalidation— when the primary URL fails and a fallback wins, we re-run the country gate against the fallback's actual content to prevent Frankenstein rows.

Acknowledged gap: there is no second-LLM verification step on the extracted pain ("given this article, quote the supporting sentence for this claim, or reject"). This is planned. See Honest caveats.

Hackathon judging (VibeJudge)

Hackathon submissions are scored by VibeJudge, an AWS Lambda service that runs five Bedrock-backed agents in parallel:

bug_hunter — Nova Lite. File:line citations for security and correctness issues.
innovation — assesses originality vs prior art.
performance — runtime and bundle-size review.
code_quality — architecture and idiom check.
ai_detection — flags submissions that look entirely AI-generated.

Each agent returns a Pydantic-validated structured score. The final submission score is a weighted aggregate. The repository is not retained — VibeJudge clones, analyzes, and discards.

Source: github.com/ma-za-kpe/vibejudge-kiro

Eval discipline

Every catalog generation logs to gate-decision traces with countryRelevant / webSearchAttempted / region so production failures can be diagnosed without re-running the pipeline. Each gate rule has a regression test that pins an exact production incident.

A separate audit script (scripts/audit-recent-catalog-country.ts) walks the most recent N catalog rows and reports country-tag mismatches. The script exits non-zero on any mismatch — usable as a CI smoke gate.

Beyond regex-grade tests, an LLM-based eval harness using promptfoo is on the roadmap. Phase 1 of the founder's personal syllabus at /syllabus is dedicated entirely to building this out.

Honest caveats

We do not claim perfection. The system has known gaps. They are listed here so users can weight catalog rows accordingly.

Single source per row. Each catalog claim comes from one article. Tier-1 research institutions (McKinsey, CB Insights, Bain) require triangulation from two or more independent sources before publishing. We do not yet. Triangulation is on the Phase 1 roadmap at /syllabus.
No source-tier floor. Any URL that Brave/Serper returns can become a row. We do not whitelist national-paper-grade outlets over SEO content farms.
No second-LLM verification.The pain extraction is trusted verbatim. A RAG-as-validator step ("given this article, quote the supporting sentence") would cut a class of subtle hallucinations and is planned.
Catalog quality varies by country.Tier 1 (US, UK, DE, FR, IN, SG) has dense, high-quality news coverage. Tier 2 (BR, MX, AR) is patchier. African long-tail (SL, GW, KM) is the thinnest — when we can't find a country-relevant signal, we skip the slot rather than publish a Frankenstein row, but this means some smaller markets get less coverage.
No human review for routine rows. The admin catalog editor exists at /admin/catalog for spot-correcting individual rows but routine generation is unsupervised.

Methodology changelog

2026-05 — Body dominance ratio (Rule 2.5). Added after AO/ZA passing- mention incident.
2026-05 — Retry-fallback country revalidation. Closes the Frankenstein-row failure mode.
2026-05 — Title prompt: ban ISO codes.Stops "FR Fishing Farming" class of titles.
2026-05 — LangGraph mergeIntoState refactor. Stops silent channel loss from spread-based state returns.
2026-05 — Country-leading title rule. Lead with affected country, not antagonist.
2026-04 — Slot-skip on gate failure.Stops the "degraded floor" from silently shipping low-confidence rows.
2026-04 — Title sanitizer. Strip Grok rationale and markdown wrapping from titles.

Full commit history at github.com/ma-za-kpe/CROWN/commits/main.