# TeamForge — Judge Psychology & Upgrade Strategy # Internal reference — not shipped to users ## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE ### What creates instant impression (first 10 seconds) - A README that leads with a RESULT, not a description → "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents" - A demo that runs with ONE command and looks beautiful - A leaderboard table — signals real benchmark, not toy - Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness ### What causes skips/rejections - "This is an environment for..." — sounds like boilerplate - No results shown — judge can't evaluate impact without numbers - Giant walls of architecture text before showing what it does - Tests that are obviously trivial or fake - No evidence it actually runs ### What creates "this is different" - Unexpected depth: e.g. AST-based anti-exploit detection - Research finding stated as a fact: "r=0.94 correlation between model size and Hard task" - Dense reward explained with a real formula, not just described - Git sandbox = real isolation = production thinking - Phase-by-phase logs that look like actual agent traces ### How TeamForge exploits these biases ✓ Leaderboard in README section 1 ✓ "python demo.py" = one command ✓ Pre-computed results with a key finding ✓ Anti-exploit section signals rigor ✓ Comparison table vs HumanEval/SWE-bench ## STEP 2: REMAINING GAPS (brutally honest) STILL HACKATHON-LEVEL: 1. No paper-style "Findings" section with actual analysis 2. Hard task has no "unexpected twist" — it's just an algorithm problem 3. Bonus task exists in file but isn't wired into benchmark 4. app.py Gradio demo is barebones — no live agent trace display 5. No per-action timing analysis (latency profiling) 6. No failure mode analysis (what do models fail on specifically?) 7. README findings section needs real statistical language 8. No .env.example file — friction for new users 9. No CONTRIBUTING.md — signals solo hack, not real project 10. No task difficulty validation (empirical pass rates) ## STEP 3: WOW MOMENTS TO ENGINEER 1. LIVE PHASE TICKER in demo — watch phases light up one by one 2. FAILURE ANALYSIS card — "models fail most on: final chunk edge case" 3. ANTI-EXPLOIT DEMO — show what happens when agent tries to trivialize tests 4. CORRELATION FINDING — scatter plot text showing model size vs score 5. ARTIFACT VIEWER — show actual plan/review/reflection docs agent produced ## GAPS TO FIX IN CODE: - Add .env.example - Add CONTRIBUTING.md - Add findings.md with research-voice analysis - Wire bonus task into registry - Add per-step latency to benchmark output - Improve hard task with "unexpected twist" (merge conflict simulation) - Add failure mode constants to grader