teamforge / STRATEGY.md
Your Name
fix: add FastAPI REST endpoints for OpenEnv validator
637f42c

TeamForge β€” Judge Psychology & Upgrade Strategy

Internal reference β€” not shipped to users

STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE

What creates instant impression (first 10 seconds)

  • A README that leads with a RESULT, not a description β†’ "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
  • A demo that runs with ONE command and looks beautiful
  • A leaderboard table β€” signals real benchmark, not toy
  • Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness

What causes skips/rejections

  • "This is an environment for..." β€” sounds like boilerplate
  • No results shown β€” judge can't evaluate impact without numbers
  • Giant walls of architecture text before showing what it does
  • Tests that are obviously trivial or fake
  • No evidence it actually runs

What creates "this is different"

  • Unexpected depth: e.g. AST-based anti-exploit detection
  • Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
  • Dense reward explained with a real formula, not just described
  • Git sandbox = real isolation = production thinking
  • Phase-by-phase logs that look like actual agent traces

How TeamForge exploits these biases

βœ“ Leaderboard in README section 1 βœ“ "python demo.py" = one command βœ“ Pre-computed results with a key finding βœ“ Anti-exploit section signals rigor βœ“ Comparison table vs HumanEval/SWE-bench

STEP 2: REMAINING GAPS (brutally honest)

STILL HACKATHON-LEVEL:

  1. No paper-style "Findings" section with actual analysis
  2. Hard task has no "unexpected twist" β€” it's just an algorithm problem
  3. Bonus task exists in file but isn't wired into benchmark
  4. app.py Gradio demo is barebones β€” no live agent trace display
  5. No per-action timing analysis (latency profiling)
  6. No failure mode analysis (what do models fail on specifically?)
  7. README findings section needs real statistical language
  8. No .env.example file β€” friction for new users
  9. No CONTRIBUTING.md β€” signals solo hack, not real project
  10. No task difficulty validation (empirical pass rates)

STEP 3: WOW MOMENTS TO ENGINEER

  1. LIVE PHASE TICKER in demo β€” watch phases light up one by one
  2. FAILURE ANALYSIS card β€” "models fail most on: final chunk edge case"
  3. ANTI-EXPLOIT DEMO β€” show what happens when agent tries to trivialize tests
  4. CORRELATION FINDING β€” scatter plot text showing model size vs score
  5. ARTIFACT VIEWER β€” show actual plan/review/reflection docs agent produced

GAPS TO FIX IN CODE:

  • Add .env.example
  • Add CONTRIBUTING.md
  • Add findings.md with research-voice analysis
  • Wire bonus task into registry
  • Add per-step latency to benchmark output
  • Improve hard task with "unexpected twist" (merge conflict simulation)
  • Add failure mode constants to grader