teamforge / STRATEGY.md
Your Name
fix: add FastAPI REST endpoints for OpenEnv validator
637f42c
# TeamForge β€” Judge Psychology & Upgrade Strategy
# Internal reference β€” not shipped to users
## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE
### What creates instant impression (first 10 seconds)
- A README that leads with a RESULT, not a description
β†’ "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
- A demo that runs with ONE command and looks beautiful
- A leaderboard table β€” signals real benchmark, not toy
- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness
### What causes skips/rejections
- "This is an environment for..." β€” sounds like boilerplate
- No results shown β€” judge can't evaluate impact without numbers
- Giant walls of architecture text before showing what it does
- Tests that are obviously trivial or fake
- No evidence it actually runs
### What creates "this is different"
- Unexpected depth: e.g. AST-based anti-exploit detection
- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
- Dense reward explained with a real formula, not just described
- Git sandbox = real isolation = production thinking
- Phase-by-phase logs that look like actual agent traces
### How TeamForge exploits these biases
βœ“ Leaderboard in README section 1
βœ“ "python demo.py" = one command
βœ“ Pre-computed results with a key finding
βœ“ Anti-exploit section signals rigor
βœ“ Comparison table vs HumanEval/SWE-bench
## STEP 2: REMAINING GAPS (brutally honest)
STILL HACKATHON-LEVEL:
1. No paper-style "Findings" section with actual analysis
2. Hard task has no "unexpected twist" β€” it's just an algorithm problem
3. Bonus task exists in file but isn't wired into benchmark
4. app.py Gradio demo is barebones β€” no live agent trace display
5. No per-action timing analysis (latency profiling)
6. No failure mode analysis (what do models fail on specifically?)
7. README findings section needs real statistical language
8. No .env.example file β€” friction for new users
9. No CONTRIBUTING.md β€” signals solo hack, not real project
10. No task difficulty validation (empirical pass rates)
## STEP 3: WOW MOMENTS TO ENGINEER
1. LIVE PHASE TICKER in demo β€” watch phases light up one by one
2. FAILURE ANALYSIS card β€” "models fail most on: final chunk edge case"
3. ANTI-EXPLOIT DEMO β€” show what happens when agent tries to trivialize tests
4. CORRELATION FINDING β€” scatter plot text showing model size vs score
5. ARTIFACT VIEWER β€” show actual plan/review/reflection docs agent produced
## GAPS TO FIX IN CODE:
- Add .env.example
- Add CONTRIBUTING.md
- Add findings.md with research-voice analysis
- Wire bonus task into registry
- Add per-step latency to benchmark output
- Improve hard task with "unexpected twist" (merge conflict simulation)
- Add failure mode constants to grader