Spaces:
Sleeping
Sleeping
TeamForge β Judge Psychology & Upgrade Strategy
Internal reference β not shipped to users
STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE
What creates instant impression (first 10 seconds)
- A README that leads with a RESULT, not a description β "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
- A demo that runs with ONE command and looks beautiful
- A leaderboard table β signals real benchmark, not toy
- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness
What causes skips/rejections
- "This is an environment for..." β sounds like boilerplate
- No results shown β judge can't evaluate impact without numbers
- Giant walls of architecture text before showing what it does
- Tests that are obviously trivial or fake
- No evidence it actually runs
What creates "this is different"
- Unexpected depth: e.g. AST-based anti-exploit detection
- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
- Dense reward explained with a real formula, not just described
- Git sandbox = real isolation = production thinking
- Phase-by-phase logs that look like actual agent traces
How TeamForge exploits these biases
β Leaderboard in README section 1 β "python demo.py" = one command β Pre-computed results with a key finding β Anti-exploit section signals rigor β Comparison table vs HumanEval/SWE-bench
STEP 2: REMAINING GAPS (brutally honest)
STILL HACKATHON-LEVEL:
- No paper-style "Findings" section with actual analysis
- Hard task has no "unexpected twist" β it's just an algorithm problem
- Bonus task exists in file but isn't wired into benchmark
- app.py Gradio demo is barebones β no live agent trace display
- No per-action timing analysis (latency profiling)
- No failure mode analysis (what do models fail on specifically?)
- README findings section needs real statistical language
- No .env.example file β friction for new users
- No CONTRIBUTING.md β signals solo hack, not real project
- No task difficulty validation (empirical pass rates)
STEP 3: WOW MOMENTS TO ENGINEER
- LIVE PHASE TICKER in demo β watch phases light up one by one
- FAILURE ANALYSIS card β "models fail most on: final chunk edge case"
- ANTI-EXPLOIT DEMO β show what happens when agent tries to trivialize tests
- CORRELATION FINDING β scatter plot text showing model size vs score
- ARTIFACT VIEWER β show actual plan/review/reflection docs agent produced
GAPS TO FIX IN CODE:
- Add .env.example
- Add CONTRIBUTING.md
- Add findings.md with research-voice analysis
- Wire bonus task into registry
- Add per-step latency to benchmark output
- Improve hard task with "unexpected twist" (merge conflict simulation)
- Add failure mode constants to grader