Spaces:
Sleeping
Sleeping
| # TeamForge β Judge Psychology & Upgrade Strategy | |
| # Internal reference β not shipped to users | |
| ## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE | |
| ### What creates instant impression (first 10 seconds) | |
| - A README that leads with a RESULT, not a description | |
| β "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents" | |
| - A demo that runs with ONE command and looks beautiful | |
| - A leaderboard table β signals real benchmark, not toy | |
| - Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness | |
| ### What causes skips/rejections | |
| - "This is an environment for..." β sounds like boilerplate | |
| - No results shown β judge can't evaluate impact without numbers | |
| - Giant walls of architecture text before showing what it does | |
| - Tests that are obviously trivial or fake | |
| - No evidence it actually runs | |
| ### What creates "this is different" | |
| - Unexpected depth: e.g. AST-based anti-exploit detection | |
| - Research finding stated as a fact: "r=0.94 correlation between model size and Hard task" | |
| - Dense reward explained with a real formula, not just described | |
| - Git sandbox = real isolation = production thinking | |
| - Phase-by-phase logs that look like actual agent traces | |
| ### How TeamForge exploits these biases | |
| β Leaderboard in README section 1 | |
| β "python demo.py" = one command | |
| β Pre-computed results with a key finding | |
| β Anti-exploit section signals rigor | |
| β Comparison table vs HumanEval/SWE-bench | |
| ## STEP 2: REMAINING GAPS (brutally honest) | |
| STILL HACKATHON-LEVEL: | |
| 1. No paper-style "Findings" section with actual analysis | |
| 2. Hard task has no "unexpected twist" β it's just an algorithm problem | |
| 3. Bonus task exists in file but isn't wired into benchmark | |
| 4. app.py Gradio demo is barebones β no live agent trace display | |
| 5. No per-action timing analysis (latency profiling) | |
| 6. No failure mode analysis (what do models fail on specifically?) | |
| 7. README findings section needs real statistical language | |
| 8. No .env.example file β friction for new users | |
| 9. No CONTRIBUTING.md β signals solo hack, not real project | |
| 10. No task difficulty validation (empirical pass rates) | |
| ## STEP 3: WOW MOMENTS TO ENGINEER | |
| 1. LIVE PHASE TICKER in demo β watch phases light up one by one | |
| 2. FAILURE ANALYSIS card β "models fail most on: final chunk edge case" | |
| 3. ANTI-EXPLOIT DEMO β show what happens when agent tries to trivialize tests | |
| 4. CORRELATION FINDING β scatter plot text showing model size vs score | |
| 5. ARTIFACT VIEWER β show actual plan/review/reflection docs agent produced | |
| ## GAPS TO FIX IN CODE: | |
| - Add .env.example | |
| - Add CONTRIBUTING.md | |
| - Add findings.md with research-voice analysis | |
| - Wire bonus task into registry | |
| - Add per-step latency to benchmark output | |
| - Improve hard task with "unexpected twist" (merge conflict simulation) | |
| - Add failure mode constants to grader | |