Spaces:
Sleeping
Sleeping
File size: 2,823 Bytes
637f42c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | # TeamForge β Judge Psychology & Upgrade Strategy
# Internal reference β not shipped to users
## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE
### What creates instant impression (first 10 seconds)
- A README that leads with a RESULT, not a description
β "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
- A demo that runs with ONE command and looks beautiful
- A leaderboard table β signals real benchmark, not toy
- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness
### What causes skips/rejections
- "This is an environment for..." β sounds like boilerplate
- No results shown β judge can't evaluate impact without numbers
- Giant walls of architecture text before showing what it does
- Tests that are obviously trivial or fake
- No evidence it actually runs
### What creates "this is different"
- Unexpected depth: e.g. AST-based anti-exploit detection
- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
- Dense reward explained with a real formula, not just described
- Git sandbox = real isolation = production thinking
- Phase-by-phase logs that look like actual agent traces
### How TeamForge exploits these biases
β Leaderboard in README section 1
β "python demo.py" = one command
β Pre-computed results with a key finding
β Anti-exploit section signals rigor
β Comparison table vs HumanEval/SWE-bench
## STEP 2: REMAINING GAPS (brutally honest)
STILL HACKATHON-LEVEL:
1. No paper-style "Findings" section with actual analysis
2. Hard task has no "unexpected twist" β it's just an algorithm problem
3. Bonus task exists in file but isn't wired into benchmark
4. app.py Gradio demo is barebones β no live agent trace display
5. No per-action timing analysis (latency profiling)
6. No failure mode analysis (what do models fail on specifically?)
7. README findings section needs real statistical language
8. No .env.example file β friction for new users
9. No CONTRIBUTING.md β signals solo hack, not real project
10. No task difficulty validation (empirical pass rates)
## STEP 3: WOW MOMENTS TO ENGINEER
1. LIVE PHASE TICKER in demo β watch phases light up one by one
2. FAILURE ANALYSIS card β "models fail most on: final chunk edge case"
3. ANTI-EXPLOIT DEMO β show what happens when agent tries to trivialize tests
4. CORRELATION FINDING β scatter plot text showing model size vs score
5. ARTIFACT VIEWER β show actual plan/review/reflection docs agent produced
## GAPS TO FIX IN CODE:
- Add .env.example
- Add CONTRIBUTING.md
- Add findings.md with research-voice analysis
- Wire bonus task into registry
- Add per-step latency to benchmark output
- Improve hard task with "unexpected twist" (merge conflict simulation)
- Add failure mode constants to grader
|