File size: 2,823 Bytes
637f42c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# TeamForge β€” Judge Psychology & Upgrade Strategy
# Internal reference β€” not shipped to users

## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE

### What creates instant impression (first 10 seconds)
- A README that leads with a RESULT, not a description
  β†’ "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
- A demo that runs with ONE command and looks beautiful
- A leaderboard table β€” signals real benchmark, not toy
- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness

### What causes skips/rejections
- "This is an environment for..." β€” sounds like boilerplate
- No results shown β€” judge can't evaluate impact without numbers
- Giant walls of architecture text before showing what it does
- Tests that are obviously trivial or fake
- No evidence it actually runs

### What creates "this is different"
- Unexpected depth: e.g. AST-based anti-exploit detection
- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
- Dense reward explained with a real formula, not just described
- Git sandbox = real isolation = production thinking
- Phase-by-phase logs that look like actual agent traces

### How TeamForge exploits these biases
βœ“ Leaderboard in README section 1
βœ“ "python demo.py" = one command
βœ“ Pre-computed results with a key finding
βœ“ Anti-exploit section signals rigor
βœ“ Comparison table vs HumanEval/SWE-bench

## STEP 2: REMAINING GAPS (brutally honest)

STILL HACKATHON-LEVEL:
1. No paper-style "Findings" section with actual analysis
2. Hard task has no "unexpected twist" β€” it's just an algorithm problem
3. Bonus task exists in file but isn't wired into benchmark
4. app.py Gradio demo is barebones β€” no live agent trace display
5. No per-action timing analysis (latency profiling)
6. No failure mode analysis (what do models fail on specifically?)
7. README findings section needs real statistical language
8. No .env.example file β€” friction for new users
9. No CONTRIBUTING.md β€” signals solo hack, not real project
10. No task difficulty validation (empirical pass rates)

## STEP 3: WOW MOMENTS TO ENGINEER

1. LIVE PHASE TICKER in demo β€” watch phases light up one by one
2. FAILURE ANALYSIS card β€” "models fail most on: final chunk edge case"
3. ANTI-EXPLOIT DEMO β€” show what happens when agent tries to trivialize tests
4. CORRELATION FINDING β€” scatter plot text showing model size vs score
5. ARTIFACT VIEWER β€” show actual plan/review/reflection docs agent produced

## GAPS TO FIX IN CODE:
- Add .env.example
- Add CONTRIBUTING.md  
- Add findings.md with research-voice analysis
- Wire bonus task into registry
- Add per-step latency to benchmark output
- Improve hard task with "unexpected twist" (merge conflict simulation)
- Add failure mode constants to grader