Spaces:

PrakashCider
/

teamforge

Sleeping

File size: 2,823 Bytes

637f42c

# TeamForge — Judge Psychology & Upgrade Strategy
# Internal reference — not shipped to users

## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE

### What creates instant impression (first 10 seconds)
- A README that leads with a RESULT, not a description
  → "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
- A demo that runs with ONE command and looks beautiful
- A leaderboard table — signals real benchmark, not toy
- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness

### What causes skips/rejections
- "This is an environment for..." — sounds like boilerplate
- No results shown — judge can't evaluate impact without numbers
- Giant walls of architecture text before showing what it does
- Tests that are obviously trivial or fake
- No evidence it actually runs

### What creates "this is different"
- Unexpected depth: e.g. AST-based anti-exploit detection
- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
- Dense reward explained with a real formula, not just described
- Git sandbox = real isolation = production thinking
- Phase-by-phase logs that look like actual agent traces

### How TeamForge exploits these biases
✓ Leaderboard in README section 1
✓ "python demo.py" = one command
✓ Pre-computed results with a key finding
✓ Anti-exploit section signals rigor
✓ Comparison table vs HumanEval/SWE-bench

## STEP 2: REMAINING GAPS (brutally honest)

STILL HACKATHON-LEVEL:
1. No paper-style "Findings" section with actual analysis
2. Hard task has no "unexpected twist" — it's just an algorithm problem
3. Bonus task exists in file but isn't wired into benchmark
4. app.py Gradio demo is barebones — no live agent trace display
5. No per-action timing analysis (latency profiling)
6. No failure mode analysis (what do models fail on specifically?)
7. README findings section needs real statistical language
8. No .env.example file — friction for new users
9. No CONTRIBUTING.md — signals solo hack, not real project
10. No task difficulty validation (empirical pass rates)

## STEP 3: WOW MOMENTS TO ENGINEER

1. LIVE PHASE TICKER in demo — watch phases light up one by one
2. FAILURE ANALYSIS card — "models fail most on: final chunk edge case"
3. ANTI-EXPLOIT DEMO — show what happens when agent tries to trivialize tests
4. CORRELATION FINDING — scatter plot text showing model size vs score
5. ARTIFACT VIEWER — show actual plan/review/reflection docs agent produced

## GAPS TO FIX IN CODE:
- Add .env.example
- Add CONTRIBUTING.md  
- Add findings.md with research-voice analysis
- Wire bonus task into registry
- Add per-step latency to benchmark output
- Improve hard task with "unexpected twist" (merge conflict simulation)
- Add failure mode constants to grader