Spaces:

PrakashCider
/

teamforge

Sleeping

App Files Files Community

teamforge / STRATEGY.md

Your Name

fix: add FastAPI REST endpoints for OpenEnv validator

637f42c about 1 month ago

preview code

raw

history blame contribute delete

2.82 kB

TeamForge — Judge Psychology & Upgrade Strategy

Internal reference — not shipped to users

STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE

What creates instant impression (first 10 seconds)

A README that leads with a RESULT, not a description → "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
A demo that runs with ONE command and looks beautiful
A leaderboard table — signals real benchmark, not toy
Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness

What causes skips/rejections

"This is an environment for..." — sounds like boilerplate
No results shown — judge can't evaluate impact without numbers
Giant walls of architecture text before showing what it does
Tests that are obviously trivial or fake
No evidence it actually runs

What creates "this is different"

Unexpected depth: e.g. AST-based anti-exploit detection
Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
Dense reward explained with a real formula, not just described
Git sandbox = real isolation = production thinking
Phase-by-phase logs that look like actual agent traces

How TeamForge exploits these biases

✓ Leaderboard in README section 1 ✓ "python demo.py" = one command ✓ Pre-computed results with a key finding ✓ Anti-exploit section signals rigor ✓ Comparison table vs HumanEval/SWE-bench

STEP 2: REMAINING GAPS (brutally honest)

STILL HACKATHON-LEVEL:

No paper-style "Findings" section with actual analysis
Hard task has no "unexpected twist" — it's just an algorithm problem
Bonus task exists in file but isn't wired into benchmark
app.py Gradio demo is barebones — no live agent trace display
No per-action timing analysis (latency profiling)
No failure mode analysis (what do models fail on specifically?)
README findings section needs real statistical language
No .env.example file — friction for new users
No CONTRIBUTING.md — signals solo hack, not real project
No task difficulty validation (empirical pass rates)

STEP 3: WOW MOMENTS TO ENGINEER

LIVE PHASE TICKER in demo — watch phases light up one by one
FAILURE ANALYSIS card — "models fail most on: final chunk edge case"
ANTI-EXPLOIT DEMO — show what happens when agent tries to trivialize tests
CORRELATION FINDING — scatter plot text showing model size vs score
ARTIFACT VIEWER — show actual plan/review/reflection docs agent produced

GAPS TO FIX IN CODE:

Add .env.example
Add CONTRIBUTING.md
Add findings.md with research-voice analysis
Wire bonus task into registry
Add per-step latency to benchmark output
Improve hard task with "unexpected twist" (merge conflict simulation)
Add failure mode constants to grader