Spaces:

PrakashCider
/

teamforge

Sleeping

App Files Files Community

teamforge / STRATEGY.md

Your Name

fix: add FastAPI REST endpoints for OpenEnv validator

637f42c about 1 month ago

preview code

raw

history blame contribute delete

2.82 kB

	# TeamForge — Judge Psychology & Upgrade Strategy
	# Internal reference — not shipped to users

	## STEP 1: HOW JUDGES EVALUATE UNDER TIME PRESSURE

	### What creates instant impression (first 10 seconds)
	- A README that leads with a RESULT, not a description
	→ "llama3-70b scores 0.78, gemma2 scores 0.35" beats "environment for coding agents"
	- A demo that runs with ONE command and looks beautiful
	- A leaderboard table — signals real benchmark, not toy
	- Comparison to known benchmarks (SWE-bench, HumanEval) shows awareness

	### What causes skips/rejections
	- "This is an environment for..." — sounds like boilerplate
	- No results shown — judge can't evaluate impact without numbers
	- Giant walls of architecture text before showing what it does
	- Tests that are obviously trivial or fake
	- No evidence it actually runs

	### What creates "this is different"
	- Unexpected depth: e.g. AST-based anti-exploit detection
	- Research finding stated as a fact: "r=0.94 correlation between model size and Hard task"
	- Dense reward explained with a real formula, not just described
	- Git sandbox = real isolation = production thinking
	- Phase-by-phase logs that look like actual agent traces

	### How TeamForge exploits these biases
	✓ Leaderboard in README section 1
	✓ "python demo.py" = one command
	✓ Pre-computed results with a key finding
	✓ Anti-exploit section signals rigor
	✓ Comparison table vs HumanEval/SWE-bench

	## STEP 2: REMAINING GAPS (brutally honest)

	STILL HACKATHON-LEVEL:
	1. No paper-style "Findings" section with actual analysis
	2. Hard task has no "unexpected twist" — it's just an algorithm problem
	3. Bonus task exists in file but isn't wired into benchmark
	4. app.py Gradio demo is barebones — no live agent trace display
	5. No per-action timing analysis (latency profiling)
	6. No failure mode analysis (what do models fail on specifically?)
	7. README findings section needs real statistical language
	8. No .env.example file — friction for new users
	9. No CONTRIBUTING.md — signals solo hack, not real project
	10. No task difficulty validation (empirical pass rates)

	## STEP 3: WOW MOMENTS TO ENGINEER

	1. LIVE PHASE TICKER in demo — watch phases light up one by one
	2. FAILURE ANALYSIS card — "models fail most on: final chunk edge case"
	3. ANTI-EXPLOIT DEMO — show what happens when agent tries to trivialize tests
	4. CORRELATION FINDING — scatter plot text showing model size vs score
	5. ARTIFACT VIEWER — show actual plan/review/reflection docs agent produced

	## GAPS TO FIX IN CODE:
	- Add .env.example
	- Add CONTRIBUTING.md
	- Add findings.md with research-voice analysis
	- Wire bonus task into registry
	- Add per-step latency to benchmark output
	- Improve hard task with "unexpected twist" (merge conflict simulation)
	- Add failure mode constants to grader