Spaces:

PrakashCider
/

teamforge

Sleeping

App Files Files Community

teamforge / CONTRIBUTING.md

Your Name

fix: add FastAPI REST endpoints for OpenEnv validator

637f42c about 1 month ago

preview code

raw

history blame contribute delete

2.24 kB

	# Contributing to TeamForge

	TeamForge is an open benchmark — contributions are how it stays relevant.

	## Ways to Contribute

	### 1. Submit Your Model's Results
	Run the benchmark and open a PR adding your results to `results/`:

	```bash
	export GROQ_API_KEY=gsk_...
	python benchmark.py --model your-model-name --task all
	# Results saved to results/your-model-name/
	git add results/your-model-name/
	git commit -m "bench: add results for your-model-name"
	```

	### 2. Add a New Task
	Tasks live in `tasks/`. Each task needs:
	- `TASK_ID` — unique string
	- `DIFFICULTY` — "easy" \| "medium" \| "hard"
	- `MAX_STEPS` — integer
	- `DESCRIPTION` — markdown string shown to agent
	- `INITIAL_FILES` — dict of `{path: content}` for the git sandbox
	- `REQUIRED_KEYWORDS_IN_REVIEW` — list of strings grader checks for
	- `PASSING_TESTS` — expected number of passing tests

	Copy `tasks/easy_task.py` as a template.
	Register your task in `tasks/task_registry.py`.

	### 3. Improve the Grader
	`grader.py` is the most security-sensitive file.
	Any change must:
	- Not break existing tests: `pytest tests/test_environment.py -v`
	- Remain deterministic (same inputs → same score, always)
	- Include a test case in `tests/test_environment.py`

	### 4. Report Failure Modes
	If a model fails in an interesting way, open an issue with:
	- Task ID
	- Model name
	- Step count at failure
	- Last observation output

	This data directly improves the benchmark.

	## Development Setup

	```bash
	git clone https://github.com/yourname/teamforge
	cd teamforge
	python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
	pip install -r requirements.txt
	cp .env.example .env # add your GROQ_API_KEY
	pytest tests/ -v # must be 21/21 green before any PR
	```

	## Code Standards

	- All models use Pydantic v2
	- All public functions have type hints
	- All new files have a module docstring
	- `ruff check .` must pass clean

	## Benchmark Integrity

	TeamForge uses deterministic grading. To prevent score inflation:
	- Never modify test files in tasks
	- The AST-based tamper detector will zero your score if you do
	- Results PRs are verified by re-running the benchmark

	---

	Questions? Open an issue. Disagreements about scoring? Open a discussion.