Spaces:
Sleeping
Sleeping
| # Contributing to TeamForge | |
| TeamForge is an open benchmark β contributions are how it stays relevant. | |
| ## Ways to Contribute | |
| ### 1. Submit Your Model's Results | |
| Run the benchmark and open a PR adding your results to `results/`: | |
| ```bash | |
| export GROQ_API_KEY=gsk_... | |
| python benchmark.py --model your-model-name --task all | |
| # Results saved to results/your-model-name/ | |
| git add results/your-model-name/ | |
| git commit -m "bench: add results for your-model-name" | |
| ``` | |
| ### 2. Add a New Task | |
| Tasks live in `tasks/`. Each task needs: | |
| - `TASK_ID` β unique string | |
| - `DIFFICULTY` β "easy" | "medium" | "hard" | |
| - `MAX_STEPS` β integer | |
| - `DESCRIPTION` β markdown string shown to agent | |
| - `INITIAL_FILES` β dict of `{path: content}` for the git sandbox | |
| - `REQUIRED_KEYWORDS_IN_REVIEW` β list of strings grader checks for | |
| - `PASSING_TESTS` β expected number of passing tests | |
| Copy `tasks/easy_task.py` as a template. | |
| Register your task in `tasks/task_registry.py`. | |
| ### 3. Improve the Grader | |
| `grader.py` is the most security-sensitive file. | |
| Any change must: | |
| - Not break existing tests: `pytest tests/test_environment.py -v` | |
| - Remain deterministic (same inputs β same score, always) | |
| - Include a test case in `tests/test_environment.py` | |
| ### 4. Report Failure Modes | |
| If a model fails in an interesting way, open an issue with: | |
| - Task ID | |
| - Model name | |
| - Step count at failure | |
| - Last observation output | |
| This data directly improves the benchmark. | |
| ## Development Setup | |
| ```bash | |
| git clone https://github.com/yourname/teamforge | |
| cd teamforge | |
| python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate | |
| pip install -r requirements.txt | |
| cp .env.example .env # add your GROQ_API_KEY | |
| pytest tests/ -v # must be 21/21 green before any PR | |
| ``` | |
| ## Code Standards | |
| - All models use Pydantic v2 | |
| - All public functions have type hints | |
| - All new files have a module docstring | |
| - `ruff check .` must pass clean | |
| ## Benchmark Integrity | |
| TeamForge uses deterministic grading. To prevent score inflation: | |
| - Never modify test files in tasks | |
| - The AST-based tamper detector will zero your score if you do | |
| - Results PRs are verified by re-running the benchmark | |
| --- | |
| *Questions? Open an issue. Disagreements about scoring? Open a discussion.* | |