# Contributing to TeamForge TeamForge is an open benchmark — contributions are how it stays relevant. ## Ways to Contribute ### 1. Submit Your Model's Results Run the benchmark and open a PR adding your results to `results/`: ```bash export GROQ_API_KEY=gsk_... python benchmark.py --model your-model-name --task all # Results saved to results/your-model-name/ git add results/your-model-name/ git commit -m "bench: add results for your-model-name" ``` ### 2. Add a New Task Tasks live in `tasks/`. Each task needs: - `TASK_ID` — unique string - `DIFFICULTY` — "easy" | "medium" | "hard" - `MAX_STEPS` — integer - `DESCRIPTION` — markdown string shown to agent - `INITIAL_FILES` — dict of `{path: content}` for the git sandbox - `REQUIRED_KEYWORDS_IN_REVIEW` — list of strings grader checks for - `PASSING_TESTS` — expected number of passing tests Copy `tasks/easy_task.py` as a template. Register your task in `tasks/task_registry.py`. ### 3. Improve the Grader `grader.py` is the most security-sensitive file. Any change must: - Not break existing tests: `pytest tests/test_environment.py -v` - Remain deterministic (same inputs → same score, always) - Include a test case in `tests/test_environment.py` ### 4. Report Failure Modes If a model fails in an interesting way, open an issue with: - Task ID - Model name - Step count at failure - Last observation output This data directly improves the benchmark. ## Development Setup ```bash git clone https://github.com/yourname/teamforge cd teamforge python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt cp .env.example .env # add your GROQ_API_KEY pytest tests/ -v # must be 21/21 green before any PR ``` ## Code Standards - All models use Pydantic v2 - All public functions have type hints - All new files have a module docstring - `ruff check .` must pass clean ## Benchmark Integrity TeamForge uses deterministic grading. To prevent score inflation: - Never modify test files in tasks - The AST-based tamper detector will zero your score if you do - Results PRs are verified by re-running the benchmark --- *Questions? Open an issue. Disagreements about scoring? Open a discussion.*