# Contributing to TeamForge

TeamForge is an open benchmark — contributions are how it stays relevant.

## Ways to Contribute

### 1. Submit Your Model's Results
Run the benchmark and open a PR adding your results to `results/`:

```bash
export GROQ_API_KEY=gsk_...
python benchmark.py --model your-model-name --task all
# Results saved to results/your-model-name/
git add results/your-model-name/
git commit -m "bench: add results for your-model-name"
```

### 2. Add a New Task
Tasks live in `tasks/`. Each task needs:
- `TASK_ID` — unique string
- `DIFFICULTY` — "easy" | "medium" | "hard"
- `MAX_STEPS` — integer
- `DESCRIPTION` — markdown string shown to agent
- `INITIAL_FILES` — dict of `{path: content}` for the git sandbox
- `REQUIRED_KEYWORDS_IN_REVIEW` — list of strings grader checks for
- `PASSING_TESTS` — expected number of passing tests

Copy `tasks/easy_task.py` as a template.
Register your task in `tasks/task_registry.py`.

### 3. Improve the Grader
`grader.py` is the most security-sensitive file.
Any change must:
- Not break existing tests: `pytest tests/test_environment.py -v`
- Remain deterministic (same inputs → same score, always)
- Include a test case in `tests/test_environment.py`

### 4. Report Failure Modes
If a model fails in an interesting way, open an issue with:
- Task ID
- Model name
- Step count at failure
- Last observation output

This data directly improves the benchmark.

## Development Setup

```bash
git clone https://github.com/yourname/teamforge
cd teamforge
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env   # add your GROQ_API_KEY
pytest tests/ -v       # must be 21/21 green before any PR
```

## Code Standards

- All models use Pydantic v2
- All public functions have type hints
- All new files have a module docstring
- `ruff check .` must pass clean

## Benchmark Integrity

TeamForge uses deterministic grading. To prevent score inflation:
- Never modify test files in tasks
- The AST-based tamper detector will zero your score if you do
- Results PRs are verified by re-running the benchmark

---

*Questions? Open an issue. Disagreements about scoring? Open a discussion.*