teamforge / CONTRIBUTING.md
Your Name
fix: add FastAPI REST endpoints for OpenEnv validator
637f42c
# Contributing to TeamForge
TeamForge is an open benchmark β€” contributions are how it stays relevant.
## Ways to Contribute
### 1. Submit Your Model's Results
Run the benchmark and open a PR adding your results to `results/`:
```bash
export GROQ_API_KEY=gsk_...
python benchmark.py --model your-model-name --task all
# Results saved to results/your-model-name/
git add results/your-model-name/
git commit -m "bench: add results for your-model-name"
```
### 2. Add a New Task
Tasks live in `tasks/`. Each task needs:
- `TASK_ID` β€” unique string
- `DIFFICULTY` β€” "easy" | "medium" | "hard"
- `MAX_STEPS` β€” integer
- `DESCRIPTION` β€” markdown string shown to agent
- `INITIAL_FILES` β€” dict of `{path: content}` for the git sandbox
- `REQUIRED_KEYWORDS_IN_REVIEW` β€” list of strings grader checks for
- `PASSING_TESTS` β€” expected number of passing tests
Copy `tasks/easy_task.py` as a template.
Register your task in `tasks/task_registry.py`.
### 3. Improve the Grader
`grader.py` is the most security-sensitive file.
Any change must:
- Not break existing tests: `pytest tests/test_environment.py -v`
- Remain deterministic (same inputs β†’ same score, always)
- Include a test case in `tests/test_environment.py`
### 4. Report Failure Modes
If a model fails in an interesting way, open an issue with:
- Task ID
- Model name
- Step count at failure
- Last observation output
This data directly improves the benchmark.
## Development Setup
```bash
git clone https://github.com/yourname/teamforge
cd teamforge
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # add your GROQ_API_KEY
pytest tests/ -v # must be 21/21 green before any PR
```
## Code Standards
- All models use Pydantic v2
- All public functions have type hints
- All new files have a module docstring
- `ruff check .` must pass clean
## Benchmark Integrity
TeamForge uses deterministic grading. To prevent score inflation:
- Never modify test files in tasks
- The AST-based tamper detector will zero your score if you do
- Results PRs are verified by re-running the benchmark
---
*Questions? Open an issue. Disagreements about scoring? Open a discussion.*