teamforge / CONTRIBUTING.md
Your Name
fix: add FastAPI REST endpoints for OpenEnv validator
637f42c

Contributing to TeamForge

TeamForge is an open benchmark β€” contributions are how it stays relevant.

Ways to Contribute

1. Submit Your Model's Results

Run the benchmark and open a PR adding your results to results/:

export GROQ_API_KEY=gsk_...
python benchmark.py --model your-model-name --task all
# Results saved to results/your-model-name/
git add results/your-model-name/
git commit -m "bench: add results for your-model-name"

2. Add a New Task

Tasks live in tasks/. Each task needs:

  • TASK_ID β€” unique string
  • DIFFICULTY β€” "easy" | "medium" | "hard"
  • MAX_STEPS β€” integer
  • DESCRIPTION β€” markdown string shown to agent
  • INITIAL_FILES β€” dict of {path: content} for the git sandbox
  • REQUIRED_KEYWORDS_IN_REVIEW β€” list of strings grader checks for
  • PASSING_TESTS β€” expected number of passing tests

Copy tasks/easy_task.py as a template. Register your task in tasks/task_registry.py.

3. Improve the Grader

grader.py is the most security-sensitive file. Any change must:

  • Not break existing tests: pytest tests/test_environment.py -v
  • Remain deterministic (same inputs β†’ same score, always)
  • Include a test case in tests/test_environment.py

4. Report Failure Modes

If a model fails in an interesting way, open an issue with:

  • Task ID
  • Model name
  • Step count at failure
  • Last observation output

This data directly improves the benchmark.

Development Setup

git clone https://github.com/yourname/teamforge
cd teamforge
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env   # add your GROQ_API_KEY
pytest tests/ -v       # must be 21/21 green before any PR

Code Standards

  • All models use Pydantic v2
  • All public functions have type hints
  • All new files have a module docstring
  • ruff check . must pass clean

Benchmark Integrity

TeamForge uses deterministic grading. To prevent score inflation:

  • Never modify test files in tasks
  • The AST-based tamper detector will zero your score if you do
  • Results PRs are verified by re-running the benchmark

Questions? Open an issue. Disagreements about scoring? Open a discussion.