Spaces:

PrakashCider
/

teamforge

Sleeping

App Files Files Community

teamforge / CONTRIBUTING.md

Your Name

fix: add FastAPI REST endpoints for OpenEnv validator

637f42c about 1 month ago

preview code

raw

history blame contribute delete

2.24 kB

Contributing to TeamForge

TeamForge is an open benchmark — contributions are how it stays relevant.

Ways to Contribute

1. Submit Your Model's Results

Run the benchmark and open a PR adding your results to results/:

export GROQ_API_KEY=gsk_...
python benchmark.py --model your-model-name --task all
# Results saved to results/your-model-name/
git add results/your-model-name/
git commit -m "bench: add results for your-model-name"

2. Add a New Task

Tasks live in tasks/. Each task needs:

TASK_ID — unique string
DIFFICULTY — "easy" | "medium" | "hard"
MAX_STEPS — integer
DESCRIPTION — markdown string shown to agent
INITIAL_FILES — dict of {path: content} for the git sandbox
REQUIRED_KEYWORDS_IN_REVIEW — list of strings grader checks for
PASSING_TESTS — expected number of passing tests

Copy tasks/easy_task.py as a template. Register your task in tasks/task_registry.py.

3. Improve the Grader

grader.py is the most security-sensitive file. Any change must:

Not break existing tests: pytest tests/test_environment.py -v
Remain deterministic (same inputs → same score, always)
Include a test case in tests/test_environment.py

4. Report Failure Modes

If a model fails in an interesting way, open an issue with:

Task ID
Model name
Step count at failure
Last observation output

This data directly improves the benchmark.

Development Setup

git clone https://github.com/yourname/teamforge
cd teamforge
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env   # add your GROQ_API_KEY
pytest tests/ -v       # must be 21/21 green before any PR

Code Standards

All models use Pydantic v2
All public functions have type hints
All new files have a module docstring
ruff check . must pass clean

Benchmark Integrity

TeamForge uses deterministic grading. To prevent score inflation:

Never modify test files in tasks
The AST-based tamper detector will zero your score if you do
Results PRs are verified by re-running the benchmark

Questions? Open an issue. Disagreements about scoring? Open a discussion.