Spaces:
Sleeping
Sleeping
Contributing to TeamForge
TeamForge is an open benchmark β contributions are how it stays relevant.
Ways to Contribute
1. Submit Your Model's Results
Run the benchmark and open a PR adding your results to results/:
export GROQ_API_KEY=gsk_...
python benchmark.py --model your-model-name --task all
# Results saved to results/your-model-name/
git add results/your-model-name/
git commit -m "bench: add results for your-model-name"
2. Add a New Task
Tasks live in tasks/. Each task needs:
TASK_IDβ unique stringDIFFICULTYβ "easy" | "medium" | "hard"MAX_STEPSβ integerDESCRIPTIONβ markdown string shown to agentINITIAL_FILESβ dict of{path: content}for the git sandboxREQUIRED_KEYWORDS_IN_REVIEWβ list of strings grader checks forPASSING_TESTSβ expected number of passing tests
Copy tasks/easy_task.py as a template.
Register your task in tasks/task_registry.py.
3. Improve the Grader
grader.py is the most security-sensitive file.
Any change must:
- Not break existing tests:
pytest tests/test_environment.py -v - Remain deterministic (same inputs β same score, always)
- Include a test case in
tests/test_environment.py
4. Report Failure Modes
If a model fails in an interesting way, open an issue with:
- Task ID
- Model name
- Step count at failure
- Last observation output
This data directly improves the benchmark.
Development Setup
git clone https://github.com/yourname/teamforge
cd teamforge
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # add your GROQ_API_KEY
pytest tests/ -v # must be 21/21 green before any PR
Code Standards
- All models use Pydantic v2
- All public functions have type hints
- All new files have a module docstring
ruff check .must pass clean
Benchmark Integrity
TeamForge uses deterministic grading. To prevent score inflation:
- Never modify test files in tasks
- The AST-based tamper detector will zero your score if you do
- Results PRs are verified by re-running the benchmark
Questions? Open an issue. Disagreements about scoring? Open a discussion.