Spaces:
Paused
Paused
Implementation Checklist
Slice 1: Submission reliability
-
pip install -e '.[dev]' -
pytest -q -
openenv validate . -
scripts/validate-local.sh --skip-docker -
scripts/validate-local.shon a Docker-enabled machine
Slice 2: Benchmark quality
- verify each task has a strong-score happy path
- verify unsafe or irrelevant actions reduce final score
- verify duplicate handling is required for medium and hard tasks
- verify hard task reply does not overclaim a confirmed breach
Slice 3: Docs and demo polish
- README reflects final baseline and deployment instructions
- benchmark brief and glossary are present
- Hugging Face Space secrets are configured
- remote
/healthand/resetchecks succeed
Slice 4: Final submission
- push to Hugging Face Space
- run remote validator against the Space URL
- record final baseline scores and URL in the submission notes