Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F007-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

12.4 kB

	# Verification Specification

	Feature: F007
	Generated from: specs/F007-VERIFICATION_INPUT.json
	Generated: 2026-03-27

	---

	## 1. Unit Tests

	### Dockerfile Validation

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_dockerfile_exists \| Dockerfile exists at server/Dockerfile \| N/A \| File exists \| happy \|
	\| test_dockerfile_has_base_image_arg \| BASE_IMAGE ARG is declared \| Parse Dockerfile \| `ARG BASE_IMAGE` present \| happy \|
	\| test_dockerfile_port_env_variable \| PORT env var with fallback to 8000 \| Parse Dockerfile \| `ENV PORT` or CMD reads `$PORT` \| happy \|
	\| test_dockerfile_cmd_uses_port_env \| CMD respects PORT env override \| Set `PORT=7860` \| Server binds to 7860 \| happy \|
	\| test_dockerfile_non_root_user \| Container runs as non-root user \| Parse Dockerfile \| `USER appuser` or equivalent non-root USER directive \| security \|
	\| test_dockerfile_copies_databases \| Spider databases are bundled \| Parse Dockerfile \| COPY instruction includes `data/databases/` \| happy \|
	\| test_dockerfile_healthcheck \| Health check endpoint configured \| Parse Dockerfile \| HEALTHCHECK directive present \| happy \|
	\| test_dockerfile_no_dev_dependencies \| No test/dev packages in final image \| Inspect final stage \| No pytest, ruff, etc. \| edge \|

	Run: `uv run pytest tests/unit/test_dockerfile.py -v`

	### openenv.yaml Manifest

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_manifest_exists \| openenv.yaml exists at project root \| N/A \| File exists \| happy \|
	\| test_manifest_spec_version \| spec_version field equals 1 \| Parse YAML \| `spec_version: 1` \| happy \|
	\| test_manifest_name \| name field is sql_env \| Parse YAML \| `name: sql_env` \| happy \|
	\| test_manifest_type_space \| type field is 'space' \| Parse YAML \| `type: space` \| happy \|
	\| test_manifest_runtime_fastapi \| runtime field is 'fastapi' \| Parse YAML \| `runtime: fastapi` \| happy \|
	\| test_manifest_app_entrypoint \| app field points to valid module \| Parse YAML \| `app: server.app:app` \| happy \|
	\| test_manifest_port \| port field is 8000 \| Parse YAML \| `port: 8000` \| happy \|
	\| test_manifest_no_extra_fields \| No unrecognized fields \| Parse YAML \| Only spec_version, name, type, runtime, app, port \| edge \|
	\| test_manifest_missing_required_field \| Missing field produces validation error \| Remove `name` \| Validation error \| error \|

	Run: `uv run pytest tests/unit/test_manifest.py -v`

	### Blog Outline (docs/blog-outline.md)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_blog_outline_exists \| Blog outline file exists \| N/A \| File at `docs/blog-outline.md` \| happy \|
	\| test_blog_has_hook_section \| Hook section present \| Parse markdown \| Section heading for hook/intro \| happy \|
	\| test_blog_has_problem_section \| Problem section present \| Parse markdown \| Section about static benchmarks \| happy \|
	\| test_blog_has_solution_section \| Solution/architecture section present \| Parse markdown \| Section about SQLEnv architecture \| happy \|
	\| test_blog_has_results_placeholder \| Results placeholder for F006 \| Parse markdown \| Placeholder text for training results \| happy \|
	\| test_blog_has_try_it_section \| Try-it-yourself section with links \| Parse markdown \| Links to HF Space, notebook, GitHub \| happy \|
	\| test_blog_links_not_broken \| All links in blog are valid or marked placeholder \| Parse markdown \| No dead internal links \| edge \|
	\| test_blog_minimum_length \| Blog outline has substantive content \| Parse markdown \| At least 200 words \| edge \|

	Run: `uv run pytest tests/unit/test_blog_outline.py -v`

	### Training Notebook (notebooks/train_grpo.ipynb)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_notebook_exists \| Notebook file exists \| N/A \| File at `notebooks/train_grpo.ipynb` \| happy \|
	\| test_notebook_valid_json \| Notebook is valid JSON / ipynb format \| Parse file \| Valid nbformat structure \| happy \|
	\| test_notebook_has_setup_cell \| Setup cell with pip install \| Inspect cells \| Cell containing `pip install` \| happy \|
	\| test_notebook_has_connect_cell \| Connect cell using SQLEnvClient \| Inspect cells \| Cell importing/using SQLEnvClient \| happy \|
	\| test_notebook_has_train_cell \| Training cell with GRPO loop \| Inspect cells \| Cell with training logic \| happy \|
	\| test_notebook_has_eval_cell \| Evaluation cell for held-out questions \| Inspect cells \| Cell with evaluation logic \| happy \|
	\| test_notebook_has_plot_cell \| Plotting cell with matplotlib \| Inspect cells \| Cell importing matplotlib and plotting \| happy \|
	\| test_notebook_colab_compatible \| Colab badge or runtime metadata \| Inspect metadata \| `colab` in metadata or Colab badge in first cell \| happy \|
	\| test_notebook_no_hardcoded_paths \| No absolute local paths \| Inspect all cells \| No `/Users/`, `/home/`, `C:\\` paths \| edge \|
	\| test_notebook_cells_ordered \| Setup before connect before train \| Inspect cell order \| Correct logical ordering \| edge \|
	\| test_notebook_empty_outputs \| Notebook shipped with cleared outputs \| Inspect cells \| All `outputs` arrays empty \| edge \|

	Run: `uv run pytest tests/unit/test_notebook.py -v`

	---

	## 2. Integration Tests

	### Flow: Local Docker Build and Run

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| `docker build -t sql-env:test -f server/Dockerfile .` \| Build succeeds with exit code 0 \| Check exit code \|
	\| 2 \| `docker run -d -p 8000:8000 --name sql-env-test sql-env:test` \| Container starts \| Container running (`docker ps`) \|
	\| 3 \| Wait for health check (up to 30s) \| `/health` returns 200 \| `curl -f http://localhost:8000/health` \|
	\| 4 \| Connect WebSocket client, call reset \| Episode starts, observation returned \| Valid SQLObservation JSON \|
	\| 5 \| Send DESCRIBE action via WebSocket \| Column info returned \| Non-empty result field \|
	\| 6 \| Send ANSWER action via WebSocket \| Episode ends, reward returned \| `done: true`, reward is numeric \|
	\| 7 \| Stop container \| Container stops cleanly \| `docker stop sql-env-test` exits 0 \|

	Run: `uv run pytest tests/integration/test_docker_local.py -v`

	### Flow: PORT Override for HF Spaces

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| `docker run -d -p 7860:7860 -e PORT=7860 --name sql-env-port sql-env:test` \| Container starts on port 7860 \| Container running \|
	\| 2 \| `curl -f http://localhost:7860/health` \| Health check passes \| HTTP 200 \|
	\| 3 \| Port 8000 is NOT listening \| No response on 8000 \| `curl` fails on port 8000 \|

	Run: `uv run pytest tests/integration/test_port_override.py -v`

	### Flow: Database Bundling Verification

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Build Docker image \| Build succeeds \| Exit code 0 \|
	\| 2 \| `docker run --rm sql-env:test ls /app/env/data/databases/` \| Spider databases present \| At least one database directory listed \|
	\| 3 \| `docker run --rm sql-env:test find /app/env/data/databases/ -name "*.sqlite"` \| SQLite files present \| At least one .sqlite file found \|
	\| 4 \| Start container and reset episode \| Episode loads a bundled database \| No "database not found" error \|

	Run: `uv run pytest tests/integration/test_db_bundling.py -v`

	---

	## 3. API Tests

	No new API endpoints are introduced by F007. The existing `/health`, WebSocket, and REST endpoints from prior features are tested via integration tests above.

	---

	## 4. E2E Tests

	### Scenario: Judge Experience -- Visit HF Space and Play Episode

	Setup: Docker container running (locally simulating HF Space)
	Actions:
	1. Open health endpoint URL -- confirm service is up
	2. Connect via WebSocket
	3. Call `reset` -- receive initial observation with question and schema
	4. Call `step` with DESCRIBE action -- receive column details
	5. Call `step` with QUERY action -- receive query results
	6. Call `step` with ANSWER action -- receive terminal observation with reward
	Expected: Full episode completes without errors; reward is 0.0 or 1.0

	Run: `uv run pytest tests/e2e/test_judge_experience.py -v`

	### Scenario: Notebook Cell Sequence Validation

	Setup: Notebook file at `notebooks/train_grpo.ipynb`
	Actions:
	1. Parse notebook JSON
	2. Validate each cell type and content markers in order:
	- Cell with `pip install` (setup)
	- Cell with `SQLEnvClient` (connect)
	- Cell with training loop keywords: `grpo`, `train`, `optimizer` (train)
	- Cell with `eval` or `accuracy` or `held-out` (evaluate)
	- Cell with `matplotlib` or `plt.` (plot)
	3. Validate no syntax errors in code cells (compile check)
	Expected: All five cell categories present in correct order; no syntax errors

	Run: `uv run pytest tests/e2e/test_notebook_validation.py -v`

	### Scenario: README Has Competition-Ready Content

	Setup: README.md at project root
	Actions:
	1. Verify README contains project description
	2. Verify README contains quickstart / getting started section
	3. Verify README contains link to HF Space (or placeholder)
	4. Verify README contains link to training notebook
	5. Verify README contains architecture or how-it-works section
	Expected: All five content sections present

	Run: `uv run pytest tests/e2e/test_readme_completeness.py -v`

	---

	## 5. Edge Cases Checklist

	- [ ] Dockerfile builds on CPU-only machine (no CUDA dependencies in final image)
	- [ ] Container memory stays under HF Spaces free tier limit (~16GB)
	- [ ] PORT env variable with non-numeric value handled gracefully
	- [ ] PORT env variable with value 0 or negative handled gracefully
	- [ ] Missing data/databases/ directory causes clear error at startup, not silent failure
	- [ ] openenv.yaml with wrong spec_version is rejected by openenv validate
	- [ ] Blog outline contains no TODO/FIXME/placeholder markers except the results section
	- [ ] Notebook code cells have no import errors when dependencies are installed
	- [ ] Notebook does not require GPU (runs on Colab free tier CPU)
	- [ ] Container starts within 60 seconds (reasonable cold start)
	- [ ] Docker image size is under 2GB (reasonable for free tier)
	- [ ] .dockerignore excludes test files, .git, __pycache__, .env
	- [ ] Non-root user can read database files (file permissions correct)
	- [ ] Container handles SIGTERM gracefully (clean shutdown)

	---

	## 6. Evidence Requirements

	\| Category \| Evidence Type \| Example \|
	\|----------\|---------------\|---------\|
	\| Unit tests \| pytest output \| `X passed` \|
	\| Integration \| pytest + docker logs \| `Container healthy, episode complete` \|
	\| Dockerfile \| docker build output \| `Successfully built <hash>` \|
	\| Port override \| curl output \| `HTTP 200 on port 7860` \|
	\| Database bundling \| docker exec output \| `ls` shows .sqlite files \|
	\| Blog outline \| File exists + content check \| `5 sections present` \|
	\| Notebook \| nbformat validation \| `Valid ipynb, 5+ cells in order` \|
	\| README \| Content grep \| `All required sections present` \|
	\| E2E \| Full episode log \| `reset -> steps -> answer, reward=1.0` \|
	\| Image size \| docker images output \| `< 2GB` \|

	---

	## 7. External Deployment Prerequisites and Remediation

	Use this checklist when deployment verification fails with external auth/access errors.

	### GHCR Base Image Access (`403 Forbidden`)

	1. Authenticate Docker to GHCR:
	- `echo "$GITHUB_TOKEN" \| docker login ghcr.io -u <github-username> --password-stdin`
	2. Ensure `GITHUB_TOKEN` has package read scope for `ghcr.io/meta-pytorch/openenv-base`.
	3. Retry build using explicit lowercase tag:
	- `uv run openenv build -t openenv-sql-env-f007-hf-submission`

	### Hugging Face Push Readiness

	1. Authenticate Hugging Face CLI:
	- `huggingface-cli login`
	2. Confirm target Space repo exists and token has write access.
	3. Run push:
	- `uv run openenv push`

	### Verification Outcome Rules for External Failures

	- If local tests pass but GHCR/HF auth fails, record status as partial verification (external blocker) and include exact remediation commands above.
	- Do not mark verifier result as `approved` until at least one authenticated build+push attempt is documented.
	- Record authenticated evidence in `specs/F007-DEMO.md` under `## Live Local Proof` with separate `Authenticated Build Evidence` and `Hugging Face Push Evidence` subsections containing raw command output.