| # Verification Specification |
|
|
| **Feature:** F007 |
| **Generated from:** specs/F007-VERIFICATION_INPUT.json |
| **Generated:** 2026-03-27 |
| |
| --- |
| |
| ## 1. Unit Tests |
| |
| ### Dockerfile Validation |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_dockerfile_exists | Dockerfile exists at server/Dockerfile | N/A | File exists | happy | |
| | test_dockerfile_has_base_image_arg | BASE_IMAGE ARG is declared | Parse Dockerfile | `ARG BASE_IMAGE` present | happy | |
| | test_dockerfile_port_env_variable | PORT env var with fallback to 8000 | Parse Dockerfile | `ENV PORT` or CMD reads `$PORT` | happy | |
| | test_dockerfile_cmd_uses_port_env | CMD respects PORT env override | Set `PORT=7860` | Server binds to 7860 | happy | |
| | test_dockerfile_non_root_user | Container runs as non-root user | Parse Dockerfile | `USER appuser` or equivalent non-root USER directive | security | |
| | test_dockerfile_copies_databases | Spider databases are bundled | Parse Dockerfile | COPY instruction includes `data/databases/` | happy | |
| | test_dockerfile_healthcheck | Health check endpoint configured | Parse Dockerfile | HEALTHCHECK directive present | happy | |
| | test_dockerfile_no_dev_dependencies | No test/dev packages in final image | Inspect final stage | No pytest, ruff, etc. | edge | |
|
|
| **Run:** `uv run pytest tests/unit/test_dockerfile.py -v` |
|
|
| ### openenv.yaml Manifest |
|
|
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_manifest_exists | openenv.yaml exists at project root | N/A | File exists | happy | |
| | test_manifest_spec_version | spec_version field equals 1 | Parse YAML | `spec_version: 1` | happy | |
| | test_manifest_name | name field is sql_env | Parse YAML | `name: sql_env` | happy | |
| | test_manifest_type_space | type field is 'space' | Parse YAML | `type: space` | happy | |
| | test_manifest_runtime_fastapi | runtime field is 'fastapi' | Parse YAML | `runtime: fastapi` | happy | |
| | test_manifest_app_entrypoint | app field points to valid module | Parse YAML | `app: server.app:app` | happy | |
| | test_manifest_port | port field is 8000 | Parse YAML | `port: 8000` | happy | |
| | test_manifest_no_extra_fields | No unrecognized fields | Parse YAML | Only spec_version, name, type, runtime, app, port | edge | |
| | test_manifest_missing_required_field | Missing field produces validation error | Remove `name` | Validation error | error | |
|
|
| **Run:** `uv run pytest tests/unit/test_manifest.py -v` |
|
|
| ### Blog Outline (docs/blog-outline.md) |
|
|
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_blog_outline_exists | Blog outline file exists | N/A | File at `docs/blog-outline.md` | happy | |
| | test_blog_has_hook_section | Hook section present | Parse markdown | Section heading for hook/intro | happy | |
| | test_blog_has_problem_section | Problem section present | Parse markdown | Section about static benchmarks | happy | |
| | test_blog_has_solution_section | Solution/architecture section present | Parse markdown | Section about SQLEnv architecture | happy | |
| | test_blog_has_results_placeholder | Results placeholder for F006 | Parse markdown | Placeholder text for training results | happy | |
| | test_blog_has_try_it_section | Try-it-yourself section with links | Parse markdown | Links to HF Space, notebook, GitHub | happy | |
| | test_blog_links_not_broken | All links in blog are valid or marked placeholder | Parse markdown | No dead internal links | edge | |
| | test_blog_minimum_length | Blog outline has substantive content | Parse markdown | At least 200 words | edge | |
| |
| **Run:** `uv run pytest tests/unit/test_blog_outline.py -v` |
| |
| ### Training Notebook (notebooks/train_grpo.ipynb) |
|
|
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_notebook_exists | Notebook file exists | N/A | File at `notebooks/train_grpo.ipynb` | happy | |
| | test_notebook_valid_json | Notebook is valid JSON / ipynb format | Parse file | Valid nbformat structure | happy | |
| | test_notebook_has_setup_cell | Setup cell with pip install | Inspect cells | Cell containing `pip install` | happy | |
| | test_notebook_has_connect_cell | Connect cell using SQLEnvClient | Inspect cells | Cell importing/using SQLEnvClient | happy | |
| | test_notebook_has_train_cell | Training cell with GRPO loop | Inspect cells | Cell with training logic | happy | |
| | test_notebook_has_eval_cell | Evaluation cell for held-out questions | Inspect cells | Cell with evaluation logic | happy | |
| | test_notebook_has_plot_cell | Plotting cell with matplotlib | Inspect cells | Cell importing matplotlib and plotting | happy | |
| | test_notebook_colab_compatible | Colab badge or runtime metadata | Inspect metadata | `colab` in metadata or Colab badge in first cell | happy | |
| | test_notebook_no_hardcoded_paths | No absolute local paths | Inspect all cells | No `/Users/`, `/home/`, `C:\\` paths | edge | |
| | test_notebook_cells_ordered | Setup before connect before train | Inspect cell order | Correct logical ordering | edge | |
| | test_notebook_empty_outputs | Notebook shipped with cleared outputs | Inspect cells | All `outputs` arrays empty | edge | |
|
|
| **Run:** `uv run pytest tests/unit/test_notebook.py -v` |
|
|
| --- |
|
|
| ## 2. Integration Tests |
|
|
| ### Flow: Local Docker Build and Run |
|
|
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | `docker build -t sql-env:test -f server/Dockerfile .` | Build succeeds with exit code 0 | Check exit code | |
| | 2 | `docker run -d -p 8000:8000 --name sql-env-test sql-env:test` | Container starts | Container running (`docker ps`) | |
| | 3 | Wait for health check (up to 30s) | `/health` returns 200 | `curl -f http://localhost:8000/health` | |
| | 4 | Connect WebSocket client, call reset | Episode starts, observation returned | Valid SQLObservation JSON | |
| | 5 | Send DESCRIBE action via WebSocket | Column info returned | Non-empty result field | |
| | 6 | Send ANSWER action via WebSocket | Episode ends, reward returned | `done: true`, reward is numeric | |
| | 7 | Stop container | Container stops cleanly | `docker stop sql-env-test` exits 0 | |
|
|
| **Run:** `uv run pytest tests/integration/test_docker_local.py -v` |
|
|
| ### Flow: PORT Override for HF Spaces |
|
|
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | `docker run -d -p 7860:7860 -e PORT=7860 --name sql-env-port sql-env:test` | Container starts on port 7860 | Container running | |
| | 2 | `curl -f http://localhost:7860/health` | Health check passes | HTTP 200 | |
| | 3 | Port 8000 is NOT listening | No response on 8000 | `curl` fails on port 8000 | |
|
|
| **Run:** `uv run pytest tests/integration/test_port_override.py -v` |
|
|
| ### Flow: Database Bundling Verification |
|
|
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | Build Docker image | Build succeeds | Exit code 0 | |
| | 2 | `docker run --rm sql-env:test ls /app/env/data/databases/` | Spider databases present | At least one database directory listed | |
| | 3 | `docker run --rm sql-env:test find /app/env/data/databases/ -name "*.sqlite"` | SQLite files present | At least one .sqlite file found | |
| | 4 | Start container and reset episode | Episode loads a bundled database | No "database not found" error | |
|
|
| **Run:** `uv run pytest tests/integration/test_db_bundling.py -v` |
|
|
| --- |
|
|
| ## 3. API Tests |
|
|
| No new API endpoints are introduced by F007. The existing `/health`, WebSocket, and REST endpoints from prior features are tested via integration tests above. |
|
|
| --- |
|
|
| ## 4. E2E Tests |
|
|
| ### Scenario: Judge Experience -- Visit HF Space and Play Episode |
|
|
| **Setup:** Docker container running (locally simulating HF Space) |
| **Actions:** |
| 1. Open health endpoint URL -- confirm service is up |
| 2. Connect via WebSocket |
| 3. Call `reset` -- receive initial observation with question and schema |
| 4. Call `step` with DESCRIBE action -- receive column details |
| 5. Call `step` with QUERY action -- receive query results |
| 6. Call `step` with ANSWER action -- receive terminal observation with reward |
| **Expected:** Full episode completes without errors; reward is 0.0 or 1.0 |
|
|
| **Run:** `uv run pytest tests/e2e/test_judge_experience.py -v` |
|
|
| ### Scenario: Notebook Cell Sequence Validation |
|
|
| **Setup:** Notebook file at `notebooks/train_grpo.ipynb` |
| **Actions:** |
| 1. Parse notebook JSON |
| 2. Validate each cell type and content markers in order: |
| - Cell with `pip install` (setup) |
| - Cell with `SQLEnvClient` (connect) |
| - Cell with training loop keywords: `grpo`, `train`, `optimizer` (train) |
| - Cell with `eval` or `accuracy` or `held-out` (evaluate) |
| - Cell with `matplotlib` or `plt.` (plot) |
| 3. Validate no syntax errors in code cells (compile check) |
| **Expected:** All five cell categories present in correct order; no syntax errors |
|
|
| **Run:** `uv run pytest tests/e2e/test_notebook_validation.py -v` |
|
|
| ### Scenario: README Has Competition-Ready Content |
|
|
| **Setup:** README.md at project root |
| **Actions:** |
| 1. Verify README contains project description |
| 2. Verify README contains quickstart / getting started section |
| 3. Verify README contains link to HF Space (or placeholder) |
| 4. Verify README contains link to training notebook |
| 5. Verify README contains architecture or how-it-works section |
| **Expected:** All five content sections present |
|
|
| **Run:** `uv run pytest tests/e2e/test_readme_completeness.py -v` |
|
|
| --- |
|
|
| ## 5. Edge Cases Checklist |
|
|
| - [ ] Dockerfile builds on CPU-only machine (no CUDA dependencies in final image) |
| - [ ] Container memory stays under HF Spaces free tier limit (~16GB) |
| - [ ] PORT env variable with non-numeric value handled gracefully |
| - [ ] PORT env variable with value 0 or negative handled gracefully |
| - [ ] Missing data/databases/ directory causes clear error at startup, not silent failure |
| - [ ] openenv.yaml with wrong spec_version is rejected by openenv validate |
| - [ ] Blog outline contains no TODO/FIXME/placeholder markers except the results section |
| - [ ] Notebook code cells have no import errors when dependencies are installed |
| - [ ] Notebook does not require GPU (runs on Colab free tier CPU) |
| - [ ] Container starts within 60 seconds (reasonable cold start) |
| - [ ] Docker image size is under 2GB (reasonable for free tier) |
| - [ ] .dockerignore excludes test files, .git, __pycache__, .env |
| - [ ] Non-root user can read database files (file permissions correct) |
| - [ ] Container handles SIGTERM gracefully (clean shutdown) |
| |
| --- |
| |
| ## 6. Evidence Requirements |
| |
| | Category | Evidence Type | Example | |
| |----------|---------------|---------| |
| | Unit tests | pytest output | `X passed` | |
| | Integration | pytest + docker logs | `Container healthy, episode complete` | |
| | Dockerfile | docker build output | `Successfully built <hash>` | |
| | Port override | curl output | `HTTP 200 on port 7860` | |
| | Database bundling | docker exec output | `ls` shows .sqlite files | |
| | Blog outline | File exists + content check | `5 sections present` | |
| | Notebook | nbformat validation | `Valid ipynb, 5+ cells in order` | |
| | README | Content grep | `All required sections present` | |
| | E2E | Full episode log | `reset -> steps -> answer, reward=1.0` | |
| | Image size | docker images output | `< 2GB` | |
| |
| --- |
| |
| ## 7. External Deployment Prerequisites and Remediation |
| |
| Use this checklist when deployment verification fails with external auth/access errors. |
| |
| ### GHCR Base Image Access (`403 Forbidden`) |
| |
| 1. Authenticate Docker to GHCR: |
| - `echo "$GITHUB_TOKEN" | docker login ghcr.io -u <github-username> --password-stdin` |
| 2. Ensure `GITHUB_TOKEN` has package read scope for `ghcr.io/meta-pytorch/openenv-base`. |
| 3. Retry build using explicit lowercase tag: |
| - `uv run openenv build -t openenv-sql-env-f007-hf-submission` |
|
|
| ### Hugging Face Push Readiness |
|
|
| 1. Authenticate Hugging Face CLI: |
| - `huggingface-cli login` |
| 2. Confirm target Space repo exists and token has write access. |
| 3. Run push: |
| - `uv run openenv push` |
|
|
| ### Verification Outcome Rules for External Failures |
|
|
| - If local tests pass but GHCR/HF auth fails, record status as **partial verification** (external blocker) and include exact remediation commands above. |
| - Do not mark verifier result as `approved` until at least one authenticated build+push attempt is documented. |
| - Record authenticated evidence in `specs/F007-DEMO.md` under `## Live Local Proof` with separate `Authenticated Build Evidence` and `Hugging Face Push Evidence` subsections containing raw command output. |
|
|