sql_env / specs /F007-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
|
Raw
History Blame Contribute Delete
12.4 kB
# Verification Specification
**Feature:** F007
**Generated from:** specs/F007-VERIFICATION_INPUT.json
**Generated:** 2026-03-27
---
## 1. Unit Tests
### Dockerfile Validation
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_dockerfile_exists | Dockerfile exists at server/Dockerfile | N/A | File exists | happy |
| test_dockerfile_has_base_image_arg | BASE_IMAGE ARG is declared | Parse Dockerfile | `ARG BASE_IMAGE` present | happy |
| test_dockerfile_port_env_variable | PORT env var with fallback to 8000 | Parse Dockerfile | `ENV PORT` or CMD reads `$PORT` | happy |
| test_dockerfile_cmd_uses_port_env | CMD respects PORT env override | Set `PORT=7860` | Server binds to 7860 | happy |
| test_dockerfile_non_root_user | Container runs as non-root user | Parse Dockerfile | `USER appuser` or equivalent non-root USER directive | security |
| test_dockerfile_copies_databases | Spider databases are bundled | Parse Dockerfile | COPY instruction includes `data/databases/` | happy |
| test_dockerfile_healthcheck | Health check endpoint configured | Parse Dockerfile | HEALTHCHECK directive present | happy |
| test_dockerfile_no_dev_dependencies | No test/dev packages in final image | Inspect final stage | No pytest, ruff, etc. | edge |
**Run:** `uv run pytest tests/unit/test_dockerfile.py -v`
### openenv.yaml Manifest
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_manifest_exists | openenv.yaml exists at project root | N/A | File exists | happy |
| test_manifest_spec_version | spec_version field equals 1 | Parse YAML | `spec_version: 1` | happy |
| test_manifest_name | name field is sql_env | Parse YAML | `name: sql_env` | happy |
| test_manifest_type_space | type field is 'space' | Parse YAML | `type: space` | happy |
| test_manifest_runtime_fastapi | runtime field is 'fastapi' | Parse YAML | `runtime: fastapi` | happy |
| test_manifest_app_entrypoint | app field points to valid module | Parse YAML | `app: server.app:app` | happy |
| test_manifest_port | port field is 8000 | Parse YAML | `port: 8000` | happy |
| test_manifest_no_extra_fields | No unrecognized fields | Parse YAML | Only spec_version, name, type, runtime, app, port | edge |
| test_manifest_missing_required_field | Missing field produces validation error | Remove `name` | Validation error | error |
**Run:** `uv run pytest tests/unit/test_manifest.py -v`
### Blog Outline (docs/blog-outline.md)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_blog_outline_exists | Blog outline file exists | N/A | File at `docs/blog-outline.md` | happy |
| test_blog_has_hook_section | Hook section present | Parse markdown | Section heading for hook/intro | happy |
| test_blog_has_problem_section | Problem section present | Parse markdown | Section about static benchmarks | happy |
| test_blog_has_solution_section | Solution/architecture section present | Parse markdown | Section about SQLEnv architecture | happy |
| test_blog_has_results_placeholder | Results placeholder for F006 | Parse markdown | Placeholder text for training results | happy |
| test_blog_has_try_it_section | Try-it-yourself section with links | Parse markdown | Links to HF Space, notebook, GitHub | happy |
| test_blog_links_not_broken | All links in blog are valid or marked placeholder | Parse markdown | No dead internal links | edge |
| test_blog_minimum_length | Blog outline has substantive content | Parse markdown | At least 200 words | edge |
**Run:** `uv run pytest tests/unit/test_blog_outline.py -v`
### Training Notebook (notebooks/train_grpo.ipynb)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_notebook_exists | Notebook file exists | N/A | File at `notebooks/train_grpo.ipynb` | happy |
| test_notebook_valid_json | Notebook is valid JSON / ipynb format | Parse file | Valid nbformat structure | happy |
| test_notebook_has_setup_cell | Setup cell with pip install | Inspect cells | Cell containing `pip install` | happy |
| test_notebook_has_connect_cell | Connect cell using SQLEnvClient | Inspect cells | Cell importing/using SQLEnvClient | happy |
| test_notebook_has_train_cell | Training cell with GRPO loop | Inspect cells | Cell with training logic | happy |
| test_notebook_has_eval_cell | Evaluation cell for held-out questions | Inspect cells | Cell with evaluation logic | happy |
| test_notebook_has_plot_cell | Plotting cell with matplotlib | Inspect cells | Cell importing matplotlib and plotting | happy |
| test_notebook_colab_compatible | Colab badge or runtime metadata | Inspect metadata | `colab` in metadata or Colab badge in first cell | happy |
| test_notebook_no_hardcoded_paths | No absolute local paths | Inspect all cells | No `/Users/`, `/home/`, `C:\\` paths | edge |
| test_notebook_cells_ordered | Setup before connect before train | Inspect cell order | Correct logical ordering | edge |
| test_notebook_empty_outputs | Notebook shipped with cleared outputs | Inspect cells | All `outputs` arrays empty | edge |
**Run:** `uv run pytest tests/unit/test_notebook.py -v`
---
## 2. Integration Tests
### Flow: Local Docker Build and Run
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `docker build -t sql-env:test -f server/Dockerfile .` | Build succeeds with exit code 0 | Check exit code |
| 2 | `docker run -d -p 8000:8000 --name sql-env-test sql-env:test` | Container starts | Container running (`docker ps`) |
| 3 | Wait for health check (up to 30s) | `/health` returns 200 | `curl -f http://localhost:8000/health` |
| 4 | Connect WebSocket client, call reset | Episode starts, observation returned | Valid SQLObservation JSON |
| 5 | Send DESCRIBE action via WebSocket | Column info returned | Non-empty result field |
| 6 | Send ANSWER action via WebSocket | Episode ends, reward returned | `done: true`, reward is numeric |
| 7 | Stop container | Container stops cleanly | `docker stop sql-env-test` exits 0 |
**Run:** `uv run pytest tests/integration/test_docker_local.py -v`
### Flow: PORT Override for HF Spaces
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `docker run -d -p 7860:7860 -e PORT=7860 --name sql-env-port sql-env:test` | Container starts on port 7860 | Container running |
| 2 | `curl -f http://localhost:7860/health` | Health check passes | HTTP 200 |
| 3 | Port 8000 is NOT listening | No response on 8000 | `curl` fails on port 8000 |
**Run:** `uv run pytest tests/integration/test_port_override.py -v`
### Flow: Database Bundling Verification
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Build Docker image | Build succeeds | Exit code 0 |
| 2 | `docker run --rm sql-env:test ls /app/env/data/databases/` | Spider databases present | At least one database directory listed |
| 3 | `docker run --rm sql-env:test find /app/env/data/databases/ -name "*.sqlite"` | SQLite files present | At least one .sqlite file found |
| 4 | Start container and reset episode | Episode loads a bundled database | No "database not found" error |
**Run:** `uv run pytest tests/integration/test_db_bundling.py -v`
---
## 3. API Tests
No new API endpoints are introduced by F007. The existing `/health`, WebSocket, and REST endpoints from prior features are tested via integration tests above.
---
## 4. E2E Tests
### Scenario: Judge Experience -- Visit HF Space and Play Episode
**Setup:** Docker container running (locally simulating HF Space)
**Actions:**
1. Open health endpoint URL -- confirm service is up
2. Connect via WebSocket
3. Call `reset` -- receive initial observation with question and schema
4. Call `step` with DESCRIBE action -- receive column details
5. Call `step` with QUERY action -- receive query results
6. Call `step` with ANSWER action -- receive terminal observation with reward
**Expected:** Full episode completes without errors; reward is 0.0 or 1.0
**Run:** `uv run pytest tests/e2e/test_judge_experience.py -v`
### Scenario: Notebook Cell Sequence Validation
**Setup:** Notebook file at `notebooks/train_grpo.ipynb`
**Actions:**
1. Parse notebook JSON
2. Validate each cell type and content markers in order:
- Cell with `pip install` (setup)
- Cell with `SQLEnvClient` (connect)
- Cell with training loop keywords: `grpo`, `train`, `optimizer` (train)
- Cell with `eval` or `accuracy` or `held-out` (evaluate)
- Cell with `matplotlib` or `plt.` (plot)
3. Validate no syntax errors in code cells (compile check)
**Expected:** All five cell categories present in correct order; no syntax errors
**Run:** `uv run pytest tests/e2e/test_notebook_validation.py -v`
### Scenario: README Has Competition-Ready Content
**Setup:** README.md at project root
**Actions:**
1. Verify README contains project description
2. Verify README contains quickstart / getting started section
3. Verify README contains link to HF Space (or placeholder)
4. Verify README contains link to training notebook
5. Verify README contains architecture or how-it-works section
**Expected:** All five content sections present
**Run:** `uv run pytest tests/e2e/test_readme_completeness.py -v`
---
## 5. Edge Cases Checklist
- [ ] Dockerfile builds on CPU-only machine (no CUDA dependencies in final image)
- [ ] Container memory stays under HF Spaces free tier limit (~16GB)
- [ ] PORT env variable with non-numeric value handled gracefully
- [ ] PORT env variable with value 0 or negative handled gracefully
- [ ] Missing data/databases/ directory causes clear error at startup, not silent failure
- [ ] openenv.yaml with wrong spec_version is rejected by openenv validate
- [ ] Blog outline contains no TODO/FIXME/placeholder markers except the results section
- [ ] Notebook code cells have no import errors when dependencies are installed
- [ ] Notebook does not require GPU (runs on Colab free tier CPU)
- [ ] Container starts within 60 seconds (reasonable cold start)
- [ ] Docker image size is under 2GB (reasonable for free tier)
- [ ] .dockerignore excludes test files, .git, __pycache__, .env
- [ ] Non-root user can read database files (file permissions correct)
- [ ] Container handles SIGTERM gracefully (clean shutdown)
---
## 6. Evidence Requirements
| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed` |
| Integration | pytest + docker logs | `Container healthy, episode complete` |
| Dockerfile | docker build output | `Successfully built <hash>` |
| Port override | curl output | `HTTP 200 on port 7860` |
| Database bundling | docker exec output | `ls` shows .sqlite files |
| Blog outline | File exists + content check | `5 sections present` |
| Notebook | nbformat validation | `Valid ipynb, 5+ cells in order` |
| README | Content grep | `All required sections present` |
| E2E | Full episode log | `reset -> steps -> answer, reward=1.0` |
| Image size | docker images output | `< 2GB` |
---
## 7. External Deployment Prerequisites and Remediation
Use this checklist when deployment verification fails with external auth/access errors.
### GHCR Base Image Access (`403 Forbidden`)
1. Authenticate Docker to GHCR:
- `echo "$GITHUB_TOKEN" | docker login ghcr.io -u <github-username> --password-stdin`
2. Ensure `GITHUB_TOKEN` has package read scope for `ghcr.io/meta-pytorch/openenv-base`.
3. Retry build using explicit lowercase tag:
- `uv run openenv build -t openenv-sql-env-f007-hf-submission`
### Hugging Face Push Readiness
1. Authenticate Hugging Face CLI:
- `huggingface-cli login`
2. Confirm target Space repo exists and token has write access.
3. Run push:
- `uv run openenv push`
### Verification Outcome Rules for External Failures
- If local tests pass but GHCR/HF auth fails, record status as **partial verification** (external blocker) and include exact remediation commands above.
- Do not mark verifier result as `approved` until at least one authenticated build+push attempt is documented.
- Record authenticated evidence in `specs/F007-DEMO.md` under `## Live Local Proof` with separate `Authenticated Build Evidence` and `Hugging Face Push Evidence` subsections containing raw command output.