Spaces:

developmentseed
/

gazet

Running

App Files Files Community

Daniel Wiesmann commited on Apr 20

Commit

af5cded

unverified ·

2 Parent(s): 5e72207 789bf58

Merge pull request #1 from developmentseed/slm-qwen3.5

Browse files

Files changed (32) hide show

.dockerignore +12 -0
.gitignore +22 -2
Dockerfile +20 -0
README.md +4 -4
dataset/README.md +177 -0
dataset/__init__.py +1 -0
dataset/config.yaml +63 -0
dataset/modal_app.py +272 -0
dataset/scripts/__init__.py +1 -0
dataset/scripts/build_inventory.py +124 -0
dataset/scripts/build_relations.py +557 -0
dataset/scripts/cli.py +377 -0
dataset/scripts/export_training_data.py +372 -0
dataset/scripts/generate_samples.py +1560 -0
dataset/scripts/sql_templates.py +1651 -0
dataset/scripts/validate_dataset.py +309 -0
docker-compose.yml +41 -0
finetune/README.md +291 -0
finetune/__init__.py +1 -0
finetune/check_token_lengths.py +155 -0
finetune/eval_cli.py +248 -0
finetune/eval_demo.py +351 -0
finetune/train_modal_qwen35.py +363 -0
gazet_demo.py +14 -2
pyproject.toml +11 -1
src/gazet/api.py +33 -16
src/gazet/config.py +30 -15
src/gazet/export.py +64 -7
src/gazet/lm.py +230 -7
src/gazet/search.py +24 -28
src/gazet/sql.py +104 -8
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,12 @@

+.git
+.venv
+__pycache__
+data/
+finetune/models/
+dataset/output/
+dataset/intermediate/
+results/
+*.gguf
+*.safetensors
+.windsurf/
+.claude/

.gitignore CHANGED Viewed

@@ -133,6 +133,26 @@ dmypy.json
 # Pyre type checker
 .pyre/
 data/
-output/

 # Pyre type checker
 .pyre/
+# Dataset
 data/
+output/
+*.parquet
+*.jsonl
+# Eval results
+results/
+# IDE
+.windsurf/
+# Local notes
+notes.md
+# Model
+models/
+*.gguf
+*.safetensors
+*.bin
+*.pt
+*.pth
+*.ckpt

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+FROM python:3.13-slim
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+WORKDIR /app
+# Install dependencies first (cache layer)
+COPY pyproject.toml uv.lock ./
+RUN uv sync --frozen --no-install-project --extra demo
+# Copy application code
+COPY src/ src/
+COPY gazet_demo.py .
+# Install the project itself
+RUN uv sync --frozen --extra demo
+ENV PATH="/app/.venv/bin:$PATH"
+EXPOSE 8000 8501

README.md CHANGED Viewed

@@ -26,14 +26,14 @@ uv sync --extra dev --extra demo
 Example for downloading overture
 ```bash
-aws s3 sync
-s3 sync s3://overturemaps-us-west-2/release/2026-02-18.0/theme=divisions/type=division_area/ data/overture/divisions_area
 ```
 Example for running conversion script for natural earth
 ```bash
-python -m ingest.convert_natural_earth ~/Downloads/10m_physical
 ```
 ### Based on ollama
@@ -61,7 +61,7 @@ uv run streamlit run gazet_demo.py   # demo UI
 | Module | Contents |
 | --- | --- |
 | `config.py` | data paths, model name, SQL schema description |
-| `types.py` | `SUBTYPES`, `COUNTRIES`, `Place`, `PlacesResult` |
 | `lm.py` | DSPy signatures + LM init (`extract`, `write_sql`) |
 | `search.py` | fuzzy search against `divisions_area` / `natural_earth` |
 | `sql.py` | code-act SQL generation loop |

 Example for downloading overture
 ```bash
+aws s3 sync s3://overturemaps-us-west-2/release/2026-02-18.0/theme=divisions/type=division_area/ data/overture/divisions_area
 ```
 Example for running conversion script for natural earth
 ```bash
+unzip ~/Downloads/10m_physical.zip -d data/natural_earth
+python -m ingest.convert_natural_earth data/natural_earth
 ```
 ### Based on ollama
 | Module | Contents |
 | --- | --- |
 | `config.py` | data paths, model name, SQL schema description |
+| `schemas.py` | `SUBTYPES`, `COUNTRIES`, `Place`, `PlacesResult` |
 | `lm.py` | DSPy signatures + LM init (`extract`, `write_sql`) |
 | `search.py` | fuzzy search against `divisions_area` / `natural_earth` |
 | `sql.py` | code-act SQL generation loop |

dataset/README.md ADDED Viewed

	@@ -0,0 +1,177 @@

+# Gazet Dataset Generation
+Generates synthetic training data for fine-tuning the geocoding model.
+Two datasets come out of one pipeline run:
+- **SQL generation** — `(question + candidates) -> DuckDB SQL`
+- **Place extraction** — `question -> place names JSON`
+Both tasks export in **conversation format** (`messages` list of
+system/user/assistant turns), ready for chat-template fine-tuning.
+---
+## Prerequisites
+```bash
+uv sync
+```
+You need the Overture and Natural Earth parquet files under `data/` locally,
+or on a Modal volume if running in the cloud.
+---
+## Option A — Run locally (small datasets, development)
+Use this when you want to iterate quickly on a laptop with a subset of countries.
+**Step 1 — Pick a run name and countries in `config.yaml`**
+```yaml
+run_name: "v1"   # change this every time you generate fresh data
+countries:
+  - IN   # India
+  - BR   # Brazil
+  - US   # United States
+  # add more, or use "- all" for every country (slow locally)
+```
+**Step 2 — Run the full pipeline**
+```bash
+gazet-dataset full-pipeline --config dataset/config.yaml
+```
+That's it. It runs all four steps in order and puts the results in
+`dataset/output/runs/my-run-001/`.
+If you want to run steps individually (e.g. to re-export without regenerating):
+```bash
+gazet-dataset build-relations  --config dataset/config.yaml  # ~5 min
+gazet-dataset generate-samples --config dataset/config.yaml  # ~15 min
+gazet-dataset validate         --config dataset/config.yaml  # ~5 min
+gazet-dataset export           --config dataset/config.yaml  # <1 min
+```
+---
+## Option B — Run on Modal (large datasets, production)
+Use this when you need 10 K+ samples or want to use all countries. Modal
+distributes generation across many containers in parallel.
+**Step 1 — One-time setup**
+```bash
+modal setup   # authenticate with Modal (one time)
+gazet-dataset modal-upload --config dataset/config.yaml   # upload parquet data to Modal volume
+```
+**Step 2 — Set run name and targets in `config.yaml`**
+```yaml
+run_name: "v1"
+countries:
+  - all
+sample_targets:
+  adjacency:     1250
+  containment:   1250
+  # ... see config.yaml for all families
+```
+**Step 3 — Run on Modal**
+```bash
+gazet-dataset modal-generate --config dataset/config.yaml
+```
+This builds relations, generates samples, validates, and exports — same as
+`full-pipeline` but distributed across 100 cloud containers.
+If relations are already built from a previous run (same countries, same
+template version), skip rebuilding them:
+```bash
+gazet-dataset modal-generate --config dataset/config.yaml --skip-relations
+```
+---
+## Output
+After running, your training files are at:
+```
+dataset/output/runs/{run_name}/
+  sql/
+    train.jsonl    <- fine-tune the SQL generation model
+    val.jsonl
+    test.jsonl
+  places/
+    train.jsonl    <- fine-tune the place extraction model
+    val.jsonl
+    test.jsonl
+  stats.json       <- sample counts by family
+```
+Each JSONL row is a conversation-format dict:
+```json
+{
+  "messages": [
+    {"role": "system",    "content": "..."},
+    {"role": "user",      "content": "..."},
+    {"role": "assistant", "content": "..."}
+  ]
+}
+```
+**SQL task**: the system prompt includes the full two-table schema inside
+`<SCHEMA>` tags. The user prompt contains only `<CANDIDATES>` CSV and
+`<USER_QUERY>`. The assistant response is pretty-printed SQL (via `sqlparse`).
+All parquet paths are symbolic (`divisions_area` / `natural_earth`), never
+runtime-specific.
+**Places task**: the system prompt includes output format, extraction rules,
+and the full list of Overture subtypes. The assistant response is a JSON
+object with a `places` array.
+---
+## When to regenerate from scratch
+Change `run_name` and regenerate from scratch whenever you:
+- Change any SQL templates (`sql_templates.py`)
+- Add new template families
+- Change the candidate format or count
+- Change the system/user prompt structure or content
+- Change the export format
+For local runs, the default is a clean run. For Modal, `modal-generate` appends
+by default; pass `--fresh` to overwrite existing samples.
+---
+## Troubleshooting
+**Very few samples generated for a family**
+The generation loop tries `retry_multiplier × target` and discards SQL that
+returns empty results. Some families (e.g. `multi_adjacency`, `chained`) have
+a lower success rate. Increase `sample_targets` for those families or increase
+`retry_multiplier` in `config.yaml`.
+**Relations step is slow**
+Normal for `countries: [all]` — it's a spatial self-join over millions of
+features. Use a country subset for development. Relations only need to be
+rebuilt when you add countries or change template families.
+**Validate step drops many samples**
+The validate step re-executes every SQL query and discards ones that return
+empty results. This is expected — check `output/runs/{run_name}/stats.json`
+for per-family counts after export.

dataset/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Synthetic dataset generation package."""

dataset/config.yaml ADDED Viewed

	@@ -0,0 +1,63 @@

+# Dataset Generation Configuration
+# This config controls which countries to process and how many samples to generate
+# Countries to include in relation building
+# Use ISO 3166-1 alpha-2 codes, or "all" to include every country
+countries:
+  - all
+  # Or specify a subset:
+  # - IN   # India
+  # - PK   # Pakistan
+  # - EC   # Ecuador
+  # - BE   # Belgium
+  # - KE   # Kenya
+# Sample generation targets per family
+# Relation limits are auto-calculated from these targets
+sample_targets:
+  direct_lookup:       500
+  adjacency:           750
+  multi_adjacency:     300
+  containment:         750
+  intersection:        500
+  buffer:              500
+  chained:             750   # coastal / landlocked variants
+  difference:          300
+  border_corridor:     300
+  set_operations:      500
+  partial_selection:   500
+  aggregation:         500
+  window_function:     300
+  attribute_filter:    300
+# Generation settings
+generation:
+  max_workers: 8           # Number of parallel workers
+  retry_multiplier: 2      # Generate 2x samples to account for failures
+  append_mode: false       # Set false for clean regeneration after template/format changes
+# Auto-scaling configuration
+# Relation limits are automatically calculated: target * retry_multiplier * safety_factor
+auto_scaling:
+  safety_factor: 1.5       # Extra buffer to ensure enough unique pairs
+  # Manual overrides (optional) - uncomment to override auto-calculated limits
+  manual_limits: {}
+    # adjacency: 10000     # Uncomment to manually set
+    # containment: 2000
+    # intersection: 1000
+    # cross_source: 500
+# Modal configuration for distributed generation
+modal:
+  volume_name: "gazet-data"         # Modal Volume for parquet data
+  app_name: "gazet-dataset"         # Modal app name
+  num_containers: 100               # Number of parallel containers for sample generation
+  container_cpu: 2                  # CPUs per container
+  container_memory: 4096            # Memory (MB) per container
+  timeout: 3600                     # Per-container timeout in seconds
+# Run name — used to version exported splits so re-runs never overwrite previous data.
+# Change this whenever you regenerate from scratch (e.g. after template changes).
+# Exported files land in: output/runs/{run_name}/
+run_name: "v1"

dataset/modal_app.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""Modal app for distributed dataset generation."""
+import modal
+app = modal.App("gazet-dataset")
+VOLUME_MOUNT = "/data"
+INTERMEDIATE_MOUNT = "/intermediate"
+volume = modal.Volume.from_name("gazet-data", create_if_missing=True)
+intermediate_volume = modal.Volume.from_name(
+    "gazet-intermediate", create_if_missing=True
+)
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install(
+        "duckdb>=1.4.4",
+        "dspy>=3.1.3",
+        "fastapi>=0.100",
+        "pandas>=2.2",
+        "pydantic>=2.0",
+        "pyarrow>=17.0.0",
+        "pyyaml>=6.0",
+    )
+    .env({"GAZET_DATA_DIR": VOLUME_MOUNT, "PYTHONPATH": "/root"})
+    .add_local_dir("src/gazet", "/root/gazet")
+    .add_local_dir("dataset", "/root/dataset")
+)
+@app.function(
+    image=image,
+    volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
+    timeout=300,
+    cpu=2,
+    memory=4096,
+)
+def build_inventory_remote():
+    """Build entity inventory from parquet files on the volume."""
+    from pathlib import Path
+    from dataset.scripts.build_inventory import build_inventory_to_dir
+    result = build_inventory_to_dir(Path(INTERMEDIATE_MOUNT))
+    intermediate_volume.commit()
+    return result
+@app.function(
+    image=image,
+    volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
+    timeout=3600,
+    cpu=4,
+    memory=32768,
+)
+def build_relation_remote(relation_type: str, countries: list, limit: int):
+    """Compute one relation type and save to intermediate volume."""
+    from pathlib import Path
+    from dataset.scripts.build_relations import compute_single_relation
+    count = compute_single_relation(
+        relation_type=relation_type,
+        countries=countries,
+        limit=limit,
+        output_dir=Path(INTERMEDIATE_MOUNT),
+    )
+    intermediate_volume.commit()
+    return {"relation_type": relation_type, "count": count}
+@app.function(
+    image=image,
+    volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
+    timeout=3600,
+    cpu=2,
+    memory=4096,
+)
+def generate_batch_remote(work_items: list) -> list:
+    """Process a batch of work items on a Modal container."""
+    from dataset.scripts.generate_samples import generate_batch_core
+    results = generate_batch_core(
+        work_items=work_items,
+        intermediate_dir=INTERMEDIATE_MOUNT,
+    )
+    print(f"Batch complete: {sum(1 for r in results if r['sample'])} success / "
+          f"{sum(1 for r in results if not r['sample'])} failed out of {len(work_items)}")
+    return results
+@app.local_entrypoint()
+def run_pipeline(
+    config_path: str = "dataset/config.yaml",
+    num_containers: int = 0,
+    skip_inventory: bool = False,
+    skip_relations: bool = False,
+    fresh: bool = False,
+):
+    """Run the full distributed pipeline."""
+    import yaml
+    from pathlib import Path
+    config = yaml.safe_load(Path(config_path).read_text())
+    countries = config["countries"]
+    sample_targets = config["sample_targets"]
+    modal_cfg = config.get("modal", {})
+    n_containers = num_containers or modal_cfg.get("num_containers", 50)
+    retry_multiplier = config["generation"]["retry_multiplier"]
+    print(f"Countries: {countries}")
+    print(f"Sample targets: {sample_targets}")
+    print(f"Containers: {n_containers}")
+    if not skip_inventory:
+        print("Building inventory...")
+        result = build_inventory_remote.remote()
+        print(f"  Inventory: {result}")
+    if not skip_relations:
+        print("Building relations...")
+        from dataset.scripts.cli import calculate_relation_limits
+        relation_needs = calculate_relation_limits(config)
+        handles = []
+        for rel_type, limit in relation_needs.items():
+            h = build_relation_remote.spawn(rel_type, countries, max(limit, 500))
+            handles.append((rel_type, h))
+        for rel_type, h in handles:
+            result = h.get()
+            print(f"  {rel_type}: {result['count']} pairs")
+    print(f"Generating samples across {n_containers} containers...")
+    import json
+    from dataset.scripts.generate_samples import prepare_work_items
+    output_dir = Path("dataset/output")
+    output_dir.mkdir(exist_ok=True, parents=True)
+    output_file = output_dir / "dataset_raw.jsonl"
+    existing_samples = []
+    sample_counter = 1
+    if not fresh and output_file.exists():
+        with open(output_file) as f:
+            for line in f:
+                if line.strip():
+                    existing_samples.append(json.loads(line))
+        if existing_samples:
+            max_id = max(
+                int(s["id"].split("_")[1])
+                for s in existing_samples
+                if s["id"].startswith("sample_")
+            )
+            sample_counter = max_id + 1
+            print(f"  Appending to {len(existing_samples)} existing samples")
+    work_items = prepare_work_items(
+        target_counts=sample_targets,
+        retry_multiplier=retry_multiplier,
+        start_counter=sample_counter,
+        intermediate_dir_str="",
+    )
+    total_work = len(work_items)
+    print(f"  Total work items: {total_work}")
+    batch_size = max(1, (total_work + n_containers - 1) // n_containers)
+    batches = [
+        work_items[i : i + batch_size]
+        for i in range(0, total_work, batch_size)
+    ]
+    print(f"  Batches: {len(batches)} x ~{batch_size} items")
+    new_sample_count = 0
+    failed_batches = 0
+    family_progress = {}
+    write_mode = "w" if fresh else "a"
+    fout = open(output_file, write_mode)
+    try:
+        for batch_results in generate_batch_remote.map(
+            batches, return_exceptions=True
+        ):
+            if isinstance(batch_results, Exception):
+                failed_batches += 1
+                print(f"  Batch failed: {batch_results}")
+                continue
+            batch_samples = []
+            for r in batch_results:
+                fam = r["family"]
+                if fam not in family_progress:
+                    family_progress[fam] = {"success": 0, "failed": 0}
+                if r["sample"]:
+                    batch_samples.append(r["sample"])
+                    family_progress[fam]["success"] += 1
+                else:
+                    family_progress[fam]["failed"] += 1
+            for sample in batch_samples:
+                fout.write(json.dumps(sample) + "\n")
+            fout.flush()
+            new_sample_count += len(batch_samples)
+            done = sum(p["success"] + p["failed"] for p in family_progress.values())
+            print(f"  Progress: {done}/{total_work} items | {new_sample_count} saved | {failed_batches} batch errors")
+    except Exception as e:
+        print(f"  Map interrupted: {e}")
+    finally:
+        fout.close()
+    print(f"\nResults by family:")
+    for fam in sorted(family_progress.keys()):
+        s = family_progress[fam]["success"]
+        f = family_progress[fam]["failed"]
+        total = s + f
+        rate = (s / total * 100) if total > 0 else 0
+        target = sample_targets.get(fam, 0)
+        print(
+            f"  {fam:20s}: {s:4d} success / {f:4d} failed "
+            f"({rate:5.1f}%, target: {target})"
+        )
+    total_samples = len(existing_samples) + new_sample_count
+    status = "COMPLETE" if failed_batches == 0 else "PARTIAL"
+    print(f"\nGeneration {status}: {new_sample_count} new, {total_samples} total")
+    if failed_batches:
+        print(f"  Failed batches: {failed_batches}/{len(batches)}")
+    print(f"  Output: {output_file}")
+@app.local_entrypoint()
+def upload_data(data_dir: str = "data"):
+    """Upload local data directory to the Modal volume."""
+    import os
+    from pathlib import Path
+    data_path = Path(data_dir)
+    if not data_path.exists():
+        print(f"Error: {data_path} does not exist")
+        return
+    print(f"Uploading {data_path} to Modal volume 'gazet-data'...")
+    file_count = 0
+    total_size = 0
+    for root, dirs, files in os.walk(data_path):
+        for f in files:
+            local_path = os.path.join(root, f)
+            # Relative path within data_dir becomes the volume path
+            rel = os.path.relpath(local_path, data_path)
+            size = os.path.getsize(local_path)
+            total_size += size
+            file_count += 1
+            print(f"  {rel} ({size / (1024*1024):.1f} MB)")
+    print(f"  {file_count} files, {total_size / (1024*1024):.1f} MB")
+    vol = modal.Volume.from_name("gazet-data", create_if_missing=True)
+    with vol.batch_upload() as batch:
+        batch.put_directory(str(data_path), "/")
+    print("Upload complete")

dataset/scripts/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Dataset generation scripts package."""

dataset/scripts/build_inventory.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""
+Build entity inventory from divisions_area and natural_earth parquet files.
+This script creates compact inventory tables containing only the fields needed
+for candidate sampling and distractor generation.
+Output:
+- intermediate/divisions_area_inventory.parquet
+- intermediate/natural_earth_inventory.parquet
+"""
+import duckdb
+import pandas as pd
+from pathlib import Path
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+def build_divisions_area_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
+    """Extract compact inventory from divisions_area."""
+    query = """
+    SELECT
+        'divisions_area' AS source,
+        id,
+        names."primary" AS name,
+        subtype,
+        country,
+        region,
+        admin_level,
+        class,
+        is_land,
+        is_territorial,
+        division_id,
+        ST_Area(geometry) AS area_sq_deg,
+        ST_XMin(geometry) AS xmin,
+        ST_YMin(geometry) AS ymin,
+        ST_XMax(geometry) AS xmax,
+        ST_YMax(geometry) AS ymax
+    FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
+      AND trim(names."primary") != ''
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH]).fetchdf()
+    print(f"Divisions area inventory: {len(df)} entities")
+    print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
+    print(f"Countries: {df['country'].nunique()} unique")
+    return df
+def build_natural_earth_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
+    """Extract compact inventory from natural_earth."""
+    query = """
+    SELECT
+        'natural_earth' AS source,
+        id,
+        names."primary" AS name,
+        subtype,
+        country,
+        region,
+        admin_level,
+        class,
+        is_land,
+        is_territorial,
+        ST_Area(geometry) AS area_sq_deg,
+        ST_XMin(geometry) AS xmin,
+        ST_YMin(geometry) AS ymin,
+        ST_XMax(geometry) AS xmax,
+        ST_YMax(geometry) AS ymax
+    FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
+      AND trim(names."primary") != ''
+    """
+    df = con.execute(query, [NATURAL_EARTH_PATH]).fetchdf()
+    print(f"\nNatural earth inventory: {len(df)} entities")
+    print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
+    return df
+def build_inventory_to_dir(output_dir: Path) -> dict:
+    """Build and save all inventory tables to output_dir.
+    Reusable entry point for both local CLI and Modal.
+    Returns:
+        Dict with counts: {"divisions_area": int, "natural_earth": int}
+    """
+    output_dir.mkdir(exist_ok=True, parents=True)
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    print("Building divisions_area inventory...")
+    divisions_df = build_divisions_area_inventory(con)
+    divisions_path = output_dir / "divisions_area_inventory.parquet"
+    divisions_df.to_parquet(divisions_path, index=False)
+    print(f"Saved to {divisions_path}")
+    print("\nBuilding natural_earth inventory...")
+    natural_earth_df = build_natural_earth_inventory(con)
+    natural_earth_path = output_dir / "natural_earth_inventory.parquet"
+    natural_earth_df.to_parquet(natural_earth_path, index=False)
+    print(f"Saved to {natural_earth_path}")
+    con.close()
+    total = len(divisions_df) + len(natural_earth_df)
+    print(f"\nInventory build complete")
+    print(f"  Total entities: {total}")
+    return {"divisions_area": len(divisions_df), "natural_earth": len(natural_earth_df)}
+def main():
+    """Build and save inventory tables."""
+    output_dir = Path(__file__).parent.parent / "intermediate"
+    build_inventory_to_dir(output_dir)
+if __name__ == "__main__":
+    main()

dataset/scripts/build_relations.py ADDED Viewed

	@@ -0,0 +1,557 @@

+"""
+Precompute spatial relation tables for efficient anchor sampling.
+This script computes:
+- Adjacency pairs (touching features)
+- Containment pairs (features within other features)
+- Intersection pairs (overlapping features)
+- Cross-source relations (divisions_area ↔ natural_earth)
+Output:
+- intermediate/adjacency_pairs.parquet
+- intermediate/containment_pairs.parquet
+- intermediate/intersection_pairs.parquet
+- intermediate/cross_source_relations.parquet
+"""
+import duckdb
+import pandas as pd
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+# Subtypes too granular for spatial self-joins at global scale
+_EXCLUDED_SUBTYPES_FOR_GLOBAL = ("locality", "neighborhood", "microhood", "macrohood")
+def _country_filter(countries: list) -> tuple[str, list]:
+    """Return (SQL WHERE clause, params) handling 'all' sentinel."""
+    if countries == ["all"]:
+        return "", []
+    return "WHERE country IN (SELECT unnest(?))", [countries]
+def _country_filter_for_join(countries: list) -> tuple[str, list]:
+    """Like _country_filter but also excludes fine-grained subtypes for global runs.
+    When joining all 1M+ entities, localities/neighborhoods/microhoods cause
+    OOM. Excluding them keeps ~110K higher-level admin entities.
+    """
+    excluded = "', '".join(_EXCLUDED_SUBTYPES_FOR_GLOBAL)
+    subtype_clause = f"AND subtype NOT IN ('{excluded}')"
+    if countries == ["all"]:
+        return f"WHERE 1=1 {subtype_clause}", []
+    return f"WHERE country IN (SELECT unnest(?)) {subtype_clause}", [countries]
+def compute_adjacency_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find all pairs of features that touch (share a boundary)."""
+    print("Computing adjacency pairs (optimized with spatial index)...")
+    cfilter, cparams = _country_filter_for_join(countries)
+    # Use bounding box pre-filter to avoid full cartesian product
+    query = f"""
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        {cfilter}
+    )
+    SELECT
+        a.id AS anchor_id,
+        a.name AS anchor_name,
+        a.subtype AS anchor_subtype,
+        a.country AS anchor_country,
+        b.id AS target_id,
+        b.name AS target_name,
+        b.subtype AS target_subtype,
+        b.country AS target_country,
+        'adjacency' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id < b.id
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Touches(a.geometry, b.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH] + cparams + [limit]).fetchdf()
+    print(f"Found {len(df)} adjacency pairs")
+    return df
+def compute_containment_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find all pairs where one feature contains another."""
+    print("\nComputing containment pairs (optimized)...")
+    cfilter, cparams = _country_filter(countries)
+    query = f"""
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        {cfilter}
+    )
+    SELECT
+        a.id AS container_id,
+        a.name AS container_name,
+        a.subtype AS container_subtype,
+        b.id AS contained_id,
+        b.name AS contained_name,
+        b.subtype AS contained_subtype,
+        'containment' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id != b.id
+        AND a.admin_level < b.admin_level
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Within(b.geometry, a.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH] + cparams + [limit]).fetchdf()
+    print(f"Found {len(df)} containment pairs")
+    return df
+def compute_intersection_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find pairs that intersect but don't touch or contain."""
+    print("\nComputing intersection pairs (optimized)...")
+    cfilter, cparams = _country_filter_for_join(countries)
+    query = f"""
+    WITH features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        {cfilter}
+    )
+    SELECT
+        a.id AS anchor_id,
+        a.name AS anchor_name,
+        a.subtype AS anchor_subtype,
+        b.id AS target_id,
+        b.name AS target_name,
+        b.subtype AS target_subtype,
+        'intersection' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id < b.id
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Intersects(a.geometry, b.geometry)
+        AND NOT ST_Touches(a.geometry, b.geometry)
+        AND NOT ST_Within(a.geometry, b.geometry)
+        AND NOT ST_Within(b.geometry, a.geometry)
+    )
+    LIMIT ?
+    """
+    df = con.execute(query, [DIVISIONS_AREA_PATH] + cparams + [limit]).fetchdf()
+    print(f"Found {len(df)} same-source intersection pairs")
+    return df
+def compute_cross_source_relations(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int
+) -> pd.DataFrame:
+    """Find relations between divisions_area and natural_earth.
+    Covers all natural_earth subtypes that appear in SQL templates:
+    seas/oceans (adjacency, buffer, chained), terrain areas and island
+    groups (chained_03, intersect_02, buffer_03/04).
+    """
+    print("\nComputing cross-source relations...")
+    cfilter, cparams = _country_filter(countries)
+    query = f"""
+    WITH divisions AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            geometry
+        FROM read_parquet(?)
+        {cfilter}
+    ),
+    natural_features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            ST_SetCRS(geometry, 'OGC:CRS84') AS geometry
+        FROM read_parquet(?)
+        WHERE subtype IN (
+            'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay',
+            'Terrain area', 'Island group', 'Peninsula', 'Strait',
+            'Reef', 'Range/Mts', 'Depression'
+        )
+        LIMIT 500
+    )
+    SELECT
+        d.id AS division_id,
+        d.name AS division_name,
+        d.subtype AS division_subtype,
+        d.country AS division_country,
+        n.id AS natural_id,
+        n.name AS natural_name,
+        n.subtype AS natural_subtype,
+        CASE
+            WHEN ST_Touches(d.geometry, n.geometry) THEN 'touches'
+            WHEN ST_Within(d.geometry, n.geometry) THEN 'within'
+            WHEN ST_Contains(d.geometry, n.geometry) THEN 'contains'
+            WHEN ST_Intersects(d.geometry, n.geometry) THEN 'intersects'
+        END AS relation_type
+    FROM divisions AS d
+    JOIN natural_features AS n ON ST_Intersects(d.geometry, n.geometry)
+    LIMIT ?
+    """
+    df = con.execute(
+        query, [DIVISIONS_AREA_PATH] + cparams + [NATURAL_EARTH_PATH, limit]
+    ).fetchdf()
+    print(f"Found {len(df)} cross-source relations")
+    return df
+def compute_coastal_containment_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int,
+) -> pd.DataFrame:
+    """Containment pairs where the container is in a coastal country.
+    Used by chained_01 (coastal towns of X) to ensure sampled containment
+    anchors actually have sea-adjacent sub-features, keeping the SQL
+    verification step from constantly returning empty results.
+    Strategy: find countries whose geometry intersects any ocean/sea in
+    natural_earth, then filter containment_pairs to those countries.
+    """
+    print("\nComputing coastal containment pairs...")
+    cfilter, cparams = _country_filter(countries)
+    query = f"""
+    WITH coastal_countries AS (
+        SELECT DISTINCT d.country
+        FROM read_parquet(?) AS d
+        JOIN read_parquet(?) AS n
+          ON ST_Intersects(d.geometry, ST_SetCRS(n.geometry, 'OGC:CRS84'))
+        WHERE d.subtype = 'country'
+          AND n.subtype IN ('sea', 'ocean')
+    ),
+    features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        {cfilter}
+    )
+    SELECT
+        a.id AS container_id,
+        a.name AS container_name,
+        a.subtype AS container_subtype,
+        b.id AS contained_id,
+        b.name AS contained_name,
+        b.subtype AS contained_subtype,
+        a.country AS container_country,
+        'coastal_containment' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id != b.id
+        AND a.admin_level < b.admin_level
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Within(b.geometry, a.geometry)
+    )
+    WHERE a.country IN (SELECT country FROM coastal_countries)
+    LIMIT ?
+    """
+    df = con.execute(
+        query,
+        [DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH] + cparams + [DIVISIONS_AREA_PATH, limit],
+    ).fetchdf()
+    print(f"Found {len(df)} coastal containment pairs")
+    return df
+def compute_landlocked_containment_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int,
+) -> pd.DataFrame:
+    """Containment pairs where the container is in a landlocked country.
+    Used by chained_02 (landlocked regions within X) to ensure sampled
+    anchors genuinely have no sea access, keeping SQL verification from
+    always returning empty.
+    """
+    print("\nComputing landlocked containment pairs...")
+    cfilter, cparams = _country_filter(countries)
+    query = f"""
+    WITH coastal_countries AS (
+        SELECT DISTINCT d.country
+        FROM read_parquet(?) AS d
+        JOIN read_parquet(?) AS n
+          ON ST_Intersects(d.geometry, ST_SetCRS(n.geometry, 'OGC:CRS84'))
+        WHERE d.subtype = 'country'
+          AND n.subtype IN ('sea', 'ocean')
+    ),
+    features AS (
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            admin_level,
+            geometry,
+            ST_Envelope(geometry) AS bbox
+        FROM read_parquet(?)
+        {cfilter}
+    )
+    SELECT
+        a.id AS container_id,
+        a.name AS container_name,
+        a.subtype AS container_subtype,
+        b.id AS contained_id,
+        b.name AS contained_name,
+        b.subtype AS contained_subtype,
+        a.country AS container_country,
+        'landlocked_containment' AS relation_type
+    FROM features AS a
+    JOIN features AS b ON (
+        a.id != b.id
+        AND a.admin_level < b.admin_level
+        AND ST_Intersects(a.bbox, b.bbox)
+        AND ST_Within(b.geometry, a.geometry)
+    )
+    WHERE a.country NOT IN (SELECT country FROM coastal_countries)
+    LIMIT ?
+    """
+    df = con.execute(
+        query,
+        [DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH] + cparams + [DIVISIONS_AREA_PATH, limit],
+    ).fetchdf()
+    print(f"Found {len(df)} landlocked containment pairs")
+    return df
+def compute_common_neighbor_pairs(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int,
+) -> pd.DataFrame:
+    """Pairs of anchors that share at least one common touching neighbour.
+    Used by multi_adj_01 (borders both X and Y) so that the generated SQL
+    is guaranteed to return at least one result rather than failing constantly
+    on random pairs that have no common neighbour.
+    Derived by self-joining adjacency_pairs on the shared target_id.
+    """
+    print("\nComputing common-neighbor pairs...")
+    adj_path = Path(__file__).parent.parent / "intermediate" / "adjacency_pairs.parquet"
+    if not adj_path.exists():
+        print("  adjacency_pairs.parquet not found — skipping (run adjacency first)")
+        return pd.DataFrame(columns=[
+            "anchor_id_1", "anchor_name_1", "anchor_id_2", "anchor_name_2",
+            "shared_neighbor_id", "shared_neighbor_name",
+        ])
+    query = """
+    SELECT DISTINCT
+        a1.anchor_id   AS anchor_id_1,
+        a1.anchor_name AS anchor_name_1,
+        a2.anchor_id   AS anchor_id_2,
+        a2.anchor_name AS anchor_name_2,
+        a1.target_id   AS shared_neighbor_id,
+        a1.target_name AS shared_neighbor_name
+    FROM read_parquet(?) AS a1
+    JOIN read_parquet(?) AS a2
+      ON a1.target_id = a2.target_id
+     AND a1.anchor_id < a2.anchor_id
+    LIMIT ?
+    """
+    df = con.execute(query, [str(adj_path), str(adj_path), limit]).fetchdf()
+    print(f"Found {len(df)} common-neighbor pairs")
+    return df
+def _make_connection():
+    """Create a new DuckDB connection with spatial extension loaded."""
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    con.execute("SET memory_limit='24GB'")
+    con.execute("SET temp_directory='/tmp/duckdb_tmp'")
+    con.execute("SET threads=4")
+    return con
+def _compute_and_save(compute_fn, countries, limit, output_path):
+    """Compute a relation table and save it to parquet. Uses its own DuckDB connection."""
+    con = _make_connection()
+    try:
+        df = compute_fn(con, countries, limit)
+        df.to_parquet(output_path, index=False)
+        print(f"Saved to {output_path}")
+        return df
+    finally:
+        con.close()
+RELATION_FUNCTIONS = {
+    "adjacency":              compute_adjacency_pairs,
+    "containment":            compute_containment_pairs,
+    "intersection":           compute_intersection_pairs,
+    "cross_source":           compute_cross_source_relations,
+    "coastal_containment":    compute_coastal_containment_pairs,
+    "landlocked_containment": compute_landlocked_containment_pairs,
+    "common_neighbor":        compute_common_neighbor_pairs,
+}
+def compute_single_relation(
+    relation_type: str,
+    countries: list,
+    limit: int,
+    output_dir: Path,
+) -> int:
+    """Compute one relation type and save to output_dir.
+    Returns the number of rows computed. Usable from Modal or locally.
+    """
+    compute_fn = RELATION_FUNCTIONS.get(relation_type)
+    if compute_fn is None:
+        raise ValueError(
+            f"Unknown relation type: {relation_type}. "
+            f"Expected one of {list(RELATION_FUNCTIONS)}"
+        )
+    output_dir.mkdir(exist_ok=True, parents=True)
+    output_path = output_dir / f"{relation_type}_pairs.parquet"
+    df = _compute_and_save(compute_fn, countries, limit, output_path)
+    return len(df)
+def main(countries: list = None, relation_limits: dict = None):
+    """Compute and save all relation tables in parallel.
+    Args:
+        countries: List of country codes to process
+        relation_limits: Dict with keys: adjacency, containment, intersection, cross_source
+    """
+    # Defaults
+    if countries is None:
+        countries = ['EC', 'BE', 'KE', 'AE', 'SG', 'CH']
+    if relation_limits is None:
+        relation_limits = {
+            'adjacency':              50000,
+            'containment':            1000,
+            'intersection':           500,
+            'cross_source':           500,
+            'coastal_containment':    1000,
+            'landlocked_containment': 500,
+            'common_neighbor':        5000,
+        }
+    output_dir = Path(__file__).parent.parent / "intermediate"
+    output_dir.mkdir(exist_ok=True, parents=True)
+    # Define all relation tasks.
+    # common_neighbor depends on adjacency_pairs so it runs after adjacency.
+    tasks = [
+        ("adjacency",              compute_adjacency_pairs,              relation_limits['adjacency'],              output_dir / "adjacency_pairs.parquet"),
+        ("containment",            compute_containment_pairs,            relation_limits['containment'],            output_dir / "containment_pairs.parquet"),
+        ("intersection",           compute_intersection_pairs,           relation_limits['intersection'],           output_dir / "intersection_pairs.parquet"),
+        ("cross_source",           compute_cross_source_relations,       relation_limits['cross_source'],           output_dir / "cross_source_relations.parquet"),
+        ("coastal_containment",    compute_coastal_containment_pairs,    relation_limits['coastal_containment'],    output_dir / "coastal_containment_pairs.parquet"),
+        ("landlocked_containment", compute_landlocked_containment_pairs, relation_limits['landlocked_containment'], output_dir / "landlocked_containment_pairs.parquet"),
+        ("common_neighbor",        compute_common_neighbor_pairs,        relation_limits['common_neighbor'],        output_dir / "common_neighbor_pairs.parquet"),
+    ]
+    # common_neighbor reads adjacency_pairs.parquet so it must run after
+    # adjacency finishes.  Split into two waves.
+    independent_tasks = [t for t in tasks if t[0] != "common_neighbor"]
+    dependent_tasks   = [t for t in tasks if t[0] == "common_neighbor"]
+    print(f"Computing {len(independent_tasks)} relation types in parallel...")
+    with ThreadPoolExecutor(max_workers=len(independent_tasks)) as executor:
+        futures = {
+            executor.submit(_compute_and_save, compute_fn, countries, limit, path): name
+            for name, compute_fn, limit, path in independent_tasks
+        }
+        for future in as_completed(futures):
+            name = futures[future]
+            try:
+                future.result()
+            except Exception as e:
+                print(f"ERROR computing {name}: {e}")
+                raise
+    for name, compute_fn, limit, path in dependent_tasks:
+        print(f"\nComputing {name} (depends on adjacency)...")
+        try:
+            _compute_and_save(compute_fn, countries, limit, path)
+        except Exception as e:
+            print(f"ERROR computing {name}: {e}")
+            raise
+    print("\nRelation tables build complete")
+if __name__ == "__main__":
+    main()

dataset/scripts/cli.py ADDED Viewed

	@@ -0,0 +1,377 @@

+#!/usr/bin/env python3
+"""
+CLI for synthetic dataset generation.
+Usage:
+    python cli.py build-relations --config ../config.yaml
+    python cli.py generate-samples --config ../config.yaml
+    python cli.py generate-samples --config ../config.yaml --append
+    python cli.py full-pipeline --config ../config.yaml
+"""
+import argparse
+import subprocess
+import sys
+from pathlib import Path
+from typing import Dict
+import pandas as pd
+import yaml
+def load_config(config_path: Path) -> dict:
+    """Load configuration from YAML file."""
+    with open(config_path) as f:
+        return yaml.safe_load(f)
+def should_rebuild_relations(config: dict, intermediate_dir: Path, append: bool) -> bool:
+    """Check if relation tables need to be rebuilt.
+    Returns True if:
+    - Not in append mode (always rebuild)
+    - Relation tables don't exist
+    - Countries in config differ from countries in existing relation tables
+    """
+    if not append:
+        return True
+    # Check if relation tables exist
+    adjacency_file = intermediate_dir / "adjacency_pairs.parquet"
+    if not adjacency_file.exists():
+        print("WARNING: Relation tables not found, will rebuild despite append mode")
+        return True
+    # Check if countries have changed
+    try:
+        df = pd.read_parquet(adjacency_file)
+        if 'anchor_country' in df.columns:
+            existing_countries = set(df['anchor_country'].unique())
+            config_countries = set(config['countries'])
+            if existing_countries != config_countries:
+                print(f"WARNING: Countries changed:")
+                print(f"    Previous: {sorted(existing_countries)}")
+                print(f"    New: {sorted(config_countries)}")
+                print(f"    Will rebuild relation tables to include new countries")
+                return True
+            else:
+                print(f"Countries unchanged: {sorted(config_countries)}")
+                return False
+        else:
+            # Can't determine countries, rebuild to be safe
+            print("WARNING: Cannot determine countries from existing tables, will rebuild")
+            return True
+    except Exception as e:
+        print(f"WARNING: Error reading existing relation tables: {e}")
+        print("    Will rebuild to be safe")
+        return True
+def calculate_relation_limits(config: dict) -> Dict[str, int]:
+    """Auto-calculate relation limits based on sample targets."""
+    sample_targets = config['sample_targets']
+    retry_mult = config['generation']['retry_multiplier']
+    safety = config.get('auto_scaling', {}).get('safety_factor', 1.5)
+    # Map each task family to the relation tables it draws anchors from.
+    # A family can need multiple relation types.
+    family_to_relations = {
+        'direct_lookup':      [],
+        'adjacency':          ['adjacency'],
+        'multi_adjacency':    ['adjacency', 'common_neighbor'],
+        'containment':        ['containment'],
+        'intersection':       ['intersection', 'cross_source'],
+        'buffer':             ['adjacency'],
+        'chained':            ['coastal_containment', 'landlocked_containment', 'containment'],
+        'difference':         ['containment', 'cross_source'],
+        'border_corridor':    ['adjacency'],
+        'set_operations':     ['containment', 'cross_source'],
+        'partial_selection':  ['containment', 'cross_source'],
+        'aggregation':        ['containment'],
+        'window_function':    [],
+        'attribute_filter':   [],
+    }
+    relation_needs: Dict[str, int] = {}
+    for family, target in sample_targets.items():
+        for rel_type in family_to_relations.get(family, []):
+            needed = int(target * retry_mult * safety)
+            relation_needs[rel_type] = relation_needs.get(rel_type, 0) + needed
+    # common_neighbor is derived from adjacency — keep its limit proportional
+    if 'common_neighbor' not in relation_needs and 'adjacency' in relation_needs:
+        relation_needs['common_neighbor'] = relation_needs['adjacency'] * 3
+    # Apply manual overrides if specified
+    manual = config.get('auto_scaling', {}).get('manual_limits', {})
+    relation_needs.update(manual)
+    return relation_needs
+def build_relations(config_path: Path):
+    """Run relation building with config."""
+    config = load_config(config_path)
+    # Auto-calculate relation limits
+    relation_limits = calculate_relation_limits(config)
+    print("=" * 60)
+    print("STEP 1: Building Relation Tables")
+    print("=" * 60)
+    print(f"Countries: {', '.join(config['countries'])}")
+    print(f"\nAuto-calculated relation limits:")
+    for rel_type, limit in relation_limits.items():
+        print(f"  {rel_type:20s}: {limit:,}")
+    print()
+    # Import and run the relation builder
+    from dataset.scripts import build_relations
+    # Run with config parameters
+    build_relations.main(
+        countries=config['countries'],
+        relation_limits=relation_limits
+    )
+    print("\nRelation tables built successfully")
+def generate_samples(config_path: Path, append: bool = False):
+    """Run sample generation with config."""
+    config = load_config(config_path)
+    print("=" * 60)
+    print("STEP 2: Generating Samples")
+    print("=" * 60)
+    print(f"Targets: {config['sample_targets']}")
+    print(f"Workers: {config['generation']['max_workers']}")
+    print(f"Append mode: {append or config['generation']['append_mode']}")
+    print()
+    # Simple import - no number prefixes needed
+    from dataset.scripts import generate_samples as gs_module
+    # Override config values
+    gs_module.TARGET_COUNTS = config['sample_targets']
+    gs_module.MAX_WORKERS = config['generation']['max_workers']
+    gs_module.RETRY_MULTIPLIER = config['generation']['retry_multiplier']
+    gs_module.APPEND_MODE = append or config['generation']['append_mode']
+    # Run the main function
+    gs_module.main()
+    print("\nSamples generated successfully")
+def validate_dataset(config_path: Path):
+    """Run dataset validation."""
+    print("=" * 60)
+    print("STEP 3: Validating Dataset")
+    print("=" * 60)
+    script_dir = Path(__file__).parent
+    result = subprocess.run(
+        [sys.executable, str(script_dir / "validate_dataset.py")],
+        check=True
+    )
+    print("\nDataset validated successfully")
+def export_dataset(config_path: Path):
+    """Run dataset export for both SQL generation and place extraction tasks."""
+    print("=" * 60)
+    print("STEP 4: Exporting Dataset")
+    print("=" * 60)
+    from dataset.scripts.export_training_data import main as export_main
+    export_main(config_path=config_path)
+    print("\nDataset exported successfully")
+def modal_upload(config_path: Path):
+    """Upload local data to Modal volume."""
+    subprocess.run(
+        [sys.executable, "-m", "modal", "run",
+         "dataset/modal_app.py::upload_data"],
+        check=True
+    )
+def modal_generate(config_path: Path, num_containers: int = 0,
+                   skip_inventory: bool = False, skip_relations: bool = False,
+                   fresh: bool = False):
+    """Run distributed generation on Modal (appends by default)."""
+    cmd = [
+        sys.executable, "-m", "modal", "run",
+        "dataset/modal_app.py::run_pipeline",
+        "--config-path", str(config_path),
+    ]
+    if num_containers > 0:
+        cmd.extend(["--num-containers", str(num_containers)])
+    if skip_inventory:
+        cmd.append("--skip-inventory")
+    if skip_relations:
+        cmd.append("--skip-relations")
+    if fresh:
+        cmd.append("--fresh")
+    subprocess.run(cmd, check=True)
+    validate_dataset(config_path)
+    export_dataset(config_path)
+def full_pipeline(config_path: Path, append: bool = False):
+    """Run the full pipeline."""
+    print("Running full dataset generation pipeline")
+    config = load_config(config_path)
+    # Check if inventory exists, create if not
+    script_dir = Path(__file__).parent
+    intermediate_dir = script_dir.parent / "intermediate"
+    inventory_files = [
+        intermediate_dir / "divisions_area_inventory.parquet",
+        intermediate_dir / "natural_earth_inventory.parquet"
+    ]
+    inventory_missing = any(not f.exists() for f in inventory_files)
+    if inventory_missing:
+        print("=" * 60)
+        print("STEP 0: Building Entity Inventory")
+        print("=" * 60)
+        print("Inventory files not found, building...")
+        from dataset.scripts import build_inventory
+        build_inventory.main()
+    # Check if we need to rebuild relations
+    need_rebuild = should_rebuild_relations(config, intermediate_dir, append)
+    if need_rebuild:
+        build_relations(config_path)
+    else:
+        print("Using existing relation tables (append mode, same countries)")
+    generate_samples(config_path, append=append)
+    validate_dataset(config_path)
+    export_dataset(config_path)
+    print("\nPipeline complete")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Synthetic dataset generation CLI",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Build relation tables only
+  python cli.py build-relations --config ../config.yaml
+  # Generate samples only
+  python cli.py generate-samples --config ../config.yaml
+  # Generate and append to existing dataset
+  python cli.py generate-samples --config ../config.yaml --append
+  # Run full pipeline
+  python cli.py full-pipeline --config ../config.yaml
+  # Run full pipeline in append mode (skip relation building)
+  python cli.py full-pipeline --config ../config.yaml --append
+  # Upload data to Modal volume (one-time)
+  python cli.py modal-upload --config ../config.yaml
+  # Run distributed generation on Modal
+  python cli.py modal-generate --config ../config.yaml
+  python cli.py modal-generate --config ../config.yaml --num-containers 100
+  python cli.py modal-generate --config ../config.yaml --skip-inventory --skip-relations
+        """
+    )
+    parser.add_argument(
+        'command',
+        choices=['build-relations', 'generate-samples', 'validate', 'export',
+                 'full-pipeline', 'modal-upload', 'modal-generate'],
+        help='Command to run'
+    )
+    parser.add_argument(
+        '--config',
+        type=Path,
+        required=True,
+        help='Path to config YAML file'
+    )
+    parser.add_argument(
+        '--append',
+        action='store_true',
+        help='Append to existing dataset instead of overwriting'
+    )
+    parser.add_argument(
+        '--num-containers',
+        type=int,
+        default=0,
+        help='Number of Modal containers (0 = use config default)'
+    )
+    parser.add_argument(
+        '--skip-inventory',
+        action='store_true',
+        help='Skip inventory building on Modal'
+    )
+    parser.add_argument(
+        '--skip-relations',
+        action='store_true',
+        help='Skip relation building on Modal'
+    )
+    parser.add_argument(
+        '--fresh',
+        action='store_true',
+        help='Overwrite existing dataset instead of appending (Modal only)'
+    )
+    args = parser.parse_args()
+    # Validate config file exists
+    if not args.config.exists():
+        print(f"Error: Config file not found: {args.config}")
+        sys.exit(1)
+    # Run the appropriate command
+    try:
+        if args.command == 'build-relations':
+            build_relations(args.config)
+        elif args.command == 'generate-samples':
+            generate_samples(args.config, args.append)
+        elif args.command == 'validate':
+            validate_dataset(args.config)
+        elif args.command == 'export':
+            export_dataset(args.config)
+        elif args.command == 'full-pipeline':
+            full_pipeline(args.config, args.append)
+        elif args.command == 'modal-upload':
+            modal_upload(args.config)
+        elif args.command == 'modal-generate':
+            modal_generate(
+                args.config,
+                num_containers=args.num_containers,
+                skip_inventory=args.skip_inventory,
+                skip_relations=args.skip_relations,
+                fresh=args.fresh,
+            )
+    except Exception as e:
+        print(f"\nError: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

dataset/scripts/export_training_data.py ADDED Viewed

	@@ -0,0 +1,372 @@

+"""
+Export validated dataset to train/val/test splits.
+Produces two task datasets from the same source samples:
+1. SQL generation  (prompt = question + candidates CSV, completion = SQL)
+2. Place extraction (prompt = question only, completion = PlacesResult JSON)
+Place extraction pairs are derived automatically: for each SQL sample the
+selected_candidates give us the correct place names, subtypes, and country
+codes that the extractor should return.
+Output layout (all paths relative to dataset/):
+    output/runs/{run_name}/sql/train.jsonl
+    output/runs/{run_name}/sql/val.jsonl
+    output/runs/{run_name}/sql/test.jsonl
+    output/runs/{run_name}/places/train.jsonl
+    output/runs/{run_name}/places/val.jsonl
+    output/runs/{run_name}/places/test.jsonl
+    output/runs/{run_name}/stats.json
+"""
+import json
+import random
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import sqlparse
+import yaml
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+# ---------------------------------------------------------------------------
+# Loading
+# ---------------------------------------------------------------------------
+def load_samples(jsonl_path: Path) -> List[Dict[str, Any]]:
+    samples = []
+    with open(jsonl_path) as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                samples.append(json.loads(line))
+    return samples
+def load_run_name(config_path: Optional[Path]) -> str:
+    if config_path and config_path.exists():
+        with open(config_path) as f:
+            cfg = yaml.safe_load(f)
+        return cfg.get("run_name", "default")
+    return "default"
+# ---------------------------------------------------------------------------
+# Splitting
+# ---------------------------------------------------------------------------
+def stratified_split(
+    samples: List[Dict[str, Any]],
+    train_ratio: float = 0.8,
+    val_ratio: float = 0.1,
+    seed: int = 42,
+) -> Tuple[List[Dict], List[Dict], List[Dict]]:
+    """Split stratified by task_family so every family is represented in each split."""
+    random.seed(seed)
+    by_family: Dict[str, List] = defaultdict(list)
+    for s in samples:
+        by_family[s["metadata"]["task_family"]].append(s)
+    train, val, test = [], [], []
+    for family_samples in by_family.values():
+        random.shuffle(family_samples)
+        n = len(family_samples)
+        n_train = int(n * train_ratio)
+        n_val = int(n * val_ratio)
+        train.extend(family_samples[:n_train])
+        val.extend(family_samples[n_train : n_train + n_val])
+        test.extend(family_samples[n_train + n_val :])
+    random.shuffle(train)
+    random.shuffle(val)
+    random.shuffle(test)
+    return train, val, test
+# ---------------------------------------------------------------------------
+# SQL generation format
+# Conversational prompt-completion: model sees system + user, generates SQL.
+# ---------------------------------------------------------------------------
+_SQL_SYSTEM = """You are a text to SQL query translator that helps in natural language geocoding.
+You have access to two DuckDB parquet tables. Given a set of candidate entities and a user query, generate the SQL to retrieve the desired geometry.
+<SCHEMA>
+1. divisions_area  -- Overture polygon/multipolygon admin boundaries
+   query: read_parquet('divisions_area')
+   columns:
+     id VARCHAR              -- unique feature id
+     names STRUCT("primary" VARCHAR, ...)
+     country VARCHAR         -- ISO 3166-1 alpha-2
+     subtype VARCHAR         -- country | region | dependency | county | localadmin |
+                               locality | macrohood | neighborhood | microhood
+     class VARCHAR
+     region VARCHAR
+     admin_level INTEGER
+     division_id VARCHAR
+     is_land BOOLEAN
+     is_territorial BOOLEAN
+     geometry GEOMETRY       -- WGS-84 polygon/multipolygon (spatial ext loaded)
+2. natural_earth  -- Natural Earth geography polygons (oceans, seas, rivers, terrain)
+   query: read_parquet('natural_earth')
+   columns:
+     id VARCHAR              -- unique feature id prefixed 'ne_'
+     names STRUCT("primary" VARCHAR, ...)
+     country VARCHAR
+     subtype VARCHAR         -- e.g. 'ocean', 'sea', 'bay', 'Terrain area', 'Island group'
+     class VARCHAR
+     region VARCHAR
+     admin_level INTEGER
+     is_land BOOLEAN
+     is_territorial BOOLEAN
+     geometry GEOMETRY       -- WGS-84 polygon/multipolygon (spatial ext loaded)
+</SCHEMA>
+The candidates table has a 'source' column: 'divisions_area' or 'natural_earth'.
+Use read_parquet('divisions_area') or read_parquet('natural_earth') accordingly.
+Use ST_AsGeoJSON(geometry) for all geometry outputs."""
+_CANDIDATES_COLS = [
+    "source", "id", "name", "subtype", "country", "region",
+    "admin_level", "similarity",
+]
+def _candidates_csv(candidates: List[Dict]) -> str:
+    import io
+    import csv
+    rows = []
+    for c in candidates:
+        row = {col: c.get(col, "") for col in _CANDIDATES_COLS if col in c}
+        rows.append(row)
+    if not rows:
+        return ""
+    buf = io.StringIO()
+    writer = csv.DictWriter(buf, fieldnames=[k for k in _CANDIDATES_COLS if k in rows[0]])
+    writer.writeheader()
+    writer.writerows(rows)
+    return buf.getvalue().strip()
+def _to_symbolic_sql(sql: str) -> str:
+    """Normalize any hardcoded or runtime paths back to symbolic names."""
+    sql = sql.replace(DIVISIONS_AREA_PATH, "divisions_area")
+    sql = sql.replace(NATURAL_EARTH_PATH, "natural_earth")
+    sql = sql.replace("/data/overture/division_area/*.parquet",          "divisions_area")
+    sql = sql.replace("/data/overture/divisions_area/*.parquet",         "divisions_area")
+    sql = sql.replace("/data/natural_earth_geoparquet/ne_geography.parquet", "natural_earth")
+    return sql
+def _format_sql(sql: str) -> str:
+    """Pretty-print SQL so the model learns clean, readable style."""
+    return sqlparse.format(
+        sql,
+        reindent=True,
+        keyword_case="upper",
+        indent_width=4,
+    ).strip()
+def sample_to_sql_pair(sample: Dict[str, Any]) -> Optional[Dict]:
+    """Convert a raw sample to a conversational prompt-completion pair for SQL generation."""
+    sql = sample.get("target", {}).get("sql", "").strip()
+    if not sql:
+        return None
+    sql = _format_sql(_to_symbolic_sql(sql))
+    user_content = (
+        f"<CANDIDATES>\n{_candidates_csv(sample.get('candidates', []))}\n</CANDIDATES>\n\n"
+        f"<USER_QUERY>\n{sample['question']}\n</USER_QUERY>"
+    )
+    return {
+        "messages": [
+            {"role": "system",    "content": _SQL_SYSTEM},
+            {"role": "user",      "content": user_content},
+            {"role": "assistant", "content": sql},
+        ],
+        "metadata": sample.get("metadata", {}),
+    }
+# ---------------------------------------------------------------------------
+# Place extraction format
+# Derived from the same SQL samples: selected_candidates → PlacesResult JSON.
+# ---------------------------------------------------------------------------
+_PLACE_SYSTEM = """You are a geographic entity extractor. Extract place names from the user query and return valid JSON only.
+OUTPUT FORMAT:
+{"places": [{"place": "<name>", "country": "<ISO-2>", "subtype": "<subtype>"}]}
+"country" and "subtype" are optional; omit if not applicable.
+RULES:
+- Only extract places explicitly mentioned. Never infer or expand (e.g. "states of India" -> extract "India" only).
+- No duplicate place names.
+- "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
+- "subtype": include only when the geographic level is clear from the query.
+SUBTYPES:
+country, dependency, region, county, localadmin, locality, macrohood, neighborhood, microhood
+- Default to locality for cities/towns; omit for physical features (oceans, rivers, mountains)."""
+# Overture division subtypes — used to filter out natural_earth candidates
+# from the place extraction output (NE features don't have these subtypes).
+_DIVISION_SUBTYPES = {
+    "country", "region", "dependency", "county", "localadmin",
+    "locality", "macrohood", "neighborhood", "microhood",
+}
+def _candidate_to_place(c: Dict) -> Optional[Dict]:
+    """Convert a selected candidate to a Place dict for PlacesResult."""
+    name = c.get("name", "").strip()
+    if not name:
+        return None
+    place: Dict[str, Any] = {"place": name}
+    subtype = c.get("subtype", "")
+    if subtype in _DIVISION_SUBTYPES:
+        place["subtype"] = subtype
+    country = c.get("country", "")
+    if country and len(country) == 2:
+        place["country"] = country
+    return place
+def sample_to_place_pair(sample: Dict[str, Any]) -> Optional[Dict]:
+    """Convert a raw sample to a conversational prompt-completion pair for place extraction.
+    Uses selected_candidates to determine the correct PlacesResult output.
+    Skips samples where no valid places can be derived.
+    """
+    selected_ids = set(sample.get("target", {}).get("selected_candidates", []))
+    if not selected_ids:
+        return None
+    id_to_candidate = {c["candidate_id"]: c for c in sample.get("candidates", [])}
+    places = []
+    seen_names: set = set()
+    for cid in selected_ids:
+        c = id_to_candidate.get(cid)
+        if not c:
+            continue
+        place = _candidate_to_place(c)
+        if place and place["place"].lower() not in seen_names:
+            places.append(place)
+            seen_names.add(place["place"].lower())
+    if not places:
+        return None
+    completion_json = json.dumps({"places": places}, ensure_ascii=False)
+    return {
+        "messages": [
+            {"role": "system",    "content": _PLACE_SYSTEM},
+            {"role": "user",      "content": sample["question"]},
+            {"role": "assistant", "content": completion_json},
+        ],
+        "metadata": sample.get("metadata", {}),
+    }
+# ---------------------------------------------------------------------------
+# I/O helpers
+# ---------------------------------------------------------------------------
+def save_jsonl(records: List[Dict], path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w") as f:
+        for r in records:
+            f.write(json.dumps(r, ensure_ascii=False) + "\n")
+def split_stats(samples: List[Dict]) -> Dict[str, int]:
+    counts: Dict[str, int] = defaultdict(int)
+    for s in samples:
+        counts[s.get("metadata", {}).get("task_family", "unknown")] += 1
+    return dict(sorted(counts.items()))
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main(config_path: Optional[Path] = None) -> None:
+    script_dir = Path(__file__).parent
+    dataset_dir = script_dir.parent
+    output_dir = dataset_dir / "output"
+    run_name = load_run_name(config_path or dataset_dir / "config.yaml")
+    validated_file = output_dir / "dataset_validated.jsonl"
+    if not validated_file.exists():
+        print(f"Error: {validated_file} not found. Run validate first.")
+        sys.exit(1)
+    run_dir = output_dir / "runs" / run_name
+    sql_dir = run_dir / "sql"
+    places_dir = run_dir / "places"
+    print(f"Run name   : {run_name}")
+    print(f"Output dir : {run_dir}")
+    # Load
+    print("\nLoading validated samples...")
+    samples = load_samples(validated_file)
+    print(f"  {len(samples):,} samples loaded")
+    # Split once, reuse for both tasks
+    print("\nSplitting 80 / 10 / 10 (stratified by task family)...")
+    train_raw, val_raw, test_raw = stratified_split(samples)
+    print(f"  train={len(train_raw):,}  val={len(val_raw):,}  test={len(test_raw):,}")
+    # --- SQL generation ---
+    print("\nBuilding SQL generation splits...")
+    sql_stats: Dict = {}
+    for split_name, raw in [("train", train_raw), ("val", val_raw), ("test", test_raw)]:
+        pairs = [p for s in raw if (p := sample_to_sql_pair(s)) is not None]
+        save_jsonl(pairs, sql_dir / f"{split_name}.jsonl")
+        sql_stats[split_name] = {"total": len(pairs), "by_family": split_stats(pairs)}
+        print(f"  sql/{split_name}.jsonl  — {len(pairs):,} pairs")
+    # --- Place extraction ---
+    print("\nBuilding place extraction splits...")
+    place_stats: Dict = {}
+    for split_name, raw in [("train", train_raw), ("val", val_raw), ("test", test_raw)]:
+        pairs = [p for s in raw if (p := sample_to_place_pair(s)) is not None]
+        save_jsonl(pairs, places_dir / f"{split_name}.jsonl")
+        place_stats[split_name] = {"total": len(pairs), "by_family": split_stats(pairs)}
+        print(f"  places/{split_name}.jsonl  — {len(pairs):,} pairs")
+    # --- Stats ---
+    stats = {
+        "run_name": run_name,
+        "total_samples": len(samples),
+        "sql_generation": sql_stats,
+        "place_extraction": place_stats,
+    }
+    stats_path = run_dir / "stats.json"
+    with open(stats_path, "w") as f:
+        json.dump(stats, f, indent=2)
+    print(f"\nStats written to {stats_path}")
+    print("\nDone. Training-ready files:")
+    print(f"  SQL generation  : {sql_dir}/{{train,val,test}}.jsonl")
+    print(f"  Place extraction: {places_dir}/{{train,val,test}}.jsonl")
+if __name__ == "__main__":
+    main()

dataset/scripts/generate_samples.py ADDED Viewed

	@@ -0,0 +1,1560 @@

+"""
+Generate synthetic training samples for text-to-SQL task.
+This script:
+1. Loads relation tables and entity inventories
+2. For each SQL template, samples valid anchors
+3. Renders and executes SQL to verify it works
+4. Builds candidate lists with controlled distractors
+5. Generates natural language questions using LLM
+6. Saves complete training samples
+Output:
+- output/samples/sample_*.json (individual samples)
+- output/dataset_raw.jsonl (all samples)
+"""
+import json
+import random
+import warnings
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+from concurrent.futures import ProcessPoolExecutor, as_completed
+from functools import partial
+import duckdb
+import pandas as pd
+from pydantic import BaseModel
+# Suppress warnings
+warnings.filterwarnings('ignore')
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+# Fixed paths embedded in every training SQL string.
+# The model learns these short, stable strings rather than machine-specific
+# local paths.  At inference, sql.py's _rewrite_data_paths substitutes them
+# with the actual runtime paths from gazet.config.
+_DIVISIONS_SQL_PATH = 'divisions_area'
+_NATURAL_EARTH_SQL_PATH = 'natural_earth'
+def _for_execution(sql: str) -> str:
+    """Replace symbolic placeholder paths with actual local paths for verification."""
+    return (
+        sql
+        .replace("read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')")
+        .replace("read_parquet('natural_earth')",  f"read_parquet('{NATURAL_EARTH_PATH}')")
+    )
+# Configurable parameters (can be overridden by CLI)
+TARGET_COUNTS = None  # Will be set in main() or by CLI
+MAX_WORKERS = 8
+RETRY_MULTIPLIER = 2
+APPEND_MODE = False
+# Import templates from same directory
+from . import sql_templates
+TEMPLATES = sql_templates.TEMPLATES
+SQLTemplate = sql_templates.SQLTemplate
+get_templates_by_family = sql_templates.get_templates_by_family
+class Candidate(BaseModel):
+    """Candidate entity for grounding."""
+    candidate_id: str
+    source: str
+    id: str
+    name: str
+    subtype: Optional[str] = None
+    country: Optional[str] = None
+    region: Optional[str] = None
+    admin_level: Optional[int] = None
+    similarity: float = 0.0
+class TrainingSample(BaseModel):
+    """Complete training sample."""
+    id: str
+    question: str
+    candidates: List[Candidate]
+    target: Dict[str, Any]
+    metadata: Dict[str, Any]
+def load_relation_tables(intermediate_dir: Path, quiet: bool = False) -> Dict[str, pd.DataFrame]:
+    """Load all precomputed relation tables."""
+    tables = {}
+    for file in intermediate_dir.glob("*.parquet"):
+        name = file.stem
+        tables[name] = pd.read_parquet(file)
+        if not quiet:
+            print(f"  {name}: {len(tables[name])} rows")
+    return tables
+def sample_adjacency_anchor(adjacency_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random adjacency pair."""
+    if adjacency_df.empty:
+        return None
+    row = adjacency_df.sample(n=1).iloc[0]
+    return {
+        'anchor_id': row['anchor_id'],
+        'anchor_name': row['anchor_name'],
+        'anchor_subtype': row['anchor_subtype'],
+        'anchor_country': row.get('anchor_country'),  # May not exist in all tables
+        'target_subtype': row.get('target_subtype')
+    }
+def sample_intersection_anchor(intersection_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random intersection pair."""
+    if intersection_df.empty:
+        return None
+    row = intersection_df.sample(n=1).iloc[0]
+    return {
+        'anchor_id': row['anchor_id'],
+        'anchor_name': row['anchor_name'],
+        'anchor_subtype': row['anchor_subtype'],
+        'target_id': row.get('target_id'),
+        'target_name': row.get('target_name'),
+        'target_subtype': row.get('target_subtype')
+    }
+def sample_containment_anchor(containment_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random containment pair."""
+    if containment_df.empty:
+        return None
+    row = containment_df.sample(n=1).iloc[0]
+    return {
+        'container_id': row['container_id'],
+        'container_name': row['container_name'],
+        'container_subtype': row['container_subtype'],
+        'contained_subtype': row['contained_subtype']
+    }
+def sample_cross_source_anchor(cross_source_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random cross-source relation."""
+    if cross_source_df.empty:
+        return None
+    row = cross_source_df.sample(n=1).iloc[0]
+    return {
+        'division_id': row['division_id'],
+        'division_name': row['division_name'],
+        'division_subtype': row['division_subtype'],
+        'natural_id': row['natural_id'],
+        'natural_name': row['natural_name'],
+        'natural_subtype': row['natural_subtype'],
+        'relation_type': row['relation_type']
+    }
+def _merge_candidate_lists(
+    *lists: List[Candidate],
+    max_total: int = 10,
+) -> List[Candidate]:
+    """Merge N candidate lists, deduplicate by id, reassign candidate_ids.
+    Interleaves the lists so each anchor is represented before any anchor
+    gets a second candidate — matching the grouped-then-interleaved order
+    that inference produces.
+    """
+    from itertools import zip_longest
+    seen: set = set()
+    merged: List[Candidate] = []
+    for row in zip_longest(*lists):
+        for c in row:
+            if c is None:
+                continue
+            if c.id not in seen:
+                merged.append(c)
+                seen.add(c.id)
+            if len(merged) >= max_total:
+                break
+        if len(merged) >= max_total:
+            break
+    for i, c in enumerate(merged, 1):
+        c.candidate_id = f"c{i}"
+    return merged
+def build_candidate_list(
+    con: duckdb.DuckDBPyConnection,
+    anchor_id: str,
+    anchor_name: str,
+    anchor_source: str,
+    num_candidates: int = 10,
+    difficulty: str = "medium"
+) -> List[Candidate]:
+    """Build candidate list with true anchor + distractors."""
+    # Helper to convert pandas NA to None
+    def safe_get(row, key, default=None):
+        val = row.get(key, default)
+        return None if pd.isna(val) else val
+    # Get the true anchor
+    if anchor_source == "divisions_area":
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            region,
+            admin_level
+        FROM read_parquet(?)
+        WHERE id = ?
+        """
+        anchor_row = con.execute(query, [DIVISIONS_AREA_PATH, anchor_id]).fetchdf().iloc[0]
+    else:
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype
+        FROM read_parquet(?)
+        WHERE id = ?
+        """
+        anchor_row = con.execute(query, [NATURAL_EARTH_PATH, anchor_id]).fetchdf().iloc[0]
+    # Build true candidate
+    true_candidate = Candidate(
+        candidate_id="c1",
+        source=anchor_source,
+        id=anchor_id,
+        name=safe_get(anchor_row, 'name'),
+        subtype=safe_get(anchor_row, 'subtype'),
+        country=safe_get(anchor_row, 'country'),
+        region=safe_get(anchor_row, 'region'),
+        admin_level=safe_get(anchor_row, 'admin_level'),
+        similarity=1.0
+    )
+    # Build distractors based on difficulty
+    distractors = build_distractors(
+        con,
+        anchor_name,
+        anchor_source,
+        anchor_id,
+        num_candidates - 1,
+        difficulty
+    )
+    # Order: true anchor first, then same-source distractors, then cross-source
+    # distractors. This mirrors inference order (anchor at top by similarity,
+    # same source grouped before the other source).
+    candidates = [true_candidate] + distractors
+    # Reassign candidate IDs in order
+    for i, cand in enumerate(candidates, 1):
+        cand.candidate_id = f"c{i}"
+    return candidates
+def build_distractors(
+    con: duckdb.DuckDBPyConnection,
+    anchor_name: str,
+    anchor_source: str,
+    exclude_id: str,
+    num_distractors: int,
+    difficulty: str,
+    cross_source_ratio: float = 0.5,
+) -> List[Candidate]:
+    """Build distractor candidates using fuzzy search.
+    Always includes candidates from both sources so the model sees mixed
+    ``source`` values in every training example — matching the inference
+    behaviour where search.py queries divisions_area AND natural_earth equally
+    (5 results each per place).
+    Args:
+        cross_source_ratio: Fraction of distractors drawn from the *other*
+            source.  Defaults to 0.5 (50/50 split) to match inference exactly.
+    """
+    def safe_get(row, key, default=None):
+        val = row.get(key, default)
+        return None if pd.isna(val) else val
+    def _query_source(path: str, src_name: str, n: int, excl_id: str) -> List[Candidate]:
+        query = """
+        SELECT
+            id,
+            names."primary" AS name,
+            subtype,
+            country,
+            region,
+            admin_level,
+            jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity
+        FROM read_parquet(?)
+        WHERE id != ?
+          AND names."primary" IS NOT NULL
+        ORDER BY similarity DESC
+        LIMIT ?
+        """
+        df = con.execute(query, [anchor_name, path, excl_id, n]).fetchdf()
+        results = []
+        for _, row in df.iterrows():
+            results.append(Candidate(
+                candidate_id="temp",
+                source=src_name,
+                id=row["id"],
+                name=safe_get(row, "name"),
+                subtype=safe_get(row, "subtype"),
+                country=safe_get(row, "country"),
+                region=safe_get(row, "region"),
+                admin_level=safe_get(row, "admin_level"),
+                similarity=float(row["similarity"]),
+            ))
+        return results
+    cross_n = max(1, round(num_distractors * cross_source_ratio))
+    same_n = num_distractors - cross_n
+    if anchor_source == "divisions_area":
+        same = _query_source(DIVISIONS_AREA_PATH, "divisions_area", same_n, exclude_id)
+        cross = _query_source(NATURAL_EARTH_PATH, "natural_earth", cross_n, "")
+    else:
+        same = _query_source(NATURAL_EARTH_PATH, "natural_earth", same_n, exclude_id)
+        cross = _query_source(DIVISIONS_AREA_PATH, "divisions_area", cross_n, "")
+    return same + cross
+def generate_adjacency_sample(
+    con: duckdb.DuckDBPyConnection,
+    adjacency_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for adjacency task."""
+    anchor = sample_adjacency_anchor(adjacency_df)
+    if not anchor:
+        return None
+    # Build SQL
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('divisions_area')
+  WHERE id = '{anchor['anchor_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('divisions_area') AS b, a
+WHERE b.id != '{anchor['anchor_id']}'
+  AND b.subtype = '{anchor['target_subtype']}'
+  AND ST_Touches(a.geometry, b.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(_for_execution(sql)).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates
+    candidates = build_candidate_list(
+        con,
+        anchor['anchor_id'],
+        anchor['anchor_name'],
+        "divisions_area",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['anchor_id']]
+    # Generate question
+    question = f"Which {anchor['target_subtype']}s border {anchor['anchor_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "adjacency",
+            "sql_difficulty": "medium",
+            "grounding_difficulty": "medium",
+            "template_id": "adj_02",
+            "num_candidates": len(candidates),
+            "anchor_source": "divisions_area",
+            "sql_verified": True
+        }
+    )
+def generate_containment_sample(
+    con: duckdb.DuckDBPyConnection,
+    containment_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for containment task."""
+    anchor = sample_containment_anchor(containment_df)
+    if not anchor:
+        return None
+    # Build SQL
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('divisions_area')
+  WHERE id = '{anchor['container_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('divisions_area') AS b, a
+WHERE b.id != '{anchor['container_id']}'
+  AND b.subtype = '{anchor['contained_subtype']}'
+  AND ST_Within(b.geometry, a.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(_for_execution(sql)).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates
+    candidates = build_candidate_list(
+        con,
+        anchor['container_id'],
+        anchor['container_name'],
+        "divisions_area",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['container_id']]
+    # Generate question
+    question = f"What {anchor['contained_subtype']}s are in {anchor['container_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "containment",
+            "sql_difficulty": "medium",
+            "grounding_difficulty": "medium",
+            "template_id": "contain_01",
+            "num_candidates": len(candidates),
+            "anchor_source": "divisions_area",
+            "sql_verified": True
+        }
+    )
+def sample_random_entity(
+    con: duckdb.DuckDBPyConnection,
+    inventory_df: pd.DataFrame,
+    source: str
+) -> Optional[Dict[str, Any]]:
+    """Sample a random entity from inventory."""
+    if inventory_df.empty:
+        return None
+    row = inventory_df.sample(n=1).iloc[0]
+    return {
+        'id': row['id'],
+        'name': row['name'],
+        'subtype': row.get('subtype'),
+        'country': row.get('country'),
+        'source': source
+    }
+def generate_template_based_sample(
+    con: duckdb.DuckDBPyConnection,
+    template: SQLTemplate,
+    tables: Dict[str, pd.DataFrame],
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample based on a SQL template."""
+    # Sample anchor based on template requirements
+    if template.family == "direct_lookup":
+        # Just pick a random entity
+        if template.anchor_source == "divisions_area":
+            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        else:
+            anchor = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
+        if not anchor:
+            return None
+        # Render SQL
+        sql = template.sql_template.format(
+            anchor_id=anchor['id']
+        )
+        # Build candidates
+        candidates = build_candidate_list(
+            con, anchor['id'], anchor['name'], anchor['source'],
+            num_candidates=10, difficulty="easy"
+        )
+        # Question
+        question = random.choice(template.question_hints).format(anchor_name=anchor['name'])
+    elif template.family == "adjacency":
+        anchor = sample_adjacency_anchor(tables['adjacency_pairs'])
+        if not anchor:
+            return None
+        sql = template.sql_template.format(
+            anchor_id=anchor['anchor_id'],
+            target_subtype=anchor['target_subtype']
+        )
+        candidates = build_candidate_list(
+            con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
+            num_candidates=10, difficulty="medium"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['anchor_name'],
+            target_subtype=anchor['target_subtype']
+        )
+    elif template.family == "containment":
+        anchor = sample_containment_anchor(tables['containment_pairs'])
+        if not anchor:
+            return None
+        sql = template.sql_template.format(
+            anchor_id=anchor['container_id'],
+            target_subtype=anchor['contained_subtype']
+        )
+        candidates = build_candidate_list(
+            con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+            num_candidates=10, difficulty="medium"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['container_name'],
+            target_subtype=anchor['contained_subtype']
+        )
+    elif template.family == "intersection":
+        if template.anchor_source == "natural_earth":
+            anchor = sample_cross_source_anchor(tables['cross_source_relations'])
+            if not anchor:
+                return None
+            sql = template.sql_template.format(
+                        anchor_id=anchor['natural_id'],
+                target_subtype='country'
+            )
+            candidates = build_candidate_list(
+                con, anchor['natural_id'], anchor['natural_name'], 'natural_earth',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['natural_name'],
+                target_subtype='country'
+            )
+        else:
+            # Same-source intersection
+            anchor = sample_intersection_anchor(tables['intersection_pairs'])
+            if not anchor:
+                return None
+            # Use a generic subtype if not available
+            target_subtype = anchor.get('target_subtype') or 'region'
+            sql = template.sql_template.format(
+                        anchor_id=anchor['anchor_id'],
+                target_subtype=target_subtype
+            )
+            candidates = build_candidate_list(
+                con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['anchor_name'],
+                target_subtype=target_subtype
+            )
+    elif template.family == "set_operations":
+        if template.template_id == "union_03":
+            # 3-anchor union by ID — candidates: 3 per anchor (9 total)
+            anchors = [
+                sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+                for _ in range(3)
+            ]
+            if any(a is None for a in anchors):
+                return None
+            anchor1, anchor2, anchor3 = anchors
+            sql = template.sql_template.format(
+                    anchor_id_1=anchor1['id'],
+                anchor_id_2=anchor2['id'],
+                anchor_id_3=anchor3['id'],
+            )
+            per_anchor = 3
+            cands = [
+                build_candidate_list(con, a['id'], a['name'], 'divisions_area',
+                                     num_candidates=per_anchor, difficulty="medium")
+                for a in anchors
+            ]
+            candidates = _merge_candidate_lists(*cands, max_total=9)
+            question = random.choice(template.question_hints).format(
+                anchor_1_name=anchor1['name'],
+                anchor_2_name=anchor2['name'],
+                anchor_3_name=anchor3['name'],
+            )
+        elif template.template_id in ("contain_multi_01", "contain_multi_02"):
+            # country IN clause — 2 or 3 anchors, each contributes its country code
+            num_a = 3 if template.template_id == "contain_multi_02" else 2
+            anchors = [
+                sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+                for _ in range(num_a)
+            ]
+            if any(a is None for a in anchors):
+                return None
+            countries = [a.get('country') or 'US' for a in anchors]
+            target_subtype = random.choice(['region', 'locality'])
+            per_anchor = 3 if num_a == 3 else 4
+            fmt_kwargs = dict(
+                    target_subtype=target_subtype,
+            )
+            for i, c in enumerate(countries, 1):
+                fmt_kwargs[f'country_{i}'] = c
+            sql = template.sql_template.format(**fmt_kwargs)
+            cands = [
+                build_candidate_list(con, a['id'], a['name'], 'divisions_area',
+                                     num_candidates=per_anchor, difficulty="medium")
+                for a in anchors
+            ]
+            candidates = _merge_candidate_lists(*cands, max_total=num_a * per_anchor)
+            q_kwargs = dict(target_subtype=target_subtype)
+            for i, a in enumerate(anchors, 1):
+                q_kwargs[f'anchor_{i}_name'] = a['name']
+            question = random.choice(template.question_hints).format(**q_kwargs)
+        elif template.template_id == "union_02":
+            # Filtered union: ST_Union_Agg of contained sub-features
+            pair = sample_containment_anchor(tables['containment_pairs'])
+            if not pair:
+                return None
+            target_subtype = pair.get('contained_subtype', 'locality')
+            sql = template.sql_template.format(
+                    anchor_id=pair['container_id'],
+                target_subtype=target_subtype,
+            )
+            candidates = build_candidate_list(
+                con, pair['container_id'], pair['container_name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=pair['container_name'],
+                target_subtype=target_subtype,
+            )
+        else:
+            # union_01: 2-anchor union by ID — candidates: 5 per anchor
+            anchor1 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            anchor2 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor1 or not anchor2:
+                return None
+            sql = template.sql_template.format(
+                    anchor_id_1=anchor1['id'],
+                anchor_id_2=anchor2['id'],
+            )
+            cands1 = build_candidate_list(
+                con, anchor1['id'], anchor1['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            cands2 = build_candidate_list(
+                con, anchor2['id'], anchor2['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates = _merge_candidate_lists(cands1, cands2, max_total=10)
+            question = random.choice(template.question_hints).format(
+                anchor_1_name=anchor1['name'],
+                anchor_2_name=anchor2['name'],
+            )
+    elif template.family == "buffer":
+        # Buffer operations
+        # Kilometre distances used by buffer_01 and buffer_03 templates.
+        # Metre distances used by buffer_02 and buffer_04 templates.
+        # The template SQL divides by 111 320 to convert to degrees.
+        _buffer_km_choices = [1, 2, 5, 10, 25, 50, 100, 200]
+        _buffer_m_choices = [100, 250, 500, 1000, 2000, 5000]
+        if template.num_anchors == 1:
+            if template.anchor_source == "natural_earth":
+                anchor = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
+            else:
+                anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor:
+                return None
+            # Choose unit based on which placeholder the template uses.
+            uses_km = "{buffer_km}" in template.sql_template
+            if uses_km:
+                buffer_val = random.choice(_buffer_km_choices)
+                fmt_kwargs = dict(
+                    anchor_id=anchor['id'],
+                    buffer_km=buffer_val,
+                )
+                q_kwargs = dict(anchor_name=anchor['name'], buffer_km=buffer_val)
+            else:
+                buffer_val = random.choice(_buffer_m_choices)
+                fmt_kwargs = dict(
+                    anchor_id=anchor['id'],
+                    buffer_m=buffer_val,
+                )
+                q_kwargs = dict(anchor_name=anchor['name'], buffer_m=buffer_val)
+            sql = template.sql_template.format(**fmt_kwargs)
+            candidates = build_candidate_list(
+                con, anchor['id'], anchor['name'], anchor['source'],
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(**q_kwargs)
+        else:
+            # Two anchor buffer (union / set-op style) — kept for completeness.
+            anchor1 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            anchor2 = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor1 or not anchor2:
+                return None
+            buffer_val = random.choice(_buffer_km_choices[:4])  # smaller range for two-anchor
+            sql = template.sql_template.format(
+                        anchor_id_1=anchor1['id'],
+                anchor_id_2=anchor2['id'],
+                buffer_km=buffer_val,
+            )
+            candidates1 = build_candidate_list(
+                con, anchor1['id'], anchor1['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates2 = build_candidate_list(
+                con, anchor2['id'], anchor2['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates = candidates1 + candidates2
+            seen_ids = set()
+            unique_candidates = []
+            for c in candidates:
+                if c.id not in seen_ids:
+                    unique_candidates.append(c)
+                    seen_ids.add(c.id)
+            candidates = unique_candidates[:10]
+            for i, c in enumerate(candidates, 1):
+                c.candidate_id = f"c{i}"
+            question = random.choice(template.question_hints).format(
+                anchor_1_name=anchor1['name'],
+                anchor_2_name=anchor2['name'],
+                buffer_km=buffer_val,
+            )
+    elif template.family == "partial_selection":
+        # Partial selection (northern half, clipping, etc.)
+        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        if not anchor:
+            return None
+        if template.num_anchors == 1:
+            sql = template.sql_template.format(
+                        anchor_id=anchor['id'],
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['name'],
+            )
+            candidates = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=10, difficulty="hard",
+            )
+        else:
+            # Mixed-source clip: division intersected with a natural_earth feature.
+            # Use cross_source_relations so the pair is guaranteed to intersect —
+            # random sampling almost never produces an intersecting pair.
+            cs_df = tables.get('cross_source_relations', pd.DataFrame())
+            if cs_df.empty:
+                return None
+            row = cs_df.sample(n=1).iloc[0]
+            clip_feature = {
+                'id':   row['natural_id'],
+                'name': row['natural_name'],
+                'source': 'natural_earth',
+            }
+            # Override the division anchor with the paired division so the
+            # ST_Intersects check in the SQL is guaranteed to pass.
+            anchor = {
+                'id':   row['division_id'],
+                'name': row['division_name'],
+                'source': 'divisions_area',
+            }
+            sql = template.sql_template.format(
+                        anchor_id=anchor['id'],
+                clip_feature_id=clip_feature['id'],
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['name'],
+                clip_feature_name=clip_feature['name'],
+            )
+            # Build candidates for BOTH anchors so the model sees both IDs
+            # in context and learns to pick the right one for each placeholder.
+            div_cands = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=5, difficulty="hard",
+            )
+            ne_cands = build_candidate_list(
+                con, clip_feature['id'], clip_feature['name'], 'natural_earth',
+                num_candidates=5, difficulty="hard",
+            )
+            candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
+    elif template.family == "aggregation":
+        top_n = random.choice([3, 5, 10])
+        target_subtype = random.choice(['locality', 'region'])
+        if template.template_id in ['agg_03', 'agg_04']:
+            # Country-level aggregation: SQL uses country code, not anchor id.
+            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor:
+                return None
+            country = anchor.get('country') or 'US'
+            sql = template.sql_template.format(
+                    country=country,
+                target_subtype=target_subtype,
+                top_n=top_n,
+            )
+            candidates = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=10, difficulty="hard"
+            )
+            question = random.choice(template.question_hints).format(
+                top_n=top_n,
+                target_subtype=target_subtype,
+                anchor_name=anchor['name'],
+            )
+        else:
+            # Containment-based aggregation: anchor is the container region.
+            anchor = sample_containment_anchor(tables['containment_pairs'])
+            if not anchor:
+                return None
+            sql = template.sql_template.format(
+                    anchor_id=anchor['container_id'],
+                target_subtype=target_subtype,
+                top_n=top_n,
+            )
+            candidates = build_candidate_list(
+                con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+                num_candidates=10, difficulty="hard"
+            )
+            question = random.choice(template.question_hints).format(
+                top_n=top_n,
+                target_subtype=target_subtype,
+                anchor_name=anchor['container_name'],
+            )
+    elif template.family == "chained":
+        # Use pre-filtered coastal/landlocked containment pairs so the SQL
+        # verification step doesn't constantly return empty results.
+        if template.template_id == "chained_01":
+            table_key = 'coastal_containment_pairs'
+        elif template.template_id == "chained_02":
+            table_key = 'landlocked_containment_pairs'
+        else:
+            table_key = 'containment_pairs'
+        anchor = sample_containment_anchor(tables.get(table_key, tables['containment_pairs']))
+        if not anchor:
+            return None
+        target_subtype = anchor.get('contained_subtype', 'locality')
+        sql = template.sql_template.format(
+            anchor_id=anchor['container_id'],
+            target_subtype=target_subtype,
+        )
+        candidates = build_candidate_list(
+            con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+            num_candidates=10, difficulty="hard"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['container_name'],
+            target_subtype=target_subtype,
+        )
+    elif template.family == "multi_adjacency":
+        # Use common_neighbor_pairs so anchor1 and anchor2 are guaranteed to
+        # share at least one touching neighbour — SQL will return non-empty.
+        cn_df = tables.get('common_neighbor_pairs', pd.DataFrame())
+        if cn_df.empty:
+            return None
+        row = cn_df.sample(n=1).iloc[0]
+        anchor1 = {'id': row['anchor_id_1'], 'name': row['anchor_name_1'], 'source': 'divisions_area'}
+        anchor2 = {'id': row['anchor_id_2'], 'name': row['anchor_name_2'], 'source': 'divisions_area'}
+        sql = template.sql_template.format(
+            anchor_id_1=anchor1['id'],
+            anchor_id_2=anchor2['id'],
+        )
+        candidates1 = build_candidate_list(
+            con, anchor1['id'], anchor1['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        candidates2 = build_candidate_list(
+            con, anchor2['id'], anchor2['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        candidates = _merge_candidate_lists(candidates1, candidates2)
+        question = random.choice(template.question_hints).format(
+            anchor_1_name=anchor1['name'],
+            anchor_2_name=anchor2['name'],
+        )
+    elif template.family == "difference":
+        if template.anchor_source == "mixed":
+            # divisions_area anchor differenced against a natural_earth feature.
+            # Use cross_source_relations so the pair is guaranteed to intersect
+            # (ST_Difference on non-intersecting geometries is always equal to
+            # the original geometry — a trivial and uninformative sample).
+            cs_df = tables.get('cross_source_relations', pd.DataFrame())
+            if cs_df.empty:
+                return None
+            row = cs_df.sample(n=1).iloc[0]
+            anchor = {
+                'id':   row['division_id'],
+                'name': row['division_name'],
+                'source': 'divisions_area',
+            }
+            clip_feature = {
+                'id':   row['natural_id'],
+                'name': row['natural_name'],
+                'source': 'natural_earth',
+            }
+            sql = template.sql_template.format(
+                        anchor_id=anchor['id'],
+                clip_feature_id=clip_feature['id'],
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['name'],
+                clip_feature_name=clip_feature['name'],
+            )
+            # Build candidates for BOTH anchors — model must see both IDs
+            # to correctly assign anchor_id vs clip_feature_id in the SQL.
+            div_cands = build_candidate_list(
+                con, anchor['id'], anchor['name'], 'divisions_area',
+                num_candidates=5, difficulty="hard",
+            )
+            ne_cands = build_candidate_list(
+                con, clip_feature['id'], clip_feature['name'], 'natural_earth',
+                num_candidates=5, difficulty="hard",
+            )
+            candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
+        else:
+            # Two divisions_area anchors — use containment pairs so the
+            # smaller (contained) is guaranteed to intersect the larger.
+            pair = sample_containment_anchor(tables['containment_pairs'])
+            if not pair:
+                return None
+            anchor1 = {'id': pair['container_id'], 'name': pair['container_name']}
+            anchor2_row = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+            if not anchor2_row:
+                return None
+            anchor2 = anchor2_row
+            sql = template.sql_template.format(
+                        anchor_id_1=anchor1['id'],
+                anchor_id_2=anchor2['id'],
+            )
+            candidates1 = build_candidate_list(
+                con, anchor1['id'], anchor1['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates2 = build_candidate_list(
+                con, anchor2['id'], anchor2['name'], 'divisions_area',
+                num_candidates=5, difficulty="medium"
+            )
+            candidates = _merge_candidate_lists(candidates1, candidates2)
+            question = random.choice(template.question_hints).format(
+                anchor_1_name=anchor1['name'],
+                anchor_2_name=anchor2['name'],
+            )
+    elif template.family == "border_corridor":
+        # Buffered border zone — needs two anchors that actually touch.
+        pair = sample_adjacency_anchor(tables['adjacency_pairs'])
+        if not pair:
+            return None
+        # The adjacency table only records one direction; sample a second
+        # anchor that is known to be adjacent to the first.
+        anchor1 = {'id': pair['anchor_id'], 'name': pair['anchor_name']}
+        # Find a random neighbour of anchor1 from adjacency pairs
+        neighbours = tables['adjacency_pairs']
+        neighbours = neighbours[neighbours['anchor_id'] == anchor1['id']]
+        if neighbours.empty:
+            return None
+        nb_row = neighbours.sample(n=1).iloc[0]
+        anchor2 = {'id': nb_row.get('target_id', nb_row['anchor_id']), 'name': nb_row.get('target_name', nb_row['anchor_name'])}
+        if anchor1['id'] == anchor2['id']:
+            return None
+        buffer_val = random.choice([5, 10, 25, 50])
+        sql = template.sql_template.format(
+            anchor_id_1=anchor1['id'],
+            anchor_id_2=anchor2['id'],
+            buffer_km=buffer_val,
+        )
+        candidates1 = build_candidate_list(
+            con, anchor1['id'], anchor1['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        candidates2 = build_candidate_list(
+            con, anchor2['id'], anchor2['name'], 'divisions_area',
+            num_candidates=5, difficulty="medium"
+        )
+        candidates = _merge_candidate_lists(candidates1, candidates2)
+        question = random.choice(template.question_hints).format(
+            anchor_1_name=anchor1['name'],
+            anchor_2_name=anchor2['name'],
+            buffer_km=buffer_val,
+        )
+    elif template.family == "window_function":
+        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        if not anchor:
+            return None
+        country = anchor.get('country') or 'US'
+        target_subtype = random.choice(['locality', 'neighborhood'])
+        sql = template.sql_template.format(
+            country=country,
+            target_subtype=target_subtype,
+        )
+        candidates = build_candidate_list(
+            con, anchor['id'], anchor['name'], 'divisions_area',
+            num_candidates=10, difficulty="hard"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['name'],
+            target_subtype=target_subtype,
+        )
+    elif template.family == "attribute_filter":
+        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
+        if not anchor:
+            return None
+        country = anchor.get('country') or 'US'
+        target_subtype = template.target_subtype or random.choice(['dependency', 'region', 'locality'])
+        sql = template.sql_template.format(
+            country=country,
+            target_subtype=target_subtype,
+        )
+        candidates = build_candidate_list(
+            con, anchor['id'], anchor['name'], 'divisions_area',
+            num_candidates=10, difficulty="medium"
+        )
+        question = random.choice(template.question_hints).format(
+            anchor_name=anchor['name'],
+            target_subtype=target_subtype,
+            country=country,
+        )
+    else:
+        # Skip unsupported families
+        return None
+    # Execute SQL to verify
+    try:
+        result = con.execute(_for_execution(sql)).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        # Errors are tracked in worker return, no need to print
+        return None
+    # Collect every anchor ID that appears in the generated SQL so we can
+    # mark them as the "selected" candidates in the training sample.
+    _multi_anchor_families = {"set_operations", "multi_adjacency", "difference", "border_corridor"}
+    # Mixed partial_selection (partial_05) and mixed difference (diff_02) each
+    # have two anchors from different sources — both must be marked selected.
+    _is_mixed_two_anchor = (
+        template.anchor_source == "mixed" and template.num_anchors == 2
+    )
+    if template.family in _multi_anchor_families and template.num_anchors >= 2:
+        anchor_ids: set = set()
+        for var in ("anchor1", "anchor2", "anchor3"):
+            obj = locals().get(var)
+            if obj:
+                anchor_ids.add(obj.get("id", ""))
+        if "anchors" in locals():
+            for a in locals()["anchors"]:
+                if a:
+                    anchor_ids.add(a.get("id", ""))
+        selected_candidate_ids = [c.candidate_id for c in candidates if c.id in anchor_ids]
+    elif _is_mixed_two_anchor:
+        # partial_05 / diff_02: anchor (division) + clip_feature (natural_earth)
+        mixed_ids = {anchor.get("id", ""), clip_feature.get("id", "")}
+        selected_candidate_ids = [c.candidate_id for c in candidates if c.id in mixed_ids]
+    else:
+        anchor_id_to_find = (
+            anchor.get('anchor_id')
+            or anchor.get('container_id')
+            or anchor.get('natural_id')
+            or anchor.get('id')
+        )
+        selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor_id_to_find]
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": template.family,
+            "sql_difficulty": template.sql_difficulty,
+            "grounding_difficulty": "medium",
+            "template_id": template.template_id,
+            "num_candidates": len(candidates),
+            "anchor_source": template.anchor_source,
+            "sql_verified": True
+        }
+    )
+def generate_cross_source_sample(
+    con: duckdb.DuckDBPyConnection,
+    cross_source_df: pd.DataFrame,
+    sample_id: str
+) -> Optional[TrainingSample]:
+    """Generate a sample for cross-source intersection task."""
+    anchor = sample_cross_source_anchor(cross_source_df)
+    if not anchor:
+        return None
+    # Build SQL (natural feature -> divisions)
+    sql = f"""WITH a AS (
+  SELECT geometry FROM read_parquet('natural_earth')
+  WHERE id = '{anchor['natural_id']}'
+)
+SELECT b.id, b.names."primary" AS name, b.geometry
+FROM read_parquet('divisions_area') AS b, a
+WHERE b.subtype = 'country'
+  AND ST_Intersects(b.geometry, a.geometry)"""
+    # Execute to verify
+    try:
+        result = con.execute(_for_execution(sql)).fetchdf()
+        if result.empty:
+            return None
+    except Exception as e:
+        print(f"SQL execution failed: {e}")
+        return None
+    # Build candidates for natural feature
+    candidates = build_candidate_list(
+        con,
+        anchor['natural_id'],
+        anchor['natural_name'],
+        "natural_earth",
+        num_candidates=10,
+        difficulty="medium"
+    )
+    # Find which candidate is the true anchor
+    selected_candidate_ids = [c.candidate_id for c in candidates if c.id == anchor['natural_id']]
+    # Generate question
+    question = f"Which countries intersect the {anchor['natural_name']}?"
+    return TrainingSample(
+        id=sample_id,
+        question=question,
+        candidates=candidates,
+        target={
+            "selected_candidates": selected_candidate_ids,
+            "sql": sql
+        },
+        metadata={
+            "task_family": "intersection",
+            "sql_difficulty": "medium-hard",
+            "grounding_difficulty": "medium",
+            "template_id": "intersect_02",
+            "num_candidates": len(candidates),
+            "anchor_source": "natural_earth",
+            "sql_verified": True
+        }
+    )
+def generate_sample_batch_worker(args):
+    """Worker function that processes a batch of work items with a single DuckDB connection.
+    Initializes DuckDB, spatial extension, templates module, and relation tables
+    ONCE per batch, then processes all items sequentially.
+    """
+    from pathlib import Path
+    work_items, intermediate_dir_str = args
+    # Convert string back to Path
+    intermediate_dir = Path(intermediate_dir_str)
+    # Initialize DuckDB ONCE for the entire batch
+    con = duckdb.connect()
+    con.execute("SET enable_progress_bar=false")
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    # Load relation tables ONCE
+    tables = load_relation_tables(intermediate_dir, quiet=True)
+    # Process all items in batch
+    results = []
+    for family, template_dict, sample_id, _ in work_items:
+        # Reconstruct template from dict (sql_templates is already imported at module level)
+        template = sql_templates.SQLTemplate(**template_dict)
+        try:
+            sample = generate_template_based_sample(con, template, tables, sample_id)
+            if sample:
+                results.append((sample, family, template.template_id, None))
+            else:
+                results.append((None, family, template.template_id, "Empty result"))
+        except Exception as e:
+            results.append((None, family, template_dict.get('template_id', 'unknown'), str(e)))
+    con.close()
+    return results
+def generate_batch_core(
+    work_items: List[tuple],
+    intermediate_dir: str,
+) -> List[Dict[str, Any]]:
+    """Standalone batch worker usable from Modal or any remote context.
+    Data paths are resolved via GAZET_DATA_DIR env var (set in Modal image).
+    Args:
+        work_items: List of (family, template_dict, sample_id, _) tuples
+        intermediate_dir: Path to intermediate dir with relation parquets
+    Returns:
+        List of dicts with keys: sample (dict or None), family, template_id, error
+    """
+    from pathlib import Path as _Path
+    intermediate = _Path(intermediate_dir)
+    con = duckdb.connect()
+    con.execute("SET enable_progress_bar=false")
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    tables = load_relation_tables(intermediate, quiet=True)
+    results = []
+    for family, template_dict, sample_id, _ in work_items:
+        template = sql_templates.SQLTemplate(**template_dict)
+        try:
+            sample = generate_template_based_sample(con, template, tables, sample_id)
+            if sample:
+                results.append({
+                    "sample": sample.model_dump(),
+                    "family": family,
+                    "template_id": template.template_id,
+                    "error": None,
+                })
+            else:
+                results.append({
+                    "sample": None,
+                    "family": family,
+                    "template_id": template.template_id,
+                    "error": "Empty result",
+                })
+        except Exception as e:
+            results.append({
+                "sample": None,
+                "family": family,
+                "template_id": template_dict.get('template_id', 'unknown'),
+                "error": str(e),
+            })
+    con.close()
+    return results
+def prepare_work_items(
+    target_counts: Dict[str, int],
+    retry_multiplier: int = 2,
+    start_counter: int = 1,
+    intermediate_dir_str: str = "",
+) -> List[tuple]:
+    """Prepare shuffled work items for sample generation.
+    Returns list of (family, template_dict, sample_id, intermediate_dir_str) tuples.
+    Reusable by both local main() and Modal orchestrator.
+    """
+    work_items = []
+    sample_counter = start_counter
+    for family, target_count in target_counts.items():
+        if target_count == 0:
+            continue
+        family_templates = [t for t in TEMPLATES if t.family == family]
+        if not family_templates:
+            print(f"No templates found for {family}, skipping...")
+            continue
+        for _ in range(target_count * retry_multiplier):
+            template = random.choice(family_templates)
+            template_dict = {
+                'template_id': template.template_id,
+                'family': template.family,
+                'sql_difficulty': template.sql_difficulty,
+                'anchor_source': template.anchor_source,
+                'num_anchors': template.num_anchors,
+                'sql_template': template.sql_template,
+                'question_hints': template.question_hints,
+                'target_subtype': template.target_subtype,
+                'requires_buffer': template.requires_buffer,
+                'requires_aggregation': template.requires_aggregation
+            }
+            work_items.append((
+                family,
+                template_dict,
+                f"sample_{sample_counter:06d}",
+                intermediate_dir_str,
+            ))
+            sample_counter += 1
+    random.shuffle(work_items)
+    return work_items
+def main():
+    """Generate training samples."""
+    global TARGET_COUNTS, MAX_WORKERS, RETRY_MULTIPLIER, APPEND_MODE
+    # Setup paths
+    script_dir = Path(__file__).parent
+    intermediate_dir = script_dir.parent / "intermediate"
+    output_dir = script_dir.parent / "output"
+    output_dir.mkdir(exist_ok=True, parents=True)
+    # Load relation tables once to check availability
+    print("Loading relation tables...")
+    tables = load_relation_tables(intermediate_dir, quiet=False)
+    # Use configured target counts or defaults
+    if TARGET_COUNTS is None:
+        target_counts = {
+            'direct_lookup':    100,
+            'adjacency':        150,
+            'multi_adjacency':   75,
+            'containment':      100,
+            'intersection':     100,
+            'buffer':           100,
+            'chained':          150,
+            'difference':        75,
+            'border_corridor':   75,
+            'set_operations':   150,
+            'partial_selection': 75,
+            'aggregation':      100,
+            'window_function':   75,
+            'attribute_filter':  75,
+        }
+    else:
+        target_counts = TARGET_COUNTS
+    # Load existing samples if in append mode
+    existing_samples = []
+    existing_sample_ids = set()
+    jsonl_file = output_dir / "dataset_raw.jsonl"
+    if APPEND_MODE and jsonl_file.exists():
+        print(f"\nAppend mode: Loading existing samples from {jsonl_file}")
+        with open(jsonl_file, 'r') as f:
+            for line in f:
+                if line.strip():
+                    sample_data = json.loads(line)
+                    existing_samples.append(sample_data)
+                    existing_sample_ids.add(sample_data['id'])
+        print(f"  Found {len(existing_samples)} existing samples")
+        # Determine starting sample counter
+        max_existing_id = max([int(s['id'].split('_')[1]) for s in existing_samples if s['id'].startswith('sample_')], default=0)
+        sample_counter = max_existing_id + 1
+    else:
+        sample_counter = 1
+    # Prepare work items using shared helper
+    work_items = prepare_work_items(
+        target_counts=target_counts,
+        retry_multiplier=RETRY_MULTIPLIER,
+        start_counter=sample_counter,
+        intermediate_dir_str=str(intermediate_dir),
+    )
+    starting_sample_counter = sample_counter
+    # Partition work items into batches (one per worker)
+    num_workers = min(MAX_WORKERS, len(work_items))
+    if num_workers == 0:
+        print("No work items to process")
+        return
+    batch_size = (len(work_items) + num_workers - 1) // num_workers
+    batches = []
+    for i in range(0, len(work_items), batch_size):
+        batch = work_items[i:i + batch_size]
+        batches.append((batch, str(intermediate_dir)))
+    # Generate samples in parallel (one batch per worker)
+    active_families = len([f for f in target_counts.values() if f > 0])
+    print(f"\nGenerating {len(work_items)} samples across {active_families} families...")
+    print(f"  Split into {len(batches)} batches of ~{batch_size} items (1 DuckDB init per batch)")
+    if APPEND_MODE and existing_samples:
+        print(f"Appending: starting from sample_{starting_sample_counter:03d}")
+    all_samples = []
+    family_progress = {f: {'success': 0, 'failed': 0} for f in target_counts.keys() if target_counts[f] > 0}
+    with ProcessPoolExecutor(max_workers=num_workers) as executor:
+        # Submit one batch per worker
+        futures = {executor.submit(generate_sample_batch_worker, batch): i for i, batch in enumerate(batches)}
+        # Collect results as batches complete
+        batches_done = 0
+        for future in as_completed(futures):
+            try:
+                batch_results = future.result()
+                for sample, family, template_id, error in batch_results:
+                    if sample:
+                        all_samples.append(sample)
+                        family_progress[family]['success'] += 1
+                    else:
+                        family_progress[family]['failed'] += 1
+            except Exception as e:
+                print(f"\n  Batch failed: {e}")
+            batches_done += 1
+            total_done = sum(p['success'] + p['failed'] for p in family_progress.values())
+            print(f"\r  Progress: {total_done}/{len(work_items)} samples ({batches_done}/{len(batches)} batches) ", end='', flush=True)
+        print()  # New line after progress
+    # Show distribution (keep all samples, no filtering)
+    print("\nResults by family:")
+    for family in sorted(family_progress.keys()):
+        success = family_progress[family]['success']
+        failed = family_progress[family]['failed']
+        target = target_counts.get(family, 0)
+        total = success + failed
+        success_rate = (success / total * 100) if total > 0 else 0
+        print(f"  {family:20s}: {success:3d} success / {failed:3d} failed ({success_rate:5.1f}% success rate, target: {target})")
+    # Save combined JSONL (skip individual JSON files for speed at scale)
+    print(f"\nSaving {len(all_samples)} new samples...")
+    if APPEND_MODE and existing_samples:
+        # Append to existing dataset
+        print(f"Appending to existing dataset ({len(existing_samples)} existing samples)")
+        with open(jsonl_file, 'a') as f:
+            for sample in all_samples:
+                f.write(json.dumps(sample.model_dump()) + '\n')
+        total_samples = len(existing_samples) + len(all_samples)
+    else:
+        # Overwrite dataset
+        with open(jsonl_file, 'w') as f:
+            for sample in all_samples:
+                f.write(json.dumps(sample.model_dump()) + '\n')
+        total_samples = len(all_samples)
+    print(f"\nGenerated {len(all_samples)} new samples")
+    print(f"Total dataset size: {total_samples} samples")
+    print(f"  Dataset: {jsonl_file}")
+if __name__ == "__main__":
+    main()

dataset/scripts/sql_templates.py ADDED Viewed

	@@ -0,0 +1,1651 @@

+"""
+SQL template definitions for synthetic data generation.
+Geometry output convention
+--------------------------
+Every final SELECT wraps geometry with ST_AsGeoJSON():
+    ST_AsGeoJSON(geometry) AS geometry
+This returns a GeoJSON string instead of raw WKB bytes, which is directly
+JSON-serialisable and matches what the serving stack expects.
+CTEs that compute intermediate geometries (used only for spatial predicates
+or ST_Area) keep the column as raw GEOMETRY so DuckDB spatial functions work.
+Buffer distance convention
+--------------------------
+All buffer templates use {buffer_km} or {buffer_m} (never degrees).
+SQL converts to degrees: metres / 111_320.
+Mixed-source candidates
+-----------------------
+generate_samples.py pads every candidate list with 50 % cross-source
+distractors so the model always sees both source values and learns the
+correct parquet path from the candidates table.
+Template families
+-----------------
+direct_lookup      Simple single-feature fetch by ID.
+adjacency          ST_Touches — features sharing a border.
+multi_adjacency    Features that simultaneously touch TWO anchors.
+containment        ST_Within / ST_Contains — hierarchical nesting.
+intersection       ST_Intersects — overlapping or crossing features.
+buffer             ST_Buffer — proximity zones in km or metres.
+chained            Containment + EXISTS/NOT EXISTS sea predicate.
+difference         ST_Difference — geometry subtraction.
+border_corridor    Buffered ST_Intersection of a shared border.
+set_operations     ST_Union_Agg — merging multiple geometries.
+partial_selection  Bbox clipping — directional halves or feature clips.
+aggregation        TOP-N by area with ORDER BY.
+window_function    ROW_NUMBER() OVER (PARTITION BY) — per-group ranking.
+attribute_filter   Pure attribute predicates: is_land, country, etc.
+"""
+from dataclasses import dataclass
+from typing import List, Literal
+@dataclass
+class SQLTemplate:
+    """SQL template for synthetic data generation."""
+    template_id: str
+    family: str
+    sql_difficulty: Literal["easy", "medium", "medium-hard", "hard"]
+    anchor_source: Literal["divisions_area", "natural_earth", "mixed"]
+    num_anchors: int
+    sql_template: str
+    question_hints: List[str]
+    target_subtype: str = None
+    requires_buffer: bool = False
+    requires_aggregation: bool = False
+# ---------------------------------------------------------------------------
+# Template catalog
+# ---------------------------------------------------------------------------
+TEMPLATES = [
+    # ── DIRECT LOOKUP ────────────────────────────────────────────────────────
+    SQLTemplate(
+        template_id="lookup_01",
+        family="direct_lookup",
+        sql_difficulty="easy",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "SELECT ST_AsGeoJSON(geometry) AS geometry,"
+            " names.\"primary\" AS name, id, subtype, country"
+            " FROM read_parquet('divisions_area')"
+            " WHERE id = '{anchor_id}'"
+        ),
+        question_hints=[
+            "Show me {anchor_name}",
+            "Get the boundary of {anchor_name}",
+            "Find {anchor_name}",
+            "Where is {anchor_name}?",
+            "Give me the outline of {anchor_name}",
+            "Display {anchor_name} on a map",
+            "What does {anchor_name} look like?",
+            "I need the shape of {anchor_name}",
+            "Pull up {anchor_name}",
+            "Can you show {anchor_name}?",
+            "Map of {anchor_name}",
+            "{anchor_name} boundary",
+            "Locate {anchor_name} for me",
+        ],
+    ),
+    SQLTemplate(
+        template_id="lookup_02",
+        family="direct_lookup",
+        sql_difficulty="easy",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        sql_template=(
+            "SELECT ST_AsGeoJSON(geometry) AS geometry,"
+            " names.\"primary\" AS name, id, subtype"
+            " FROM read_parquet('natural_earth')"
+            " WHERE id = '{anchor_id}'"
+        ),
+        question_hints=[
+            "Show me the {anchor_name}",
+            "Get {anchor_name}",
+            "Find the {anchor_name}",
+            "Where is the {anchor_name}?",
+            "Show the extent of the {anchor_name}",
+            "Give me the geometry of the {anchor_name}",
+            "Display the {anchor_name}",
+            "Pull up the {anchor_name}",
+            "I want to see the {anchor_name}",
+            "Map the {anchor_name}",
+            "How big is the {anchor_name}?",
+            "Outline of the {anchor_name}",
+        ],
+    ),
+    # ── ADJACENCY ────────────────────────────────────────────────────────────
+    SQLTemplate(
+        template_id="adj_01",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND ST_Touches(a.geometry, b.geometry)"
+        ),
+        question_hints=[
+            "Which regions border {anchor_name}?",
+            "What administrative units touch {anchor_name}?",
+            "List all places adjacent to {anchor_name}",
+            "What shares a border with {anchor_name}?",
+            "Neighbours of {anchor_name}",
+            "What is adjacent to {anchor_name}?",
+            "What surrounds {anchor_name}?",
+            "Places next to {anchor_name}",
+            "Everything bordering {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="adj_02",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND b.subtype = '{target_subtype}'"
+            "   AND ST_Touches(a.geometry, b.geometry)"
+        ),
+        question_hints=[
+            "Which {target_subtype}s border {anchor_name}?",
+            "What {target_subtype}s share a border with {anchor_name}?",
+            "{target_subtype}s that touch {anchor_name}",
+            "Neighbouring {target_subtype}s of {anchor_name}",
+            "Which {target_subtype}s are adjacent to {anchor_name}?",
+            "{target_subtype}s along the {anchor_name} border",
+            "Find {target_subtype}s next to {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="adj_03",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="sea",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
+            "        ST_AsGeoJSON(n.geometry) AS geometry"
+            " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE n.subtype IN ('ocean', 'sea')"
+            "   AND ST_Touches(a.geometry, n.geometry)"
+        ),
+        question_hints=[
+            "Which seas touch {anchor_name}?",
+            "What seas border {anchor_name}?",
+            "Which bodies of water is {anchor_name} adjacent to?",
+            "What ocean or sea borders {anchor_name}?",
+            "Which oceans touch {anchor_name}?",
+            "What coastline does {anchor_name} have?",
+            "Which water bodies does {anchor_name} border?",
+            "Does {anchor_name} have access to the sea?",
+            "What ocean is {anchor_name} on?",
+        ],
+    ),
+    # ── MULTI-ADJACENCY ──────────────────────────────────────────────────────
+    SQLTemplate(
+        template_id="multi_adj_01",
+        family="multi_adjacency",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_1}'"
+            "),"
+            " b AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_2}'"
+            ")"
+            " SELECT c.id, c.names.\"primary\" AS name, c.subtype, c.country,"
+            "        ST_AsGeoJSON(c.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS c, a, b"
+            " WHERE c.id NOT IN ('{anchor_id_1}', '{anchor_id_2}')"
+            "   AND ST_Touches(c.geometry, a.geometry)"
+            "   AND ST_Touches(c.geometry, b.geometry)"
+        ),
+        question_hints=[
+            "Which regions border both {anchor_1_name} and {anchor_2_name}?",
+            "What places touch both {anchor_1_name} and {anchor_2_name}?",
+            "Regions adjacent to both {anchor_1_name} and {anchor_2_name}",
+            "What lies between {anchor_1_name} and {anchor_2_name}?",
+            "Common neighbours of {anchor_1_name} and {anchor_2_name}",
+        ],
+    ),
+    # ── CONTAINMENT ──────────────────────────────────────────────────────────
+    SQLTemplate(
+        template_id="contain_01",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "What {target_subtype}s are in {anchor_name}?",
+            "Which {target_subtype}s fall within {anchor_name}?",
+            "List all {target_subtype}s inside {anchor_name}",
+            "{target_subtype}s contained by {anchor_name}",
+            "All {target_subtype}s within the boundaries of {anchor_name}",
+            "{target_subtype}s of {anchor_name}",
+            "Show every {target_subtype} in {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="contain_02",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND b.subtype = '{target_subtype}'"
+            "   AND ST_Contains(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "What country contains {anchor_name}?",
+            "Which country is {anchor_name} in?",
+            "What country does {anchor_name} belong to?",
+            "Which nation contains {anchor_name}?",
+            "{anchor_name} is part of which country?",
+            "Where does {anchor_name} fall geographically?",
+            "What country is {anchor_name} located in?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="contain_03",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Which {target_subtype}s are in the {anchor_name}?",
+            "What {target_subtype}s fall within the {anchor_name}?",
+            "{target_subtype}s inside the {anchor_name}",
+            "Administrative {target_subtype}s within the {anchor_name}",
+            "All regions contained by the {anchor_name}",
+            "What {target_subtype}s does the {anchor_name} contain?",
+            "{target_subtype}s covered by the {anchor_name}",
+        ],
+    ),
+    # ── INTERSECTION ─────────────────────────────────────────────────────────
+    SQLTemplate(
+        template_id="intersect_01",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND b.subtype = '{target_subtype}'"
+            "   AND ST_Intersects(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Which {target_subtype}s intersect {anchor_name}?",
+            "What {target_subtype}s overlap with {anchor_name}?",
+            "{target_subtype}s that cross into {anchor_name}",
+            "Which {target_subtype}s overlap {anchor_name}?",
+            "{target_subtype}s partially inside {anchor_name}",
+            "What {target_subtype}s extend into {anchor_name}?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="intersect_02",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Intersects(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Which countries intersect the {anchor_name}?",
+            "What countries does the {anchor_name} pass through?",
+            "Countries that overlap with the {anchor_name}",
+            "Which countries touch the {anchor_name}?",
+            "Nations intersected by the {anchor_name}",
+            "Which nations does the {anchor_name} cross?",
+            "Countries along the {anchor_name}",
+            "What countries does the {anchor_name} cover?",
+            "Countries that the {anchor_name} spans across",
+        ],
+    ),
+    # ── BUFFER ───────────────────────────────────────────────────────────────
+    # CTE computes the buffered geometry (raw) for the spatial join.
+    # Final SELECT wraps the result features with ST_AsGeoJSON.
+    SQLTemplate(
+        template_id="buffer_01",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0) AS geom"
+            "  FROM read_parquet('divisions_area')"
+            "  WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND ST_Intersects(b.geometry, a.geom)"
+        ),
+        question_hints=[
+            "What is within {buffer_km} km of {anchor_name}?",
+            "Administrative units within {buffer_km} km of {anchor_name}",
+            "Features within a {buffer_km} km radius of {anchor_name}",
+            "Places within {buffer_km} kilometers of {anchor_name}",
+            "{buffer_km} km buffer around {anchor_name}",
+            "What falls within {buffer_km} km of {anchor_name}?",
+            "Everything within {buffer_km} km of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="buffer_02",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT ST_Buffer(geometry, {buffer_m} / 111320.0) AS geom"
+            "  FROM read_parquet('divisions_area')"
+            "  WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.id != '{anchor_id}'"
+            "   AND ST_Intersects(b.geometry, a.geom)"
+        ),
+        question_hints=[
+            "What is within {buffer_m} meters of {anchor_name}?",
+            "Features within {buffer_m} m of {anchor_name}",
+            "Places within {buffer_m} metres of {anchor_name}",
+            "{buffer_m} meter buffer around {anchor_name}",
+            "What falls within {buffer_m} m of {anchor_name}?",
+            "Administrative units within {buffer_m} metres of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="buffer_03",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0) AS geom"
+            "  FROM read_parquet('natural_earth')"
+            "  WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE ST_Intersects(b.geometry, a.geom)"
+        ),
+        question_hints=[
+            "What administrative units are within {buffer_km} km of the {anchor_name}?",
+            "Countries within {buffer_km} km of the {anchor_name}",
+            "Regions within {buffer_km} km of the {anchor_name}",
+            "What falls within {buffer_km} km of the {anchor_name}?",
+            "Administrative divisions within a {buffer_km} km radius of the {anchor_name}",
+            "Places within {buffer_km} kilometers of the {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="buffer_04",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT ST_Buffer(geometry, {buffer_m} / 111320.0) AS geom"
+            "  FROM read_parquet('natural_earth')"
+            "  WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE ST_Intersects(b.geometry, a.geom)"
+        ),
+        question_hints=[
+            "What is within {buffer_m} meters of the {anchor_name}?",
+            "Administrative units within {buffer_m} m of the {anchor_name}",
+            "Places within {buffer_m} metres of the {anchor_name}",
+            "{buffer_m} meter buffer around the {anchor_name}",
+        ],
+    ),
+    # ── CHAINED ──────────────────────────────────────────────────────────────
+    # Containment + EXISTS/NOT EXISTS ocean/sea.
+    # CTE holds raw geometry for ST_Within; final SELECT wraps with ST_AsGeoJSON.
+    SQLTemplate(
+        template_id="chained_01",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('ocean', 'sea')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Coastal {target_subtype}s of {anchor_name}",
+            "{target_subtype}s in {anchor_name} with sea access",
+            "Which {target_subtype}s in {anchor_name} are on the coast?",
+            "Seaside {target_subtype}s within {anchor_name}",
+            "{target_subtype}s in {anchor_name} bordering the sea",
+            "Oceanfront {target_subtype}s in {anchor_name}",
+            "Which {target_subtype}s in {anchor_name} have a coastline?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_02",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Intersects(b.geometry, region.geometry)"
+            "   AND NOT EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('ocean', 'sea')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Landlocked {target_subtype}s in {anchor_name}",
+            "Which {target_subtype}s in {anchor_name} have no sea access?",
+            "{target_subtype}s in {anchor_name} that are landlocked",
+            "{target_subtype}s in {anchor_name} with no coastline",
+            "Which {target_subtype}s within {anchor_name} are landlocked?",
+            "Interior {target_subtype}s of {anchor_name} with no ocean border",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_03",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Terrain area', 'Island group', 'Peninsula')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "{target_subtype}s in {anchor_name} on a terrain feature or island",
+            "{target_subtype}s of {anchor_name} on a peninsula or island group",
+            "{target_subtype}s within {anchor_name} on notable landforms",
+            "Island and peninsula {target_subtype}s of {anchor_name}",
+        ],
+    ),
+    # ── DIFFERENCE ───────────────────────────────────────────────────────────
+    # CTEs hold raw geometry; ST_Difference result wrapped with ST_AsGeoJSON.
+    SQLTemplate(
+        template_id="diff_01",
+        family="difference",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_1}'"
+            "),"
+            " b AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_2}'"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Difference(a.geometry, b.geometry)) AS geometry"
+            " FROM a, b"
+            " WHERE ST_Intersects(a.geometry, b.geometry)"
+        ),
+        question_hints=[
+            "{anchor_1_name} excluding {anchor_2_name}",
+            "{anchor_1_name} minus {anchor_2_name}",
+            "The part of {anchor_1_name} that is not in {anchor_2_name}",
+            "{anchor_1_name} without the {anchor_2_name} area",
+            "Remove {anchor_2_name} from {anchor_1_name}",
+            "{anchor_1_name} with {anchor_2_name} cut out",
+            "Subtract {anchor_2_name} from {anchor_1_name}",
+            "What is left of {anchor_1_name} after removing {anchor_2_name}?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="diff_02",
+        family="difference",
+        sql_difficulty="hard",
+        anchor_source="mixed",
+        num_anchors=2,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " b AS ("
+            "  SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{clip_feature_id}'"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Difference(a.geometry, b.geometry)) AS geometry"
+            " FROM a, b"
+            " WHERE ST_Intersects(a.geometry, b.geometry)"
+        ),
+        question_hints=[
+            "The part of {anchor_name} outside the {clip_feature_name}",
+            "{anchor_name} excluding the {clip_feature_name}",
+            "{anchor_name} minus the {clip_feature_name}",
+            "The land area of {anchor_name} not covered by the {clip_feature_name}",
+            "{anchor_name} with the {clip_feature_name} removed",
+            "What remains of {anchor_name} after removing the {clip_feature_name}?",
+        ],
+    ),
+    # ── BORDER CORRIDOR ──────────────────────────────────────────────────────
+    # Intermediate intersection kept raw; final buffer wrapped with ST_AsGeoJSON.
+    SQLTemplate(
+        template_id="corridor_01",
+        family="border_corridor",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_1}'"
+            "),"
+            " b AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id_2}'"
+            "),"
+            " border AS ("
+            "  SELECT ST_Intersection(a.geometry, b.geometry) AS line"
+            "  FROM a, b"
+            "  WHERE ST_Intersects(a.geometry, b.geometry)"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Buffer(border.line, {buffer_km} * 1000.0 / 111320.0)) AS geometry"
+            " FROM border"
+            " WHERE border.line IS NOT NULL"
+        ),
+        question_hints=[
+            "{buffer_km} km zone along the border between {anchor_1_name} and {anchor_2_name}",
+            "The {buffer_km} km border corridor between {anchor_1_name} and {anchor_2_name}",
+            "Area within {buffer_km} km of the {anchor_1_name}-{anchor_2_name} border",
+            "The region straddling the border of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
+            "{buffer_km} km on either side of the {anchor_1_name} and {anchor_2_name} border",
+            "Buffer the {anchor_1_name}-{anchor_2_name} boundary by {buffer_km} km",
+        ],
+    ),
+    # ── SET OPERATIONS ───────────────────────────────────────────────────────
+    # union_01 / union_02: 2-anchor and filtered-containment unions.
+    # union_03: 3-anchor union — trains the model on IN-clause with 3 IDs.
+    # contain_multi: subtype within multiple countries via country IN clause.
+    SQLTemplate(
+        template_id="union_01",
+        family="set_operations",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        sql_template=(
+            "SELECT ST_AsGeoJSON(ST_Union_Agg(geometry)) AS geometry,"
+            " array_agg(names.\"primary\") AS names"
+            " FROM read_parquet('divisions_area')"
+            " WHERE id IN ('{anchor_id_1}', '{anchor_id_2}')"
+        ),
+        question_hints=[
+            "The combined area of {anchor_1_name} and {anchor_2_name}",
+            "Union of {anchor_1_name} and {anchor_2_name}",
+            "Merge {anchor_1_name} and {anchor_2_name}",
+            "{anchor_1_name} and {anchor_2_name} together",
+            "Combined geometry of {anchor_1_name} and {anchor_2_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="union_03",
+        family="set_operations",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=3,
+        sql_template=(
+            "SELECT ST_AsGeoJSON(ST_Union_Agg(geometry)) AS geometry,"
+            " array_agg(names.\"primary\") AS names"
+            " FROM read_parquet('divisions_area')"
+            " WHERE id IN ('{anchor_id_1}', '{anchor_id_2}', '{anchor_id_3}')"
+        ),
+        question_hints=[
+            "Show me {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "The combined area of {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "Union of {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "Merge {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "{anchor_1_name}, {anchor_2_name} and {anchor_3_name} together",
+            "Display {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="contain_multi_01",
+        family="set_operations",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=2,
+        target_subtype="region",
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name, subtype, country,"
+            "       ST_AsGeoJSON(geometry) AS geometry"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country IN ('{country_1}', '{country_2}')"
+            "   AND subtype = '{target_subtype}'"
+        ),
+        question_hints=[
+            "{target_subtype}s of {anchor_1_name} and {anchor_2_name}",
+            "All {target_subtype}s in {anchor_1_name} and {anchor_2_name}",
+            "Show {target_subtype}s across {anchor_1_name} and {anchor_2_name}",
+            "{target_subtype}s belonging to {anchor_1_name} and {anchor_2_name}",
+            "List {target_subtype}s in both {anchor_1_name} and {anchor_2_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="contain_multi_02",
+        family="set_operations",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=3,
+        target_subtype="region",
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name, subtype, country,"
+            "       ST_AsGeoJSON(geometry) AS geometry"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country IN ('{country_1}', '{country_2}', '{country_3}')"
+            "   AND subtype = '{target_subtype}'"
+        ),
+        question_hints=[
+            "{target_subtype}s of {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "All {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "Show {target_subtype}s across {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+            "List {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="union_02",
+        family="set_operations",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Union_Agg(b.geometry)) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Merge all {target_subtype}s in {anchor_name} into one geometry",
+            "Combined geometry of all {target_subtype}s in {anchor_name}",
+            "Union of all {target_subtype}s within {anchor_name}",
+            "All {target_subtype}s of {anchor_name} merged together",
+            "The overall extent of {target_subtype}s in {anchor_name}",
+        ],
+    ),
+    # ── PARTIAL SELECTION ────────────────────────────────────────────────────
+    # Bbox clip CTEs use raw geometry; ST_Intersection result wrapped.
+    SQLTemplate(
+        template_id="partial_01",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " bbox AS ("
+            "  SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax,"
+            "         ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a"
+            "),"
+            " clip AS ("
+            "  SELECT ST_MakeEnvelope(xmin, (ymin + ymax) / 2.0, xmax, ymax) AS half_geom FROM bbox"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, clip.half_geom)) AS geometry"
+            " FROM a, clip"
+        ),
+        question_hints=[
+            "The northern half of {anchor_name}",
+            "Northern part of {anchor_name}",
+            "The top half of {anchor_name}",
+            "Northern portion of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="partial_02",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " bbox AS ("
+            "  SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax,"
+            "         ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a"
+            "),"
+            " clip AS ("
+            "  SELECT ST_MakeEnvelope(xmin, ymin, xmax, (ymin + ymax) / 2.0) AS half_geom FROM bbox"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, clip.half_geom)) AS geometry"
+            " FROM a, clip"
+        ),
+        question_hints=[
+            "The southern half of {anchor_name}",
+            "Southern part of {anchor_name}",
+            "The bottom half of {anchor_name}",
+            "Southern portion of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="partial_03",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " bbox AS ("
+            "  SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax,"
+            "         ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a"
+            "),"
+            " clip AS ("
+            "  SELECT ST_MakeEnvelope((xmin + xmax) / 2.0, ymin, xmax, ymax) AS half_geom FROM bbox"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, clip.half_geom)) AS geometry"
+            " FROM a, clip"
+        ),
+        question_hints=[
+            "The eastern half of {anchor_name}",
+            "Eastern part of {anchor_name}",
+            "The right half of {anchor_name}",
+            "Eastern portion of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="partial_04",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " bbox AS ("
+            "  SELECT ST_XMin(geometry) AS xmin, ST_XMax(geometry) AS xmax,"
+            "         ST_YMin(geometry) AS ymin, ST_YMax(geometry) AS ymax FROM a"
+            "),"
+            " clip AS ("
+            "  SELECT ST_MakeEnvelope(xmin, ymin, (xmin + xmax) / 2.0, ymax) AS half_geom FROM bbox"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, clip.half_geom)) AS geometry"
+            " FROM a, clip"
+        ),
+        question_hints=[
+            "The western half of {anchor_name}",
+            "Western part of {anchor_name}",
+            "The left half of {anchor_name}",
+            "Western portion of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="partial_05",
+        family="partial_selection",
+        sql_difficulty="hard",
+        anchor_source="mixed",
+        num_anchors=2,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry AS g1 FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            "),"
+            " b AS ("
+            "  SELECT geometry AS g2 FROM read_parquet('natural_earth') WHERE id = '{clip_feature_id}'"
+            ")"
+            " SELECT ST_AsGeoJSON(ST_Intersection(a.g1, b.g2)) AS geometry"
+            " FROM a, b"
+            " WHERE ST_Intersects(a.g1, b.g2)"
+        ),
+        question_hints=[
+            "The part of {anchor_name} that overlaps the {clip_feature_name}",
+            "{anchor_name} within the {clip_feature_name}",
+            "The portion of {anchor_name} inside the {clip_feature_name}",
+            "Clip {anchor_name} to the {clip_feature_name}",
+        ],
+    ),
+    # ── AGGREGATION ──────────────────────────────────────────────────────────
+    # ST_Area uses raw geometry in the ORDER BY; final SELECT wraps output.
+    SQLTemplate(
+        template_id="agg_01",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype=None,  # filled at generation time: locality or region
+        requires_aggregation=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry,"
+            "        ST_Area(b.geometry) AS area"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE ST_Within(b.geometry, a.geometry)"
+            "   AND b.subtype = '{target_subtype}'"
+            " ORDER BY area DESC"
+            " LIMIT {top_n}"
+        ),
+        question_hints=[
+            "Top {top_n} largest {target_subtype}s in {anchor_name}",
+            "Biggest {top_n} {target_subtype}s in {anchor_name}",
+            "{top_n} largest {target_subtype}s inside {anchor_name}",
+            "The {top_n} biggest {target_subtype}s within {anchor_name}",
+            "Largest {target_subtype} in {anchor_name}",
+            "Which {target_subtype} in {anchor_name} has the most area?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="agg_02",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype=None,  # filled at generation time: locality or region
+        requires_aggregation=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry,"
+            "        ST_Area(b.geometry) AS area"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE ST_Within(b.geometry, a.geometry)"
+            "   AND b.subtype = '{target_subtype}'"
+            " ORDER BY area ASC"
+            " LIMIT {top_n}"
+        ),
+        question_hints=[
+            "Top {top_n} smallest {target_subtype}s in {anchor_name}",
+            "Smallest {top_n} {target_subtype}s in {anchor_name}",
+            "{top_n} smallest {target_subtype}s inside {anchor_name}",
+            "The {top_n} tiniest {target_subtype}s within {anchor_name}",
+            "Smallest {target_subtype} in {anchor_name}",
+            "Which {target_subtype} in {anchor_name} has the least area?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="agg_03",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype=None,  # filled at generation time: locality or region
+        requires_aggregation=True,
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name,"
+            "       ST_AsGeoJSON(geometry) AS geometry,"
+            "       ST_Area(geometry) AS area"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country = '{country}'"
+            "   AND subtype = '{target_subtype}'"
+            " ORDER BY area DESC"
+            " LIMIT {top_n}"
+        ),
+        question_hints=[
+            "Top {top_n} largest {target_subtype}s in {anchor_name}",
+            "{top_n} biggest {target_subtype}s in {anchor_name}",
+            "Largest {top_n} {target_subtype}s in {anchor_name}",
+            "The {top_n} largest {target_subtype}s in {anchor_name}",
+            "Biggest {target_subtype} in {anchor_name}",
+            "Which {target_subtype} in {anchor_name} is the largest?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="agg_04",
+        family="aggregation",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype=None,  # filled at generation time: locality or region
+        requires_aggregation=True,
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name,"
+            "       ST_AsGeoJSON(geometry) AS geometry,"
+            "       ST_Area(geometry) AS area"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country = '{country}'"
+            "   AND subtype = '{target_subtype}'"
+            " ORDER BY area ASC"
+            " LIMIT {top_n}"
+        ),
+        question_hints=[
+            "Top {top_n} smallest {target_subtype}s in {anchor_name}",
+            "{top_n} smallest {target_subtype}s in {anchor_name}",
+            "Smallest {top_n} {target_subtype}s in {anchor_name}",
+            "The {top_n} smallest {target_subtype}s in {anchor_name}",
+            "Smallest {target_subtype} in {anchor_name}",
+            "Which {target_subtype} in {anchor_name} is the smallest?",
+        ],
+    ),
+    # ── WINDOW FUNCTION ──────────────────────────────────────────────────────
+    # CTE keeps raw geometry for ST_Area; final SELECT wraps with ST_AsGeoJSON.
+    SQLTemplate(
+        template_id="window_01",
+        family="window_function",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template=(
+            "WITH ranked AS ("
+            "  SELECT id, names.\"primary\" AS name, subtype, country, region, geometry,"
+            "         ST_Area(geometry) AS area,"
+            "         ROW_NUMBER() OVER (PARTITION BY region ORDER BY ST_Area(geometry) DESC) AS rn"
+            "  FROM read_parquet('divisions_area')"
+            "  WHERE country = '{country}'"
+            "    AND subtype = '{target_subtype}'"
+            ")"
+            " SELECT id, name, subtype, country, region,"
+            "        ST_AsGeoJSON(geometry) AS geometry, area"
+            " FROM ranked"
+            " WHERE rn = 1"
+        ),
+        question_hints=[
+            "The largest {target_subtype} in each region of {anchor_name}",
+            "Biggest {target_subtype} per region in {anchor_name}",
+            "Largest {target_subtype} for every region of {anchor_name}",
+            "The biggest {target_subtype} in each province of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="window_02",
+        family="window_function",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        requires_aggregation=True,
+        sql_template=(
+            "WITH ranked AS ("
+            "  SELECT id, names.\"primary\" AS name, subtype, country, region, geometry,"
+            "         ST_Area(geometry) AS area,"
+            "         ROW_NUMBER() OVER (PARTITION BY region ORDER BY ST_Area(geometry) ASC) AS rn"
+            "  FROM read_parquet('divisions_area')"
+            "  WHERE country = '{country}'"
+            "    AND subtype = '{target_subtype}'"
+            ")"
+            " SELECT id, name, subtype, country, region,"
+            "        ST_AsGeoJSON(geometry) AS geometry, area"
+            " FROM ranked"
+            " WHERE rn = 1"
+        ),
+        question_hints=[
+            "The smallest {target_subtype} in each region of {anchor_name}",
+            "Smallest {target_subtype} per region in {anchor_name}",
+            "Tiniest {target_subtype} for every region of {anchor_name}",
+            "The smallest {target_subtype} in each province of {anchor_name}",
+        ],
+    ),
+    # ── ATTRIBUTE FILTER ─────────────────────────────────────────────────────
+    # No spatial op — pure WHERE on is_land / is_territorial / country.
+    SQLTemplate(
+        template_id="attr_01",
+        family="attribute_filter",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="dependency",
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name, subtype, country,"
+            "       ST_AsGeoJSON(geometry) AS geometry"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country = '{country}'"
+            "   AND is_land = TRUE"
+            "   AND subtype = '{target_subtype}'"
+        ),
+        question_hints=[
+            "Island territories of {anchor_name}",
+            "Overseas island {target_subtype}s belonging to {anchor_name}",
+            "Which islands are part of {anchor_name}?",
+            "Land territories of {anchor_name}",
+            "Island possessions of {anchor_name}",
+            "{anchor_name}'s island {target_subtype}s",
+        ],
+    ),
+    SQLTemplate(
+        template_id="attr_02",
+        family="attribute_filter",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name, subtype, country,"
+            "       ST_AsGeoJSON(geometry) AS geometry"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country = '{country}'"
+            "   AND is_territorial = TRUE"
+            "   AND subtype = '{target_subtype}'"
+        ),
+        question_hints=[
+            "Territorial {target_subtype}s of {anchor_name}",
+            "Official territorial divisions of {anchor_name}",
+            "Recognised territorial {target_subtype}s belonging to {anchor_name}",
+            "Which territorial regions does {anchor_name} have?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="attr_03",
+        family="attribute_filter",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "SELECT id, names.\"primary\" AS name, subtype, country,"
+            "       ST_AsGeoJSON(geometry) AS geometry"
+            " FROM read_parquet('divisions_area')"
+            " WHERE country = '{country}'"
+            "   AND subtype = '{target_subtype}'"
+            "   AND is_land = TRUE"
+        ),
+        question_hints=[
+            "Land-based {target_subtype}s of {anchor_name}",
+            "{target_subtype}s on the mainland of {anchor_name}",
+            "All {target_subtype}s on land in {anchor_name}",
+            "Non-island {target_subtype}s of {anchor_name}",
+        ],
+    ),
+    # ── NATURAL EARTH ADJACENCY ─────────────────────────────────────────────
+    # Division anchor, natural_earth targets. Handler formats anchor_id and
+    # target_subtype but the SQL hardcodes NE subtypes (like adj_03).
+    SQLTemplate(
+        template_id="adj_04",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="river",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
+            "        ST_AsGeoJSON(n.geometry) AS geometry"
+            " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE n.subtype IN ('River', 'Lake', 'Basin')"
+            "   AND ST_Intersects(a.geometry, n.geometry)"
+        ),
+        question_hints=[
+            "What rivers or lakes are in {anchor_name}?",
+            "Natural water features of {anchor_name}",
+            "Which rivers flow through {anchor_name}?",
+            "Lakes and rivers within {anchor_name}",
+            "Water features inside {anchor_name}",
+            "What bodies of water cross {anchor_name}?",
+            "Rivers of {anchor_name}",
+            "Show me the lakes in {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="adj_05",
+        family="adjacency",
+        sql_difficulty="medium",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="range",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
+            "        ST_AsGeoJSON(n.geometry) AS geometry"
+            " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE n.subtype IN ('Range/Mts', 'Terrain area', 'Peninsula', 'Depression')"
+            "   AND ST_Intersects(a.geometry, n.geometry)"
+        ),
+        question_hints=[
+            "What mountain ranges are in {anchor_name}?",
+            "Terrain features of {anchor_name}",
+            "Which mountain ranges cross {anchor_name}?",
+            "Landforms inside {anchor_name}",
+            "Peninsulas and ranges in {anchor_name}",
+            "Geographic features within {anchor_name}",
+            "Mountains of {anchor_name}",
+            "What terrain does {anchor_name} contain?",
+        ],
+    ),
+    # ── NATURAL EARTH INTERSECTION ──────────────────────────────────────────
+    # intersect_03: NE anchor, finding overlapping regions (vs countries in
+    # intersect_02). Uses cross_source_relations handler.
+    # intersect_04: division anchor, finding NE features that overlap it.
+    # Uses intersection_pairs handler (extra NE subtypes ignored in SQL).
+    SQLTemplate(
+        template_id="intersect_03",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Intersects(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Which regions does the {anchor_name} pass through?",
+            "What administrative regions overlap with the {anchor_name}?",
+            "Regions that the {anchor_name} crosses",
+            "Administrative areas intersected by the {anchor_name}",
+            "What provinces does the {anchor_name} span?",
+            "Regions along the {anchor_name}",
+            "Which provinces overlap the {anchor_name}?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="intersect_04",
+        family="intersection",
+        sql_difficulty="medium-hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="region",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
+            "        ST_AsGeoJSON(n.geometry) AS geometry"
+            " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE ST_Intersects(n.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "What natural features intersect {anchor_name}?",
+            "Natural earth features that overlap {anchor_name}",
+            "Which geographic features cross {anchor_name}?",
+            "Everything from natural earth that touches {anchor_name}",
+            "What geographic features does {anchor_name} contain?",
+            "Natural features within or crossing {anchor_name}",
+        ],
+    ),
+    # ── NATURAL EARTH CHAINED ───────────────────────────────────────────────
+    # chained_04: localities in a region that intersect a river or lake.
+    # chained_05: localities in a region that lie on a mountain range.
+    SQLTemplate(
+        template_id="chained_04",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('River', 'Lake', 'Basin')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Riverside {target_subtype}s in {anchor_name}",
+            "{target_subtype}s in {anchor_name} near a river or lake",
+            "Which {target_subtype}s in {anchor_name} are on a waterway?",
+            "Lakeside or riverside {target_subtype}s within {anchor_name}",
+            "{target_subtype}s in {anchor_name} that touch a river",
+            "Which {target_subtype}s in {anchor_name} are on a lake?",
+            "Waterfront {target_subtype}s of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_05",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="locality",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Range/Mts', 'Depression')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Mountain {target_subtype}s in {anchor_name}",
+            "{target_subtype}s in {anchor_name} on a mountain range",
+            "Which {target_subtype}s in {anchor_name} are in the mountains?",
+            "Highland {target_subtype}s within {anchor_name}",
+            "{target_subtype}s of {anchor_name} in mountainous terrain",
+            "{target_subtype}s in {anchor_name} near a mountain range",
+        ],
+    ),
+    # ── CHAINED (county-level) ──────────────────────────────────────────────
+    # Same spatial patterns as chained_01..05 but targeting counties/districts
+    # so the model learns "coastal districts of X", "riverside counties", etc.
+    SQLTemplate(
+        template_id="chained_06",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="county",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('ocean', 'sea')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Coastal {target_subtype}s of {anchor_name}",
+            "Which districts of {anchor_name} are on the coast?",
+            "{target_subtype}s in {anchor_name} that border the sea",
+            "Seaside {target_subtype}s within {anchor_name}",
+            "{target_subtype}s of {anchor_name} with ocean access",
+            "Which {target_subtype}s in {anchor_name} touch the sea?",
+            "Maritime {target_subtype}s of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_07",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="county",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND NOT EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('ocean', 'sea')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Landlocked {target_subtype}s of {anchor_name}",
+            "Which districts of {anchor_name} have no coastline?",
+            "Interior {target_subtype}s within {anchor_name}",
+            "{target_subtype}s in {anchor_name} with no sea access",
+            "Non-coastal {target_subtype}s of {anchor_name}",
+            "Inland {target_subtype}s of {anchor_name}",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_08",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="county",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('River', 'Lake', 'Basin')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Riverside {target_subtype}s of {anchor_name}",
+            "Which districts of {anchor_name} have a river or lake?",
+            "{target_subtype}s in {anchor_name} on a waterway",
+            "Lakeside {target_subtype}s within {anchor_name}",
+            "{target_subtype}s of {anchor_name} along a river",
+            "Which {target_subtype}s in {anchor_name} border a lake?",
+        ],
+    ),
+    SQLTemplate(
+        template_id="chained_09",
+        family="chained",
+        sql_difficulty="hard",
+        anchor_source="divisions_area",
+        num_anchors=1,
+        target_subtype="county",
+        sql_template=(
+            "WITH region AS ("
+            "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, region"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Within(b.geometry, region.geometry)"
+            "   AND EXISTS ("
+            "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Range/Mts', 'Depression')"
+            "       AND ST_Intersects(b.geometry, n.geometry)"
+            "   )"
+        ),
+        question_hints=[
+            "Mountain {target_subtype}s of {anchor_name}",
+            "Which districts of {anchor_name} are in the mountains?",
+            "{target_subtype}s in {anchor_name} on a mountain range",
+            "Highland {target_subtype}s within {anchor_name}",
+            "{target_subtype}s of {anchor_name} in mountainous terrain",
+            "Which {target_subtype}s in {anchor_name} have mountain ranges?",
+        ],
+    ),
+    # ── NATURAL EARTH CONTAINMENT ───────────────────────────────────────────
+    # contain_04: NE anchor (sea/gulf/bay), find countries that touch it.
+    # Uses containment handler via containment_pairs.
+    SQLTemplate(
+        template_id="contain_04",
+        family="containment",
+        sql_difficulty="medium",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        target_subtype="country",
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT b.id, b.names.\"primary\" AS name, b.subtype,"
+            "        ST_AsGeoJSON(b.geometry) AS geometry"
+            " FROM read_parquet('divisions_area') AS b, a"
+            " WHERE b.subtype = '{target_subtype}'"
+            "   AND ST_Intersects(b.geometry, a.geometry)"
+        ),
+        question_hints=[
+            "Which countries border the {anchor_name}?",
+            "What countries are along the {anchor_name}?",
+            "Countries surrounding the {anchor_name}",
+            "Nations on the {anchor_name}",
+            "Which countries touch the {anchor_name}?",
+            "Countries with coastline on the {anchor_name}",
+            "What nations lie on the {anchor_name}?",
+        ],
+    ),
+    # ── NATURAL EARTH BUFFER ────────────────────────────────────────────────
+    # buffer_05: NE anchor, find other NE features within a buffer distance.
+    # Uses buffer handler for natural_earth.
+    SQLTemplate(
+        template_id="buffer_05",
+        family="buffer",
+        sql_difficulty="hard",
+        anchor_source="natural_earth",
+        num_anchors=1,
+        requires_buffer=True,
+        sql_template=(
+            "WITH a AS ("
+            "  SELECT ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0) AS geom"
+            "  FROM read_parquet('natural_earth')"
+            "  WHERE id = '{anchor_id}'"
+            ")"
+            " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
+            "        ST_AsGeoJSON(n.geometry) AS geometry"
+            " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE ST_Intersects(n.geometry, a.geom)"
+        ),
+        question_hints=[
+            "Natural features within {buffer_km} km of the {anchor_name}",
+            "What is within {buffer_km} km of the {anchor_name}?",
+            "Geographic features near the {anchor_name} within {buffer_km} km",
+            "Everything within {buffer_km} km of the {anchor_name}",
+            "What natural features are close to the {anchor_name}?",
+            "{buffer_km} km radius around the {anchor_name}",
+        ],
+    ),
+]
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def get_templates_by_family(family: str) -> List[SQLTemplate]:
+    """Return all templates for a specific task family."""
+    return [t for t in TEMPLATES if t.family == family]
+def get_template_by_id(template_id: str) -> SQLTemplate:
+    """Return a template by its ID, raising ValueError if not found."""
+    for t in TEMPLATES:
+        if t.template_id == template_id:
+            return t
+    raise ValueError(f"Template '{template_id}' not found")
+if __name__ == "__main__":
+    families: dict = {}
+    for t in TEMPLATES:
+        families[t.family] = families.get(t.family, 0) + 1
+    print("SQL Template Catalog")
+    print("=" * 60)
+    for family, count in sorted(families.items()):
+        print(f"{family:20s}: {count:2d} templates")
+    print(f"{'TOTAL':20s}: {len(TEMPLATES):2d} templates")
+    # Verify every template's final SELECT wraps geometry with ST_AsGeoJSON
+    print()
+    print("Geometry output check (all should show ST_AsGeoJSON)")
+    print("=" * 60)
+    for t in TEMPLATES:
+        has_geojson = "ST_AsGeoJSON" in t.sql_template
+        status = "OK" if has_geojson else "MISSING"
+        print(f"  {t.template_id:20s}: {status}")

dataset/scripts/validate_dataset.py ADDED Viewed

	@@ -0,0 +1,309 @@

+"""
+Validate and balance the generated dataset.
+This script:
+1. Loads all generated samples
+2. Validates SQL executability
+3. Checks candidate list quality
+4. Balances across task families and difficulty
+5. Removes duplicates
+6. Generates dataset statistics
+Output:
+- output/dataset_validated.jsonl
+- output/dataset_stats.json
+"""
+import json
+from pathlib import Path
+from typing import List, Dict, Any, Tuple
+from collections import Counter
+from concurrent.futures import ProcessPoolExecutor, as_completed
+import duckdb
+import pandas as pd
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
+def load_samples(jsonl_path: Path) -> List[Dict[str, Any]]:
+    """Load samples from JSONL file."""
+    samples = []
+    with open(jsonl_path, 'r') as f:
+        for line in f:
+            samples.append(json.loads(line))
+    return samples
+def _resolve_paths(sql: str) -> str:
+    """Replace symbolic placeholder paths with actual runtime paths for execution."""
+    sql = sql.replace(
+        "read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')"
+    )
+    sql = sql.replace(
+        "read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')"
+    )
+    # Legacy fixed Docker paths from earlier dataset versions
+    sql = sql.replace("/data/overture/division_area/*.parquet",          DIVISIONS_AREA_PATH)
+    sql = sql.replace("/data/overture/divisions_area/*.parquet",         DIVISIONS_AREA_PATH)
+    sql = sql.replace("/data/natural_earth_geoparquet/ne_geography.parquet", NATURAL_EARTH_PATH)
+    return sql
+def _to_symbolic_sql(sql: str) -> str:
+    """Normalize any hardcoded or runtime paths back to symbolic names for storage."""
+    # Current local runtime paths
+    sql = sql.replace(DIVISIONS_AREA_PATH, "divisions_area")
+    sql = sql.replace(NATURAL_EARTH_PATH, "natural_earth")
+    # Legacy Docker paths
+    sql = sql.replace("/data/overture/division_area/*.parquet",          "divisions_area")
+    sql = sql.replace("/data/overture/divisions_area/*.parquet",         "divisions_area")
+    sql = sql.replace("/data/natural_earth_geoparquet/ne_geography.parquet", "natural_earth")
+    return sql
+def validate_sql(con: duckdb.DuckDBPyConnection, sql: str) -> tuple[bool, str]:
+    """Validate that SQL executes without error.
+    Resolves symbolic path placeholders to actual runtime paths before execution.
+    """
+    try:
+        result = con.execute(_resolve_paths(sql)).fetchdf()
+        if result.empty:
+            return False, "Empty result"
+        return True, "OK"
+    except Exception as e:
+        return False, str(e)
+def validate_candidates(sample: Dict[str, Any]) -> tuple[bool, str]:
+    """Validate candidate list quality."""
+    candidates = sample['candidates']
+    selected = sample['target']['selected_candidates']
+    # Check we have candidates
+    if not candidates:
+        return False, "No candidates"
+    # Check selected candidates exist
+    candidate_ids = {c['candidate_id'] for c in candidates}
+    for sel_id in selected:
+        if sel_id not in candidate_ids:
+            return False, f"Selected candidate {sel_id} not in candidate list"
+    # Check for duplicates
+    ids = [c['id'] for c in candidates]
+    if len(ids) != len(set(ids)):
+        return False, "Duplicate candidates"
+    return True, "OK"
+def validate_sample(con: duckdb.DuckDBPyConnection, sample: Dict[str, Any]) -> tuple[bool, List[str]]:
+    """Validate a single sample. Returns (is_valid, list_of_issues)."""
+    issues = []
+    # Skip SQL re-execution if already verified during generation
+    if not sample.get('metadata', {}).get('sql_verified', False):
+        sql_valid, sql_msg = validate_sql(con, sample['target']['sql'])
+        if not sql_valid:
+            issues.append(f"SQL: {sql_msg}")
+    # Validate candidates
+    cand_valid, cand_msg = validate_candidates(sample)
+    if not cand_valid:
+        issues.append(f"Candidates: {cand_msg}")
+    # Check question exists
+    if not sample.get('question') or len(sample['question'].strip()) == 0:
+        issues.append("Empty question")
+    return len(issues) == 0, issues
+def validate_sample_worker(sample: Dict[str, Any]) -> Tuple[str, bool, List[str]]:
+    """Worker function for parallel validation. Returns (sample_id, is_valid, issues)."""
+    # Each worker creates its own DuckDB connection
+    con = duckdb.connect()
+    con.execute("SET enable_progress_bar=false")
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    try:
+        is_valid, issues = validate_sample(con, sample)
+        con.close()
+        if is_valid:
+            sample['target']['sql'] = _to_symbolic_sql(sample['target']['sql'])
+        return (sample['id'], is_valid, issues, sample if is_valid else None)
+    except Exception as e:
+        con.close()
+        return (sample['id'], False, [f"Validation error: {str(e)}"], None)
+def compute_statistics(samples: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """Compute dataset statistics."""
+    stats = {
+        'total_samples': len(samples),
+        'task_families': {},
+        'sql_difficulty': {},
+        'grounding_difficulty': {},
+        'anchor_sources': {},
+        'avg_candidates_per_sample': 0,
+        'avg_question_length': 0,
+        'countries_covered': set(),
+        'subtypes_covered': set()
+    }
+    total_candidates = 0
+    total_question_length = 0
+    for sample in samples:
+        meta = sample['metadata']
+        # Count by family
+        family = meta['task_family']
+        stats['task_families'][family] = stats['task_families'].get(family, 0) + 1
+        # Count by SQL difficulty
+        sql_diff = meta['sql_difficulty']
+        stats['sql_difficulty'][sql_diff] = stats['sql_difficulty'].get(sql_diff, 0) + 1
+        # Count by grounding difficulty
+        ground_diff = meta['grounding_difficulty']
+        stats['grounding_difficulty'][ground_diff] = stats['grounding_difficulty'].get(ground_diff, 0) + 1
+        # Count by anchor source
+        anchor_src = meta['anchor_source']
+        stats['anchor_sources'][anchor_src] = stats['anchor_sources'].get(anchor_src, 0) + 1
+        # Candidates
+        total_candidates += len(sample['candidates'])
+        # Question length
+        total_question_length += len(sample['question'].split())
+        # Countries and subtypes (from selected/answer candidates only)
+        selected_ids = set(sample.get('target', {}).get('selected_candidates', []))
+        for cand in sample['candidates']:
+            if cand['candidate_id'] in selected_ids:
+                if cand.get('country'):
+                    stats['countries_covered'].add(cand['country'])
+                if cand.get('subtype'):
+                    stats['subtypes_covered'].add(cand['subtype'])
+    stats['avg_candidates_per_sample'] = total_candidates / len(samples) if samples else 0
+    stats['avg_question_length'] = total_question_length / len(samples) if samples else 0
+    stats['countries_covered'] = sorted(list(stats['countries_covered']))
+    stats['subtypes_covered'] = sorted(list(stats['subtypes_covered']))
+    return stats
+def main():
+    """Validate and analyze dataset."""
+    script_dir = Path(__file__).parent
+    output_dir = script_dir.parent / "output"
+    raw_file = output_dir / "dataset_raw.jsonl"
+    validated_file = output_dir / "dataset_validated.jsonl"
+    stats_file = output_dir / "dataset_stats.json"
+    if not raw_file.exists():
+        print(f"Error: {raw_file} not found. Run generate_samples.py first.")
+        return
+    # Load samples
+    print("Loading samples...")
+    samples = load_samples(raw_file)
+    print(f"Loaded {len(samples)} samples")
+    # Validate samples in parallel
+    print("\nValidating samples in parallel...")
+    valid_samples = []
+    invalid_samples = []
+    with ProcessPoolExecutor(max_workers=8) as executor:
+        # Submit all validation tasks
+        futures = {executor.submit(validate_sample_worker, sample): sample for sample in samples}
+        # Collect results as they complete
+        completed = 0
+        for future in as_completed(futures):
+            sample_id, is_valid, issues, validated_sample = future.result()
+            if is_valid:
+                valid_samples.append(validated_sample)
+            else:
+                invalid_samples.append((sample_id, issues))
+            completed += 1
+            if completed % 50 == 0 or completed == len(samples):
+                print(f"\r  Progress: {completed}/{len(samples)} ", end='', flush=True)
+        print()  # New line after progress
+    print(f"\nValidation results:")
+    print(f"  Valid: {len(valid_samples)}")
+    print(f"  Invalid: {len(invalid_samples)}")
+    if invalid_samples and len(invalid_samples) <= 20:
+        print("\nInvalid samples:")
+        for sample_id, issues in invalid_samples[:20]:
+            print(f"  {sample_id}: {', '.join(issues)}")
+    elif invalid_samples:
+        print(f"\n{len(invalid_samples)} invalid samples (showing first 20):")
+        for sample_id, issues in invalid_samples[:20]:
+            print(f"  {sample_id}: {', '.join(issues)}")
+    # Save validated samples
+    if valid_samples:
+        with open(validated_file, 'w') as f:
+            for sample in valid_samples:
+                f.write(json.dumps(sample) + '\n')
+        print(f"\nSaved {len(valid_samples)} valid samples to {validated_file}")
+    # Compute statistics
+    print("\nComputing statistics...")
+    stats = compute_statistics(valid_samples)
+    # Save statistics
+    # Convert sets to lists for JSON serialization
+    stats_json = {k: (list(v) if isinstance(v, set) else v) for k, v in stats.items()}
+    with open(stats_file, 'w') as f:
+        json.dump(stats_json, f, indent=2)
+    print(f"Saved statistics to {stats_file}")
+    # Print summary
+    print("\n" + "=" * 60)
+    print("DATASET STATISTICS")
+    print("=" * 60)
+    print(f"\nTotal samples: {stats['total_samples']}")
+    print("\nTask families:")
+    for family, count in sorted(stats['task_families'].items()):
+        print(f"  {family:20s}: {count:3d}")
+    print("\nSQL difficulty:")
+    for diff, count in sorted(stats['sql_difficulty'].items()):
+        print(f"  {diff:20s}: {count:3d}")
+    print("\nGrounding difficulty:")
+    for diff, count in sorted(stats['grounding_difficulty'].items()):
+        print(f"  {diff:20s}: {count:3d}")
+    print("\nAnchor sources:")
+    for src, count in sorted(stats['anchor_sources'].items()):
+        print(f"  {src:20s}: {count:3d}")
+    print(f"\nAverage candidates per sample: {stats['avg_candidates_per_sample']:.1f}")
+    print(f"Average question length (words): {stats['avg_question_length']:.1f}")
+    print(f"Countries covered: {len(stats['countries_covered'])}")
+    print(f"Subtypes covered: {len(stats['subtypes_covered'])}")
+    print("\n✓ Validation complete")
+if __name__ == "__main__":
+    main()

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,41 @@

+services:
+  llama:
+    image: ghcr.io/ggml-org/llama.cpp:server
+    volumes:
+      - ./finetune/models/qwen-base-run/ckpt-001.gguf:/models/model.gguf:ro
+    command: >
+      -m /models/model.gguf
+      --port 9000
+      --host 0.0.0.0
+      --ctx-size 2048
+      -t 4
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:9000/health"]
+      interval: 10s
+      timeout: 5s
+      retries: 30
+      start_period: 30s
+  app:
+    build: .
+    volumes:
+      - ./data:/data:ro
+    environment:
+      GAZET_DATA_DIR: /data
+      LLAMA_SERVER_URL: http://llama:9000
+    ports:
+      - "8000:8000"
+    command: uvicorn gazet.api:app --host 0.0.0.0 --port 8000
+    depends_on:
+      llama:
+        condition: service_healthy
+  demo:
+    build: .
+    environment:
+      GAZET_API_URL: http://app:8000
+    ports:
+      - "8501:8501"
+    command: streamlit run gazet_demo.py --server.port 8501 --server.address 0.0.0.0
+    depends_on:
+      - app

finetune/README.md ADDED Viewed

	@@ -0,0 +1,291 @@

+# Fine-tuning and Inference
+LoRA fine-tuning of Qwen3.5-0.8B (via Unsloth) to perform two geospatial
+tasks (text-to-SQL and place extraction), then serving locally via
+llama-server.
+---
+## End-to-end workflow
+```
+1. Generate dataset       →  dataset/  (see dataset/README.md)
+2. Check token lengths    →  check_token_lengths.py
+3. Train on Modal         →  train_modal_qwen35.py
+4. Convert to GGUF        →  llama.cpp
+5. Serve locally          →  llama-server
+6. Eval locally           →  eval_cli.py (interactive or batch) + eval_demo.py
+```
+---
+## Step 1 — Check token lengths
+Before training, verify that your `max_length` setting covers the data.
+SQL samples are long (schema + candidates + SQL), places samples are short.
+```bash
+modal run finetune/check_token_lengths.py
+modal run finetune/check_token_lengths.py --run-dir /mnt/gazet/data/v1
+```
+This prints per-split statistics (min, max, P95, P99) and recommends a
+`max_length` value. Adjust `--max-seq-length` in `train_modal_qwen35.py` accordingly.
+---
+## Step 2 — Train (Qwen3.5 + Unsloth)
+Training runs on Modal with an A100-80GB GPU. The script loads both SQL and
+places JSONL files from the run directory, applies the Qwen3.5 ChatML
+template, and trains a LoRA adapter using Unsloth's
+`train_on_responses_only` to mask non-assistant tokens.
+```bash
+# Default settings (Qwen3.5-0.8B, r=16, 1 epoch)
+modal run finetune/train_modal_qwen35.py --experiment-name qwen35-v1
+# Override any config field from CLI
+modal run finetune/train_modal_qwen35.py \
+    --experiment-name qwen35-v1 \
+    --base-model unsloth/Qwen3.5-0.8B \
+    --num-train-epochs 3 \
+    --lora-r 32 \
+    --max-seq-length 2048
+# Quick smoke test
+modal run finetune/train_modal_qwen35.py --experiment-name qwen35-v1 --max-train-samples 100
+```
+All CLI overrides: `--base-model`, `--experiment-name`, `--run-dir`,
+`--num-train-epochs`, `--per-device-train-batch-size`, `--max-train-samples`,
+`--max-eval-samples`, `--lora-r`, `--max-seq-length`. When `--lora-r` is
+overridden, `lora_alpha` is automatically set to `2 * r`.
+### Training config defaults (`Qwen35Config`)
+```
+base_model:       unsloth/Qwen3.5-0.8B
+run_dir:          /mnt/gazet/data/v1
+lora_r:           16
+lora_alpha:       32       (2 * r, Unsloth recommendation for Qwen)
+lora_dropout:     0.0
+num_train_epochs: 1
+batch_size:       32 (x 1 gradient accumulation = 32 effective)
+learning_rate:    1e-4
+lr_scheduler:     linear
+optim:            adamw_8bit
+max_seq_length:   2048
+```
+### Output
+Checkpoints and the merged model are saved to the Modal volume:
+```
+/mnt/gazet/checkpoints/{experiment_name}/
+    adapter_config.json       # LoRA adapter
+    adapter_model.safetensors
+    checkpoint-*/             # intermediate checkpoints
+    merged/                   # full merged 16-bit model
+        model.safetensors
+        tokenizer.json
+```
+Pass `--experiment-name` to set a human-readable name (e.g. `qwen35-v1`).
+If omitted, it is auto-generated as `{model}-r{lora_r}-{timestamp}`.
+Training metrics are logged to [trackio](https://huggingface.co/spaces/srmsoumya/gazet-trackio).
+---
+## Step 3 — Convert merged model to GGUF
+After training, download the merged model from Modal and convert to GGUF
+for local inference with llama-server.
+```bash
+# Download from Modal volume
+modal volume get gazet checkpoints/qwen35-v1/merged ./finetune/models/merged
+# Convert to GGUF (requires llama.cpp repo)
+uv run \
+    --no-project \
+    --with transformers \
+    --with sentencepiece \
+    --with protobuf \
+    --with torch \
+    python convert_hf_to_gguf.py \
+    ../gazet/finetune/models/qwen-base/merged \
+    --outtype q8_0 \
+    --outfile ../gazet/finetune/models/qwen-base/ckpt-001.gguf
+```
+---
+## Step 4 — Serve with llama-server
+### Local
+```bash
+llama-server \
+    -m finetune/models/qwen-base/ckpt-001.gguf \
+    -ngl 99 \
+    --port 9000 \
+    --ctx-size 2048
+```
+`--ctx-size` is the total KV cache shared across all parallel slots. SQL
+prompts can be ~600 tokens; with `--parallel 4` and up to 2048 output
+tokens, use at least `8192`. Match `--parallel` to `--workers` in
+`eval_cli.py`.
+### Docker (CPU-only)
+Useful for testing inference in a constrained environment. Adjust `--cpus`
+and `--memory` to simulate deployment targets. Set `-t` to match `--cpus`.
+```bash
+docker run \
+    --cpus="2" --memory="4g" \
+    -v $(pwd)/finetune/models:/models \
+    -p 9000:9000 \
+    ghcr.io/ggml-org/llama.cpp:server \
+        -m /models/qwen-base/ckpt-001.gguf \
+        --port 9000 --host 0.0.0.0 \
+        --ctx-size 2048 -t 2 -v
+```
+Notes:
+- `--host 0.0.0.0` is required so the port forward from Docker works
+- `-v` (verbose) enables per-request timing logs (prompt eval t/s, generation t/s)
+- `-ngl` is omitted since the default Docker image is CPU-only; for GPU use
+  the CUDA image (`ghcr.io/ggml-org/llama.cpp:server-cuda`) with `--gpus`
+- The model is memory-mapped by default (`mmap = true`), so containers with
+  less RAM than the model size may still start but will be slow due to page
+  thrashing
+The server exposes `/v1/chat/completions` (chat API) on
+`http://localhost:9000`. All eval scripts use this endpoint.
+---
+## Step 5 — Evaluate
+Two evaluation tools, both using a locally running llama-server.
+### Interactive or batch eval (`eval_cli.py`)
+Requires llama-server running on port 9000 (see Step 4).
+**Interactive** — spot-check individual samples:
+```bash
+uv run finetune/eval_cli.py              # prompts for sample index
+uv run finetune/eval_cli.py 0 5 12       # run specific samples
+uv run finetune/eval_cli.py --task places 0 5
+uv run finetune/eval_cli.py -v 0         # print full prompt
+```
+**Batch** — run the full split and save a JSON results file:
+```bash
+# Full val set, SQL task
+uv run finetune/eval_cli.py --all --label finetuned-qwen35
+# Places task
+uv run finetune/eval_cli.py --all --task places --label finetuned-places
+# Limit samples, custom output path
+uv run finetune/eval_cli.py --all --max-samples 100 --output results/eval-v5.json
+# Evaluate test split instead of val
+uv run finetune/eval_cli.py --all --split test --label finetuned-qwen35
+```
+All batch CLI args:
+| Arg | Default | Description |
+|-----|---------|-------------|
+| `--all` | off | Enable batch mode |
+| `--label` | `local-gguf` | Label used in the output filename |
+| `--task` | `sql` | `sql` or `places` |
+| `--split` | `val` | Data split to evaluate (`val`, `test`) |
+| `--run-dir` | `dataset/output/runs/v1` | Directory with `{task}/{split}.jsonl` |
+| `--max-samples` | all | Cap the number of samples |
+| `--output` | `eval-{label}-{task}.json` | Output JSON path |
+| `--workers` | `4` | Concurrent requests; match llama-server `--parallel` |
+Results are saved to `results/eval-{label}-{task}.json` with this structure:
+```json
+{
+  "summary": {"label": "...", "task": "sql", "exact_match_rate": 0.85, ...},
+  "results": [
+    {"index": 0, "question": "...", "expected": "...", "predicted": "...", "exact_match": true},
+    ...
+  ]
+}
+```
+Config constants at the top of `eval_cli.py`: `SERVER_URL` (default
+`http://localhost:9000`), `MAX_TOKENS` (2048), `TEMPERATURE` (0.6).
+### Visual eval (`eval_demo.py`)
+Streamlit app that loads JSON results from `eval_cli.py --all` and displays
+them interactively. For SQL results, it shows formatted SQL side-by-side,
+a diff view for mismatches, and executes both queries against DuckDB to
+render the geometry on a map. For places results, it shows expected vs
+predicted JSON.
+```bash
+streamlit run finetune/eval_demo.py
+```
+Reads result files from `results/eval-*.json` by default. Override with:
+```bash
+GAZET_EVAL_DIR=/path/to/results streamlit run finetune/eval_demo.py
+```
+Set `GAZET_DATA_DIR` if your parquet data is not in the default `data/` directory.
+---
+## File reference
+| File | What it does |
+|---|---|
+| `train_modal_qwen35.py` | Modal training script — Qwen3.5 LoRA fine-tuning with Unsloth |
+| `check_token_lengths.py` | Modal script to analyze token length distribution before training |
+| `eval_cli.py` | Local eval — interactive spot-check or full batch mode via llama-server |
+| `eval_demo.py` | Streamlit app — visual diff + map rendering of `eval_cli.py --all` results |
+| `models/` | GGUF model files for local llama-server inference |
+---
+## Data format
+The Qwen3.5 training pipeline (`train_modal_qwen35.py`) expects data in
+**messages format**:
+```json
+{
+  "messages": [
+    {"role": "system", "content": "You are a text to SQL query translator..."},
+    {"role": "user",   "content": "GIVEN the <SCHEMA_DETAILS>..."},
+    {"role": "assistant", "content": "SELECT ST_AsGeoJSON(geometry) ..."}
+  ]
+}
+```
+The Qwen3.5 chat template (ChatML) is applied by the tokenizer. Unsloth's
+`train_on_responses_only` then masks everything before the assistant
+response marker (`<|im_start|>assistant\n<think>\n\n</think>\n\n`), so
+loss is computed only on the completion tokens.
+SQL in the training data uses symbolic path placeholders
+(`read_parquet('divisions_area')`) instead of real file paths. At inference
+time, `src/gazet/sql.py` replaces these with actual runtime paths before
+executing against DuckDB.

finetune/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

finetune/check_token_lengths.py ADDED Viewed

	@@ -0,0 +1,155 @@

+"""Check token lengths of training samples to validate max_length setting.
+Usage
+-----
+modal run finetune/check_token_lengths.py
+modal run finetune/check_token_lengths.py --run-dir /mnt/gazet/data/v1
+"""
+from __future__ import annotations
+import modal
+app = modal.App("gazet-check-token-lengths")
+check_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .pip_install(
+        "datasets>=3.0",
+        "transformers>=4.46",
+        "jinja2>=3.1",
+    )
+    .add_local_python_source("finetune", copy=True)
+    .env({"HF_HOME": "/mnt/gazet/model_cache"})
+)
+gazet_vol = modal.Volume.from_name("gazet", create_if_missing=True)
+VOLUMES = {
+    "/mnt/gazet": gazet_vol,
+}
+@app.function(
+    image=check_image,
+    volumes=VOLUMES,
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+)
+def analyze_token_lengths(run_dir: str, base_model: str):
+    import json
+    import pathlib
+    from datasets import Dataset, DatasetDict
+    from transformers import AutoTokenizer
+    def load_jsonl(path):
+        rows = []
+        with open(path) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    rows.append(json.loads(line))
+        return rows
+    print(f"Loading tokenizer: {base_model}")
+    tokenizer = AutoTokenizer.from_pretrained(base_model)
+    root = pathlib.Path(run_dir)
+    ds_dict = {}
+    for split in ("train", "val", "test"):
+        combined = []
+        for task in ("sql", "places"):
+            path = root / task / f"{split}.jsonl"
+            if path.exists():
+                combined.extend(load_jsonl(path))
+        if combined:
+            ds_dict[split] = Dataset.from_list(combined)
+    ds = DatasetDict(ds_dict)
+    def token_lengths(dataset):
+        lengths = []
+        for row in dataset:
+            msgs = row["messages"]
+            text = tokenizer.apply_chat_template(msgs, tokenize=False)
+            lengths.append(len(tokenizer.encode(text)))
+        return lengths
+    def report(split_name: str, lengths: list[int]):
+        lengths.sort()
+        n = len(lengths)
+        if not n:
+            print(f"\n{split_name}: empty")
+            return
+        print(f"\n{'='*60}")
+        print(f"{split_name} ({n:,} samples)")
+        print(f"{'='*60}")
+        print(f"  Min:    {min(lengths):,}")
+        print(f"  Max:    {max(lengths):,}")
+        print(f"  Mean:   {sum(lengths)/n:.0f}")
+        print(f"  Median: {lengths[n//2]:,}")
+        print(f"  P90:    {lengths[int(n*0.90)]:,}")
+        print(f"  P95:    {lengths[int(n*0.95)]:,}")
+        print(f"  P99:    {lengths[int(n*0.99)]:,}")
+        buckets = [512, 1024, 2048, 4096, 8192]
+        print(f"\n  Distribution:")
+        prev = 0
+        for limit in buckets:
+            count = sum(1 for l in lengths if prev < l <= limit)
+            pct = 100 * count / n
+            bar = "#" * int(pct / 2)
+            print(f"    {prev+1:>5}-{limit:<5}: {count:5,} ({pct:5.1f}%) {bar}")
+            prev = limit
+        over = sum(1 for l in lengths if l > buckets[-1])
+        if over:
+            print(f"    {buckets[-1]+1:>5}+     : {over:5,} ({100*over/n:5.1f}%)")
+        return lengths
+    all_lengths = []
+    for split in ("train", "val", "test"):
+        if split not in ds:
+            continue
+        lengths = token_lengths(ds[split])
+        report(split, lengths)
+        all_lengths.extend(lengths)
+    if all_lengths:
+        all_lengths.sort()
+        n = len(all_lengths)
+        max_len = max(all_lengths)
+        p99 = all_lengths[int(n * 0.99)]
+        print(f"\n{'='*60}")
+        print(f"RECOMMENDATION")
+        print(f"{'='*60}")
+        print(f"  Total samples: {n:,}")
+        print(f"  Max length:    {max_len:,}")
+        print(f"  P99:           {p99:,}")
+        for threshold in [1024, 2048, 4096]:
+            over = sum(1 for l in all_lengths if l > threshold)
+            pct = 100 * over / n
+            print(f"  > {threshold:5,}: {over:5,} ({pct:5.1f}%)")
+        if max_len <= 1024:
+            print(f"\n  All samples fit in 1024 tokens. Use --max-length 1024.")
+        elif max_len <= 2048:
+            print(f"\n  All samples fit in 2048 tokens. Use --max-length 2048.")
+        else:
+            over_2048 = sum(1 for l in all_lengths if l > 2048)
+            print(f"\n  {over_2048} samples exceed 2048. Consider --max-length {max_len}")
+            print(f"  or reduce candidate count to keep samples shorter.")
+@app.local_entrypoint()
+def main(
+    run_dir: str = "/mnt/gazet/data/v1",
+    base_model: str = "unsloth/Qwen3.5-0.8B",
+):
+    print(f"Checking token lengths:")
+    print(f"  Model:   {base_model}")
+    print(f"  Run dir: {run_dir}")
+    analyze_token_lengths.remote(run_dir, base_model)
+    print("Analysis complete!")

finetune/eval_cli.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""Interactive eval: run test samples through the local GGUF model.
+Requires llama-server running on port 8080:
+  llama-server -m finetune/models/<model>.gguf -ngl 99 --port 8080 --ctx-size 4096 --log-disable
+Uses the /v1/chat/completions endpoint with a messages list. The Qwen3 GGUF
+embeds its chat template in metadata, so llama-server applies it automatically.
+Usage
+-----
+uv run finetune/eval_cli.py          # prompts for index
+uv run finetune/eval_cli.py 5        # run sample at index 5
+uv run finetune/eval_cli.py 5 12 20  # run multiple samples
+Use --task places for place extraction:
+  uv run finetune/eval_cli.py --task places 0 5
+Override run directory:
+  uv run finetune/eval_cli.py --run-dir dataset/output/runs/v1 0
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import urllib.error
+import urllib.request
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from datetime import datetime
+from pathlib import Path
+SERVER_URL = "http://localhost:9000"
+MAX_TOKENS = 2048
+TEMPERATURE = 0.6
+DEFAULT_RUN_DIR = Path("dataset/output/runs/v1")
+def postprocess_sql(text: str) -> str:
+    cleaned = text.strip()
+    if "```sql" in cleaned:
+        cleaned = cleaned.split("```sql", 1)[1]
+    if cleaned.startswith("```"):
+        cleaned = cleaned[3:]
+    if "```" in cleaned:
+        cleaned = cleaned.split("```", 1)[0]
+    return cleaned.strip()
+def check_server() -> bool:
+    try:
+        urllib.request.urlopen(f"{SERVER_URL}/health", timeout=2)
+        return True
+    except Exception:
+        return False
+def chat_complete(messages: list[dict]) -> str:
+    """Call llama-server /v1/chat/completions with a messages list."""
+    payload = json.dumps({
+        "messages": messages,
+        "n_predict": MAX_TOKENS,
+        "temperature": TEMPERATURE,
+        "chat_template_kwargs": {"enable_thinking": False},
+    }).encode()
+    req = urllib.request.Request(
+        f"{SERVER_URL}/v1/chat/completions",
+        data=payload,
+        headers={"Content-Type": "application/json"},
+    )
+    with urllib.request.urlopen(req, timeout=60) as resp:
+        return json.loads(resp.read())["choices"][0]["message"]["content"]
+def load_samples(run_dir: Path, task: str, split: str = "val") -> list[dict]:
+    path = run_dir / task / f"{split}.jsonl"
+    if not path.exists():
+        print(f"Error: {path} not found")
+        sys.exit(1)
+    print(f"Loading {task} samples from: {path}")
+    with path.open() as f:
+        return [json.loads(line) for line in f if line.strip()]
+def build_raw_prompt(sample: dict) -> str:
+    """Reconstruct the plain prompt string from messages format (all turns except assistant)."""
+    return "\n\n".join(m["content"] for m in sample["messages"][:-1])
+def eval_sample(sample: dict, task: str) -> dict:
+    """Run a single sample through the server and return a result dict."""
+    expected = sample["messages"][-1]["content"]
+    messages = sample["messages"][:-1]
+    user_content = sample["messages"][-2]["content"]
+    if "<USER_QUERY>" in user_content:
+        question = user_content.split("<USER_QUERY>")[-1].split("</USER_QUERY>")[0].strip()
+    else:
+        question = user_content[:120]
+    raw = chat_complete(messages)
+    predicted = postprocess_sql(raw) if task == "sql" else raw.strip()
+    return {
+        "question": question,
+        "expected": expected,
+        "predicted": predicted,
+        "exact_match": predicted.strip() == expected.strip(),
+    }
+def run_sample(sample: dict, task: str, total: int, index: int, verbose: bool = False) -> None:
+    user_content = sample["messages"][-2]["content"]
+    if "<USER_QUERY>" in user_content:
+        question = user_content.split("<USER_QUERY>")[-1].split("</USER_QUERY>")[0].strip()
+    else:
+        question = user_content[:120]
+    header = f"  Sample {index}/{total-1} | {task}  "
+    print(f"\n{'━' * 60}")
+    print(f"{'━' * ((60 - len(header)) // 2)}{header}{'━' * ((60 - len(header)) // 2)}")
+    print(f"{'━' * 60}")
+    print(f"\nQuestion: {question}\n")
+    if verbose:
+        prompt = build_raw_prompt(sample)
+        print(f"{'─' * 60}")
+        print(f"Full prompt ({len(prompt)} chars, ~{len(prompt.split())} words):")
+        print(f"{'─' * 60}")
+        print(prompt)
+    result = eval_sample(sample, task)
+    print(f"{'─' * 60}")
+    print("Expected:")
+    print(f"{'─' * 60}")
+    print(result["expected"])
+    print(f"\n{'─' * 60}")
+    print("Generated:")
+    print(f"{'─' * 60}")
+    print(result["predicted"])
+    print(f"\n{'─' * 60}")
+    print(f"Match: {'YES' if result['exact_match'] else 'NO'}")
+def run_batch(
+    samples: list[dict],
+    task: str,
+    label: str,
+    output_path: Path,
+    workers: int = 8,
+) -> None:
+    """Run all samples concurrently and save results to a JSON file."""
+    total = len(samples)
+    results = [None] * total
+    completed = 0
+    with ThreadPoolExecutor(max_workers=workers) as executor:
+        futures = {executor.submit(eval_sample, s, task): i for i, s in enumerate(samples)}
+        for future in as_completed(futures):
+            i = futures[future]
+            result = future.result()
+            results[i] = {"index": i, **result}
+            completed += 1
+            if completed % 50 == 0 or completed == total:
+                print(f"{completed}/{total} done", flush=True)
+    matches = sum(1 for r in results if r["exact_match"])
+    exact_match_rate = matches / total if total else 0
+    output = {
+        "summary": {
+            "label": label,
+            "task": task,
+            "num_samples": total,
+            "exact_matches": matches,
+            "exact_match_rate": exact_match_rate,
+            "timestamp": datetime.now().isoformat(),
+        },
+        "results": results,
+    }
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("w") as f:
+        json.dump(output, f, indent=2)
+    print(f"\n{'=' * 60}")
+    print(f"[{label}] {matches}/{total} exact matches ({100 * exact_match_rate:.1f}%)")
+    print(f"Results saved to {output_path}")
+    print(f"{'=' * 60}")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Interactive eval against llama-server")
+    parser.add_argument("indices", nargs="*", type=int, help="Sample indices to evaluate")
+    parser.add_argument("--task", default="sql", choices=["sql", "places"])
+    parser.add_argument(
+        "--run-dir",
+        type=Path,
+        default=DEFAULT_RUN_DIR,
+        help="Run directory containing {task}/{split}.jsonl files",
+    )
+    parser.add_argument("--split", default="val", choices=["val", "test"], help="Dataset split")
+    parser.add_argument("--verbose", "-v", action="store_true", help="Print full prompt sent to the model")
+    parser.add_argument("--all", dest="run_all", action="store_true", help="Run all samples in batch mode")
+    parser.add_argument("--max-samples", type=int, default=None, help="Limit number of samples (batch mode)")
+    parser.add_argument("--label", default="local-gguf", help="Label for batch output file")
+    parser.add_argument("--output", type=Path, default=None, help="Output JSON path (batch mode)")
+    parser.add_argument("--workers", type=int, default=4, help="Concurrent requests; match llama-server --parallel (default 4)")
+    args = parser.parse_args()
+    if not check_server():
+        print("llama-server not running. Start it with:")
+        print("llama-server -m finetune/models/<model>.gguf -ngl 99 --port 9000 --ctx-size 2048 --log-disable")
+        sys.exit(1)
+    samples = load_samples(args.run_dir, args.task, args.split)
+    total = len(samples)
+    if args.run_all:
+        if args.max_samples:
+            samples = samples[: args.max_samples]
+        output_path = args.output or Path(f"eval-{args.label}-{args.task}.json")
+        print(f"Running batch eval: {len(samples)} samples, {args.workers} workers")
+        run_batch(samples, args.task, args.label, output_path, workers=args.workers)
+        return
+    if not args.indices:
+        print(f"Test set has {total} {args.task} samples (0-{total-1})")
+        raw = input("Enter index (or press Enter for 0): ").strip()
+        indices = [int(raw) if raw else 0]
+    else:
+        indices = args.indices
+    for idx in indices:
+        if not (0 <= idx < total):
+            print(f"Index {idx} out of range (0-{total-1}), skipping")
+            continue
+        run_sample(samples[idx], args.task, total, idx, verbose=args.verbose)
+    print(f"\n{'━' * 60}\n")
+if __name__ == "__main__":
+    main()

finetune/eval_demo.py ADDED Viewed

	@@ -0,0 +1,351 @@

+"""Streamlit eval viewer: compare expected vs predicted SQL and view results on a map.
+Usage: streamlit run finetune/eval_demo.py
+"""
+import difflib
+import json
+import math
+import os
+import pathlib
+import duckdb
+import numpy as np
+import pandas as pd
+import pydeck as pdk
+import sqlparse
+import streamlit as st
+PROJECT_ROOT = pathlib.Path(__file__).resolve().parent.parent
+DATA_DIR = pathlib.Path(
+    os.environ.get("GAZET_DATA_DIR", str(PROJECT_ROOT / "data"))
+)
+EVAL_DIR = pathlib.Path(
+    os.environ.get("GAZET_EVAL_DIR", str(PROJECT_ROOT / "results"))
+)
+def load_eval_results(path):
+    with open(path) as f:
+        return json.load(f)
+def rewrite_data_paths(sql):
+    """Replace symbolic and legacy paths with actual local data paths."""
+    # Legacy fixed Docker paths must be replaced first to avoid double-expansion
+    sql = sql.replace("/data/", f"{DATA_DIR}/")
+    div_path = str(DATA_DIR / "overture" / "divisions_area" / "*.parquet")
+    ne_path = str(DATA_DIR / "natural_earth_geoparquet" / "ne_geography.parquet")
+    sql = sql.replace("read_parquet('divisions_area')", f"read_parquet('{div_path}')")
+    sql = sql.replace("read_parquet('natural_earth')", f"read_parquet('{ne_path}')")
+    return sql
+def format_sql(sql):
+    """Pretty-print SQL with sqlparse."""
+    return sqlparse.format(sql, reindent=True, keyword_case="upper")
+def sql_diff_html(expected, predicted):
+    """Return an HTML diff of two SQL strings."""
+    expected_lines = format_sql(expected).splitlines()
+    predicted_lines = format_sql(predicted).splitlines()
+    diff = difflib.HtmlDiff(tabsize=2, wrapcolumn=80)
+    return diff.make_table(
+        expected_lines, predicted_lines,
+        fromdesc="Expected", todesc="Predicted",
+        context=False,
+    )
+def get_duckdb_connection():
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    return con
+def execute_sql(con, sql):
+    """Execute SQL, converting geometry columns to simplified GeoJSON strings."""
+    rel = con.sql(sql)
+    cols = rel.columns
+    types = [str(t) for t in rel.dtypes]
+    select_parts = []
+    for col, dtype in zip(cols, types):
+        if "GEOMETRY" in dtype.upper():
+            select_parts.append(
+                f'ST_AsGeoJSON(ST_SimplifyPreserveTopology("{col}", 0.001)) AS "{col}"'
+            )
+        else:
+            select_parts.append(f'"{col}"')
+    wrapped = f"SELECT {', '.join(select_parts)} FROM ({sql})"
+    return con.execute(wrapped).fetchdf()
+def _is_notna(val):
+    """Check if a value is not NA, handling arrays/lists/numpy arrays safely."""
+    if isinstance(val, (list, tuple, np.ndarray)):
+        return len(val) > 0
+    return pd.notna(val)
+def _to_python(val):
+    """Convert numpy/pandas types to native Python for JSON serialization."""
+    if isinstance(val, (np.integer,)):
+        return int(val)
+    if isinstance(val, (np.floating,)):
+        return float(val)
+    if isinstance(val, np.ndarray):
+        return val.tolist()
+    if isinstance(val, (np.bool_,)):
+        return bool(val)
+    return val
+def to_feature_collection(result_df):
+    """Build GeoJSON FeatureCollection from a DataFrame with GeoJSON string columns."""
+    geom_cols = []
+    for c in result_df.columns:
+        vals = [v for v in result_df[c].head(5) if isinstance(v, str)]
+        if vals and all(v.lstrip().startswith('{"type":') for v in vals):
+            geom_cols.append(c)
+    prop_cols = [c for c in result_df.columns if c not in geom_cols]
+    features = []
+    for _, row in result_df.iterrows():
+        geometry = None
+        if geom_cols:
+            raw = row[geom_cols[0]]
+            if raw and isinstance(raw, str):
+                geometry = json.loads(raw)
+        properties = {}
+        for c in prop_cols:
+            val = row[c]
+            if _is_notna(val):
+                properties[c] = _to_python(val)
+        features.append(
+            {"type": "Feature", "geometry": geometry, "properties": properties}
+        )
+    return {"type": "FeatureCollection", "features": features}
+def bbox_from_geojson(geojson):
+    lngs, lats = [], []
+    for f in geojson.get("features", []):
+        geom = f.get("geometry")
+        if geom:
+            for coord in _extract_coords(geom):
+                lngs.append(coord[0])
+                lats.append(coord[1])
+    if not lngs:
+        return None
+    return min(lngs), min(lats), max(lngs), max(lats)
+def _extract_coords(geom):
+    t = geom.get("type", "")
+    coords = geom.get("coordinates", [])
+    if t == "Point":
+        yield coords
+    elif t in ("LineString", "MultiPoint"):
+        yield from coords
+    elif t == "Polygon":
+        for ring in coords:
+            yield from ring
+    elif t in ("MultiLineString", "MultiPolygon"):
+        for part in coords:
+            if t == "MultiLineString":
+                yield from part
+            else:
+                for ring in part:
+                    yield from ring
+    elif t == "GeometryCollection":
+        for g in geom.get("geometries", []):
+            yield from _extract_coords(g)
+def _centroids_from_geojson(geojson):
+    """Extract centroid [lng, lat] for each feature to use as scatter markers."""
+    centroids = []
+    for f in geojson.get("features", []):
+        geom = f.get("geometry")
+        if not geom:
+            continue
+        lngs, lats = [], []
+        for coord in _extract_coords(geom):
+            lngs.append(coord[0])
+            lats.append(coord[1])
+        if lngs:
+            centroids.append({"lng": sum(lngs) / len(lngs), "lat": sum(lats) / len(lats)})
+    return centroids
+def render_map(geojson, color, key):
+    n = len(geojson.get("features", []))
+    if not n:
+        st.info("Query returned no features.")
+        return
+    layers = [
+        pdk.Layer(
+            "GeoJsonLayer",
+            data=geojson,
+            get_fill_color=color,
+            get_line_color=[100, 100, 100, 200],
+            get_line_width=2,
+            pickable=True,
+        ),
+    ]
+    bbox = bbox_from_geojson(geojson)
+    if bbox:
+        min_lng, min_lat, max_lng, max_lat = bbox
+        span = max(max_lng - min_lng, max_lat - min_lat, 1e-6)
+        zoom = max(0, min(18, math.log2(360 / span) - 0.8))
+        # Add scatter markers when polygons would be too small to see
+        if zoom < 4:
+            centroids = _centroids_from_geojson(geojson)
+            if centroids:
+                layers.append(
+                    pdk.Layer(
+                        "ScatterplotLayer",
+                        data=centroids,
+                        get_position=["lng", "lat"],
+                        get_fill_color=color[:3] + [220],
+                        get_radius=50000,
+                        radius_min_pixels=6,
+                        pickable=True,
+                    )
+                )
+        view = pdk.ViewState(
+            latitude=(min_lat + max_lat) / 2,
+            longitude=(min_lng + max_lng) / 2,
+            zoom=zoom,
+        )
+    else:
+        view = pdk.ViewState(latitude=0, longitude=0, zoom=1)
+    st.pydeck_chart(
+        pdk.Deck(layers=layers, initial_view_state=view, map_style=None),
+        width="stretch",
+        height=400,
+        key=key,
+    )
+# --- App ---
+st.set_page_config(page_title="Eval Viewer", layout="wide")
+st.title("Eval Viewer")
+eval_files = sorted(EVAL_DIR.glob("eval-*.json"))
+if not eval_files:
+    st.error(f"No eval result files found in {EVAL_DIR}")
+    st.stop()
+selected_file = st.sidebar.selectbox(
+    "Eval file",
+    eval_files,
+    format_func=lambda p: p.stem,
+)
+data = load_eval_results(selected_file)
+summary = data["summary"]
+results = data["results"]
+st.sidebar.markdown(f"""
+**Model**: `{summary.get('label', '')}`
+**Exact match**: {summary['exact_matches']}/{summary['num_samples']} ({summary['exact_match_rate']:.1%})
+""")
+filter_option = st.sidebar.radio("Filter", ["All", "Matches only", "Mismatches only"])
+if filter_option == "Matches only":
+    results = [r for r in results if r["exact_match"]]
+elif filter_option == "Mismatches only":
+    results = [r for r in results if not r["exact_match"]]
+if not results:
+    st.warning("No results match the current filter.")
+    st.stop()
+questions = [
+    f"[{r['index']}] {r.get('question', 'Sample ' + str(r['index']))}"
+    for r in results
+]
+selected_idx = st.selectbox("Select a query", range(len(questions)), format_func=lambda i: questions[i])
+row = results[selected_idx]
+match_label = "MATCH" if row["exact_match"] else "MISMATCH"
+match_color = "green" if row["exact_match"] else "red"
+st.markdown(f"### :{match_color}[{match_label}]")
+is_sql = summary.get("task", "sql") == "sql"
+expected = row["expected"]
+predicted = row["predicted"]
+# Formatted output side-by-side
+col_expected, col_predicted = st.columns(2)
+with col_expected:
+    st.markdown("**Expected**")
+    if is_sql:
+        st.code(format_sql(expected), language="sql")
+    else:
+        st.code(expected, language="json")
+with col_predicted:
+    st.markdown("**Predicted**")
+    if is_sql:
+        st.code(format_sql(predicted), language="sql")
+    else:
+        st.code(predicted, language="json")
+# Diff view
+if not row["exact_match"]:
+    with st.expander("Diff", expanded=True):
+        diff_html = sql_diff_html(expected, predicted)
+        diff_css = """
+        <style>
+        .diff_add { background-color: rgba(40, 167, 69, 0.15); }
+        .diff_sub { background-color: rgba(220, 53, 69, 0.15); }
+        .diff_chg { background-color: rgba(255, 193, 7, 0.15); }
+        .diff_header { background-color: rgba(128, 128, 128, 0.1); font-weight: bold; }
+        table.diff { border-collapse: collapse; width: 100%; font-family: monospace; color: inherit; }
+        table.diff td, table.diff th { padding: 4px 8px; border: 1px solid rgba(128, 128, 128, 0.2); }
+        </style>
+        """
+        st.html(f"{diff_css}<div style='overflow-x:auto; font-size:13px;'>{diff_html}</div>")
+# Auto-execute SQL and show maps (only for sql task)
+if is_sql:
+    con = get_duckdb_connection()
+    map_col1, map_col2 = st.columns(2)
+    with map_col1:
+        st.markdown("**Expected result**")
+        sql = rewrite_data_paths(expected)
+        try:
+            df = execute_sql(con, sql)
+            geojson = to_feature_collection(df)
+            render_map(geojson, [40, 180, 160, 140], key="map_expected")
+            with st.expander("Result table"):
+                st.dataframe(df, width="stretch")
+        except Exception as e:
+            st.error(f"Execution error: {e}")
+    with map_col2:
+        st.markdown("**Predicted result**")
+        sql = rewrite_data_paths(predicted)
+        try:
+            df = execute_sql(con, sql)
+            geojson = to_feature_collection(df)
+            render_map(geojson, [180, 80, 60, 140], key="map_predicted")
+            with st.expander("Result table"):
+                st.dataframe(df, width="stretch")
+        except Exception as e:
+            st.error(f"Execution error: {e}")
+    con.close()

finetune/train_modal_qwen35.py ADDED Viewed

	@@ -0,0 +1,363 @@

+"""Modal training script for gazet Qwen3.5 LoRA fine-tuning with Unsloth.
+Key differences from train_modal.py (Gemma):
+- Uses Unsloth's FastLanguageModel for memory-efficient training
+- Applies Qwen3.5 chat template to format data (not plain prompt+completion strings)
+- Uses train_on_responses_only with ChatML markers to mask non-assistant tokens
+- Saves merged 16-bit model via unsloth's save_pretrained_merged
+Usage
+-----
+modal run finetune/train_modal_qwen35.py
+modal run finetune/train_modal_qwen35.py --base-model unsloth/Qwen3.5-0.8B
+modal run finetune/train_modal_qwen35.py --run-dir /mnt/gazet/data/v3-symbolic-paths
+modal run finetune/train_modal_qwen35.py --num-train-epochs 5 --lora-r 32
+"""
+from __future__ import annotations
+import pathlib
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Optional
+import modal
+app = modal.App("gazet-nlg-qwen35-finetune-v2")
+GPU_TYPE = "A100-80GB"
+TIMEOUT_HOURS = 24
+MAX_RETRIES = 1
+train_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .pip_install(
+        # Use unsloth's bundled CUDA+torch extra so bitsandbytes, xformers,
+        # and trl are all resolved together against the same CUDA/torch build.
+        # Mirrors the approach in https://modal.com/docs/examples/unsloth_finetune
+        "unsloth[cu129-torch280]",
+        "unsloth_zoo",
+        "transformers~=5.2.0",
+        "hf-transfer==0.1.9",
+        "trackio[gpu]==0.21.1",
+        "datasets",
+        "pandas",
+    )
+    .add_local_python_source("finetune", copy=True)
+    .env({
+        "HF_HOME": "/mnt/gazet/model_cache",
+        "HF_HUB_ENABLE_HF_TRANSFER": "1",
+    })
+)
+with train_image.imports():
+    from unsloth import FastLanguageModel
+    from unsloth.chat_templates import train_on_responses_only
+    from trl import SFTConfig, SFTTrainer
+    from transformers import set_seed
+gazet_vol = modal.Volume.from_name("gazet", create_if_missing=True)
+VOLUMES = {
+    "/mnt/gazet": gazet_vol,
+}
+# ChatML response markers for Qwen3.5 — the empty <think> block is how Qwen3.5
+# formats non-thinking responses. We train only on tokens after this prefix.
+INSTRUCTION_PART = "<|im_start|>user\n"
+RESPONSE_PART = "<|im_start|>assistant\n<think>\n\n</think>\n\n"
+@dataclass
+class Qwen35Config:
+    # Model
+    base_model: str = "unsloth/Qwen3.5-0.8B"
+    # Dataset — path to run dir with {task}/{split}.jsonl files
+    run_dir: str = "/mnt/gazet/data/v1"
+    max_train_samples: Optional[int] = None
+    max_eval_samples: Optional[int] = None
+    # Sequence
+    max_seq_length: int = 2048
+    # LoRA — alpha=2*r follows unsloth recommendation for Qwen models
+    lora_r: int = 16
+    lora_alpha: int = 32
+    lora_dropout: float = 0.0
+    # Training
+    num_train_epochs: int = 1
+    per_device_train_batch_size: int = 32
+    per_device_eval_batch_size: int = 16
+    gradient_accumulation_steps: int = 1  # effective batch = 48
+    learning_rate: float = 1e-4
+    max_grad_norm: float = 1.0
+    warmup_steps: int = 50
+    lr_scheduler_type: str = "linear"
+    weight_decay: float = 0.01
+    optim: str = "adamw_8bit"
+    # Logging / saving
+    logging_steps: int = 10
+    save_strategy: str = "steps"
+    save_steps: int = 400
+    eval_strategy: str = "steps"
+    eval_steps: int = 200
+    report_to: str = "trackio"
+    trackio_space_id: Optional[str] = "srmsoumya/gazet-trackio"
+    project: str = "gazet-nlg-qwen35"
+    # Experiment
+    seed: int = 42
+    experiment_name: Optional[str] = None
+    def __post_init__(self):
+        if self.experiment_name is None:
+            timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+            model_short = self.base_model.split("/")[-1]
+            self.experiment_name = f"{model_short}-r{self.lora_r}-{timestamp}"
+def _load_data(run_dir: str, tokenizer, max_train_samples=None, max_eval_samples=None):
+    """Load JSONL data and apply Qwen3.5 chat template.
+    Each sample must have:
+      messages: list of {role, content} dicts (system + user + assistant)
+    The chat template produces the full ChatML string including the assistant turn.
+    train_on_responses_only then masks everything except the assistant response.
+    """
+    import json
+    from datasets import Dataset, DatasetDict
+    def load_jsonl(path: pathlib.Path) -> list[dict]:
+        rows = []
+        with open(path) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    rows.append(json.loads(line))
+        return rows
+    def to_message(sample: dict) -> dict:
+        text = tokenizer.apply_chat_template(
+            sample["messages"],
+            tokenize=False,
+            add_generation_prompt=False,
+        )
+        return {"messages": text}
+    run_dir = pathlib.Path(run_dir)
+    tasks = ("sql", "places")
+    splits = ("train", "val")
+    ds_dict: dict = {}
+    for split in splits:
+        combined: list[dict] = []
+        for task in tasks:
+            path = run_dir / task / f"{split}.jsonl"
+            if not path.exists():
+                print(f"Missing {path} — skipping")
+                continue
+            rows = load_jsonl(path)
+            flattened = [to_message(r) for r in rows]
+            combined.extend(flattened)
+            print(f"Loaded {len(rows):,} {task}/{split} rows")
+        if combined:
+            ds_dict[split] = Dataset.from_list(combined)
+            print(f"{split} split: {len(combined):,} total rows")
+    ds = DatasetDict(ds_dict).shuffle(seed=42)
+    if max_train_samples is not None and "train" in ds:
+        ds["train"] = ds["train"].select(range(min(max_train_samples, len(ds["train"]))))
+    if max_eval_samples is not None and "val" in ds:
+        ds["val"] = ds["val"].select(range(min(max_eval_samples, len(ds["val"]))))
+    return ds
+def _find_latest_checkpoint(checkpoint_dir: pathlib.Path) -> str | None:
+    if not checkpoint_dir.exists():
+        return None
+    checkpoints = list(checkpoint_dir.glob("checkpoint-*"))
+    if not checkpoints:
+        return None
+    latest = max(checkpoints, key=lambda p: int(p.name.split("-")[1]))
+    print(f"Found existing checkpoint: {latest}")
+    return str(latest)
+@app.function(
+    image=train_image,
+    gpu=GPU_TYPE,
+    volumes=VOLUMES,
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+    timeout=TIMEOUT_HOURS * 60 * 60,
+    retries=modal.Retries(initial_delay=0.0, max_retries=MAX_RETRIES),
+)
+def finetune(config_dict: dict):
+    """Run Qwen3.5 LoRA SFT training with Unsloth inside a Modal container."""
+    config = Qwen35Config(**config_dict)
+    set_seed(config.seed)
+    experiment_dir = pathlib.Path("/mnt/gazet/checkpoints") / config.experiment_name
+    experiment_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Experiment:       {config.experiment_name}")
+    print(f"Model:            {config.base_model}")
+    print(f"Run dir:          {config.run_dir}")
+    # Load base model with unsloth — gradient checkpointing is handled internally
+    model, processor = FastLanguageModel.from_pretrained(
+        config.base_model,
+        max_seq_length=config.max_seq_length,
+        load_in_4bit=False,
+        use_gradient_checkpointing="unsloth",
+        fast_inference=False,
+    )
+    tokenizer = processor.tokenizer
+    # Apply LoRA adapters — let unsloth select target modules via finetune_* flags
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=config.lora_r,
+        lora_alpha=config.lora_alpha,
+        lora_dropout=config.lora_dropout,
+        finetune_vision_layers=False,
+        finetune_language_layers=True,
+        finetune_attention_modules=True,
+        finetune_mlp_modules=True,
+        bias="none",
+        random_state=config.seed,
+        use_gradient_checkpointing=False,  # already set in from_pretrained
+    )
+    total_params = sum(p.numel() for p in model.parameters())
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Total parameters:     {total_params:,}")
+    print(f"Trainable parameters: {trainable_params:,}")
+    ds = _load_data(
+        config.run_dir,
+        tokenizer,
+        max_train_samples=config.max_train_samples,
+        max_eval_samples=config.max_eval_samples,
+    )
+    if "train" not in ds:
+        raise RuntimeError(
+            f"No training data found in {config.run_dir}. "
+            "Run the dataset pipeline and upload exported data to the volume first."
+        )
+    print(f"Train samples:    {len(ds['train']):,}")
+    if "val" in ds:
+        print(f"Val samples:      {len(ds['val']):,}")
+    effective_batch = config.per_device_train_batch_size * config.gradient_accumulation_steps
+    print(f"Effective batch:  {effective_batch}")
+    sft_args = SFTConfig(
+        output_dir=str(experiment_dir),
+        dataset_text_field="messages",
+        max_seq_length=config.max_seq_length,
+        num_train_epochs=config.num_train_epochs,
+        per_device_train_batch_size=config.per_device_train_batch_size,
+        per_device_eval_batch_size=config.per_device_eval_batch_size,
+        gradient_accumulation_steps=config.gradient_accumulation_steps,
+        learning_rate=config.learning_rate,
+        max_grad_norm=config.max_grad_norm,
+        warmup_steps=config.warmup_steps,
+        lr_scheduler_type=config.lr_scheduler_type,
+        weight_decay=config.weight_decay,
+        optim=config.optim,
+        bf16=True,
+        logging_steps=config.logging_steps,
+        save_strategy=config.save_strategy,
+        save_steps=config.save_steps,
+        eval_strategy=config.eval_strategy,
+        eval_steps=config.eval_steps,
+        report_to=config.report_to,
+        trackio_space_id=config.trackio_space_id,
+        project=config.project,
+        dataset_num_proc=8,
+        seed=config.seed,
+    )
+    trainer = SFTTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=ds["train"],
+        eval_dataset=ds.get("val"),
+        args=sft_args,
+    )
+    # Mask all tokens except the assistant response — train on completions only
+    trainer = train_on_responses_only(
+        trainer,
+        instruction_part=INSTRUCTION_PART,
+        response_part=RESPONSE_PART,
+    )
+    resume_from = _find_latest_checkpoint(experiment_dir)
+    if resume_from:
+        print(f"Resuming from {resume_from}")
+    trainer.train(resume_from_checkpoint=resume_from)
+    # Save LoRA adapter + tokenizer (lightweight, for future merging)
+    print(f"Saving LoRA adapter to {experiment_dir}")
+    model.save_pretrained(str(experiment_dir))
+    tokenizer.save_pretrained(str(experiment_dir))
+    # Save merged 16-bit model (full weights, ready for inference / GGUF conversion)
+    merged_dir = experiment_dir / "merged"
+    merged_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Saving merged 16-bit model to {merged_dir}")
+    model.save_pretrained_merged(str(merged_dir), tokenizer, save_method="merged_16bit")
+    gazet_vol.commit()
+    print(f"Training complete: {config.experiment_name}")
+    return config.experiment_name
+@app.local_entrypoint()
+def main(
+    base_model: Optional[str] = None,
+    experiment_name: Optional[str] = None,
+    run_dir: Optional[str] = None,
+    num_train_epochs: Optional[int] = None,
+    per_device_train_batch_size: Optional[int] = None,
+    max_train_samples: Optional[int] = None,
+    max_eval_samples: Optional[int] = None,
+    lora_r: Optional[int] = None,
+    max_seq_length: Optional[int] = None,
+):
+    overrides = {
+        k: v for k, v in dict(
+            base_model=base_model,
+            experiment_name=experiment_name,
+            run_dir=run_dir,
+            num_train_epochs=num_train_epochs,
+            per_device_train_batch_size=per_device_train_batch_size,
+            max_train_samples=max_train_samples,
+            max_eval_samples=max_eval_samples,
+            lora_r=lora_r,
+            max_seq_length=max_seq_length,
+        ).items() if v is not None
+    }
+    config = Qwen35Config(**overrides)
+    # lora_alpha follows r if r was overridden and alpha wasn't
+    if lora_r is not None:
+        config.lora_alpha = 2 * config.lora_r
+    print(f"Starting experiment: {config.experiment_name}")
+    print(f"Model:               {config.base_model}")
+    print(f"Run dir:             {config.run_dir}")
+    print(f"LoRA:                r={config.lora_r}, alpha={config.lora_alpha}")
+    effective_batch = config.per_device_train_batch_size * config.gradient_accumulation_steps
+    print(f"Effective batch:     {effective_batch}")
+    result = finetune.remote(config.__dict__)
+    print(f"Training complete: {result}")

gazet_demo.py CHANGED Viewed

@@ -2,6 +2,7 @@
 import json
 import math
 import pandas as pd
 import requests
@@ -100,7 +101,7 @@ def _render_map(geojson, placeholder):
             st.json(geojson)
-API = "http://127.0.0.1:8000"
 EXAMPLES = [
     "Angola and Mozambique",
     "Mediterranean Sea",
@@ -114,6 +115,17 @@ st.set_page_config(page_title="Gazet", page_icon="🌍", layout="wide")
 st.title("Gazet")
 st.caption("Natural-language geo search · click an example or type your own")
 if "run_q" not in st.session_state:
     st.session_state.run_q = None
@@ -149,7 +161,7 @@ with col2:
         try:
             with requests.get(
-                f"{API}/search/stream", params={"q": to_run}, stream=True, timeout=120
             ) as r:
                 r.raise_for_status()

 import json
 import math
+import os
 import pandas as pd
 import requests
             st.json(geojson)
+API = os.environ.get("GAZET_API_URL", "http://127.0.0.1:8000")
 EXAMPLES = [
     "Angola and Mozambique",
     "Mediterranean Sea",
 st.title("Gazet")
 st.caption("Natural-language geo search · click an example or type your own")
+backend = st.sidebar.radio(
+    "SQL Backend",
+    ["gguf", "dspy"],
+    index=0,
+    format_func=lambda x: "⚡ GGUF (llama-server)" if x == "gguf" else "🧠 DSPy (cloud LM)",
+)
+st.sidebar.caption(
+    "**gguf** → finetuned Qwen3.5 via llama-server\n\n"
+    "**dspy** → Ollama / cloud LM with retry loop"
+)
 if "run_q" not in st.session_state:
     st.session_state.run_q = None
         try:
             with requests.get(
+                f"{API}/search/stream", params={"q": to_run, "backend": backend}, stream=True, timeout=120
             ) as r:
                 r.raise_for_status()

pyproject.toml CHANGED Viewed

@@ -16,8 +16,18 @@ dependencies = [
     "pydantic>=2.0",
     "pyarrow>=17.0.0",
     "geopandas>=1.1.2",
 ]
 optional-dependencies = { demo = ["streamlit", "requests", "pydeck"], dev = ["ruff"] }
 [tool.hatch.build.targets.wheel]
-packages = ["src/gazet"]

     "pydantic>=2.0",
     "pyarrow>=17.0.0",
     "geopandas>=1.1.2",
+    "httpx>=0.28.1",
+    "sqlparse>=0.5.5",
 ]
 optional-dependencies = { demo = ["streamlit", "requests", "pydeck"], dev = ["ruff"] }
+[project.scripts]
+gazet-dataset = "dataset.scripts.cli:main"
 [tool.hatch.build.targets.wheel]
+packages = ["src/gazet", "dataset"]
+[dependency-groups]
+dataset = [
+    "modal>=1.4.0",
+]

src/gazet/api.py CHANGED Viewed

@@ -1,4 +1,5 @@
 import json
 from typing import Any, Generator
 import duckdb
@@ -7,19 +8,33 @@ from fastapi import FastAPI, HTTPException
 from fastapi.responses import StreamingResponse
 from .export import to_feature_collection
-from .lm import extract
-from .search import search_divisions_area, search_natural_earth
-from .sql import run_geo_sql_loop
 app = FastAPI()
 def _df_to_records(df: pd.DataFrame) -> list[dict[str, Any]]:
     """Convert DataFrame to list of dicts for JSON; handle non-JSON-serializable types."""
     return df.replace({float("nan"): None}).to_dict(orient="records")
-def _run_stream(query: str) -> Generator[str, None, None]:
     """Yield NDJSON lines as each stage of the search completes.
     Event ``type`` values (in order of emission):
@@ -30,9 +45,12 @@ def _run_stream(query: str) -> Generator[str, None, None]:
     - ``geojson``     – final FeatureCollection
     - ``error``       – fatal error (no result)
     """
-    pred = extract(query=query)
-    print("extract result:", pred.result)
-    places_result = pred.result
     yield json.dumps({"type": "places", "data": places_result.model_dump()}) + "\n"
@@ -41,12 +59,10 @@ def _run_stream(query: str) -> Generator[str, None, None]:
     con.execute("LOAD spatial")
     try:
         all_candidates: list[pd.DataFrame] = []
         for place in places_result.places:
-            for search_fn in (search_divisions_area, search_natural_earth):
-                df = search_fn(con, place)
-                if not df.empty:
-                    all_candidates.append(df)
         if not all_candidates:
             yield json.dumps({"type": "error", "data": "No candidates found"}) + "\n"
@@ -64,8 +80,9 @@ def _run_stream(query: str) -> Generator[str, None, None]:
             + "\n"
         )
         result_df: pd.DataFrame | None = None
-        for event in run_geo_sql_loop(con, query, candidates_df):
             if event["type"] == "sql_attempt":
                 yield (
                     json.dumps(
@@ -105,13 +122,13 @@ def _run_stream(query: str) -> Generator[str, None, None]:
 @app.get("/search/stream")
-def search_stream(q: str) -> StreamingResponse:
     """Stream search progress as NDJSON (one JSON object per line)."""
-    return StreamingResponse(_run_stream(q), media_type="application/x-ndjson")
 @app.get("/search", response_model=None)
-def search(q: str) -> dict[str, Any]:
     """Run geo search for natural-language query (non-streaming).
     Returns GeoJSON FeatureCollection, the executed SQL, and the identified
@@ -122,7 +139,7 @@ def search(q: str) -> dict[str, Any]:
     sql = ""
     geojson: dict | None = None
-    for line in _run_stream(q):
         if not line.strip():
             continue
         event = json.loads(line)

 import json
+import math
 from typing import Any, Generator
 import duckdb
 from fastapi.responses import StreamingResponse
 from .export import to_feature_collection
+from .lm import extract, generate_places
+from .search import search_candidates
+from .sql import run_geo_sql_dspy, run_geo_sql_gguf
 app = FastAPI()
+def _per_source_limit(num_places: int) -> int:
+    """Candidates to fetch per source per place, scaled by number of places.
+    Keeps the total candidate count in the prompt manageable:
+      1 place  → 5 per source → 10 total
+      2 places → 4 per source → 16 total
+      3 places → 3 per source → 18 total
+      4 places → 2 per source → 16 total
+      5 places → 2 per source → 20 total
+    """
+    table = {1: 5, 2: 4, 3: 3, 4: 2, 5: 2}
+    return table.get(num_places, max(1, math.ceil(5 / num_places)))
 def _df_to_records(df: pd.DataFrame) -> list[dict[str, Any]]:
     """Convert DataFrame to list of dicts for JSON; handle non-JSON-serializable types."""
     return df.replace({float("nan"): None}).to_dict(orient="records")
+def _run_stream(query: str, backend: str = "gguf") -> Generator[str, None, None]:
     """Yield NDJSON lines as each stage of the search completes.
     Event ``type`` values (in order of emission):
     - ``geojson``     – final FeatureCollection
     - ``error``       – fatal error (no result)
     """
+    if backend == "gguf":
+        places_result = generate_places(query)
+    else:
+        pred = extract(query=query)
+        places_result = pred.result
+    print("places:", places_result)
     yield json.dumps({"type": "places", "data": places_result.model_dump()}) + "\n"
     con.execute("LOAD spatial")
     try:
+        limit = _per_source_limit(len(places_result.places))
         all_candidates: list[pd.DataFrame] = []
         for place in places_result.places:
+            all_candidates.extend(search_candidates(con, place, limit=limit))
         if not all_candidates:
             yield json.dumps({"type": "error", "data": "No candidates found"}) + "\n"
             + "\n"
         )
+        sql_fn = run_geo_sql_gguf if backend == "gguf" else run_geo_sql_dspy
         result_df: pd.DataFrame | None = None
+        for event in sql_fn(con, query, candidates_df):
             if event["type"] == "sql_attempt":
                 yield (
                     json.dumps(
 @app.get("/search/stream")
+def search_stream(q: str, backend: str = "gguf") -> StreamingResponse:
     """Stream search progress as NDJSON (one JSON object per line)."""
+    return StreamingResponse(_run_stream(q, backend), media_type="application/x-ndjson")
 @app.get("/search", response_model=None)
+def search(q: str, backend: str = "gguf") -> dict[str, Any]:
     """Run geo search for natural-language query (non-streaming).
     Returns GeoJSON FeatureCollection, the executed SQL, and the identified
     sql = ""
     geojson: dict | None = None
+    for line in _run_stream(q, backend):
         if not line.strip():
             continue
         event = json.loads(line)

src/gazet/config.py CHANGED Viewed

@@ -1,7 +1,11 @@
 import pathlib
-# Data lives at project root (gazet/data/), not inside the package
-_DATA_DIR = pathlib.Path(__file__).resolve().parent.parent.parent / "data"
 DIVISIONS_AREA_PATH = str(_DATA_DIR / "overture/divisions_area/*.parquet")
 NATURAL_EARTH_PATH = str(_DATA_DIR / "natural_earth_geoparquet/ne_geography.parquet")
@@ -9,18 +13,29 @@ NATURAL_EARTH_PATH = str(_DATA_DIR / "natural_earth_geoparquet/ne_geography.parq
 # MODEL = "granite4:350m"
 # MODEL = "gemma3:12b-cloud"
 # MODEL = "qwen3.5:397b-cloud"
-MODEL = "gpt-oss:20b-cloud"
 # MODEL = "qwen3:4b"
 # MODEL = "qwen3-coder-next:cloud"
 # MODEL = "deepseek-coder:1.3b"
 MAX_SQL_ITERATIONS = 5
-SCHEMA_INFO = f"""
 Available DuckDB datasets (read via read_parquet):
 1. divisions_area  — Overture polygon/multipolygon admin boundaries
-   path: '{DIVISIONS_AREA_PATH}'
    columns:
      id VARCHAR              -- unique feature id (use this to filter precisely)
      names STRUCT("primary" VARCHAR, ...)
@@ -36,7 +51,7 @@ Available DuckDB datasets (read via read_parquet):
      geometry GEOMETRY       -- boundary polygon/multipolygon (WKB, spatial ext loaded)
 2. natural_earth  — Natural Earth geography polygons (oceans, seas, terrain regions, islands)
-   path: '{NATURAL_EARTH_PATH}'
    columns:
      id VARCHAR              -- unique feature id prefixed 'ne_'
      names STRUCT("primary" VARCHAR, ...)
@@ -49,26 +64,26 @@ Available DuckDB datasets (read via read_parquet):
      is_territorial BOOLEAN
      geometry GEOMETRY       -- polygon/multipolygon (WKB, spatial ext loaded)
-Spatial extension is already loaded — use ST_AsGeoJSON(geometry) or ST_AsText(geometry).
 To access names use: names."primary"
 The candidates table has a 'source' column: 'divisions_area' or 'natural_earth'.
-Use the matching path for each candidate's source when querying.
 Example patterns:
   -- single region boundary from divisions_area
-  SELECT id, names."primary" AS name, ST_AsGeoJSON(geometry) AS geojson
-  FROM read_parquet('{DIVISIONS_AREA_PATH}')
   WHERE id = '<candidate_id>'
   -- feature from natural_earth
-  SELECT id, names."primary" AS name, ST_AsGeoJSON(geometry) AS geojson
-  FROM read_parquet('{NATURAL_EARTH_PATH}')
   WHERE id = '<candidate_id>'
   -- shared border between two adjacent regions
-  WITH a AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '<id_a>'),
-       b AS (SELECT geometry FROM read_parquet('{DIVISIONS_AREA_PATH}') WHERE id = '<id_b>')
-  SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, b.geometry)) AS border
   FROM a, b
 """

+import os
 import pathlib
+# Data lives at project root (gazet/data/), not inside the package.
+# Override with GAZET_DATA_DIR env var for remote execution (e.g. Modal volume at /data).
+_DATA_DIR = pathlib.Path(os.environ.get("GAZET_DATA_DIR", str(
+    pathlib.Path(__file__).resolve().parent.parent.parent / "data"
+)))
 DIVISIONS_AREA_PATH = str(_DATA_DIR / "overture/divisions_area/*.parquet")
 NATURAL_EARTH_PATH = str(_DATA_DIR / "natural_earth_geoparquet/ne_geography.parquet")
 # MODEL = "granite4:350m"
 # MODEL = "gemma3:12b-cloud"
 # MODEL = "qwen3.5:397b-cloud"
+# MODEL = "gpt-oss:20b-cloud"
 # MODEL = "qwen3:4b"
 # MODEL = "qwen3-coder-next:cloud"
 # MODEL = "deepseek-coder:1.3b"
+# MODEL = "qwen3.5:2b"
+# MODEL = "qwen3.5:0.8b"
+# MODEL = "qwen2.5-coder:1.5b"
+PLACE_EXTRACTION_MODEL = "gpt-oss:20b-cloud"
+SQL_GENERATION_MODEL = "gpt-oss:20b-cloud"
 MAX_SQL_ITERATIONS = 5
+# ── GGUF / llama-server config ────────────────────────────────────────────────
+LLAMA_SERVER_URL = os.environ.get("LLAMA_SERVER_URL", "http://localhost:9000")
+LLAMA_MAX_TOKENS = int(os.environ.get("LLAMA_MAX_TOKENS", "2048"))
+LLAMA_TEMPERATURE = float(os.environ.get("LLAMA_TEMPERATURE", "0"))
+SCHEMA_INFO = """
 Available DuckDB datasets (read via read_parquet):
 1. divisions_area  — Overture polygon/multipolygon admin boundaries
+   query: read_parquet('divisions_area')
    columns:
      id VARCHAR              -- unique feature id (use this to filter precisely)
      names STRUCT("primary" VARCHAR, ...)
      geometry GEOMETRY       -- boundary polygon/multipolygon (WKB, spatial ext loaded)
 2. natural_earth  — Natural Earth geography polygons (oceans, seas, terrain regions, islands)
+   query: read_parquet('natural_earth')
    columns:
      id VARCHAR              -- unique feature id prefixed 'ne_'
      names STRUCT("primary" VARCHAR, ...)
      is_territorial BOOLEAN
      geometry GEOMETRY       -- polygon/multipolygon (WKB, spatial ext loaded)
+Spatial extension is already loaded — use ST_AsGeoJSON(geometry) for geometry outputs.
 To access names use: names."primary"
 The candidates table has a 'source' column: 'divisions_area' or 'natural_earth'.
+Use read_parquet('divisions_area') or read_parquet('natural_earth') accordingly.
 Example patterns:
   -- single region boundary from divisions_area
+  SELECT id, names."primary" AS name, ST_AsGeoJSON(geometry) AS geometry
+  FROM read_parquet('divisions_area')
   WHERE id = '<candidate_id>'
   -- feature from natural_earth
+  SELECT id, names."primary" AS name, ST_AsGeoJSON(geometry) AS geometry
+  FROM read_parquet('natural_earth')
   WHERE id = '<candidate_id>'
   -- shared border between two adjacent regions
+  WITH a AS (SELECT geometry FROM read_parquet('divisions_area') WHERE id = '<id_a>'),
+       b AS (SELECT geometry FROM read_parquet('divisions_area') WHERE id = '<id_b>')
+  SELECT ST_AsGeoJSON(ST_Intersection(a.geometry, b.geometry)) AS geometry
   FROM a, b
 """

src/gazet/export.py CHANGED Viewed

@@ -2,9 +2,25 @@ import json
 import pathlib
 import re
 import pandas as pd
 def _is_geojson_col(series: pd.Series) -> bool:
     """Heuristic: a string column whose non-null values start with '{"type":'."""
     sample = series.dropna().head(5)
@@ -16,6 +32,36 @@ def _is_geojson_col(series: pd.Series) -> bool:
     )
 def save_geojson(
     result_df: pd.DataFrame, query: str, output_dir: pathlib.Path | str = "."
 ) -> pathlib.Path:
@@ -43,22 +89,33 @@ def to_feature_collection(result_df: pd.DataFrame) -> dict:
 def _to_feature_collection(result_df: pd.DataFrame) -> dict:
-    geom_cols = [c for c in result_df.columns if _is_geojson_col(result_df[c])]
     prop_cols = [c for c in result_df.columns if c not in geom_cols]
     features = []
     for _, row in result_df.iterrows():
         geometry = None
-        if geom_cols:
-            raw = row[geom_cols[0]]
             if raw and isinstance(raw, str):
                 try:
                     geometry = json.loads(raw)
                 except json.JSONDecodeError:
                     pass
-        properties = {c: row[c] for c in prop_cols if pd.notna(row[c])}
-        for c in geom_cols[1:]:
-            if pd.notna(row[c]):
-                properties[c] = row[c]
         features.append(
             {"type": "Feature", "geometry": geometry, "properties": properties}
         )

 import pathlib
 import re
+import numpy as np
 import pandas as pd
+def _to_serializable(val):
+    """Convert a value to a JSON-serializable Python type."""
+    if isinstance(val, (bytearray, bytes)):
+        return None
+    if isinstance(val, np.ndarray):
+        return val.tolist()
+    if isinstance(val, (np.integer,)):
+        return int(val)
+    if isinstance(val, (np.floating,)):
+        return float(val)
+    if isinstance(val, (np.bool_,)):
+        return bool(val)
+    return val
 def _is_geojson_col(series: pd.Series) -> bool:
     """Heuristic: a string column whose non-null values start with '{"type":'."""
     sample = series.dropna().head(5)
     )
+def _is_wkb_col(series: pd.Series) -> bool:
+    """Heuristic: a column whose non-null values are bytearray or bytes (WKB geometry)."""
+    sample = series.dropna().head(5)
+    return (
+        sample.apply(lambda v: isinstance(v, (bytearray, bytes))).all()
+        and len(sample) > 0
+    )
+def _wkb_to_geojson(wkb: bytearray | bytes) -> dict | None:
+    """Convert WKB geometry to GeoJSON dict via DuckDB."""
+    import duckdb
+    con = duckdb.connect()
+    try:
+        con.execute("INSTALL spatial")
+        con.execute("LOAD spatial")
+        result = con.execute(
+            "SELECT ST_AsGeoJSON(ST_GeomFromWKB(?::BLOB)) AS geojson",
+            [bytes(wkb)],
+        ).fetchone()
+        if result and result[0]:
+            return json.loads(result[0])
+    except Exception:
+        pass
+    finally:
+        con.close()
+    return None
 def save_geojson(
     result_df: pd.DataFrame, query: str, output_dir: pathlib.Path | str = "."
 ) -> pathlib.Path:
 def _to_feature_collection(result_df: pd.DataFrame) -> dict:
+    geojson_cols = [c for c in result_df.columns if _is_geojson_col(result_df[c])]
+    wkb_cols = [c for c in result_df.columns if _is_wkb_col(result_df[c])]
+    geom_cols = geojson_cols + wkb_cols
     prop_cols = [c for c in result_df.columns if c not in geom_cols]
     features = []
     for _, row in result_df.iterrows():
         geometry = None
+        if geojson_cols:
+            raw = row[geojson_cols[0]]
             if raw and isinstance(raw, str):
                 try:
                     geometry = json.loads(raw)
                 except json.JSONDecodeError:
                     pass
+        elif wkb_cols:
+            raw = row[wkb_cols[0]]
+            if raw and isinstance(raw, (bytearray, bytes)):
+                geometry = _wkb_to_geojson(raw)
+        properties = {}
+        for c in prop_cols:
+            v = row[c]
+            try:
+                if not pd.notna(v):
+                    continue
+            except ValueError:
+                pass  # pd.notna fails on arrays — treat as present
+            properties[c] = _to_serializable(v)
         features.append(
             {"type": "Feature", "geometry": geometry, "properties": properties}
         )

src/gazet/lm.py CHANGED Viewed

@@ -1,7 +1,20 @@
 import dspy
-from .config import MODEL
-from .schemas import PlacesResult
 class ExtractPlaces(dspy.Signature):
@@ -20,6 +33,13 @@ class ExtractPlaces(dspy.Signature):
     Where possible and relevant, also extract the ISO country code for each place.
     Do not repeat the same place name in the result.
     If the user does not explicitly mention a country, dont add the country code to the result.
@@ -103,10 +123,213 @@ class WriteGeoSQL(dspy.Signature):
     )
-lm = dspy.LM(
-    f"ollama_chat/{MODEL}", api_base="http://localhost:11434", api_key="", temperature=0.1, cache=False,
 )
-dspy.configure(lm=lm)
-extract = dspy.Predict(ExtractPlaces)
-write_sql = dspy.Predict(WriteGeoSQL)

+import json
+import logging
 import dspy
+import httpx
+import pandas as pd
+from .config import (
+    LLAMA_MAX_TOKENS,
+    LLAMA_SERVER_URL,
+    LLAMA_TEMPERATURE,
+    PLACE_EXTRACTION_MODEL,
+    SQL_GENERATION_MODEL,
+)
+from .schemas import Place, PlacesResult
+logger = logging.getLogger(__name__)
 class ExtractPlaces(dspy.Signature):
     Where possible and relevant, also extract the ISO country code for each place.
+    Only extract place names that are explicitly mentioned in the query.
+    Do NOT generate or infer place names from your own knowledge.
+    For example:
+    - "north half of India" -> extract "India", NOT individual state names
+    - "coastal cities of France" -> extract "France", NOT city names
+    - "neighbouring states of Odisha" -> extract "Odisha", NOT neighbouring state names
     Do not repeat the same place name in the result.
     If the user does not explicitly mention a country, dont add the country code to the result.
     )
+place_extraction_lm = dspy.LM(
+    f"ollama_chat/{PLACE_EXTRACTION_MODEL}",
+    api_base="http://localhost:11434",
+    api_key="",
+    temperature=0.1,
+    cache=False,
+)
+sql_generation_lm = dspy.LM(
+    f"ollama_chat/{SQL_GENERATION_MODEL}",
+    api_base="http://localhost:11434",
+    api_key="",
+    temperature=0.1,
+    cache=False,
+    think=False
 )
+class PlaceExtractor(dspy.Module):
+    def __init__(self, lm):
+        super().__init__()
+        self.lm = lm
+        self.predictor = dspy.Predict(ExtractPlaces)
+    def forward(self, query: str):
+        with dspy.context(lm=self.lm):
+            return self.predictor(query=query)
+class SQLWriter(dspy.Module):
+    def __init__(self, lm):
+        super().__init__()
+        self.lm = lm
+        self.predictor = dspy.Predict(WriteGeoSQL)
+    def forward(self, user_query: str, schema: str, candidates: str,
+                previous_sql: str = "", execution_error: str = ""):
+        with dspy.context(lm=self.lm):
+            return self.predictor(
+                user_query=user_query,
+                schema=schema,
+                candidates=candidates,
+                previous_sql=previous_sql,
+                execution_error=execution_error
+            )
+extract = PlaceExtractor(lm=place_extraction_lm)
+write_sql = SQLWriter(lm=sql_generation_lm)
+# ── GGUF SQL generation via llama-server ──────────────────────────────────────
+_SYSTEM_PROMPT = """You are a text to SQL query translator that helps in natural language geocoding.
+You have access to two DuckDB parquet tables. Given a set of candidate entities and a user query, generate the SQL to retrieve the desired geometry.
+<SCHEMA>
+1. divisions_area  -- Overture polygon/multipolygon admin boundaries
+   query: read_parquet('divisions_area')
+   columns:
+     id VARCHAR              -- unique feature id
+     names STRUCT("primary" VARCHAR, ...)
+     country VARCHAR         -- ISO 3166-1 alpha-2
+     subtype VARCHAR         -- country | region | dependency | county | localadmin |
+                               locality | macrohood | neighborhood | microhood
+     class VARCHAR
+     region VARCHAR
+     admin_level INTEGER
+     division_id VARCHAR
+     is_land BOOLEAN
+     is_territorial BOOLEAN
+     geometry GEOMETRY       -- WGS-84 polygon/multipolygon (spatial ext loaded)
+2. natural_earth  -- Natural Earth geography polygons (oceans, seas, rivers, terrain)
+   query: read_parquet('natural_earth')
+   columns:
+     id VARCHAR              -- unique feature id prefixed 'ne_'
+     names STRUCT("primary" VARCHAR, ...)
+     country VARCHAR
+     subtype VARCHAR         -- e.g. 'ocean', 'sea', 'bay', 'Terrain area', 'Island group'
+     class VARCHAR
+     region VARCHAR
+     admin_level INTEGER
+     is_land BOOLEAN
+     is_territorial BOOLEAN
+     geometry GEOMETRY       -- WGS-84 polygon/multipolygon (spatial ext loaded)
+</SCHEMA>
+The candidates table has a 'source' column: 'divisions_area' or 'natural_earth'.
+Use read_parquet('divisions_area') or read_parquet('natural_earth') accordingly.
+Use ST_AsGeoJSON(geometry) for all geometry outputs."""
+_USER_PROMPT_TEMPLATE = """<CANDIDATES>
+{candidates_csv}
+</CANDIDATES>
+<USER_QUERY>
+{question}
+</USER_QUERY>
+"""
+def _postprocess_sql(text: str) -> str:
+    """Strip markdown fences and whitespace from generated SQL."""
+    cleaned = text.strip()
+    if "```sql" in cleaned:
+        cleaned = cleaned.split("```sql", 1)[1]
+    if cleaned.startswith("```"):
+        cleaned = cleaned[3:]
+    if "```" in cleaned:
+        cleaned = cleaned.split("```", 1)[0]
+    return cleaned.strip()
+def is_llama_server_available() -> bool:
+    """Check if the llama-server is running and healthy."""
+    try:
+        resp = httpx.get(f"{LLAMA_SERVER_URL}/health", timeout=2)
+        return resp.status_code == 200
+    except (httpx.ConnectError, httpx.TimeoutException):
+        return False
+def _llama_chat_complete(messages: list[dict]) -> str:
+    """Call llama-server /v1/chat/completions with a messages list."""
+    resp = httpx.post(
+        f"{LLAMA_SERVER_URL}/v1/chat/completions",
+        json={
+            "messages": messages,
+            "n_predict": LLAMA_MAX_TOKENS,
+            "temperature": LLAMA_TEMPERATURE,
+            "chat_template_kwargs": {"enable_thinking": False},
+        },
+        timeout=60,
+    )
+    if resp.status_code != 200:
+        logger.error("llama-server %s: %s", resp.status_code, resp.text[:500])
+    resp.raise_for_status()
+    return resp.json()["choices"][0]["message"]["content"]
+_PLACES_SYSTEM_PROMPT = """You are a geographic entity extractor. Extract place names from the user query and return valid JSON only.
+OUTPUT FORMAT:
+{"places": [{"place": "<name>", "country": "<ISO-2>", "subtype": "<subtype>"}]}
+"country" and "subtype" are optional; omit if not applicable.
+RULES:
+- Only extract places explicitly mentioned. Never infer or expand (e.g. "states of India" -> extract "India" only).
+- No duplicate place names.
+- "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
+- "subtype": include only when the geographic level is clear from the query.
+SUBTYPES:
+country, dependency, region, county, localadmin, locality, macrohood, neighborhood, microhood
+- Default to locality for cities/towns; omit for physical features (oceans, rivers, mountains)."""
+def generate_places(user_query: str) -> PlacesResult:
+    """Extract place names from a query using the finetuned GGUF model.
+    Uses the same prompt format the model was trained on.
+    Returns a PlacesResult; falls back to an empty result on parse failure.
+    """
+    messages = [
+        {"role": "system", "content": _PLACES_SYSTEM_PROMPT},
+        {"role": "user", "content": user_query},
+    ]
+    raw_output = _llama_chat_complete(messages).strip()
+    # Strip markdown fences if the model wrapped the JSON
+    if raw_output.startswith("```"):
+        raw_output = raw_output.split("```")[1]
+        if raw_output.startswith("json"):
+            raw_output = raw_output[4:]
+        raw_output = raw_output.strip()
+    try:
+        data = json.loads(raw_output)
+        return PlacesResult.model_validate(data)
+    except Exception as exc:
+        logger.warning("generate_places: failed to parse output %r: %s", raw_output, exc)
+        # Best-effort: treat entire query as a single unnamed place
+        return PlacesResult(places=[Place(place=user_query)])
+def generate_sql(user_query: str, candidates_df: pd.DataFrame) -> str:
+    """Generate SQL from a natural language query using the finetuned GGUF model.
+    Uses the same prompt format the model was trained on:
+    SYSTEM_PROMPT (includes schema) + USER_PROMPT_TEMPLATE with candidates CSV and question.
+    Single-shot — no retry loop (the finetuned model can't improve from error feedback).
+    """
+    # Keep only columns the model was trained on
+    keep_cols = ["source", "id", "name", "subtype", "country", "region", "admin_level", "similarity"]
+    cols = [c for c in keep_cols if c in candidates_df.columns]
+    candidates_csv = candidates_df[cols].to_csv(index=False)
+    user_prompt = _USER_PROMPT_TEMPLATE.format(
+        candidates_csv=candidates_csv.strip(),
+        question=user_query.strip(),
+    )
+    messages = [
+        {"role": "system", "content": _SYSTEM_PROMPT},
+        {"role": "user", "content": user_prompt},
+    ]
+    raw_output = _llama_chat_complete(messages)
+    return _postprocess_sql(raw_output)

src/gazet/search.py CHANGED Viewed

@@ -5,31 +5,16 @@ from .config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
 from .schemas import Place
-def _fuzzy_search(
     con: duckdb.DuckDBPyConnection,
     path: str,
     source: str,
     place: Place,
     extra_select: str = "",
     limit: int = 5,
-    is_overture: bool = False,
 ) -> pd.DataFrame:
-    """Generic Levenshtein fuzzy search against any parquet with a names.primary column."""
-    country_filter = ""
-    country_params: list = []
-    if is_overture and place.country:
-        country_filter = "AND country = ?"
-        country_params = [place.country]
-    subtype_filter = ""
-    subtype_params: list = []
-    if is_overture and place.subtype:
-        subtype_filter = "AND subtype = ?"
-        subtype_params = [place.subtype]
-    params = (
-        [place.place, place.place, path] + country_params + subtype_params + [limit]
-    )
     extra_clause = f", {extra_select}" if extra_select else ""
     rel = con.execute(
@@ -44,12 +29,9 @@ def _fuzzy_search(
             admin_level,
             is_land,
             is_territorial{extra_clause},
-            1.0 - (levenshtein(lower(names."primary"), lower(?))::float
-                   / greatest(length(names."primary"), length(?), 1)) AS similarity
         FROM read_parquet(?)
         WHERE names."primary" IS NOT NULL AND trim(names."primary") != ''
-        {country_filter}
-        {subtype_filter}
         ORDER BY similarity DESC, admin_level ASC
         LIMIT ?
         """,
@@ -57,11 +39,10 @@ def _fuzzy_search(
     )
     df = rel.fetchdf()
     df.insert(0, "source", source)
-    label = f'"{place.place}"' + (f" [{place.country}]" if place.country else "")
     if df.empty:
-        print(f"\n{source} – {label}: no matches")
     else:
-        print(f"\n{source} – {label} (top {len(df)} by name similarity):")
         print(df.to_string(index=False))
     return df
@@ -70,14 +51,13 @@ def search_divisions_area(
     con: duckdb.DuckDBPyConnection, place: Place, limit: int = 5
 ) -> pd.DataFrame:
     """Fuzzy-match a place against divisions_area (Overture admin boundaries)."""
-    return _fuzzy_search(
         con,
         DIVISIONS_AREA_PATH,
         "divisions_area",
         place,
         extra_select="division_id",
         limit=limit,
-        is_overture=True,
     )
@@ -85,10 +65,26 @@ def search_natural_earth(
     con: duckdb.DuckDBPyConnection, place: Place, limit: int = 5
 ) -> pd.DataFrame:
     """Fuzzy-match a place against Natural Earth geography polygons."""
-    return _fuzzy_search(
         con,
         NATURAL_EARTH_PATH,
         "natural_earth",
         place,
         limit=limit,
     )

 from .schemas import Place
+def simple_fuzzy_search(
     con: duckdb.DuckDBPyConnection,
     path: str,
     source: str,
     place: Place,
     extra_select: str = "",
     limit: int = 5,
 ) -> pd.DataFrame:
+    """Jaro-Winkler similarity search using only the place name."""
+    params = [place.place, path, limit]
     extra_clause = f", {extra_select}" if extra_select else ""
     rel = con.execute(
             admin_level,
             is_land,
             is_territorial{extra_clause},
+            jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity
         FROM read_parquet(?)
         WHERE names."primary" IS NOT NULL AND trim(names."primary") != ''
         ORDER BY similarity DESC, admin_level ASC
         LIMIT ?
         """,
     )
     df = rel.fetchdf()
     df.insert(0, "source", source)
     if df.empty:
+        print(f"\n{source} - \"{place.place}\": no matches")
     else:
+        print(f"\n{source} - \"{place.place}\" (top {len(df)} by Jaro-Winkler):")
         print(df.to_string(index=False))
     return df
     con: duckdb.DuckDBPyConnection, place: Place, limit: int = 5
 ) -> pd.DataFrame:
     """Fuzzy-match a place against divisions_area (Overture admin boundaries)."""
+    return simple_fuzzy_search(
         con,
         DIVISIONS_AREA_PATH,
         "divisions_area",
         place,
         extra_select="division_id",
         limit=limit,
     )
     con: duckdb.DuckDBPyConnection, place: Place, limit: int = 5
 ) -> pd.DataFrame:
     """Fuzzy-match a place against Natural Earth geography polygons."""
+    return simple_fuzzy_search(
         con,
         NATURAL_EARTH_PATH,
         "natural_earth",
         place,
         limit=limit,
     )
+def search_candidates(
+    con: duckdb.DuckDBPyConnection, place: Place, limit: int = 5
+) -> list[pd.DataFrame]:
+    """Return candidate DataFrames for a place from both sources.
+    Always searches divisions_area and natural_earth to avoid missing
+    natural features when the model assigns an incorrect admin subtype.
+    """
+    results = []
+    for fn in (search_divisions_area, search_natural_earth):
+        df = fn(con, place, limit=limit)
+        if not df.empty:
+            results.append(df)
+    return results

src/gazet/sql.py CHANGED Viewed

@@ -4,8 +4,40 @@ from typing import Any, Generator, Optional
 import duckdb
 import pandas as pd
-from .config import MAX_SQL_ITERATIONS, SCHEMA_INFO
-from .lm import write_sql
 def _strip_fences(sql: Optional[str]) -> str:
@@ -17,13 +49,38 @@ def _strip_fences(sql: Optional[str]) -> str:
     return sql.strip()
-def run_geo_sql_loop(
     con: duckdb.DuckDBPyConnection,
     user_query: str,
     candidates_df: pd.DataFrame,
-    max_iterations: int = MAX_SQL_ITERATIONS,
 ) -> Generator[dict[str, Any], None, None]:
-    """Code-act loop yielding progress events.
     Event types:
     - ``sql_attempt``  – ``{"type": "sql_attempt", "sql": str, "iteration": int}``
@@ -31,7 +88,46 @@ def run_geo_sql_loop(
     - ``result``       – ``{"type": "result", "df": DataFrame | None, "sql": str}``
     """
     if candidates_df.empty:
-        print("\n[SQL-Act] No candidates to work with — skipping.")
         yield {"type": "result", "df": None, "sql": ""}
         return
@@ -41,7 +137,7 @@ def run_geo_sql_loop(
     for iteration in range(1, max_iterations + 1):
         print(f"\n{'=' * 60}")
-        print(f"[SQL-Act] Iteration {iteration}/{max_iterations}")
         try:
             pred = write_sql(
@@ -86,6 +182,6 @@ def run_geo_sql_loop(
             yield {"type": "sql_error", "error": error, "iteration": iteration}
     print(
-        f"\n[SQL-Act] Exhausted {max_iterations} iterations without a successful query."
     )
     yield {"type": "result", "df": None, "sql": ""}

 import duckdb
 import pandas as pd
+from .config import DIVISIONS_AREA_PATH, MAX_SQL_ITERATIONS, NATURAL_EARTH_PATH, SCHEMA_INFO
+from .lm import generate_sql, write_sql
+def _rewrite_data_paths(sql: str) -> str:
+    """Replace any read_parquet table reference with the correct runtime path.
+    Handles three generations of model output:
+      - Symbolic:  read_parquet('divisions_area')
+      - Old paths: read_parquet('/data/overture/division_area/...')
+      - Hallucinated variants: any quoted path containing 'division' or 'natural_earth'
+    Legacy replacements run FIRST so the absolute path is never re-matched.
+    """
+    # Any quoted path that looks like a divisions_area reference
+    sql = re.sub(
+        r"read_parquet\(['\"][^'\"]*(?:division_area|divisions_area)[^'\"]*['\"]\)",
+        f"read_parquet('{DIVISIONS_AREA_PATH}')",
+        sql,
+    )
+    # Any quoted path that looks like a natural_earth reference
+    sql = re.sub(
+        r"read_parquet\(['\"][^'\"]*natural_earth[^'\"]*['\"]\)",
+        f"read_parquet('{NATURAL_EARTH_PATH}')",
+        sql,
+    )
+    # Symbolic names (current training format)
+    sql = sql.replace(
+        "read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')"
+    )
+    sql = sql.replace(
+        "read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')"
+    )
+    return sql
 def _strip_fences(sql: Optional[str]) -> str:
     return sql.strip()
+def _execute_sql(
+    con: duckdb.DuckDBPyConnection,
+    sql: str,
+    label: str,
+    iteration: int,
+) -> Generator[dict[str, Any], None, None]:
+    """Execute SQL and yield result/error events. Shared by both paths."""
+    try:
+        result_df = con.execute(sql).fetchdf()
+        if result_df.empty:
+            print(f"[{label}] Query returned no rows.")
+            yield {"type": "sql_error", "error": "Query returned no rows", "iteration": iteration}
+            yield {"type": "result", "df": None, "sql": sql}
+        else:
+            print(f"[{label}] Result ({len(result_df)} row(s))")
+            yield {"type": "result", "df": result_df, "sql": sql}
+    except Exception as exc:
+        error = str(exc)
+        print(f"[{label}] Execution error: {error}")
+        yield {"type": "sql_error", "error": error, "iteration": iteration}
+        yield {"type": "result", "df": None, "sql": sql}
+# ── GGUF path: finetuned model via llama-server (single-shot) ─────────────────
+def run_geo_sql_gguf(
     con: duckdb.DuckDBPyConnection,
     user_query: str,
     candidates_df: pd.DataFrame,
 ) -> Generator[dict[str, Any], None, None]:
+    """Single-shot text-to-SQL via the finetuned GGUF model (llama-server).
     Event types:
     - ``sql_attempt``  – ``{"type": "sql_attempt", "sql": str, "iteration": int}``
     - ``result``       – ``{"type": "result", "df": DataFrame | None, "sql": str}``
     """
     if candidates_df.empty:
+        print("\n[SQL·GGUF] No candidates to work with — skipping.")
+        yield {"type": "result", "df": None, "sql": ""}
+        return
+    try:
+        sql = generate_sql(user_query, candidates_df)
+    except Exception as exc:
+        error = f"GGUF generation failed: {exc}"
+        print(f"[SQL·GGUF] {error}")
+        yield {"type": "sql_error", "error": error, "iteration": 1}
+        yield {"type": "result", "df": None, "sql": ""}
+        return
+    if not sql:
+        print("[SQL·GGUF] Model returned empty SQL.")
+        yield {"type": "sql_error", "error": "Empty SQL response", "iteration": 1}
+        yield {"type": "result", "df": None, "sql": ""}
+        return
+    sql = _rewrite_data_paths(sql)
+    print(f"\n[SQL·GGUF] Generated:\n{sql}\n")
+    yield {"type": "sql_attempt", "sql": sql, "iteration": 1}
+    yield from _execute_sql(con, sql, "SQL·GGUF", iteration=1)
+# ── DSPy path: cloud/local LM with retry loop ────────────────────────────────
+def run_geo_sql_dspy(
+    con: duckdb.DuckDBPyConnection,
+    user_query: str,
+    candidates_df: pd.DataFrame,
+    max_iterations: int = MAX_SQL_ITERATIONS,
+) -> Generator[dict[str, Any], None, None]:
+    """Code-act retry loop using the DSPy SQL writer (Ollama / cloud LM).
+    Same event types as ``run_geo_sql_gguf``.
+    """
+    if candidates_df.empty:
+        print("\n[SQL·DSPy] No candidates to work with — skipping.")
         yield {"type": "result", "df": None, "sql": ""}
         return
     for iteration in range(1, max_iterations + 1):
         print(f"\n{'=' * 60}")
+        print(f"[SQL·DSPy] Iteration {iteration}/{max_iterations}")
         try:
             pred = write_sql(
             yield {"type": "sql_error", "error": error, "iteration": iteration}
     print(
+        f"\n[SQL·DSPy] Exhausted {max_iterations} iterations without a successful query."
     )
     yield {"type": "result", "df": None, "sql": ""}

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff