# Gazet Dataset Generation Generates synthetic training data for fine-tuning the geocoding model. Two datasets come out of one pipeline run: - **SQL generation** — `(question + candidates) -> DuckDB SQL` - **Place extraction** — `question -> place names JSON` Both tasks export in **conversation format** (`messages` list of system/user/assistant turns), ready for chat-template fine-tuning. --- ## Prerequisites ```bash uv sync ``` You need the Overture and Natural Earth parquet files under `data/` locally, or on a Modal volume if running in the cloud. Before large runs, normalize them once so both datasets use harmonized geometry metadata and cross-source joins behave the same locally and on Modal: ```bash gazet-dataset normalize-data --config dataset/config.yaml ``` This writes: - `data/overture_normalized/divisions_area/part-000.parquet` - `data/natural_earth_normalized/ne_geography.parquet` When those files exist, `gazet.config` will prefer them automatically. --- ## Option A — Run locally (small datasets, development) Use this when you want to iterate quickly on a laptop with a subset of countries. **Step 1 — Pick a run name and countries in `config.yaml`** ```yaml run_name: "v1" # change this every time you generate fresh data countries: - IN # India - BR # Brazil - US # United States # add more, or use "- all" for every country (slow locally) ``` **Step 2 — Run the full pipeline** ```bash gazet-dataset full-pipeline --config dataset/config.yaml ``` That's it. It runs all four steps in order and puts the results in `dataset/output/runs/my-run-001/`. If you want to run steps individually (e.g. to re-export without regenerating): ```bash gazet-dataset build-relations --config dataset/config.yaml # ~5 min gazet-dataset generate-samples --config dataset/config.yaml # ~15 min gazet-dataset validate --config dataset/config.yaml # ~5 min gazet-dataset export --config dataset/config.yaml # <1 min ``` --- ## Option B — Run on Modal (large datasets, production) Use this when you need 10 K+ samples or want to use all countries. Modal distributes generation across many containers in parallel. Modal uses two volumes: - `gazet-data` — read-only source parquets (Overture + Natural Earth). Populated once by `modal-upload`. - `gazet-intermediate` — entity inventories and relation tables built by the pipeline. Regenerated on each run. **Step 1 — One-time setup (only first time, or when source parquets change)** First normalize the source geodata locally: ```bash gazet-dataset normalize-data --config dataset/config.yaml ``` Then upload `data/` to Modal: ```bash modal setup # authenticate gazet-dataset modal-upload --config dataset/config.yaml # ~15 min, uploads data/ to gazet-data volume ``` Verify: ```bash modal volume ls gazet-data # should show: overture_normalized/, natural_earth_normalized/ ``` Skip this step on subsequent runs — the volume persists across runs. **Step 2 — Set run name and targets in `config.yaml`** ```yaml run_name: "v2" # bump this every time you regenerate from scratch countries: - all sample_targets: adjacency: 1500 containment: 1200 # ... see config.yaml for all families ``` **Step 3 — Run on Modal** ```bash gazet-dataset modal-generate --config dataset/config.yaml --fresh ``` This builds inventories + relations, generates samples across ~100 containers, validates, and exports. Output lands in `dataset/output/runs/{run_name}/`. Flags: - `--fresh` overwrites `dataset/output/dataset_raw.jsonl` instead of appending. - `--skip-inventory` reuses `{divisions_area,natural_earth}_inventory.parquet` on the intermediate volume. - `--skip-relations` reuses the seven `*_pairs.parquet` / `*_relations.parquet` files on the intermediate volume. Only safe when countries and template families are unchanged. ### Fresh-start recipe (after template / SQL / prompt changes) Always clear stale state so nothing from the previous run leaks in: ```bash # 1. Bump run_name in config.yaml (e.g. v1 -> v2) # 2. Wipe the intermediate volume so inventories and relations are rebuilt modal volume ls gazet-intermediate # for each file shown: modal volume rm gazet-intermediate # 3. Remove local raw/validated files so nothing gets appended to rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl # 4. Run the full pipeline gazet-dataset modal-generate --config dataset/config.yaml --fresh ``` You do NOT need to re-run `modal-upload` — source parquets on `gazet-data` don't change. ### Faster iteration (same templates, just more samples) If relations + inventories are still valid from a previous run: ```bash gazet-dataset modal-generate --config dataset/config.yaml \ --skip-inventory --skip-relations --fresh ``` --- ## Output After running, your training files are at: ``` dataset/output/runs/{run_name}/ sql/ train.jsonl <- fine-tune the SQL generation model val.jsonl test.jsonl places/ train.jsonl <- fine-tune the place extraction model val.jsonl test.jsonl stats.json <- sample counts by family ``` Each JSONL row is a conversation-format dict: ```json { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] } ``` **SQL task**: the system prompt includes the full two-table schema inside `` tags. The user prompt contains only `` CSV and ``. The assistant response is pretty-printed SQL (via `sqlparse`). All parquet paths are symbolic (`divisions_area` / `natural_earth`), never runtime-specific. **Places task**: the system prompt includes output format, extraction rules, and the full list of Overture subtypes. The assistant response is a JSON object with a `places` array. --- ## When to regenerate from scratch Change `run_name` and regenerate from scratch whenever you: - Change any SQL templates (`sql_templates.py`) - Add new template families - Change the candidate format or count - Change the system/user prompt structure or content - Change the export format For local runs, the default is a clean run. For Modal, `modal-generate` appends by default; pass `--fresh` to overwrite existing samples. --- ## Data quality checks After a run, spot-check the output with the pytest suite: ```bash uv run --extra dev pytest dataset/tests/ -v ``` The suite reads `dataset/output/dataset_validated.jsonl` plus the exported `runs/{run_name}/*.jsonl` and verifies: schema, no unresolved `{placeholders}` in questions, candidate refs resolve, SQL shape, template coverage, subtype-filtered templates match their phrasing, disambiguation samples have same-name distractors, and exported assistant payloads parse as valid JSON / SQL. Tests skip gracefully when outputs are missing. --- ## Troubleshooting **Very few samples generated for a family** The generation loop tries `retry_multiplier × target` and discards SQL that returns empty results. Some families (e.g. `multi_adjacency`, `chained`) have a lower success rate. Increase `sample_targets` for those families or increase `retry_multiplier` in `config.yaml`. **Relations step is slow** Normal for `countries: [all]` — it's a spatial self-join over millions of features. Use a country subset for development. Relations only need to be rebuilt when you add countries or change template families. **Validate step drops many samples** The validate step re-executes every SQL query and discards ones that return empty results. This is expected — check `output/runs/{run_name}/stats.json` for per-family counts after export.