Spaces:
Sleeping
Sleeping
| # Gazet Dataset Generation | |
| Generates synthetic training data for fine-tuning the geocoding model. | |
| Two datasets come out of one pipeline run: | |
| - **SQL generation** — `(question + candidates) -> DuckDB SQL` | |
| - **Place extraction** — `question -> place names JSON` | |
| Both tasks export in **conversation format** (`messages` list of | |
| system/user/assistant turns), ready for chat-template fine-tuning. | |
| --- | |
| ## Prerequisites | |
| ```bash | |
| uv sync | |
| ``` | |
| You need the Overture and Natural Earth parquet files under `data/` locally, | |
| or on a Modal volume if running in the cloud. | |
| Before large runs, normalize them once so both datasets use harmonized | |
| geometry metadata and cross-source joins behave the same locally and on Modal: | |
| ```bash | |
| gazet-dataset normalize-data --config dataset/config.yaml | |
| ``` | |
| This writes: | |
| - `data/overture_normalized/divisions_area/part-000.parquet` | |
| - `data/natural_earth_normalized/ne_geography.parquet` | |
| When those files exist, `gazet.config` will prefer them automatically. | |
| --- | |
| ## Option A — Run locally (small datasets, development) | |
| Use this when you want to iterate quickly on a laptop with a subset of countries. | |
| **Step 1 — Pick a run name and countries in `config.yaml`** | |
| ```yaml | |
| run_name: "v1" # change this every time you generate fresh data | |
| countries: | |
| - IN # India | |
| - BR # Brazil | |
| - US # United States | |
| # add more, or use "- all" for every country (slow locally) | |
| ``` | |
| **Step 2 — Run the full pipeline** | |
| ```bash | |
| gazet-dataset full-pipeline --config dataset/config.yaml | |
| ``` | |
| That's it. It runs all four steps in order and puts the results in | |
| `dataset/output/runs/my-run-001/`. | |
| If you want to run steps individually (e.g. to re-export without regenerating): | |
| ```bash | |
| gazet-dataset build-relations --config dataset/config.yaml # ~5 min | |
| gazet-dataset generate-samples --config dataset/config.yaml # ~15 min | |
| gazet-dataset validate --config dataset/config.yaml # ~5 min | |
| gazet-dataset export --config dataset/config.yaml # <1 min | |
| ``` | |
| --- | |
| ## Option B — Run on Modal (large datasets, production) | |
| Use this when you need 10 K+ samples or want to use all countries. Modal | |
| distributes generation across many containers in parallel. | |
| Modal uses two volumes: | |
| - `gazet-data` — read-only source parquets (Overture + Natural Earth). Populated | |
| once by `modal-upload`. | |
| - `gazet-intermediate` — entity inventories and relation tables built by the | |
| pipeline. Regenerated on each run. | |
| **Step 1 — One-time setup (only first time, or when source parquets change)** | |
| First normalize the source geodata locally: | |
| ```bash | |
| gazet-dataset normalize-data --config dataset/config.yaml | |
| ``` | |
| Then upload `data/` to Modal: | |
| ```bash | |
| modal setup # authenticate | |
| gazet-dataset modal-upload --config dataset/config.yaml # ~15 min, uploads data/ to gazet-data volume | |
| ``` | |
| Verify: | |
| ```bash | |
| modal volume ls gazet-data | |
| # should show: overture_normalized/, natural_earth_normalized/ | |
| ``` | |
| Skip this step on subsequent runs — the volume persists across runs. | |
| **Step 2 — Set run name and targets in `config.yaml`** | |
| ```yaml | |
| run_name: "v2" # bump this every time you regenerate from scratch | |
| countries: | |
| - all | |
| sample_targets: | |
| adjacency: 1500 | |
| containment: 1200 | |
| # ... see config.yaml for all families | |
| ``` | |
| **Step 3 — Run on Modal** | |
| ```bash | |
| gazet-dataset modal-generate --config dataset/config.yaml --fresh | |
| ``` | |
| This builds inventories + relations, generates samples across ~100 containers, | |
| validates, and exports. Output lands in `dataset/output/runs/{run_name}/`. | |
| Flags: | |
| - `--fresh` overwrites `dataset/output/dataset_raw.jsonl` instead of appending. | |
| - `--skip-inventory` reuses `{divisions_area,natural_earth}_inventory.parquet` | |
| on the intermediate volume. | |
| - `--skip-relations` reuses the seven `*_pairs.parquet` / `*_relations.parquet` | |
| files on the intermediate volume. Only safe when countries and template | |
| families are unchanged. | |
| ### Fresh-start recipe (after template / SQL / prompt changes) | |
| Always clear stale state so nothing from the previous run leaks in: | |
| ```bash | |
| # 1. Bump run_name in config.yaml (e.g. v1 -> v2) | |
| # 2. Wipe the intermediate volume so inventories and relations are rebuilt | |
| modal volume ls gazet-intermediate | |
| # for each file shown: | |
| modal volume rm gazet-intermediate <filename> | |
| # 3. Remove local raw/validated files so nothing gets appended to | |
| rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl | |
| # 4. Run the full pipeline | |
| gazet-dataset modal-generate --config dataset/config.yaml --fresh | |
| ``` | |
| You do NOT need to re-run `modal-upload` — source parquets on `gazet-data` | |
| don't change. | |
| ### Faster iteration (same templates, just more samples) | |
| If relations + inventories are still valid from a previous run: | |
| ```bash | |
| gazet-dataset modal-generate --config dataset/config.yaml \ | |
| --skip-inventory --skip-relations --fresh | |
| ``` | |
| --- | |
| ## Output | |
| After running, your training files are at: | |
| ``` | |
| dataset/output/runs/{run_name}/ | |
| sql/ | |
| train.jsonl <- fine-tune the SQL generation model | |
| val.jsonl | |
| test.jsonl | |
| places/ | |
| train.jsonl <- fine-tune the place extraction model | |
| val.jsonl | |
| test.jsonl | |
| stats.json <- sample counts by family | |
| ``` | |
| Each JSONL row is a conversation-format dict: | |
| ```json | |
| { | |
| "messages": [ | |
| {"role": "system", "content": "..."}, | |
| {"role": "user", "content": "..."}, | |
| {"role": "assistant", "content": "..."} | |
| ] | |
| } | |
| ``` | |
| **SQL task**: the system prompt includes the full two-table schema inside | |
| `<SCHEMA>` tags. The user prompt contains only `<CANDIDATES>` CSV and | |
| `<USER_QUERY>`. The assistant response is pretty-printed SQL (via `sqlparse`). | |
| All parquet paths are symbolic (`divisions_area` / `natural_earth`), never | |
| runtime-specific. | |
| **Places task**: the system prompt includes output format, extraction rules, | |
| and the full list of Overture subtypes. The assistant response is a JSON | |
| object with a `places` array. | |
| --- | |
| ## When to regenerate from scratch | |
| Change `run_name` and regenerate from scratch whenever you: | |
| - Change any SQL templates (`sql_templates.py`) | |
| - Add new template families | |
| - Change the candidate format or count | |
| - Change the system/user prompt structure or content | |
| - Change the export format | |
| For local runs, the default is a clean run. For Modal, `modal-generate` appends | |
| by default; pass `--fresh` to overwrite existing samples. | |
| --- | |
| ## Data quality checks | |
| After a run, spot-check the output with the pytest suite: | |
| ```bash | |
| uv run --extra dev pytest dataset/tests/ -v | |
| ``` | |
| The suite reads `dataset/output/dataset_validated.jsonl` plus the exported | |
| `runs/{run_name}/*.jsonl` and verifies: schema, no unresolved `{placeholders}` | |
| in questions, candidate refs resolve, SQL shape, template coverage, | |
| subtype-filtered templates match their phrasing, disambiguation samples have | |
| same-name distractors, and exported assistant payloads parse as valid | |
| JSON / SQL. Tests skip gracefully when outputs are missing. | |
| --- | |
| ## Troubleshooting | |
| **Very few samples generated for a family** | |
| The generation loop tries `retry_multiplier × target` and discards SQL that | |
| returns empty results. Some families (e.g. `multi_adjacency`, `chained`) have | |
| a lower success rate. Increase `sample_targets` for those families or increase | |
| `retry_multiplier` in `config.yaml`. | |
| **Relations step is slow** | |
| Normal for `countries: [all]` — it's a spatial self-join over millions of | |
| features. Use a country subset for development. Relations only need to be | |
| rebuilt when you add countries or change template families. | |
| **Validate step drops many samples** | |
| The validate step re-executes every SQL query and discards ones that return | |
| empty results. This is expected — check `output/runs/{run_name}/stats.json` | |
| for per-family counts after export. | |