Spaces:
Sleeping
Gazet Dataset Generation
Generates synthetic training data for fine-tuning the geocoding model. Two datasets come out of one pipeline run:
- SQL generation β
(question + candidates) -> DuckDB SQL - Place extraction β
question -> place names JSON
Both tasks export in conversation format (messages list of
system/user/assistant turns), ready for chat-template fine-tuning.
Prerequisites
uv sync
You need the Overture and Natural Earth parquet files under data/ locally,
or on a Modal volume if running in the cloud.
Before large runs, normalize them once so both datasets use harmonized geometry metadata and cross-source joins behave the same locally and on Modal:
gazet-dataset normalize-data --config dataset/config.yaml
This writes:
data/overture_normalized/divisions_area/part-000.parquetdata/natural_earth_normalized/ne_geography.parquet
When those files exist, gazet.config will prefer them automatically.
Option A β Run locally (small datasets, development)
Use this when you want to iterate quickly on a laptop with a subset of countries.
Step 1 β Pick a run name and countries in config.yaml
run_name: "v1" # change this every time you generate fresh data
countries:
- IN # India
- BR # Brazil
- US # United States
# add more, or use "- all" for every country (slow locally)
Step 2 β Run the full pipeline
gazet-dataset full-pipeline --config dataset/config.yaml
That's it. It runs all four steps in order and puts the results in
dataset/output/runs/my-run-001/.
If you want to run steps individually (e.g. to re-export without regenerating):
gazet-dataset build-relations --config dataset/config.yaml # ~5 min
gazet-dataset generate-samples --config dataset/config.yaml # ~15 min
gazet-dataset validate --config dataset/config.yaml # ~5 min
gazet-dataset export --config dataset/config.yaml # <1 min
Option B β Run on Modal (large datasets, production)
Use this when you need 10 K+ samples or want to use all countries. Modal distributes generation across many containers in parallel.
Modal uses two volumes:
gazet-dataβ read-only source parquets (Overture + Natural Earth). Populated once bymodal-upload.gazet-intermediateβ entity inventories and relation tables built by the pipeline. Regenerated on each run.
Step 1 β One-time setup (only first time, or when source parquets change)
First normalize the source geodata locally:
gazet-dataset normalize-data --config dataset/config.yaml
Then upload data/ to Modal:
modal setup # authenticate
gazet-dataset modal-upload --config dataset/config.yaml # ~15 min, uploads data/ to gazet-data volume
Verify:
modal volume ls gazet-data
# should show: overture_normalized/, natural_earth_normalized/
Skip this step on subsequent runs β the volume persists across runs.
Step 2 β Set run name and targets in config.yaml
run_name: "v2" # bump this every time you regenerate from scratch
countries:
- all
sample_targets:
adjacency: 1500
containment: 1200
# ... see config.yaml for all families
Step 3 β Run on Modal
gazet-dataset modal-generate --config dataset/config.yaml --fresh
This builds inventories + relations, generates samples across ~100 containers,
validates, and exports. Output lands in dataset/output/runs/{run_name}/.
Flags:
--freshoverwritesdataset/output/dataset_raw.jsonlinstead of appending.--skip-inventoryreuses{divisions_area,natural_earth}_inventory.parqueton the intermediate volume.--skip-relationsreuses the seven*_pairs.parquet/*_relations.parquetfiles on the intermediate volume. Only safe when countries and template families are unchanged.
Fresh-start recipe (after template / SQL / prompt changes)
Always clear stale state so nothing from the previous run leaks in:
# 1. Bump run_name in config.yaml (e.g. v1 -> v2)
# 2. Wipe the intermediate volume so inventories and relations are rebuilt
modal volume ls gazet-intermediate
# for each file shown:
modal volume rm gazet-intermediate <filename>
# 3. Remove local raw/validated files so nothing gets appended to
rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl
# 4. Run the full pipeline
gazet-dataset modal-generate --config dataset/config.yaml --fresh
You do NOT need to re-run modal-upload β source parquets on gazet-data
don't change.
Faster iteration (same templates, just more samples)
If relations + inventories are still valid from a previous run:
gazet-dataset modal-generate --config dataset/config.yaml \
--skip-inventory --skip-relations --fresh
Output
After running, your training files are at:
dataset/output/runs/{run_name}/
sql/
train.jsonl <- fine-tune the SQL generation model
val.jsonl
test.jsonl
places/
train.jsonl <- fine-tune the place extraction model
val.jsonl
test.jsonl
stats.json <- sample counts by family
Each JSONL row is a conversation-format dict:
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
SQL task: the system prompt includes the full two-table schema inside
<SCHEMA> tags. The user prompt contains only <CANDIDATES> CSV and
<USER_QUERY>. The assistant response is pretty-printed SQL (via sqlparse).
All parquet paths are symbolic (divisions_area / natural_earth), never
runtime-specific.
Places task: the system prompt includes output format, extraction rules,
and the full list of Overture subtypes. The assistant response is a JSON
object with a places array.
When to regenerate from scratch
Change run_name and regenerate from scratch whenever you:
- Change any SQL templates (
sql_templates.py) - Add new template families
- Change the candidate format or count
- Change the system/user prompt structure or content
- Change the export format
For local runs, the default is a clean run. For Modal, modal-generate appends
by default; pass --fresh to overwrite existing samples.
Data quality checks
After a run, spot-check the output with the pytest suite:
uv run --extra dev pytest dataset/tests/ -v
The suite reads dataset/output/dataset_validated.jsonl plus the exported
runs/{run_name}/*.jsonl and verifies: schema, no unresolved {placeholders}
in questions, candidate refs resolve, SQL shape, template coverage,
subtype-filtered templates match their phrasing, disambiguation samples have
same-name distractors, and exported assistant payloads parse as valid
JSON / SQL. Tests skip gracefully when outputs are missing.
Troubleshooting
Very few samples generated for a family
The generation loop tries retry_multiplier Γ target and discards SQL that
returns empty results. Some families (e.g. multi_adjacency, chained) have
a lower success rate. Increase sample_targets for those families or increase
retry_multiplier in config.yaml.
Relations step is slow
Normal for countries: [all] β it's a spatial self-join over millions of
features. Use a country subset for development. Relations only need to be
rebuilt when you add countries or change template families.
Validate step drops many samples
The validate step re-executes every SQL query and discards ones that return
empty results. This is expected β check output/runs/{run_name}/stats.json
for per-family counts after export.