gazet / dataset /README.md
srmsoumya's picture
Randomize candidate dataset order
582d1ab
# Gazet Dataset Generation
Generates synthetic training data for fine-tuning the geocoding model.
Two datasets come out of one pipeline run:
- **SQL generation**`(question + candidates) -> DuckDB SQL`
- **Place extraction**`question -> place names JSON`
Both tasks export in **conversation format** (`messages` list of
system/user/assistant turns), ready for chat-template fine-tuning.
---
## Prerequisites
```bash
uv sync
```
You need the Overture and Natural Earth parquet files under `data/` locally,
or on a Modal volume if running in the cloud.
Before large runs, normalize them once so both datasets use harmonized
geometry metadata and cross-source joins behave the same locally and on Modal:
```bash
gazet-dataset normalize-data --config dataset/config.yaml
```
This writes:
- `data/overture_normalized/divisions_area/part-000.parquet`
- `data/natural_earth_normalized/ne_geography.parquet`
When those files exist, `gazet.config` will prefer them automatically.
---
## Option A — Run locally (small datasets, development)
Use this when you want to iterate quickly on a laptop with a subset of countries.
**Step 1 — Pick a run name and countries in `config.yaml`**
```yaml
run_name: "v1" # change this every time you generate fresh data
countries:
- IN # India
- BR # Brazil
- US # United States
# add more, or use "- all" for every country (slow locally)
```
**Step 2 — Run the full pipeline**
```bash
gazet-dataset full-pipeline --config dataset/config.yaml
```
That's it. It runs all four steps in order and puts the results in
`dataset/output/runs/my-run-001/`.
If you want to run steps individually (e.g. to re-export without regenerating):
```bash
gazet-dataset build-relations --config dataset/config.yaml # ~5 min
gazet-dataset generate-samples --config dataset/config.yaml # ~15 min
gazet-dataset validate --config dataset/config.yaml # ~5 min
gazet-dataset export --config dataset/config.yaml # <1 min
```
---
## Option B — Run on Modal (large datasets, production)
Use this when you need 10 K+ samples or want to use all countries. Modal
distributes generation across many containers in parallel.
Modal uses two volumes:
- `gazet-data` — read-only source parquets (Overture + Natural Earth). Populated
once by `modal-upload`.
- `gazet-intermediate` — entity inventories and relation tables built by the
pipeline. Regenerated on each run.
**Step 1 — One-time setup (only first time, or when source parquets change)**
First normalize the source geodata locally:
```bash
gazet-dataset normalize-data --config dataset/config.yaml
```
Then upload `data/` to Modal:
```bash
modal setup # authenticate
gazet-dataset modal-upload --config dataset/config.yaml # ~15 min, uploads data/ to gazet-data volume
```
Verify:
```bash
modal volume ls gazet-data
# should show: overture_normalized/, natural_earth_normalized/
```
Skip this step on subsequent runs — the volume persists across runs.
**Step 2 — Set run name and targets in `config.yaml`**
```yaml
run_name: "v2" # bump this every time you regenerate from scratch
countries:
- all
sample_targets:
adjacency: 1500
containment: 1200
# ... see config.yaml for all families
```
**Step 3 — Run on Modal**
```bash
gazet-dataset modal-generate --config dataset/config.yaml --fresh
```
This builds inventories + relations, generates samples across ~100 containers,
validates, and exports. Output lands in `dataset/output/runs/{run_name}/`.
Flags:
- `--fresh` overwrites `dataset/output/dataset_raw.jsonl` instead of appending.
- `--skip-inventory` reuses `{divisions_area,natural_earth}_inventory.parquet`
on the intermediate volume.
- `--skip-relations` reuses the seven `*_pairs.parquet` / `*_relations.parquet`
files on the intermediate volume. Only safe when countries and template
families are unchanged.
### Fresh-start recipe (after template / SQL / prompt changes)
Always clear stale state so nothing from the previous run leaks in:
```bash
# 1. Bump run_name in config.yaml (e.g. v1 -> v2)
# 2. Wipe the intermediate volume so inventories and relations are rebuilt
modal volume ls gazet-intermediate
# for each file shown:
modal volume rm gazet-intermediate <filename>
# 3. Remove local raw/validated files so nothing gets appended to
rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl
# 4. Run the full pipeline
gazet-dataset modal-generate --config dataset/config.yaml --fresh
```
You do NOT need to re-run `modal-upload` — source parquets on `gazet-data`
don't change.
### Faster iteration (same templates, just more samples)
If relations + inventories are still valid from a previous run:
```bash
gazet-dataset modal-generate --config dataset/config.yaml \
--skip-inventory --skip-relations --fresh
```
---
## Output
After running, your training files are at:
```
dataset/output/runs/{run_name}/
sql/
train.jsonl <- fine-tune the SQL generation model
val.jsonl
test.jsonl
places/
train.jsonl <- fine-tune the place extraction model
val.jsonl
test.jsonl
stats.json <- sample counts by family
```
Each JSONL row is a conversation-format dict:
```json
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
```
**SQL task**: the system prompt includes the full two-table schema inside
`<SCHEMA>` tags. The user prompt contains only `<CANDIDATES>` CSV and
`<USER_QUERY>`. The assistant response is pretty-printed SQL (via `sqlparse`).
All parquet paths are symbolic (`divisions_area` / `natural_earth`), never
runtime-specific.
**Places task**: the system prompt includes output format, extraction rules,
and the full list of Overture subtypes. The assistant response is a JSON
object with a `places` array.
---
## When to regenerate from scratch
Change `run_name` and regenerate from scratch whenever you:
- Change any SQL templates (`sql_templates.py`)
- Add new template families
- Change the candidate format or count
- Change the system/user prompt structure or content
- Change the export format
For local runs, the default is a clean run. For Modal, `modal-generate` appends
by default; pass `--fresh` to overwrite existing samples.
---
## Data quality checks
After a run, spot-check the output with the pytest suite:
```bash
uv run --extra dev pytest dataset/tests/ -v
```
The suite reads `dataset/output/dataset_validated.jsonl` plus the exported
`runs/{run_name}/*.jsonl` and verifies: schema, no unresolved `{placeholders}`
in questions, candidate refs resolve, SQL shape, template coverage,
subtype-filtered templates match their phrasing, disambiguation samples have
same-name distractors, and exported assistant payloads parse as valid
JSON / SQL. Tests skip gracefully when outputs are missing.
---
## Troubleshooting
**Very few samples generated for a family**
The generation loop tries `retry_multiplier × target` and discards SQL that
returns empty results. Some families (e.g. `multi_adjacency`, `chained`) have
a lower success rate. Increase `sample_targets` for those families or increase
`retry_multiplier` in `config.yaml`.
**Relations step is slow**
Normal for `countries: [all]` — it's a spatial self-join over millions of
features. Use a country subset for development. Relations only need to be
rebuilt when you add countries or change template families.
**Validate step drops many samples**
The validate step re-executes every SQL query and discards ones that return
empty results. This is expected — check `output/runs/{run_name}/stats.json`
for per-family counts after export.