gazet / dataset /README.md
srmsoumya's picture
Randomize candidate dataset order
582d1ab

Gazet Dataset Generation

Generates synthetic training data for fine-tuning the geocoding model. Two datasets come out of one pipeline run:

  • SQL generation β€” (question + candidates) -> DuckDB SQL
  • Place extraction β€” question -> place names JSON

Both tasks export in conversation format (messages list of system/user/assistant turns), ready for chat-template fine-tuning.


Prerequisites

uv sync

You need the Overture and Natural Earth parquet files under data/ locally, or on a Modal volume if running in the cloud.

Before large runs, normalize them once so both datasets use harmonized geometry metadata and cross-source joins behave the same locally and on Modal:

gazet-dataset normalize-data --config dataset/config.yaml

This writes:

  • data/overture_normalized/divisions_area/part-000.parquet
  • data/natural_earth_normalized/ne_geography.parquet

When those files exist, gazet.config will prefer them automatically.


Option A β€” Run locally (small datasets, development)

Use this when you want to iterate quickly on a laptop with a subset of countries.

Step 1 β€” Pick a run name and countries in config.yaml

run_name: "v1"   # change this every time you generate fresh data

countries:
  - IN   # India
  - BR   # Brazil
  - US   # United States
  # add more, or use "- all" for every country (slow locally)

Step 2 β€” Run the full pipeline

gazet-dataset full-pipeline --config dataset/config.yaml

That's it. It runs all four steps in order and puts the results in dataset/output/runs/my-run-001/.

If you want to run steps individually (e.g. to re-export without regenerating):

gazet-dataset build-relations  --config dataset/config.yaml  # ~5 min
gazet-dataset generate-samples --config dataset/config.yaml  # ~15 min
gazet-dataset validate         --config dataset/config.yaml  # ~5 min
gazet-dataset export           --config dataset/config.yaml  # <1 min

Option B β€” Run on Modal (large datasets, production)

Use this when you need 10 K+ samples or want to use all countries. Modal distributes generation across many containers in parallel.

Modal uses two volumes:

  • gazet-data β€” read-only source parquets (Overture + Natural Earth). Populated once by modal-upload.
  • gazet-intermediate β€” entity inventories and relation tables built by the pipeline. Regenerated on each run.

Step 1 β€” One-time setup (only first time, or when source parquets change)

First normalize the source geodata locally:

gazet-dataset normalize-data --config dataset/config.yaml

Then upload data/ to Modal:

modal setup                                                # authenticate
gazet-dataset modal-upload --config dataset/config.yaml    # ~15 min, uploads data/ to gazet-data volume

Verify:

modal volume ls gazet-data
# should show: overture_normalized/, natural_earth_normalized/

Skip this step on subsequent runs β€” the volume persists across runs.

Step 2 β€” Set run name and targets in config.yaml

run_name: "v2"   # bump this every time you regenerate from scratch

countries:
  - all

sample_targets:
  adjacency:     1500
  containment:   1200
  # ... see config.yaml for all families

Step 3 β€” Run on Modal

gazet-dataset modal-generate --config dataset/config.yaml --fresh

This builds inventories + relations, generates samples across ~100 containers, validates, and exports. Output lands in dataset/output/runs/{run_name}/.

Flags:

  • --fresh overwrites dataset/output/dataset_raw.jsonl instead of appending.
  • --skip-inventory reuses {divisions_area,natural_earth}_inventory.parquet on the intermediate volume.
  • --skip-relations reuses the seven *_pairs.parquet / *_relations.parquet files on the intermediate volume. Only safe when countries and template families are unchanged.

Fresh-start recipe (after template / SQL / prompt changes)

Always clear stale state so nothing from the previous run leaks in:

# 1. Bump run_name in config.yaml (e.g. v1 -> v2)

# 2. Wipe the intermediate volume so inventories and relations are rebuilt
modal volume ls gazet-intermediate
# for each file shown:
modal volume rm gazet-intermediate <filename>

# 3. Remove local raw/validated files so nothing gets appended to
rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl

# 4. Run the full pipeline
gazet-dataset modal-generate --config dataset/config.yaml --fresh

You do NOT need to re-run modal-upload β€” source parquets on gazet-data don't change.

Faster iteration (same templates, just more samples)

If relations + inventories are still valid from a previous run:

gazet-dataset modal-generate --config dataset/config.yaml \
  --skip-inventory --skip-relations --fresh

Output

After running, your training files are at:

dataset/output/runs/{run_name}/
  sql/
    train.jsonl    <- fine-tune the SQL generation model
    val.jsonl
    test.jsonl
  places/
    train.jsonl    <- fine-tune the place extraction model
    val.jsonl
    test.jsonl
  stats.json       <- sample counts by family

Each JSONL row is a conversation-format dict:

{
  "messages": [
    {"role": "system",    "content": "..."},
    {"role": "user",      "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

SQL task: the system prompt includes the full two-table schema inside <SCHEMA> tags. The user prompt contains only <CANDIDATES> CSV and <USER_QUERY>. The assistant response is pretty-printed SQL (via sqlparse). All parquet paths are symbolic (divisions_area / natural_earth), never runtime-specific.

Places task: the system prompt includes output format, extraction rules, and the full list of Overture subtypes. The assistant response is a JSON object with a places array.


When to regenerate from scratch

Change run_name and regenerate from scratch whenever you:

  • Change any SQL templates (sql_templates.py)
  • Add new template families
  • Change the candidate format or count
  • Change the system/user prompt structure or content
  • Change the export format

For local runs, the default is a clean run. For Modal, modal-generate appends by default; pass --fresh to overwrite existing samples.


Data quality checks

After a run, spot-check the output with the pytest suite:

uv run --extra dev pytest dataset/tests/ -v

The suite reads dataset/output/dataset_validated.jsonl plus the exported runs/{run_name}/*.jsonl and verifies: schema, no unresolved {placeholders} in questions, candidate refs resolve, SQL shape, template coverage, subtype-filtered templates match their phrasing, disambiguation samples have same-name distractors, and exported assistant payloads parse as valid JSON / SQL. Tests skip gracefully when outputs are missing.


Troubleshooting

Very few samples generated for a family The generation loop tries retry_multiplier Γ— target and discards SQL that returns empty results. Some families (e.g. multi_adjacency, chained) have a lower success rate. Increase sample_targets for those families or increase retry_multiplier in config.yaml.

Relations step is slow Normal for countries: [all] β€” it's a spatial self-join over millions of features. Use a country subset for development. Relations only need to be rebuilt when you add countries or change template families.

Validate step drops many samples The validate step re-executes every SQL query and discards ones that return empty results. This is expected β€” check output/runs/{run_name}/stats.json for per-family counts after export.