Spaces:

developmentseed
/

gazet

Sleeping

App Files Files Community

gazet / dataset /README.md

srmsoumya

Randomize candidate dataset order

582d1ab about 1 month ago

preview code

raw

history blame contribute delete

7.78 kB

	# Gazet Dataset Generation

	Generates synthetic training data for fine-tuning the geocoding model.
	Two datasets come out of one pipeline run:

	- SQL generation — `(question + candidates) -> DuckDB SQL`
	- Place extraction — `question -> place names JSON`

	Both tasks export in conversation format (`messages` list of
	system/user/assistant turns), ready for chat-template fine-tuning.

	---

	## Prerequisites

	```bash
	uv sync
	```

	You need the Overture and Natural Earth parquet files under `data/` locally,
	or on a Modal volume if running in the cloud.

	Before large runs, normalize them once so both datasets use harmonized
	geometry metadata and cross-source joins behave the same locally and on Modal:

	```bash
	gazet-dataset normalize-data --config dataset/config.yaml
	```

	This writes:

	- `data/overture_normalized/divisions_area/part-000.parquet`
	- `data/natural_earth_normalized/ne_geography.parquet`

	When those files exist, `gazet.config` will prefer them automatically.

	---

	## Option A — Run locally (small datasets, development)

	Use this when you want to iterate quickly on a laptop with a subset of countries.

	Step 1 — Pick a run name and countries in `config.yaml`

	```yaml
	run_name: "v1" # change this every time you generate fresh data

	countries:
	- IN # India
	- BR # Brazil
	- US # United States
	# add more, or use "- all" for every country (slow locally)
	```

	Step 2 — Run the full pipeline

	```bash
	gazet-dataset full-pipeline --config dataset/config.yaml
	```

	That's it. It runs all four steps in order and puts the results in
	`dataset/output/runs/my-run-001/`.

	If you want to run steps individually (e.g. to re-export without regenerating):

	```bash
	gazet-dataset build-relations --config dataset/config.yaml # ~5 min
	gazet-dataset generate-samples --config dataset/config.yaml # ~15 min
	gazet-dataset validate --config dataset/config.yaml # ~5 min
	gazet-dataset export --config dataset/config.yaml # <1 min
	```

	---

	## Option B — Run on Modal (large datasets, production)

	Use this when you need 10 K+ samples or want to use all countries. Modal
	distributes generation across many containers in parallel.

	Modal uses two volumes:

	- `gazet-data` — read-only source parquets (Overture + Natural Earth). Populated
	once by `modal-upload`.
	- `gazet-intermediate` — entity inventories and relation tables built by the
	pipeline. Regenerated on each run.

	Step 1 — One-time setup (only first time, or when source parquets change)

	First normalize the source geodata locally:

	```bash
	gazet-dataset normalize-data --config dataset/config.yaml
	```

	Then upload `data/` to Modal:

	```bash
	modal setup # authenticate
	gazet-dataset modal-upload --config dataset/config.yaml # ~15 min, uploads data/ to gazet-data volume
	```

	Verify:

	```bash
	modal volume ls gazet-data
	# should show: overture_normalized/, natural_earth_normalized/
	```

	Skip this step on subsequent runs — the volume persists across runs.

	Step 2 — Set run name and targets in `config.yaml`

	```yaml
	run_name: "v2" # bump this every time you regenerate from scratch

	countries:
	- all

	sample_targets:
	adjacency: 1500
	containment: 1200
	# ... see config.yaml for all families
	```

	Step 3 — Run on Modal

	```bash
	gazet-dataset modal-generate --config dataset/config.yaml --fresh
	```

	This builds inventories + relations, generates samples across ~100 containers,
	validates, and exports. Output lands in `dataset/output/runs/{run_name}/`.

	Flags:

	- `--fresh` overwrites `dataset/output/dataset_raw.jsonl` instead of appending.
	- `--skip-inventory` reuses `{divisions_area,natural_earth}_inventory.parquet`
	on the intermediate volume.
	- `--skip-relations` reuses the seven `_pairs.parquet` / `_relations.parquet`
	files on the intermediate volume. Only safe when countries and template
	families are unchanged.

	### Fresh-start recipe (after template / SQL / prompt changes)

	Always clear stale state so nothing from the previous run leaks in:

	```bash
	# 1. Bump run_name in config.yaml (e.g. v1 -> v2)

	# 2. Wipe the intermediate volume so inventories and relations are rebuilt
	modal volume ls gazet-intermediate
	# for each file shown:
	modal volume rm gazet-intermediate <filename>

	# 3. Remove local raw/validated files so nothing gets appended to
	rm -f dataset/output/dataset_raw.jsonl dataset/output/dataset_validated.jsonl

	# 4. Run the full pipeline
	gazet-dataset modal-generate --config dataset/config.yaml --fresh
	```

	You do NOT need to re-run `modal-upload` — source parquets on `gazet-data`
	don't change.

	### Faster iteration (same templates, just more samples)

	If relations + inventories are still valid from a previous run:

	```bash
	gazet-dataset modal-generate --config dataset/config.yaml \
	--skip-inventory --skip-relations --fresh
	```

	---

	## Output

	After running, your training files are at:

	```
	dataset/output/runs/{run_name}/
	sql/
	train.jsonl <- fine-tune the SQL generation model
	val.jsonl
	test.jsonl
	places/
	train.jsonl <- fine-tune the place extraction model
	val.jsonl
	test.jsonl
	stats.json <- sample counts by family
	```

	Each JSONL row is a conversation-format dict:

	```json
	{
	"messages": [
	{"role": "system", "content": "..."},
	{"role": "user", "content": "..."},
	{"role": "assistant", "content": "..."}
	]
	}
	```

	SQL task: the system prompt includes the full two-table schema inside
	`<SCHEMA>` tags. The user prompt contains only `<CANDIDATES>` CSV and
	`<USER_QUERY>`. The assistant response is pretty-printed SQL (via `sqlparse`).
	All parquet paths are symbolic (`divisions_area` / `natural_earth`), never
	runtime-specific.

	Places task: the system prompt includes output format, extraction rules,
	and the full list of Overture subtypes. The assistant response is a JSON
	object with a `places` array.

	---

	## When to regenerate from scratch

	Change `run_name` and regenerate from scratch whenever you:

	- Change any SQL templates (`sql_templates.py`)
	- Add new template families
	- Change the candidate format or count
	- Change the system/user prompt structure or content
	- Change the export format

	For local runs, the default is a clean run. For Modal, `modal-generate` appends
	by default; pass `--fresh` to overwrite existing samples.

	---

	## Data quality checks

	After a run, spot-check the output with the pytest suite:

	```bash
	uv run --extra dev pytest dataset/tests/ -v
	```

	The suite reads `dataset/output/dataset_validated.jsonl` plus the exported
	`runs/{run_name}/*.jsonl` and verifies: schema, no unresolved `{placeholders}`
	in questions, candidate refs resolve, SQL shape, template coverage,
	subtype-filtered templates match their phrasing, disambiguation samples have
	same-name distractors, and exported assistant payloads parse as valid
	JSON / SQL. Tests skip gracefully when outputs are missing.

	---

	## Troubleshooting

	Very few samples generated for a family
	The generation loop tries `retry_multiplier × target` and discards SQL that
	returns empty results. Some families (e.g. `multi_adjacency`, `chained`) have
	a lower success rate. Increase `sample_targets` for those families or increase
	`retry_multiplier` in `config.yaml`.

	Relations step is slow
	Normal for `countries: [all]` — it's a spatial self-join over millions of
	features. Use a country subset for development. Relations only need to be
	rebuilt when you add countries or change template families.

	Validate step drops many samples
	The validate step re-executes every SQL query and discards ones that return
	empty results. This is expected — check `output/runs/{run_name}/stats.json`
	for per-family counts after export.