Spaces:

developmentseed
/

gazet

Running

srmsoumya commited on Apr 23

Commit

dfb9466

1 Parent(s): ca28f70

Fix: No pairs are created for mixed queries

- Normalize overture & natural earth to EPSG:4326
- Add better adjancency matrix to increase the overlap in query pairs
- Fix titled & lower case subtype names in NE
- Reduce number of threads in duckdb to prevent memory issues

Files changed (19) hide show

dataset/README.md +40 -1
dataset/config.smalltest.yaml +40 -0
dataset/config.yaml +16 -16
dataset/modal_app.py +47 -11
dataset/scripts/build_inventory.py +6 -4
dataset/scripts/build_relations.py +274 -201
dataset/scripts/cli.py +20 -2
dataset/scripts/export_training_data.py +1 -1
dataset/scripts/generate_samples.py +320 -132
dataset/scripts/normalize_geodata.py +87 -0
dataset/scripts/sql_templates.py +18 -18
dataset/scripts/validate_dataset.py +2 -2
finetune/README.md +29 -11
finetune/eval_demo.py +12 -6
finetune/train_modal_qwen35.py +1 -1
gazet_demo.py +81 -10
ingest/convert_natural_earth.py +1 -1
src/gazet/config.py +27 -2
src/gazet/sql.py +25 -0

dataset/README.md CHANGED Viewed

@@ -20,6 +20,20 @@ uv sync
 You need the Overture and Natural Earth parquet files under `data/` locally,
 or on a Modal volume if running in the cloud.
 ---
 ## Option A — Run locally (small datasets, development)
@@ -72,6 +86,14 @@ Modal uses two volumes:
 **Step 1 — One-time setup (only first time, or when source parquets change)**
 ```bash
 modal setup                                                # authenticate
 gazet-dataset modal-upload --config dataset/config.yaml    # ~15 min, uploads data/ to gazet-data volume
@@ -81,7 +103,7 @@ Verify:
 ```bash
 modal volume ls gazet-data
-# should show: overture/, natural_earth/, natural_earth_geoparquet/
 ```
 Skip this step on subsequent runs — the volume persists across runs.
@@ -207,6 +229,23 @@ by default; pass `--fresh` to overwrite existing samples.
 ---
 ## Troubleshooting
 **Very few samples generated for a family**

 You need the Overture and Natural Earth parquet files under `data/` locally,
 or on a Modal volume if running in the cloud.
+Before large runs, normalize them once so both datasets use harmonized
+geometry metadata and cross-source joins behave the same locally and on Modal:
+```bash
+gazet-dataset normalize-data --config dataset/config.yaml
+```
+This writes:
+- `data/overture_normalized/divisions_area/part-000.parquet`
+- `data/natural_earth_normalized/ne_geography.parquet`
+When those files exist, `gazet.config` will prefer them automatically.
 ---
 ## Option A — Run locally (small datasets, development)
 **Step 1 — One-time setup (only first time, or when source parquets change)**
+First normalize the source geodata locally:
+```bash
+gazet-dataset normalize-data --config dataset/config.yaml
+```
+Then upload `data/` to Modal:
 ```bash
 modal setup                                                # authenticate
 gazet-dataset modal-upload --config dataset/config.yaml    # ~15 min, uploads data/ to gazet-data volume
 ```bash
 modal volume ls gazet-data
+# should show: overture/, overture_normalized/, natural_earth_geoparquet/, natural_earth_normalized/
 ```
 Skip this step on subsequent runs — the volume persists across runs.
 ---
+## Data quality checks
+After a run, spot-check the output with the pytest suite:
+```bash
+uv run --extra dev pytest dataset/tests/ -v
+```
+The suite reads `dataset/output/dataset_validated.jsonl` plus the exported
+`runs/{run_name}/*.jsonl` and verifies: schema, no unresolved `{placeholders}`
+in questions, candidate refs resolve, SQL shape, template coverage,
+subtype-filtered templates match their phrasing, disambiguation samples have
+same-name distractors, and exported assistant payloads parse as valid
+JSON / SQL. Tests skip gracefully when outputs are missing.
+---
 ## Troubleshooting
 **Very few samples generated for a family**

dataset/config.smalltest.yaml ADDED Viewed

	@@ -0,0 +1,40 @@

+countries:
+  - BE
+  - CH
+  - AE
+sample_targets:
+  direct_lookup:       4
+  disambiguation:      6
+  adjacency:          12
+  multi_adjacency:     2
+  containment:         8
+  intersection:        8
+  buffer:             10
+  chained:            22
+  difference:          4
+  border_corridor:     2
+  set_operations:     10
+  partial_selection:  10
+  aggregation:         8
+  window_function:     4
+  attribute_filter:    6
+generation:
+  max_workers: 4
+  retry_multiplier: 2
+  append_mode: false
+auto_scaling:
+  safety_factor: 1.25
+  manual_limits: {}
+modal:
+  volume_name: "gazet-data"
+  app_name: "gazet-dataset"
+  num_containers: 10
+  container_cpu: 2
+  container_memory: 4096
+  timeout: 7200
+run_name: "smalltest-v1"

dataset/config.yaml CHANGED Viewed

@@ -17,21 +17,21 @@ countries:
 # Bumped families with many templates or mixed-source variants so each
 # template_id gets enough coverage after uniform sampling + stratified split.
 sample_targets:
-  direct_lookup:       500
-  disambiguation:     1500   # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
-  adjacency:          1800   # 6 templates (adj_01..06) - adj_06 is counties
-  multi_adjacency:     300
-  containment:        1200   # 4 templates (contain_01..04)
-  intersection:       1200   # 4 templates (intersect_01..04)
-  buffer:             1200   # 5 templates (buffer_01..05)
-  chained:            3300   # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
-  difference:          900   # 2 templates, one is mixed (diff_02)
-  border_corridor:     300
-  set_operations:      900
-  partial_selection:  1500   # 5 templates, one is mixed (partial_05)
-  aggregation:         750
-  window_function:     300
-  attribute_filter:    600   # 3 templates (attr_01..03)
 # Generation settings
 generation:
@@ -58,7 +58,7 @@ modal:
   num_containers: 100               # Number of parallel containers for sample generation
   container_cpu: 2                  # CPUs per container
   container_memory: 4096            # Memory (MB) per container
-  timeout: 3600                     # Per-container timeout in seconds
 # Run name — used to version exported splits so re-runs never overwrite previous data.
 # Change this whenever you regenerate from scratch (e.g. after template changes).

 # Bumped families with many templates or mixed-source variants so each
 # template_id gets enough coverage after uniform sampling + stratified split.
 sample_targets:
+  direct_lookup:      1000
+  disambiguation:     2000   # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
+  adjacency:          2000   # 6 templates (adj_01..06) - adj_06 is counties
+  multi_adjacency:    1000
+  containment:        2000   # 4 templates (contain_01..04) - contain_02 reversed, contain_03/04 NE anchor
+  intersection:       2000   # 4 templates (intersect_01..04) - intersect_02/03 NE anchor
+  buffer:             2000   # 5 templates (buffer_01..05)
+  chained:            2000   # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
+  difference:         2000   # 2 templates, one is mixed (diff_02)
+  border_corridor:    1000
+  set_operations:     2000
+  partial_selection:  2000   # 5 templates, one is mixed (partial_05)
+  aggregation:        1500
+  window_function:    1000
+  attribute_filter:   1000   # 3 templates (attr_01..03)
 # Generation settings
 generation:
   num_containers: 100               # Number of parallel containers for sample generation
   container_cpu: 2                  # CPUs per container
   container_memory: 4096            # Memory (MB) per container
+  timeout: 7200                     # Per-container timeout in seconds
 # Run name — used to version exported splits so re-runs never overwrite previous data.
 # Change this whenever you regenerate from scratch (e.g. after template changes).

dataset/modal_app.py CHANGED Viewed

@@ -24,7 +24,16 @@ image = (
         "pyarrow>=17.0.0",
         "pyyaml>=6.0",
     )
-    .env({"GAZET_DATA_DIR": VOLUME_MOUNT, "PYTHONPATH": "/root"})
     .add_local_dir("src/gazet", "/root/gazet")
     .add_local_dir("dataset", "/root/dataset")
 )
@@ -50,9 +59,9 @@ def build_inventory_remote():
 @app.function(
     image=image,
     volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
-    timeout=3600,
-    cpu=4,
-    memory=32768,
 )
 def build_relation_remote(relation_type: str, countries: list, limit: int):
     """Compute one relation type and save to intermediate volume."""
@@ -72,7 +81,7 @@ def build_relation_remote(relation_type: str, countries: list, limit: int):
 @app.function(
     image=image,
     volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
-    timeout=3600,
     cpu=2,
     memory=4096,
 )
@@ -126,13 +135,40 @@ def run_pipeline(
         relation_needs = calculate_relation_limits(config)
-        handles = []
-        for rel_type, limit in relation_needs.items():
-            h = build_relation_remote.spawn(rel_type, countries, max(limit, 500))
-            handles.append((rel_type, h))
-        for rel_type, h in handles:
-            result = h.get()
             print(f"  {rel_type}: {result['count']} pairs")
     print(f"Generating samples across {n_containers} containers...")

         "pyarrow>=17.0.0",
         "pyyaml>=6.0",
     )
+    .env(
+        {
+            "GAZET_DATA_DIR": VOLUME_MOUNT,
+            "PYTHONPATH": "/root",
+            # Spatial self-joins are much more stable with conservative
+            # DuckDB settings inside Modal containers.
+            "GAZET_DUCKDB_THREADS": "1",
+            "GAZET_DUCKDB_MEMORY_LIMIT": "20GB",
+        }
+    )
     .add_local_dir("src/gazet", "/root/gazet")
     .add_local_dir("dataset", "/root/dataset")
 )
 @app.function(
     image=image,
     volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
+    timeout=7200,
+    cpu=2,
+    memory=65536,
 )
 def build_relation_remote(relation_type: str, countries: list, limit: int):
     """Compute one relation type and save to intermediate volume."""
 @app.function(
     image=image,
     volumes={VOLUME_MOUNT: volume, INTERMEDIATE_MOUNT: intermediate_volume},
+    timeout=7200,
     cpu=2,
     memory=4096,
 )
         relation_needs = calculate_relation_limits(config)
+        # Global containment-style relations are the most expensive and don't
+        # need extremely large anchor tables to support sample generation.
+        if countries == ["all"]:
+            for rel_type, cap in {
+                "containment": 12000,
+                "coastal_containment": 8000,
+                "landlocked_containment": 8000,
+            }.items():
+                if rel_type in relation_needs:
+                    relation_needs[rel_type] = min(relation_needs[rel_type], cap)
+        # Spatial relation builds are the most crash-prone part of the Modal
+        # pipeline. Run them sequentially with conservative DuckDB settings
+        # rather than fanning out several large native spatial joins at once.
+        # common_neighbor still runs after adjacency because it depends on the
+        # adjacency parquet being committed first.
+        ordered_relations = [
+            rel_type
+            for rel_type in (
+                "adjacency",
+                "containment",
+                "intersection",
+                "cross_source",
+                "coastal_containment",
+                "landlocked_containment",
+                "common_neighbor",
+            )
+            if rel_type in relation_needs
+        ]
+        for rel_type in ordered_relations:
+            limit = max(relation_needs[rel_type], 500)
+            print(f"  building {rel_type} (limit={limit})...")
+            result = build_relation_remote.remote(rel_type, countries, limit)
             print(f"  {rel_type}: {result['count']} pairs")
     print(f"Generating samples across {n_containers} containers...")

dataset/scripts/build_inventory.py CHANGED Viewed

@@ -37,10 +37,11 @@ def build_divisions_area_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFra
         ST_XMax(geometry) AS xmax,
         ST_YMax(geometry) AS ymax
     FROM read_parquet(?)
-    WHERE names."primary" IS NOT NULL
       AND trim(names."primary") != ''
     """
     df = con.execute(query, [DIVISIONS_AREA_PATH]).fetchdf()
     print(f"Divisions area inventory: {len(df)} entities")
     print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
@@ -69,10 +70,11 @@ def build_natural_earth_inventory(con: duckdb.DuckDBPyConnection) -> pd.DataFram
         ST_XMax(geometry) AS xmax,
         ST_YMax(geometry) AS ymax
     FROM read_parquet(?)
-    WHERE names."primary" IS NOT NULL
       AND trim(names."primary") != ''
     """
     df = con.execute(query, [NATURAL_EARTH_PATH]).fetchdf()
     print(f"\nNatural earth inventory: {len(df)} entities")
     print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")

         ST_XMax(geometry) AS xmax,
         ST_YMax(geometry) AS ymax
     FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
       AND trim(names."primary") != ''
+      AND geometry IS NOT NULL
     """
     df = con.execute(query, [DIVISIONS_AREA_PATH]).fetchdf()
     print(f"Divisions area inventory: {len(df)} entities")
     print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")
         ST_XMax(geometry) AS xmax,
         ST_YMax(geometry) AS ymax
     FROM read_parquet(?)
+    WHERE names."primary" IS NOT NULL
       AND trim(names."primary") != ''
+      AND geometry IS NOT NULL
     """
     df = con.execute(query, [NATURAL_EARTH_PATH]).fetchdf()
     print(f"\nNatural earth inventory: {len(df)} entities")
     print(f"Subtypes: {df['subtype'].value_counts().to_dict()}")

dataset/scripts/build_relations.py CHANGED Viewed

@@ -14,10 +14,12 @@ Output:
 - intermediate/cross_source_relations.parquet
 """
 import duckdb
 import pandas as pd
-from pathlib import Path
-from concurrent.futures import ThreadPoolExecutor, as_completed
 from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
@@ -26,6 +28,43 @@ from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
 _EXCLUDED_SUBTYPES_FOR_GLOBAL = ("locality", "neighborhood", "microhood", "macrohood")
 def _country_filter(countries: list) -> tuple[str, list]:
     """Return (SQL WHERE clause, params) handling 'all' sentinel."""
     if countries == ["all"]:
@@ -46,6 +85,29 @@ def _country_filter_for_join(countries: list) -> tuple[str, list]:
     return f"WHERE country IN (SELECT unnest(?)) {subtype_clause}", [countries]
 def compute_adjacency_pairs(
     con: duckdb.DuckDBPyConnection,
     countries: list,
@@ -95,50 +157,100 @@ def compute_adjacency_pairs(
     return df
 def compute_containment_pairs(
     con: duckdb.DuckDBPyConnection,
     countries: list,
     limit: int
 ) -> pd.DataFrame:
-    """Find all pairs where one feature contains another."""
-    print("\nComputing containment pairs (optimized)...")
-    cfilter, cparams = _country_filter(countries)
-    query = f"""
-    WITH features AS (
-        SELECT
-            id,
-            names."primary" AS name,
-            subtype,
-            country,
-            admin_level,
-            geometry,
-            ST_Envelope(geometry) AS bbox
-        FROM read_parquet(?)
-        {cfilter}
-    )
-    SELECT
-        a.id AS container_id,
-        a.name AS container_name,
-        a.subtype AS container_subtype,
-        b.id AS contained_id,
-        b.name AS contained_name,
-        b.subtype AS contained_subtype,
-        'containment' AS relation_type
-    FROM features AS a
-    JOIN features AS b ON (
-        a.id != b.id
-        AND a.admin_level < b.admin_level
-        AND ST_Intersects(a.bbox, b.bbox)
-        AND ST_Within(b.geometry, a.geometry)
-    )
-    LIMIT ?
-    """
-    df = con.execute(query, [DIVISIONS_AREA_PATH] + cparams + [limit]).fetchdf()
     print(f"Found {len(df)} containment pairs")
     return df
@@ -198,60 +310,71 @@ def compute_cross_source_relations(
 ) -> pd.DataFrame:
     """Find relations between divisions_area and natural_earth.
-    Covers all natural_earth subtypes that appear in SQL templates:
-    seas/oceans (adjacency, buffer, chained), terrain areas and island
-    groups (chained_03, intersect_02, buffer_03/04).
     """
-    print("\nComputing cross-source relations...")
     cfilter, cparams = _country_filter(countries)
-    query = f"""
-    WITH divisions AS (
-        SELECT
-            id,
-            names."primary" AS name,
-            subtype,
-            country,
-            geometry
-        FROM read_parquet(?)
-        {cfilter}
-    ),
-    natural_features AS (
-        SELECT
-            id,
-            names."primary" AS name,
-            subtype,
-            ST_SetCRS(geometry, 'OGC:CRS84') AS geometry
-        FROM read_parquet(?)
-        WHERE subtype IN (
-            'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay',
-            'Terrain area', 'Island group', 'Peninsula', 'Strait',
-            'Reef', 'Range/Mts', 'Depression'
         )
-    )
-    SELECT
-        d.id AS division_id,
-        d.name AS division_name,
-        d.subtype AS division_subtype,
-        d.country AS division_country,
-        n.id AS natural_id,
-        n.name AS natural_name,
-        n.subtype AS natural_subtype,
-        CASE
-            WHEN ST_Touches(d.geometry, n.geometry) THEN 'touches'
-            WHEN ST_Within(d.geometry, n.geometry) THEN 'within'
-            WHEN ST_Contains(d.geometry, n.geometry) THEN 'contains'
-            WHEN ST_Intersects(d.geometry, n.geometry) THEN 'intersects'
-        END AS relation_type
-    FROM divisions AS d
-    JOIN natural_features AS n ON ST_Intersects(d.geometry, n.geometry)
-    LIMIT ?
-    """
-    df = con.execute(
-        query, [DIVISIONS_AREA_PATH] + cparams + [NATURAL_EARTH_PATH, limit]
-    ).fetchdf()
     print(f"Found {len(df)} cross-source relations")
     return df
@@ -261,64 +384,29 @@ def compute_coastal_containment_pairs(
     countries: list,
     limit: int,
 ) -> pd.DataFrame:
-    """Containment pairs where the container is in a coastal country.
-    Used by chained_01 (coastal towns of X) to ensure sampled containment
-    anchors actually have sea-adjacent sub-features, keeping the SQL
-    verification step from constantly returning empty results.
-    Strategy: find countries whose geometry intersects any ocean/sea in
-    natural_earth, then filter containment_pairs to those countries.
     """
-    print("\nComputing coastal containment pairs...")
-    cfilter, cparams = _country_filter(countries)
-    query = f"""
-    WITH coastal_countries AS (
-        SELECT DISTINCT d.country
-        FROM read_parquet(?) AS d
-        JOIN read_parquet(?) AS n
-          ON ST_Intersects(d.geometry, ST_SetCRS(n.geometry, 'OGC:CRS84'))
-        WHERE d.subtype = 'country'
-          AND n.subtype IN ('sea', 'ocean')
-    ),
-    features AS (
-        SELECT
-            id,
-            names."primary" AS name,
-            subtype,
-            country,
-            admin_level,
-            geometry,
-            ST_Envelope(geometry) AS bbox
-        FROM read_parquet(?)
-        {cfilter}
-    )
-    SELECT
-        a.id AS container_id,
-        a.name AS container_name,
-        a.subtype AS container_subtype,
-        b.id AS contained_id,
-        b.name AS contained_name,
-        b.subtype AS contained_subtype,
-        a.country AS container_country,
-        'coastal_containment' AS relation_type
-    FROM features AS a
-    JOIN features AS b ON (
-        a.id != b.id
-        AND a.admin_level < b.admin_level
-        AND ST_Intersects(a.bbox, b.bbox)
-        AND ST_Within(b.geometry, a.geometry)
-    )
-    WHERE a.country IN (SELECT country FROM coastal_countries)
-    LIMIT ?
     """
-    df = con.execute(
-        query,
-        [DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH] + cparams + [DIVISIONS_AREA_PATH, limit],
-    ).fetchdf()
     print(f"Found {len(df)} coastal containment pairs")
     return df
@@ -328,61 +416,29 @@ def compute_landlocked_containment_pairs(
     countries: list,
     limit: int,
 ) -> pd.DataFrame:
-    """Containment pairs where the container is in a landlocked country.
-    Used by chained_02 (landlocked regions within X) to ensure sampled
-    anchors genuinely have no sea access, keeping SQL verification from
-    always returning empty.
     """
-    print("\nComputing landlocked containment pairs...")
-    cfilter, cparams = _country_filter(countries)
-    query = f"""
-    WITH coastal_countries AS (
-        SELECT DISTINCT d.country
-        FROM read_parquet(?) AS d
-        JOIN read_parquet(?) AS n
-          ON ST_Intersects(d.geometry, ST_SetCRS(n.geometry, 'OGC:CRS84'))
-        WHERE d.subtype = 'country'
-          AND n.subtype IN ('sea', 'ocean')
-    ),
-    features AS (
-        SELECT
-            id,
-            names."primary" AS name,
-            subtype,
-            country,
-            admin_level,
-            geometry,
-            ST_Envelope(geometry) AS bbox
-        FROM read_parquet(?)
-        {cfilter}
-    )
-    SELECT
-        a.id AS container_id,
-        a.name AS container_name,
-        a.subtype AS container_subtype,
-        b.id AS contained_id,
-        b.name AS contained_name,
-        b.subtype AS contained_subtype,
-        a.country AS container_country,
-        'landlocked_containment' AS relation_type
-    FROM features AS a
-    JOIN features AS b ON (
-        a.id != b.id
-        AND a.admin_level < b.admin_level
-        AND ST_Intersects(a.bbox, b.bbox)
-        AND ST_Within(b.geometry, a.geometry)
-    )
-    WHERE a.country NOT IN (SELECT country FROM coastal_countries)
-    LIMIT ?
     """
-    df = con.execute(
-        query,
-        [DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH] + cparams + [DIVISIONS_AREA_PATH, limit],
-    ).fetchdf()
     print(f"Found {len(df)} landlocked containment pairs")
     return df
@@ -411,6 +467,21 @@ def compute_common_neighbor_pairs(
         ])
     query = """
     SELECT DISTINCT
         a1.anchor_id   AS anchor_id_1,
         a1.anchor_name AS anchor_name_1,
@@ -418,8 +489,8 @@ def compute_common_neighbor_pairs(
         a2.anchor_name AS anchor_name_2,
         a1.target_id   AS shared_neighbor_id,
         a1.target_name AS shared_neighbor_name
-    FROM read_parquet(?) AS a1
-    JOIN read_parquet(?) AS a2
       ON a1.target_id = a2.target_id
      AND a1.anchor_id < a2.anchor_id
     LIMIT ?
@@ -435,9 +506,11 @@ def _make_connection():
     con = duckdb.connect()
     con.execute("INSTALL spatial")
     con.execute("LOAD spatial")
-    con.execute("SET memory_limit='24GB'")
     con.execute("SET temp_directory='/tmp/duckdb_tmp'")
-    con.execute("SET threads=4")
     return con

 - intermediate/cross_source_relations.parquet
 """
+import os
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
 import duckdb
 import pandas as pd
 from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
 _EXCLUDED_SUBTYPES_FOR_GLOBAL = ("locality", "neighborhood", "microhood", "macrohood")
+# (container_subtype, contained_subtype) combos used by the chained,
+# containment, and disambiguation templates. A naive self-join with
+# LIMIT fills up with coarse pairs (country -> region / county) first and
+# never emits locality-level pairs; stratifying by combo ensures each
+# template has anchors to draw from.
+_CONTAINMENT_SUBTYPE_PAIRS = (
+    ("country",    "region"),
+    ("country",    "county"),
+    ("country",    "localadmin"),
+    ("country",    "locality"),
+    ("region",     "county"),
+    ("region",     "localadmin"),
+    ("region",     "locality"),
+    ("county",     "locality"),
+    ("localadmin", "locality"),
+)
+# Natural Earth subtype vocabulary in the current GeoParquet.
+# Keep these exact strings in one place so relation building and templates
+# stay aligned with the underlying dataset.
+_NE_CROSS_SOURCE_SUBTYPES = (
+    "sea",
+    "ocean",
+    "Lake",
+    "River",
+    "Basin",
+    "gulf",
+    "bay",
+    "Island group",
+    "Peninsula",
+    "strait",
+    "Range/mtn",
+    "Depression",
+)
 def _country_filter(countries: list) -> tuple[str, list]:
     """Return (SQL WHERE clause, params) handling 'all' sentinel."""
     if countries == ["all"]:
     return f"WHERE country IN (SELECT unnest(?)) {subtype_clause}", [countries]
+def _country_chunks(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    chunk_size: int = 40,
+) -> list[list[str]]:
+    """Return explicit country batches for safer global containment joins."""
+    if countries != ["all"]:
+        return [countries]
+    rows = con.execute(
+        """
+        SELECT DISTINCT country
+        FROM read_parquet(?)
+        WHERE country IS NOT NULL
+          AND trim(country) != ''
+        ORDER BY country
+        """,
+        [DIVISIONS_AREA_PATH],
+    ).fetchall()
+    codes = [row[0] for row in rows]
+    return [codes[i:i + chunk_size] for i in range(0, len(codes), chunk_size)]
 def compute_adjacency_pairs(
     con: duckdb.DuckDBPyConnection,
     countries: list,
     return df
+def _stratified_containment(
+    con: duckdb.DuckDBPyConnection,
+    countries: list,
+    limit: int,
+    relation_type: str,
+    extra_where: str = "",
+    extra_params: list = None,
+) -> pd.DataFrame:
+    """Compute containment pairs stratified by (container_subtype, contained_subtype).
+    A single global self-join with LIMIT fills up with coarse country->region
+    pairs before emitting any locality-level pairs. We run one focused query
+    per subtype combo instead so every combo receives a fair share of the
+    overall limit.
+    ``extra_where`` / ``extra_params`` let the coastal and landlocked variants
+    inject their country-set filter without duplicating the whole body.
+    """
+    extra_params = extra_params or []
+    # Use a lower target per subtype combo for global runs; they are the most
+    # memory-intensive part of the pipeline and don't need huge anchor tables.
+    if countries == ["all"]:
+        per_combo = min(max(limit // len(_CONTAINMENT_SUBTYPE_PAIRS), 100), 1500)
+    else:
+        per_combo = max(limit // len(_CONTAINMENT_SUBTYPE_PAIRS), 100)
+    country_batches = _country_chunks(con, countries)
+    frames: list[pd.DataFrame] = []
+    for container_st, contained_st in _CONTAINMENT_SUBTYPE_PAIRS:
+        remaining = per_combo
+        combo_parts: list[pd.DataFrame] = []
+        for batch in country_batches:
+            if remaining <= 0:
+                break
+            cfilter, cparams = _country_filter(batch)
+            query = f"""
+            WITH a AS (
+                SELECT src.id, src.names."primary" AS name, src.subtype, src.country, src.admin_level,
+                       src.geometry, ST_Envelope(src.geometry) AS bbox
+                FROM read_parquet(?) AS src
+                WHERE src.subtype = '{container_st}'
+                  {cfilter.replace("WHERE", "AND") if cfilter else ""}
+                  {extra_where}
+            ),
+            b AS (
+                SELECT dst.id, dst.names."primary" AS name, dst.subtype, dst.country, dst.admin_level,
+                       dst.geometry, ST_Envelope(dst.geometry) AS bbox
+                FROM read_parquet(?) AS dst
+                WHERE dst.subtype = '{contained_st}'
+                  {cfilter.replace("WHERE", "AND") if cfilter else ""}
+            )
+            SELECT
+                a.id AS container_id,
+                a.name AS container_name,
+                a.subtype AS container_subtype,
+                b.id AS contained_id,
+                b.name AS contained_name,
+                b.subtype AS contained_subtype,
+                a.country AS container_country,
+                '{relation_type}' AS relation_type
+            FROM a JOIN b ON (
+                a.id != b.id
+                AND ST_Intersects(a.bbox, b.bbox)
+                AND ST_Within(b.geometry, a.geometry)
+            )
+            LIMIT {remaining}
+            """
+            params = [DIVISIONS_AREA_PATH] + extra_params + cparams + [DIVISIONS_AREA_PATH] + cparams
+            df_part = con.execute(query, params).fetchdf()
+            if not df_part.empty:
+                combo_parts.append(df_part)
+                remaining -= len(df_part)
+        df_combo = (
+            pd.concat(combo_parts, ignore_index=True)
+            if combo_parts else pd.DataFrame()
+        )
+        print(f"  {relation_type} {container_st:>10s} -> {contained_st:<10s}: {len(df_combo)} pairs")
+        frames.append(df_combo)
+    return pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
 def compute_containment_pairs(
     con: duckdb.DuckDBPyConnection,
     countries: list,
     limit: int
 ) -> pd.DataFrame:
+    """Find containment pairs stratified across admin-level combinations."""
+    print("\nComputing containment pairs (stratified by subtype combo)...")
+    df = _stratified_containment(con, countries, limit, relation_type="containment")
     print(f"Found {len(df)} containment pairs")
     return df
 ) -> pd.DataFrame:
     """Find relations between divisions_area and natural_earth.
+    The join is heavily skewed by a few abundant Natural Earth subtypes
+    (especially rivers and mountain ranges). We therefore stratify by exact
+    natural subtype so seas/oceans, gulfs/bays, island groups, and rarer
+    landforms all make it into the anchor pool.
     """
+    print("\nComputing cross-source relations (stratified by NE subtype)...")
     cfilter, cparams = _country_filter(countries)
+    per_subtype = max(limit // len(_NE_CROSS_SOURCE_SUBTYPES), 50)
+    frames: list[pd.DataFrame] = []
+    for natural_subtype in _NE_CROSS_SOURCE_SUBTYPES:
+        query = f"""
+        WITH divisions AS (
+            SELECT
+                id,
+                names."primary" AS name,
+                subtype,
+                country,
+                geometry
+            FROM read_parquet(?)
+            WHERE geometry IS NOT NULL
+              AND names."primary" IS NOT NULL
+              AND trim(names."primary") != ''
+              {cfilter.replace("WHERE", "AND") if cfilter else ''}
+        ),
+        natural_features AS (
+            SELECT
+                id,
+                names."primary" AS name,
+                subtype,
+                geometry
+            FROM read_parquet(?)
+            WHERE geometry IS NOT NULL
+              AND names."primary" IS NOT NULL
+              AND trim(names."primary") != ''
+              AND subtype = '{natural_subtype}'
         )
+        SELECT
+            d.id AS division_id,
+            d.name AS division_name,
+            d.subtype AS division_subtype,
+            d.country AS division_country,
+            n.id AS natural_id,
+            n.name AS natural_name,
+            n.subtype AS natural_subtype,
+            CASE
+                WHEN ST_Touches(d.geometry, n.geometry) THEN 'touches'
+                WHEN ST_Within(d.geometry, n.geometry) THEN 'within'
+                WHEN ST_Contains(d.geometry, n.geometry) THEN 'contains'
+                WHEN ST_Intersects(d.geometry, n.geometry) THEN 'intersects'
+            END AS relation_type
+        FROM divisions AS d
+        JOIN natural_features AS n
+          ON ST_Intersects(d.geometry, n.geometry)
+        LIMIT {per_subtype}
+        """
+        df_part = con.execute(
+            query,
+            [DIVISIONS_AREA_PATH] + cparams + [NATURAL_EARTH_PATH],
+        ).fetchdf()
+        print(f"  cross_source {natural_subtype:>12s}: {len(df_part)} rows")
+        frames.append(df_part)
+    df = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
     print(f"Found {len(df)} cross-source relations")
     return df
     countries: list,
     limit: int,
 ) -> pd.DataFrame:
+    """Stratified containment pairs limited to coastal-country containers.
+    Used by chained_01 (coastal towns of X) so sampled anchors actually have
+    sea-adjacent sub-features. Stratification guarantees coverage of every
+    admin-level combination (country->locality, region->locality, etc.).
     """
+    print("\nComputing coastal containment pairs (stratified)...")
+    extra_where = f"""
+              AND EXISTS (
+                  SELECT 1
+                  FROM read_parquet('{NATURAL_EARTH_PATH}') AS n
+                  WHERE n.geometry IS NOT NULL
+                    AND n.names."primary" IS NOT NULL
+                    AND trim(n.names."primary") != ''
+                    AND n.subtype IN ('sea', 'ocean')
+                    AND ST_Intersects(src.geometry, n.geometry)
+              )
     """
+    df = _stratified_containment(
+        con, countries, limit,
+        relation_type="coastal_containment",
+        extra_where=extra_where,
+    )
     print(f"Found {len(df)} coastal containment pairs")
     return df
     countries: list,
     limit: int,
 ) -> pd.DataFrame:
+    """Stratified containment pairs limited to landlocked-country containers.
+    Used by chained_02 (landlocked localities within X). Stratification by
+    subtype combo ensures locality-level pairs are actually present in the
+    output instead of being starved by coarse country->region pairs.
     """
+    print("\nComputing landlocked containment pairs (stratified)...")
+    extra_where = f"""
+              AND NOT EXISTS (
+                  SELECT 1
+                  FROM read_parquet('{NATURAL_EARTH_PATH}') AS n
+                  WHERE n.geometry IS NOT NULL
+                    AND n.names."primary" IS NOT NULL
+                    AND trim(n.names."primary") != ''
+                    AND n.subtype IN ('sea', 'ocean')
+                    AND ST_Intersects(src.geometry, n.geometry)
+              )
     """
+    df = _stratified_containment(
+        con, countries, limit,
+        relation_type="landlocked_containment",
+        extra_where=extra_where,
+    )
     print(f"Found {len(df)} landlocked containment pairs")
     return df
         ])
     query = """
+    WITH undirected AS (
+        SELECT
+            anchor_id,
+            anchor_name,
+            target_id,
+            target_name
+        FROM read_parquet(?)
+        UNION ALL
+        SELECT
+            target_id AS anchor_id,
+            target_name AS anchor_name,
+            anchor_id AS target_id,
+            anchor_name AS target_name
+        FROM read_parquet(?)
+    )
     SELECT DISTINCT
         a1.anchor_id   AS anchor_id_1,
         a1.anchor_name AS anchor_name_1,
         a2.anchor_name AS anchor_name_2,
         a1.target_id   AS shared_neighbor_id,
         a1.target_name AS shared_neighbor_name
+    FROM undirected AS a1
+    JOIN undirected AS a2
       ON a1.target_id = a2.target_id
      AND a1.anchor_id < a2.anchor_id
     LIMIT ?
     con = duckdb.connect()
     con.execute("INSTALL spatial")
     con.execute("LOAD spatial")
+    memory_limit = os.environ.get("GAZET_DUCKDB_MEMORY_LIMIT", "12GB")
+    threads = int(os.environ.get("GAZET_DUCKDB_THREADS", "1"))
+    con.execute(f"SET memory_limit='{memory_limit}'")
     con.execute("SET temp_directory='/tmp/duckdb_tmp'")
+    con.execute(f"SET threads={threads}")
     return con

dataset/scripts/cli.py CHANGED Viewed

@@ -110,6 +110,19 @@ def calculate_relation_limits(config: dict) -> Dict[str, int]:
     return relation_needs
 def build_relations(config_path: Path):
     """Run relation building with config."""
     config = load_config(config_path)
@@ -269,6 +282,9 @@ def main():
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""
 Examples:
   # Build relation tables only
   python cli.py build-relations --config ../config.yaml
@@ -296,7 +312,7 @@ Examples:
     parser.add_argument(
         'command',
-        choices=['build-relations', 'generate-samples', 'validate', 'export',
                  'full-pipeline', 'modal-upload', 'modal-generate'],
         help='Command to run'
     )
@@ -348,7 +364,9 @@ Examples:
     # Run the appropriate command
     try:
-        if args.command == 'build-relations':
             build_relations(args.config)
         elif args.command == 'generate-samples':
             generate_samples(args.config, args.append)

     return relation_needs
+def normalize_data():
+    """Build normalized source parquet copies with harmonized geometry metadata."""
+    print("=" * 60)
+    print("STEP 0: Normalizing Source Geodata")
+    print("=" * 60)
+    from dataset.scripts.normalize_geodata import normalize_geodata
+    result = normalize_geodata()
+    for name, path in result.items():
+        print(f"  {name}: {path}")
 def build_relations(config_path: Path):
     """Run relation building with config."""
     config = load_config(config_path)
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""
 Examples:
+  # Normalize source geodata first (recommended before Modal upload)
+  python cli.py normalize-data --config ../config.yaml
   # Build relation tables only
   python cli.py build-relations --config ../config.yaml
     parser.add_argument(
         'command',
+        choices=['normalize-data', 'build-relations', 'generate-samples', 'validate', 'export',
                  'full-pipeline', 'modal-upload', 'modal-generate'],
         help='Command to run'
     )
     # Run the appropriate command
     try:
+        if args.command == 'normalize-data':
+            normalize_data()
+        elif args.command == 'build-relations':
             build_relations(args.config)
         elif args.command == 'generate-samples':
             generate_samples(args.config, args.append)

dataset/scripts/export_training_data.py CHANGED Viewed

@@ -124,7 +124,7 @@ You have access to two DuckDB parquet tables. Given a set of candidate entities
      id VARCHAR              -- unique feature id prefixed 'ne_'
      names STRUCT("primary" VARCHAR, ...)
      country VARCHAR
-     subtype VARCHAR         -- e.g. 'ocean', 'sea', 'bay', 'Terrain area', 'Island group'
      class VARCHAR
      region VARCHAR
      admin_level INTEGER

      id VARCHAR              -- unique feature id prefixed 'ne_'
      names STRUCT("primary" VARCHAR, ...)
      country VARCHAR
+     subtype VARCHAR         -- e.g. 'ocean', 'sea', 'bay', 'Range/mtn', 'Island group'
      class VARCHAR
      region VARCHAR
      admin_level INTEGER

dataset/scripts/generate_samples.py CHANGED Viewed

@@ -44,7 +44,7 @@ def _for_execution(sql: str) -> str:
     return (
         sql
         .replace("read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')")
-        .replace("read_parquet('natural_earth')",  f"read_parquet('{NATURAL_EARTH_PATH}')")
     )
 # Configurable parameters (can be overridden by CLI)
@@ -60,6 +60,33 @@ SQLTemplate = sql_templates.SQLTemplate
 get_templates_by_family = sql_templates.get_templates_by_family
 class Candidate(BaseModel):
     """Candidate entity for grounding."""
     candidate_id: str
@@ -121,6 +148,8 @@ def sample_adjacency_anchor(
         'anchor_name': row['anchor_name'],
         'anchor_subtype': row['anchor_subtype'],
         'anchor_country': row.get('anchor_country'),  # May not exist in all tables
         'target_subtype': row.get('target_subtype')
     }
@@ -142,16 +171,23 @@ def sample_intersection_anchor(intersection_df: pd.DataFrame) -> Optional[Dict[s
 def sample_containment_anchor(containment_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
-    """Sample a random containment pair."""
     if containment_df.empty:
         return None
     row = containment_df.sample(n=1).iloc[0]
     return {
         'container_id': row['container_id'],
         'container_name': row['container_name'],
         'container_subtype': row['container_subtype'],
-        'contained_subtype': row['contained_subtype']
     }
@@ -186,12 +222,24 @@ def sample_disambiguation_anchor(
     }
-def sample_cross_source_anchor(cross_source_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
-    """Sample a random cross-source relation."""
     if cross_source_df.empty:
         return None
-    row = cross_source_df.sample(n=1).iloc[0]
     return {
         'division_id': row['division_id'],
         'division_name': row['division_name'],
@@ -242,16 +290,16 @@ def build_candidate_list(
     difficulty: str = "medium"
 ) -> List[Candidate]:
     """Build candidate list with true anchor + distractors."""
     # Helper to convert pandas NA to None
     def safe_get(row, key, default=None):
         val = row.get(key, default)
         return None if pd.isna(val) else val
     # Get the true anchor
     if anchor_source == "divisions_area":
         query = """
-        SELECT
             id,
             names."primary" AS name,
             subtype,
@@ -264,7 +312,7 @@ def build_candidate_list(
         anchor_row = con.execute(query, [DIVISIONS_AREA_PATH, anchor_id]).fetchdf().iloc[0]
     else:
         query = """
-        SELECT
             id,
             names."primary" AS name,
             subtype
@@ -272,8 +320,7 @@ def build_candidate_list(
         WHERE id = ?
         """
         anchor_row = con.execute(query, [NATURAL_EARTH_PATH, anchor_id]).fetchdf().iloc[0]
-    # Build true candidate
     true_candidate = Candidate(
         candidate_id="c1",
         source=anchor_source,
@@ -283,25 +330,31 @@ def build_candidate_list(
         country=safe_get(anchor_row, 'country'),
         region=safe_get(anchor_row, 'region'),
         admin_level=safe_get(anchor_row, 'admin_level'),
-        similarity=1.0
     )
-    # Build distractors based on difficulty
     distractors = build_distractors(
-        con,
-        anchor_name,
         anchor_source,
         anchor_id,
         num_candidates - 1,
-        difficulty
     )
-    # Order: true anchor first, then same-source distractors, then cross-source
-    # distractors. This mirrors inference order (anchor at top by similarity,
-    # same source grouped before the other source).
-    candidates = [true_candidate] + distractors
-    # Reassign candidate IDs in order
     for i, cand in enumerate(candidates, 1):
         cand.candidate_id = f"c{i}"
@@ -335,21 +388,39 @@ def build_distractors(
     def _query_source(path: str, src_name: str, n: int, excl_id: str) -> List[Candidate]:
         query = """
         SELECT
             id,
-            names."primary" AS name,
             subtype,
             country,
             region,
             admin_level,
-            jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity
-        FROM read_parquet(?)
-        WHERE id != ?
-          AND names."primary" IS NOT NULL
         ORDER BY similarity DESC
         LIMIT ?
         """
-        df = con.execute(query, [anchor_name, path, excl_id, n]).fetchdf()
         results = []
         for _, row in df.iterrows():
             results.append(Candidate(
@@ -515,13 +586,23 @@ WHERE b.id != '{anchor['container_id']}'
 def sample_random_entity(
     con: duckdb.DuckDBPyConnection,
     inventory_df: pd.DataFrame,
-    source: str
 ) -> Optional[Dict[str, Any]]:
-    """Sample a random entity from inventory."""
     if inventory_df.empty:
         return None
-    row = inventory_df.sample(n=1).iloc[0]
     return {
         'id': row['id'],
         'name': row['name'],
@@ -545,7 +626,12 @@ def generate_template_based_sample(
         if template.anchor_source == "divisions_area":
             anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
         else:
-            anchor = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
         if not anchor:
             return None
@@ -636,70 +722,156 @@ def generate_template_based_sample(
         anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
     elif template.family == "adjacency":
-        # If the template pins a target_subtype (e.g. adj_02='region',
-        # adj_06='county'), honour it so the sampled pair is guaranteed to
-        # match the question phrasing ("neighbouring counties of X").
-        anchor = sample_adjacency_anchor(
-            tables['adjacency_pairs'],
-            target_subtype=template.target_subtype,
-        )
         if not anchor:
             return None
         sql = template.sql_template.format(
             anchor_id=anchor['anchor_id'],
-            target_subtype=anchor['target_subtype']
         )
         candidates = build_candidate_list(
             con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
             num_candidates=10, difficulty="medium"
         )
         question = random.choice(template.question_hints).format(
             anchor_name=anchor['anchor_name'],
-            target_subtype=anchor['target_subtype']
         )
     elif template.family == "containment":
-        anchor = sample_containment_anchor(tables['containment_pairs'])
-        if not anchor:
-            return None
-        sql = template.sql_template.format(
-            anchor_id=anchor['container_id'],
-            target_subtype=anchor['contained_subtype']
-        )
-        candidates = build_candidate_list(
-            con, anchor['container_id'], anchor['container_name'], 'divisions_area',
-            num_candidates=10, difficulty="medium"
-        )
-        question = random.choice(template.question_hints).format(
-            anchor_name=anchor['container_name'],
-            target_subtype=anchor['contained_subtype']
-        )
     elif template.family == "intersection":
         if template.anchor_source == "natural_earth":
-            anchor = sample_cross_source_anchor(tables['cross_source_relations'])
             if not anchor:
                 return None
             sql = template.sql_template.format(
                         anchor_id=anchor['natural_id'],
-                target_subtype='country'
             )
             candidates = build_candidate_list(
                 con, anchor['natural_id'], anchor['natural_name'], 'natural_earth',
                 num_candidates=10, difficulty="medium"
             )
             question = random.choice(template.question_hints).format(
                 anchor_name=anchor['natural_name'],
-                target_subtype='country'
             )
         else:
             # Same-source intersection
@@ -760,14 +932,19 @@ def generate_template_based_sample(
             # country IN clause — 2 or 3 anchors, each contributes its country code
             num_a = 3 if template.template_id == "contain_multi_02" else 2
             anchors = [
-                sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
                 for _ in range(num_a)
             ]
             if any(a is None for a in anchors):
                 return None
             countries = [a.get('country') or 'US' for a in anchors]
-            target_subtype = random.choice(['region', 'locality'])
             per_anchor = 3 if num_a == 3 else 4
             fmt_kwargs = dict(
@@ -850,7 +1027,12 @@ def generate_template_based_sample(
         if template.num_anchors == 1:
             if template.anchor_source == "natural_earth":
-                anchor = sample_random_entity(con, tables['natural_earth_inventory'], 'natural_earth')
             else:
                 anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
             if not anchor:
@@ -944,20 +1126,22 @@ def generate_template_based_sample(
             # Mixed-source clip: division intersected with a natural_earth feature.
             # Use cross_source_relations so the pair is guaranteed to intersect —
             # random sampling almost never produces an intersecting pair.
-            cs_df = tables.get('cross_source_relations', pd.DataFrame())
-            if cs_df.empty:
                 return None
-            row = cs_df.sample(n=1).iloc[0]
             clip_feature = {
-                'id':   row['natural_id'],
-                'name': row['natural_name'],
                 'source': 'natural_earth',
             }
             # Override the division anchor with the paired division so the
             # ST_Intersects check in the SQL is guaranteed to pass.
             anchor = {
-                'id':   row['division_id'],
-                'name': row['division_name'],
                 'source': 'divisions_area',
             }
@@ -986,8 +1170,14 @@ def generate_template_based_sample(
         target_subtype = random.choice(['locality', 'region'])
         if template.template_id in ['agg_03', 'agg_04']:
-            # Country-level aggregation: SQL uses country code, not anchor id.
-            anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
             if not anchor:
                 return None
@@ -1035,35 +1225,33 @@ def generate_template_based_sample(
     elif template.family == "chained":
         # Use pre-filtered coastal/landlocked containment pairs so the SQL
         # verification step doesn't constantly return empty results.
-        if template.template_id == "chained_01":
             table_key = 'coastal_containment_pairs'
-        elif template.template_id == "chained_02":
             table_key = 'landlocked_containment_pairs'
         else:
             table_key = 'containment_pairs'
-        # chained_10/11 need a country-level anchor ("coastal states of
-        # India") and region-level targets, so filter the containment pairs
-        # to (container=country, contained=region) before sampling.
-        _chained_subtype_filter = {
-            "chained_10": ("country", "region"),
-            "chained_11": ("country", "region"),
-        }
         df = tables.get(table_key, tables['containment_pairs'])
-        filt = _chained_subtype_filter.get(template.template_id)
-        if filt:
-            df = df[
-                (df['container_subtype'] == filt[0])
-                & (df['contained_subtype'] == filt[1])
-            ]
         anchor = sample_containment_anchor(df)
         if not anchor:
             return None
-        # Prefer the template-pinned target_subtype when set (e.g. chained_10
-        # always wants 'region') so the SQL filter and question phrasing stay
-        # in sync regardless of what the sampled pair happens to contain.
         target_subtype = template.target_subtype or anchor.get('contained_subtype', 'locality')
         sql = template.sql_template.format(
@@ -1117,18 +1305,20 @@ def generate_template_based_sample(
             # Use cross_source_relations so the pair is guaranteed to intersect
             # (ST_Difference on non-intersecting geometries is always equal to
             # the original geometry — a trivial and uninformative sample).
-            cs_df = tables.get('cross_source_relations', pd.DataFrame())
-            if cs_df.empty:
                 return None
-            row = cs_df.sample(n=1).iloc[0]
             anchor = {
-                'id':   row['division_id'],
-                'name': row['division_name'],
                 'source': 'divisions_area',
             }
             clip_feature = {
-                'id':   row['natural_id'],
-                'name': row['natural_name'],
                 'source': 'natural_earth',
             }
@@ -1153,20 +1343,19 @@ def generate_template_based_sample(
             candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
         else:
-            # Two divisions_area anchors — use containment pairs so the
-            # smaller (contained) is guaranteed to intersect the larger.
             pair = sample_containment_anchor(tables['containment_pairs'])
             if not pair:
                 return None
             anchor1 = {'id': pair['container_id'], 'name': pair['container_name']}
-            anchor2_row = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
-            if not anchor2_row:
-                return None
-            anchor2 = anchor2_row
             sql = template.sql_template.format(
-                        anchor_id_1=anchor1['id'],
                 anchor_id_2=anchor2['id'],
             )
@@ -1191,19 +1380,8 @@ def generate_template_based_sample(
         if not pair:
             return None
-        # The adjacency table only records one direction; sample a second
-        # anchor that is known to be adjacent to the first.
         anchor1 = {'id': pair['anchor_id'], 'name': pair['anchor_name']}
-        # Find a random neighbour of anchor1 from adjacency pairs
-        neighbours = tables['adjacency_pairs']
-        neighbours = neighbours[neighbours['anchor_id'] == anchor1['id']]
-        if neighbours.empty:
-            return None
-        nb_row = neighbours.sample(n=1).iloc[0]
-        anchor2 = {'id': nb_row.get('target_id', nb_row['anchor_id']), 'name': nb_row.get('target_name', nb_row['anchor_name'])}
-        if anchor1['id'] == anchor2['id']:
-            return None
         buffer_val = random.choice([5, 10, 25, 50])
@@ -1230,12 +1408,17 @@ def generate_template_based_sample(
         )
     elif template.family == "window_function":
-        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
         if not anchor:
             return None
         country = anchor.get('country') or 'US'
-        target_subtype = random.choice(['locality', 'neighborhood'])
         sql = template.sql_template.format(
             country=country,
@@ -1253,12 +1436,17 @@ def generate_template_based_sample(
         )
     elif template.family == "attribute_filter":
-        anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
         if not anchor:
             return None
         country = anchor.get('country') or 'US'
-        target_subtype = template.target_subtype or random.choice(['dependency', 'region', 'locality'])
         sql = template.sql_template.format(
             country=country,

     return (
         sql
         .replace("read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')")
+        .replace("read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')")
     )
 # Configurable parameters (can be overridden by CLI)
 get_templates_by_family = sql_templates.get_templates_by_family
+_NE_NAMED_LOOKUP_SUBTYPES = {
+    'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay',
+    'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression',
+}
+_NE_TEMPLATE_SUBTYPES = {
+    'lookup_02': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+    'adj_03': {'sea', 'ocean'},
+    'adj_04': {'River', 'Lake', 'Basin'},
+    'adj_05': {'Range/mtn', 'Peninsula', 'Depression'},
+    'contain_03': {'sea', 'ocean', 'gulf', 'bay', 'Basin', 'Island group', 'Peninsula', 'Range/mtn', 'Depression'},
+    'contain_04': {'sea', 'ocean', 'gulf', 'bay', 'strait'},
+    'intersect_02': {'River', 'Lake', 'Basin', 'gulf', 'bay', 'strait', 'Range/mtn', 'Peninsula', 'Depression'},
+    'intersect_03': {'River', 'Lake', 'Basin', 'gulf', 'bay', 'strait', 'Range/mtn', 'Peninsula', 'Depression'},
+    'buffer_03': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+    'buffer_04': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+    'buffer_05': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+    'chained_03': {'Island group', 'Peninsula', 'Range/mtn', 'Depression'},
+    'chained_04': {'River', 'Lake', 'Basin'},
+    'chained_05': {'Range/mtn', 'Depression'},
+    'chained_08': {'River', 'Lake', 'Basin'},
+    'chained_09': {'Range/mtn', 'Depression'},
+    'partial_05': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+    'diff_02': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
+}
 class Candidate(BaseModel):
     """Candidate entity for grounding."""
     candidate_id: str
         'anchor_name': row['anchor_name'],
         'anchor_subtype': row['anchor_subtype'],
         'anchor_country': row.get('anchor_country'),  # May not exist in all tables
+        'target_id': row.get('target_id'),
+        'target_name': row.get('target_name'),
         'target_subtype': row.get('target_subtype')
     }
 def sample_containment_anchor(containment_df: pd.DataFrame) -> Optional[Dict[str, Any]]:
+    """Sample a random containment pair.
+    Returns both ends of the pair so callers that need the contained entity
+    (e.g. difference templates that clip container by contained) can use it
+    directly without a second random draw.
+    """
     if containment_df.empty:
         return None
     row = containment_df.sample(n=1).iloc[0]
     return {
         'container_id': row['container_id'],
         'container_name': row['container_name'],
         'container_subtype': row['container_subtype'],
+        'contained_id': row['contained_id'],
+        'contained_name': row['contained_name'],
+        'contained_subtype': row['contained_subtype'],
     }
     }
+def sample_cross_source_anchor(
+    cross_source_df: pd.DataFrame,
+    natural_subtypes: Optional[set[str]] = None,
+    relation_types: Optional[set[str]] = None,
+) -> Optional[Dict[str, Any]]:
+    """Sample a random cross-source relation with optional subtype filters."""
     if cross_source_df.empty:
         return None
+    df = cross_source_df
+    if natural_subtypes is not None:
+        df = df[df['natural_subtype'].isin(natural_subtypes)]
+    if relation_types is not None:
+        df = df[df['relation_type'].isin(relation_types)]
+    if df.empty:
+        return None
+    row = df.sample(n=1).iloc[0]
     return {
         'division_id': row['division_id'],
         'division_name': row['division_name'],
     difficulty: str = "medium"
 ) -> List[Candidate]:
     """Build candidate list with true anchor + distractors."""
     # Helper to convert pandas NA to None
     def safe_get(row, key, default=None):
         val = row.get(key, default)
         return None if pd.isna(val) else val
     # Get the true anchor
     if anchor_source == "divisions_area":
         query = """
+        SELECT
             id,
             names."primary" AS name,
             subtype,
         anchor_row = con.execute(query, [DIVISIONS_AREA_PATH, anchor_id]).fetchdf().iloc[0]
     else:
         query = """
+        SELECT
             id,
             names."primary" AS name,
             subtype
         WHERE id = ?
         """
         anchor_row = con.execute(query, [NATURAL_EARTH_PATH, anchor_id]).fetchdf().iloc[0]
     true_candidate = Candidate(
         candidate_id="c1",
         source=anchor_source,
         country=safe_get(anchor_row, 'country'),
         region=safe_get(anchor_row, 'region'),
         admin_level=safe_get(anchor_row, 'admin_level'),
+        similarity=1.0,
     )
     distractors = build_distractors(
+        con,
+        anchor_name,
         anchor_source,
         anchor_id,
         num_candidates - 1,
+        difficulty,
     )
+    # Deduplicate by underlying entity id while preserving order.
+    # Some parquet sources contain repeated rows for the same feature id,
+    # which can otherwise leak duplicate candidates into the dataset.
+    candidates: List[Candidate] = []
+    seen_ids: set[str] = set()
+    for cand in [true_candidate] + distractors:
+        if cand.id in seen_ids:
+            continue
+        candidates.append(cand)
+        seen_ids.add(cand.id)
+        if len(candidates) >= num_candidates:
+            break
     for i, cand in enumerate(candidates, 1):
         cand.candidate_id = f"c{i}"
     def _query_source(path: str, src_name: str, n: int, excl_id: str) -> List[Candidate]:
         query = """
+        WITH ranked AS (
+            SELECT
+                id,
+                names."primary" AS name,
+                subtype,
+                country,
+                region,
+                admin_level,
+                jaro_winkler_similarity(lower(names."primary"), lower(?)) AS similarity,
+                ROW_NUMBER() OVER (
+                    PARTITION BY id
+                    ORDER BY jaro_winkler_similarity(lower(names."primary"), lower(?)) DESC
+                ) AS rn
+            FROM read_parquet(?)
+            WHERE id != ?
+              AND names."primary" IS NOT NULL
+              AND trim(names."primary") != ''
+              AND geometry IS NOT NULL
+        )
         SELECT
             id,
+            name,
             subtype,
             country,
             region,
             admin_level,
+            similarity
+        FROM ranked
+        WHERE rn = 1
         ORDER BY similarity DESC
         LIMIT ?
         """
+        df = con.execute(query, [anchor_name, anchor_name, path, excl_id, n]).fetchdf()
         results = []
         for _, row in df.iterrows():
             results.append(Candidate(
 def sample_random_entity(
     con: duckdb.DuckDBPyConnection,
     inventory_df: pd.DataFrame,
+    source: str,
+    subtypes: Optional[set[str]] = None,
+    countries: Optional[set[str]] = None,
 ) -> Optional[Dict[str, Any]]:
+    """Sample a random entity from inventory with optional filters."""
     if inventory_df.empty:
         return None
+    df = inventory_df
+    if subtypes is not None:
+        df = df[df['subtype'].isin(subtypes)]
+    if countries is not None and 'country' in df.columns:
+        df = df[df['country'].isin(countries)]
+    if df.empty:
+        return None
+    row = df.sample(n=1).iloc[0]
     return {
         'id': row['id'],
         'name': row['name'],
         if template.anchor_source == "divisions_area":
             anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
         else:
+            anchor = sample_random_entity(
+                con,
+                tables['natural_earth_inventory'],
+                'natural_earth',
+                subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id, _NE_NAMED_LOOKUP_SUBTYPES),
+            )
         if not anchor:
             return None
         anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
     elif template.family == "adjacency":
+        # adj_03/04/05 target natural_earth features (seas, rivers, ranges).
+        # Their SQL hardcodes NE subtypes and does not use {target_subtype}.
+        # Sample from cross_source_relations so the anchor is a division
+        # that actually intersects the right NE features.
+        _NE_ADJ_SUBTYPES = {
+            "adj_03": ("ocean", "sea"),
+            "adj_04": ("River", "Lake", "Basin"),
+            "adj_05": ("Range/mtn", "Peninsula", "Depression"),
+        }
+        if template.template_id in _NE_ADJ_SUBTYPES:
+            cs_df = tables.get('cross_source_relations', pd.DataFrame())
+            if cs_df.empty:
+                return None
+            ne_types = _NE_ADJ_SUBTYPES[template.template_id]
+            filtered = cs_df[cs_df['natural_subtype'].isin(ne_types)]
+            if filtered.empty:
+                return None
+            row = filtered.sample(n=1).iloc[0]
+            anchor = {
+                'anchor_id': row['division_id'],
+                'anchor_name': row['division_name'],
+                'anchor_subtype': row['division_subtype'],
+                'target_subtype': row['natural_subtype'],
+            }
+        else:
+            # adj_01/02/06: divisions_area self-join adjacency.
+            # Only filter by target_subtype when the SQL uses {target_subtype}.
+            filter_subtype = (
+                template.target_subtype
+                if '{target_subtype}' in template.sql_template
+                else None
+            )
+            anchor = sample_adjacency_anchor(
+                tables['adjacency_pairs'],
+                target_subtype=filter_subtype,
+            )
         if not anchor:
             return None
         sql = template.sql_template.format(
             anchor_id=anchor['anchor_id'],
+            target_subtype=anchor.get('target_subtype', ''),
         )
         candidates = build_candidate_list(
             con, anchor['anchor_id'], anchor['anchor_name'], 'divisions_area',
             num_candidates=10, difficulty="medium"
         )
         question = random.choice(template.question_hints).format(
             anchor_name=anchor['anchor_name'],
+            target_subtype=anchor.get('target_subtype', ''),
         )
     elif template.family == "containment":
+        if template.anchor_source == "natural_earth":
+            # contain_03 / contain_04: NE anchor (sea, desert, etc.).
+            # Use cross_source_relations so the anchor exists in natural_earth
+            # and is guaranteed to intersect divisions_area features.
+            cs_anchor = sample_cross_source_anchor(
+                tables.get('cross_source_relations', pd.DataFrame()),
+                natural_subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id),
+            )
+            if not cs_anchor:
+                return None
+            anchor_id = cs_anchor['natural_id']
+            anchor_name = cs_anchor['natural_name']
+            target_subtype = template.target_subtype or 'country'
+            sql = template.sql_template.format(
+                anchor_id=anchor_id,
+                target_subtype=target_subtype,
+            )
+            candidates = build_candidate_list(
+                con, anchor_id, anchor_name, 'natural_earth',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor_name,
+                target_subtype=target_subtype,
+            )
+            anchor = {'id': anchor_id, 'name': anchor_name}
+        elif template.template_id == "contain_02":
+            # "What country contains X?" - anchor is the CONTAINED entity;
+            # result is the country that ST_Contains it.
+            df = tables['containment_pairs']
+            df = df[df['container_subtype'] == 'country']
+            pair = sample_containment_anchor(df)
+            if not pair:
+                return None
+            sql = template.sql_template.format(
+                anchor_id=pair['contained_id'],
+                target_subtype='country',
+            )
+            candidates = build_candidate_list(
+                con, pair['contained_id'], pair['contained_name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=pair['contained_name'],
+                target_subtype='country',
+            )
+            anchor = {'id': pair['contained_id'], 'name': pair['contained_name']}
+        else:
+            # contain_01: standard containment.
+            # Anchor = container, target_subtype = contained entity's subtype.
+            anchor = sample_containment_anchor(tables['containment_pairs'])
+            if not anchor:
+                return None
+            sql = template.sql_template.format(
+                anchor_id=anchor['container_id'],
+                target_subtype=anchor['contained_subtype'],
+            )
+            candidates = build_candidate_list(
+                con, anchor['container_id'], anchor['container_name'], 'divisions_area',
+                num_candidates=10, difficulty="medium"
+            )
+            question = random.choice(template.question_hints).format(
+                anchor_name=anchor['container_name'],
+                target_subtype=anchor['contained_subtype'],
+            )
     elif template.family == "intersection":
         if template.anchor_source == "natural_earth":
+            anchor = sample_cross_source_anchor(
+                tables['cross_source_relations'],
+                natural_subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id),
+            )
             if not anchor:
                 return None
+            target_subtype = template.target_subtype or 'country'
             sql = template.sql_template.format(
                         anchor_id=anchor['natural_id'],
+                target_subtype=target_subtype,
             )
             candidates = build_candidate_list(
                 con, anchor['natural_id'], anchor['natural_name'], 'natural_earth',
                 num_candidates=10, difficulty="medium"
             )
             question = random.choice(template.question_hints).format(
                 anchor_name=anchor['natural_name'],
+                target_subtype=target_subtype,
             )
         else:
             # Same-source intersection
             # country IN clause — 2 or 3 anchors, each contributes its country code
             num_a = 3 if template.template_id == "contain_multi_02" else 2
             anchors = [
+                sample_random_entity(
+                    con,
+                    tables['divisions_area_inventory'],
+                    'divisions_area',
+                    subtypes={'country'},
+                )
                 for _ in range(num_a)
             ]
             if any(a is None for a in anchors):
                 return None
             countries = [a.get('country') or 'US' for a in anchors]
+            target_subtype = template.target_subtype or 'region'
             per_anchor = 3 if num_a == 3 else 4
             fmt_kwargs = dict(
         if template.num_anchors == 1:
             if template.anchor_source == "natural_earth":
+                anchor = sample_random_entity(
+                    con,
+                    tables['natural_earth_inventory'],
+                    'natural_earth',
+                    subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id, _NE_NAMED_LOOKUP_SUBTYPES),
+                )
             else:
                 anchor = sample_random_entity(con, tables['divisions_area_inventory'], 'divisions_area')
             if not anchor:
             # Mixed-source clip: division intersected with a natural_earth feature.
             # Use cross_source_relations so the pair is guaranteed to intersect —
             # random sampling almost never produces an intersecting pair.
+            cs_anchor = sample_cross_source_anchor(
+                tables.get('cross_source_relations', pd.DataFrame()),
+                natural_subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id),
+            )
+            if not cs_anchor:
                 return None
             clip_feature = {
+                'id':   cs_anchor['natural_id'],
+                'name': cs_anchor['natural_name'],
                 'source': 'natural_earth',
             }
             # Override the division anchor with the paired division so the
             # ST_Intersects check in the SQL is guaranteed to pass.
             anchor = {
+                'id':   cs_anchor['division_id'],
+                'name': cs_anchor['division_name'],
                 'source': 'divisions_area',
             }
         target_subtype = random.choice(['locality', 'region'])
         if template.template_id in ['agg_03', 'agg_04']:
+            # Country-level aggregation: SQL uses country code, so the anchor
+            # in the question must also be a country.
+            anchor = sample_random_entity(
+                con,
+                tables['divisions_area_inventory'],
+                'divisions_area',
+                subtypes={'country'},
+            )
             if not anchor:
                 return None
     elif template.family == "chained":
         # Use pre-filtered coastal/landlocked containment pairs so the SQL
         # verification step doesn't constantly return empty results.
+        _COASTAL_CHAINED = {"chained_01", "chained_06", "chained_10"}
+        _LANDLOCKED_CHAINED = {"chained_02", "chained_07", "chained_11"}
+        if template.template_id in _COASTAL_CHAINED:
             table_key = 'coastal_containment_pairs'
+        elif template.template_id in _LANDLOCKED_CHAINED:
             table_key = 'landlocked_containment_pairs'
         else:
             table_key = 'containment_pairs'
         df = tables.get(table_key, tables['containment_pairs'])
+        # When the template pins a target_subtype (e.g. chained_06 wants
+        # counties), only consider pairs whose contained entity already
+        # matches — guarantees the sampled container holds at least one
+        # entity of the right subtype so the SQL filter returns rows.
+        if template.target_subtype:
+            df = df[df['contained_subtype'] == template.target_subtype]
+        # chained_10/11 additionally need a country-level container so
+        # phrasings like "coastal states of India" line up.
+        if template.template_id in {"chained_10", "chained_11"}:
+            df = df[df['container_subtype'] == 'country']
         anchor = sample_containment_anchor(df)
         if not anchor:
             return None
         target_subtype = template.target_subtype or anchor.get('contained_subtype', 'locality')
         sql = template.sql_template.format(
             # Use cross_source_relations so the pair is guaranteed to intersect
             # (ST_Difference on non-intersecting geometries is always equal to
             # the original geometry — a trivial and uninformative sample).
+            cs_anchor = sample_cross_source_anchor(
+                tables.get('cross_source_relations', pd.DataFrame()),
+                natural_subtypes=_NE_TEMPLATE_SUBTYPES.get(template.template_id),
+            )
+            if not cs_anchor:
                 return None
             anchor = {
+                'id':   cs_anchor['division_id'],
+                'name': cs_anchor['division_name'],
                 'source': 'divisions_area',
             }
             clip_feature = {
+                'id':   cs_anchor['natural_id'],
+                'name': cs_anchor['natural_name'],
                 'source': 'natural_earth',
             }
             candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
         else:
+            # Two divisions_area anchors: use both ends of a containment
+            # pair so the contained entity is guaranteed to intersect the
+            # container. ST_Difference(container, contained) yields the
+            # portion of the container outside the contained piece.
             pair = sample_containment_anchor(tables['containment_pairs'])
             if not pair:
                 return None
             anchor1 = {'id': pair['container_id'], 'name': pair['container_name']}
+            anchor2 = {'id': pair['contained_id'], 'name': pair['contained_name']}
             sql = template.sql_template.format(
+                anchor_id_1=anchor1['id'],
                 anchor_id_2=anchor2['id'],
             )
         if not pair:
             return None
         anchor1 = {'id': pair['anchor_id'], 'name': pair['anchor_name']}
+        anchor2 = {'id': pair['target_id'], 'name': pair['target_name']}
         buffer_val = random.choice([5, 10, 25, 50])
         )
     elif template.family == "window_function":
+        anchor = sample_random_entity(
+            con,
+            tables['divisions_area_inventory'],
+            'divisions_area',
+            subtypes={'country'},
+        )
         if not anchor:
             return None
         country = anchor.get('country') or 'US'
+        target_subtype = template.target_subtype or 'locality'
         sql = template.sql_template.format(
             country=country,
         )
     elif template.family == "attribute_filter":
+        anchor = sample_random_entity(
+            con,
+            tables['divisions_area_inventory'],
+            'divisions_area',
+            subtypes={'country'},
+        )
         if not anchor:
             return None
         country = anchor.get('country') or 'US'
+        target_subtype = template.target_subtype or 'region'
         sql = template.sql_template.format(
             country=country,

dataset/scripts/normalize_geodata.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""Normalize source GeoParquet files to a shared CRS-neutral geometry encoding.
+The training pipeline mixes Overture divisions_area and Natural Earth geometry.
+Across environments these sources can advertise different CRS metadata labels
+(`EPSG:4326` vs `OGC:CRS84`), which causes DuckDB spatial joins to fail even
+when coordinates are already compatible lon/lat values.
+This script rewrites both datasets into normalized copies whose geometry column
+is rebuilt from WKB. That preserves coordinates while dropping conflicting CRS
+metadata, so downstream joins behave consistently locally and on Modal.
+Output layout under data/ by default:
+    overture_normalized/divisions_area/part-000.parquet
+    natural_earth_normalized/ne_geography.parquet
+"""
+from pathlib import Path
+import duckdb
+from gazet.config import _DATA_DIR
+def normalize_geodata(output_root: Path | None = None) -> dict[str, str]:
+    """Write normalized copies of both source datasets.
+    Args:
+        output_root: Base directory to write normalized datasets into.
+            Defaults to the project data dir.
+    Returns:
+        Mapping of dataset name to written path/glob.
+    """
+    root = output_root or _DATA_DIR
+    overture_dir = root / "overture_normalized" / "divisions_area"
+    natural_earth_dir = root / "natural_earth_normalized"
+    overture_dir.mkdir(parents=True, exist_ok=True)
+    natural_earth_dir.mkdir(parents=True, exist_ok=True)
+    overture_path = overture_dir / "part-000.parquet"
+    natural_earth_path = natural_earth_dir / "ne_geography.parquet"
+    con = duckdb.connect()
+    con.execute("INSTALL spatial")
+    con.execute("LOAD spatial")
+    # Rebuild geometry from WKB so conflicting CRS metadata is dropped.
+    con.execute(
+        f"""
+        COPY (
+            SELECT * REPLACE (
+                ST_GeomFromWKB(ST_AsWKB(geometry)) AS geometry
+            )
+            FROM read_parquet('{root / 'overture/divisions_area/*.parquet'}')
+            WHERE geometry IS NOT NULL
+        ) TO '{overture_path}' (FORMAT PARQUET)
+        """
+    )
+    con.execute(
+        f"""
+        COPY (
+            SELECT * REPLACE (
+                ST_GeomFromWKB(ST_AsWKB(geometry)) AS geometry
+            )
+            FROM read_parquet('{root / 'natural_earth_geoparquet/ne_geography.parquet'}')
+            WHERE geometry IS NOT NULL
+        ) TO '{natural_earth_path}' (FORMAT PARQUET)
+        """
+    )
+    con.close()
+    return {
+        "divisions_area": str(overture_dir / "*.parquet"),
+        "natural_earth": str(natural_earth_path),
+    }
+def main() -> None:
+    result = normalize_geodata()
+    print("Normalized datasets written:")
+    for name, path in result.items():
+        print(f"  {name}: {path}")
+if __name__ == "__main__":
+    main()

dataset/scripts/sql_templates.py CHANGED Viewed

@@ -293,7 +293,7 @@ TEMPLATES = [
             "        ST_AsGeoJSON(n.geometry) AS geometry"
             " FROM read_parquet('natural_earth') AS n, a"
             " WHERE n.subtype IN ('ocean', 'sea')"
-            "   AND ST_Touches(a.geometry, n.geometry)"
         ),
         question_hints=[
             "which seas touch {anchor_name}?",
@@ -679,7 +679,7 @@ TEMPLATES = [
         sql_difficulty="hard",
         anchor_source="divisions_area",
         num_anchors=1,
-        target_subtype="country",
         sql_template=(
             "WITH region AS ("
             "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
@@ -723,7 +723,7 @@ TEMPLATES = [
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
-            "     WHERE n.subtype IN ('Terrain area', 'Island group', 'Peninsula')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),
@@ -1301,12 +1301,12 @@ TEMPLATES = [
             "   AND subtype = '{target_subtype}'"
         ),
         question_hints=[
-            "island territories of {anchor_name}",
-            "overseas island {target_subtype}s belonging to {anchor_name}",
-            "which islands are part of {anchor_name}?",
-            "land territories of {anchor_name}",
-            "island possessions of {anchor_name}",
-            "{anchor_name}'s island {target_subtype}s",
         ],
     ),
@@ -1339,20 +1339,20 @@ TEMPLATES = [
         sql_difficulty="medium",
         anchor_source="divisions_area",
         num_anchors=1,
-        target_subtype="locality",
         sql_template=(
             "SELECT id, names.\"primary\" AS name, subtype, country,"
             "       ST_AsGeoJSON(geometry) AS geometry"
             " FROM read_parquet('divisions_area')"
             " WHERE country = '{country}'"
             "   AND subtype = '{target_subtype}'"
-            "   AND is_land = TRUE"
         ),
         question_hints=[
-            "land-based {target_subtype}s of {anchor_name}",
-            "{target_subtype}s on the mainland of {anchor_name}",
-            "all {target_subtype}s on land in {anchor_name}",
-            "non-island {target_subtype}s of {anchor_name}",
         ],
     ),
@@ -1403,7 +1403,7 @@ TEMPLATES = [
             " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
             "        ST_AsGeoJSON(n.geometry) AS geometry"
             " FROM read_parquet('natural_earth') AS n, a"
-            " WHERE n.subtype IN ('Range/Mts', 'Terrain area', 'Peninsula', 'Depression')"
             "   AND ST_Intersects(a.geometry, n.geometry)"
         ),
         question_hints=[
@@ -1533,7 +1533,7 @@ TEMPLATES = [
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
-            "     WHERE n.subtype IN ('Range/Mts', 'Depression')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),
@@ -1666,7 +1666,7 @@ TEMPLATES = [
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
-            "     WHERE n.subtype IN ('Range/Mts', 'Depression')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),

             "        ST_AsGeoJSON(n.geometry) AS geometry"
             " FROM read_parquet('natural_earth') AS n, a"
             " WHERE n.subtype IN ('ocean', 'sea')"
+            "   AND ST_Intersects(a.geometry, n.geometry)"
         ),
         question_hints=[
             "which seas touch {anchor_name}?",
         sql_difficulty="hard",
         anchor_source="divisions_area",
         num_anchors=1,
+        target_subtype="locality",
         sql_template=(
             "WITH region AS ("
             "  SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Range/mtn', 'Island group', 'Peninsula', 'Depression')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),
             "   AND subtype = '{target_subtype}'"
         ),
         question_hints=[
+            "land {target_subtype}s of {anchor_name}",
+            "dependencies of {anchor_name} that are on land",
+            "which land dependencies belong to {anchor_name}?",
+            "{anchor_name}'s land {target_subtype}s",
+            "dependencies of {anchor_name} with land area",
+            "show the land dependencies of {anchor_name}",
         ],
     ),
         sql_difficulty="medium",
         anchor_source="divisions_area",
         num_anchors=1,
+        target_subtype="region",
         sql_template=(
             "SELECT id, names.\"primary\" AS name, subtype, country,"
             "       ST_AsGeoJSON(geometry) AS geometry"
             " FROM read_parquet('divisions_area')"
             " WHERE country = '{country}'"
             "   AND subtype = '{target_subtype}'"
+            "   AND is_land = FALSE"
         ),
         question_hints=[
+            "offshore {target_subtype}s of {anchor_name}",
+            "{target_subtype}s of {anchor_name} that are not on land",
+            "water-associated {target_subtype}s of {anchor_name}",
+            "marine or offshore {target_subtype}s of {anchor_name}",
         ],
     ),
             " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
             "        ST_AsGeoJSON(n.geometry) AS geometry"
             " FROM read_parquet('natural_earth') AS n, a"
+            " WHERE n.subtype IN ('Range/mtn', 'Peninsula', 'Depression')"
             "   AND ST_Intersects(a.geometry, n.geometry)"
         ),
         question_hints=[
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Range/mtn', 'Depression')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),
             "   AND ST_Within(b.geometry, region.geometry)"
             "   AND EXISTS ("
             "     SELECT 1 FROM read_parquet('natural_earth') AS n"
+            "     WHERE n.subtype IN ('Range/mtn', 'Depression')"
             "       AND ST_Intersects(b.geometry, n.geometry)"
             "   )"
         ),

dataset/scripts/validate_dataset.py CHANGED Viewed

@@ -44,8 +44,8 @@ def _resolve_paths(sql: str) -> str:
         "read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')"
     )
     # Legacy fixed Docker paths from earlier dataset versions
-    sql = sql.replace("/data/overture/division_area/*.parquet",          DIVISIONS_AREA_PATH)
-    sql = sql.replace("/data/overture/divisions_area/*.parquet",         DIVISIONS_AREA_PATH)
     sql = sql.replace("/data/natural_earth_geoparquet/ne_geography.parquet", NATURAL_EARTH_PATH)
     return sql

         "read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')"
     )
     # Legacy fixed Docker paths from earlier dataset versions
+    sql = sql.replace("/data/overture/division_area/*.parquet", DIVISIONS_AREA_PATH)
+    sql = sql.replace("/data/overture/divisions_area/*.parquet", DIVISIONS_AREA_PATH)
     sql = sql.replace("/data/natural_earth_geoparquet/ne_geography.parquet", NATURAL_EARTH_PATH)
     return sql

finetune/README.md CHANGED Viewed

@@ -26,7 +26,7 @@ SQL samples are long (schema + candidates + SQL), places samples are short.
 ```bash
 modal run finetune/check_token_lengths.py
-modal run finetune/check_token_lengths.py --run-dir /mnt/gazet/data/v1
 ```
 This prints per-split statistics (min, max, P95, P99) and recommends a
@@ -66,7 +66,7 @@ overridden, `lora_alpha` is automatically set to `2 * r`.
 ```
 base_model:       unsloth/Qwen3.5-0.8B
-run_dir:          /mnt/gazet/data/v1
 lora_r:           16
 lora_alpha:       32       (2 * r, Unsloth recommendation for Qwen)
 lora_dropout:     0.0
@@ -106,19 +106,19 @@ for local inference with llama-server.
 ```bash
 # Download from Modal volume
-modal volume get gazet checkpoints/qwen35-v1/merged ./finetune/models/merged
 # Convert to GGUF (requires llama.cpp repo)
-uv run \
     --no-project \
     --with transformers \
     --with sentencepiece \
     --with protobuf \
     --with torch \
     python convert_hf_to_gguf.py \
-    ../gazet/finetune/models/qwen-base/merged \
     --outtype q8_0 \
-    --outfile ../gazet/finetune/models/qwen-base/ckpt-001.gguf
 ```
 ---
@@ -129,7 +129,7 @@ uv run \
 ```bash
 llama-server \
-    -m finetune/models/qwen-base/ckpt-001.gguf \
     -ngl 99 \
     --port 9000 \
     --ctx-size 2048
@@ -151,7 +151,7 @@ docker run \
     -v $(pwd)/finetune/models:/models \
     -p 9000:9000 \
     ghcr.io/ggml-org/llama.cpp:server \
-        -m /models/qwen-base/ckpt-001.gguf \
         --port 9000 --host 0.0.0.0 \
         --ctx-size 2048 -t 2 -v
 ```
@@ -211,7 +211,7 @@ All batch CLI args:
 | `--label` | `local-gguf` | Label used in the output filename |
 | `--task` | `sql` | `sql` or `places` |
 | `--split` | `val` | Data split to evaluate (`val`, `test`) |
-| `--run-dir` | `dataset/output/runs/v1` | Directory with `{task}/{split}.jsonl` |
 | `--max-samples` | all | Cap the number of samples |
 | `--output` | `eval-{label}-{task}.json` | Output JSON path |
 | `--workers` | `4` | Concurrent requests; match llama-server `--parallel` |
@@ -250,6 +250,20 @@ GAZET_EVAL_DIR=/path/to/results streamlit run finetune/eval_demo.py
 ```
 Set `GAZET_DATA_DIR` if your parquet data is not in the default `data/` directory.
 ---
@@ -287,5 +301,9 @@ loss is computed only on the completion tokens.
 SQL in the training data uses symbolic path placeholders
 (`read_parquet('divisions_area')`) instead of real file paths. At inference
-time, `src/gazet/sql.py` replaces these with actual runtime paths before
-executing against DuckDB.

 ```bash
 modal run finetune/check_token_lengths.py
+modal run finetune/check_token_lengths.py --run-dir /mnt/gazet/data/smalltest-v1
 ```
 This prints per-split statistics (min, max, P95, P99) and recommends a
 ```
 base_model:       unsloth/Qwen3.5-0.8B
+run_dir:          /mnt/gazet/data/v1   # override to your exported run, e.g. /mnt/gazet/data/smalltest-v1
 lora_r:           16
 lora_alpha:       32       (2 * r, Unsloth recommendation for Qwen)
 lora_dropout:     0.0
 ```bash
 # Download from Modal volume
+modal volume get gazet checkpoints/qwen35-v1/merged ./finetune/models/qwen35-v1-merged
 # Convert to GGUF (requires llama.cpp repo)
+uv run \
     --no-project \
     --with transformers \
     --with sentencepiece \
     --with protobuf \
     --with torch \
     python convert_hf_to_gguf.py \
+    ./finetune/models/qwen35-v1-merged \
     --outtype q8_0 \
+    --outfile ./finetune/models/qwen35-v1-q8_0.gguf
 ```
 ---
 ```bash
 llama-server \
+    -m finetune/models/qwen35-v1-q8_0.gguf \
     -ngl 99 \
     --port 9000 \
     --ctx-size 2048
     -v $(pwd)/finetune/models:/models \
     -p 9000:9000 \
     ghcr.io/ggml-org/llama.cpp:server \
+        -m /models/qwen35-v1-q8_0.gguf \
         --port 9000 --host 0.0.0.0 \
         --ctx-size 2048 -t 2 -v
 ```
 | `--label` | `local-gguf` | Label used in the output filename |
 | `--task` | `sql` | `sql` or `places` |
 | `--split` | `val` | Data split to evaluate (`val`, `test`) |
+| `--run-dir` | `dataset/output/runs/v1` | Directory with `{task}/{split}.jsonl`; override to your exported run, e.g. `dataset/output/runs/smalltest-v1` |
 | `--max-samples` | all | Cap the number of samples |
 | `--output` | `eval-{label}-{task}.json` | Output JSON path |
 | `--workers` | `4` | Concurrent requests; match llama-server `--parallel` |
 ```
 Set `GAZET_DATA_DIR` if your parquet data is not in the default `data/` directory.
+This only affects the visual SQL viewer (`eval_demo.py`), which executes SQL
+against DuckDB; `eval_cli.py` does not read parquet files directly.
+The eval viewer resolves parquet paths through `gazet.config`, which now
+prefers normalized copies automatically when present:
+- `data/overture_normalized/divisions_area/*.parquet`
+- `data/natural_earth_normalized/ne_geography.parquet`
+Disable that fallback only if needed with:
+```bash
+GAZET_USE_NORMALIZED_DATA=0 streamlit run finetune/eval_demo.py
+```
 ---
 SQL in the training data uses symbolic path placeholders
 (`read_parquet('divisions_area')`) instead of real file paths. At inference
+and eval time, `src/gazet/sql.py` / `finetune/eval_demo.py` replace these with
+actual runtime paths before executing against DuckDB. When normalized parquet
+copies are present, `gazet.config` prefers:
+- `data/overture_normalized/divisions_area/*.parquet`
+- `data/natural_earth_normalized/ne_geography.parquet`

finetune/eval_demo.py CHANGED Viewed

@@ -16,6 +16,8 @@ import pydeck as pdk
 import sqlparse
 import streamlit as st
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parent.parent
 DATA_DIR = pathlib.Path(
     os.environ.get("GAZET_DATA_DIR", str(PROJECT_ROOT / "data"))
@@ -31,13 +33,17 @@ def load_eval_results(path):
 def rewrite_data_paths(sql):
-    """Replace symbolic and legacy paths with actual local data paths."""
-    # Legacy fixed Docker paths must be replaced first to avoid double-expansion
     sql = sql.replace("/data/", f"{DATA_DIR}/")
-    div_path = str(DATA_DIR / "overture" / "divisions_area" / "*.parquet")
-    ne_path = str(DATA_DIR / "natural_earth_geoparquet" / "ne_geography.parquet")
-    sql = sql.replace("read_parquet('divisions_area')", f"read_parquet('{div_path}')")
-    sql = sql.replace("read_parquet('natural_earth')", f"read_parquet('{ne_path}')")
     return sql

 import sqlparse
 import streamlit as st
+from gazet.config import DIVISIONS_AREA_PATH, NATURAL_EARTH_PATH
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parent.parent
 DATA_DIR = pathlib.Path(
     os.environ.get("GAZET_DATA_DIR", str(PROJECT_ROOT / "data"))
 def rewrite_data_paths(sql):
+    """Replace symbolic and legacy paths with the configured runtime data paths."""
+    # Legacy fixed Docker paths must be replaced first to avoid double-expansion.
+    sql = sql.replace("/data/overture/division_area/*.parquet", DIVISIONS_AREA_PATH)
+    sql = sql.replace("/data/overture/divisions_area/*.parquet", DIVISIONS_AREA_PATH)
+    sql = sql.replace(
+        "/data/natural_earth_geoparquet/ne_geography.parquet",
+        NATURAL_EARTH_PATH,
+    )
     sql = sql.replace("/data/", f"{DATA_DIR}/")
+    sql = sql.replace("read_parquet('divisions_area')", f"read_parquet('{DIVISIONS_AREA_PATH}')")
+    sql = sql.replace("read_parquet('natural_earth')", f"read_parquet('{NATURAL_EARTH_PATH}')")
     return sql

finetune/train_modal_qwen35.py CHANGED Viewed

@@ -101,7 +101,7 @@ class Qwen35Config:
     # Logging / saving
     logging_steps: int = 10
     save_strategy: str = "steps"
-    save_steps: int = 400
     eval_strategy: str = "steps"
     eval_steps: int = 200
     report_to: str = "trackio"

     # Logging / saving
     logging_steps: int = 10
     save_strategy: str = "steps"
+    save_steps: int = 1000
     eval_strategy: str = "steps"
     eval_steps: int = 200
     report_to: str = "trackio"

gazet_demo.py CHANGED Viewed

@@ -68,24 +68,58 @@ def view_state_for_bbox(bbox, padding_zoom=0.8):
     return pdk.ViewState(latitude=lat, longitude=lng, zoom=zoom)
 def _render_map(geojson, placeholder):
-    n = len(geojson.get("features", []))
     if pdk and n:
         layer = pdk.Layer(
             "GeoJsonLayer",
             data=geojson,
-            get_fill_color=[40, 180, 160, 200],
-            get_line_color=[125, 211, 192, 255],
-            get_line_width=2,
             pickable=True,
         )
-        bbox = bbox_from_geojson(geojson)
-        view = (
-            view_state_for_bbox(bbox)
-            if bbox
-            else pdk.ViewState(latitude=0, longitude=0, zoom=1)
-        )
         with placeholder.container():
             st.pydeck_chart(
                 pdk.Deck(
                     layers=[layer],
@@ -128,6 +162,8 @@ st.sidebar.caption(
 if "run_q" not in st.session_state:
     st.session_state.run_q = None
 col1, col2 = st.columns([1, 2])
 with col1:
@@ -150,6 +186,7 @@ with col2:
     to_run = st.session_state.run_q
     if to_run:
         st.session_state.run_q = None
         status_ph = st.empty()
         map_ph = st.empty()
@@ -159,6 +196,8 @@ with col2:
         status_ph.info("Extracting places…")
         try:
             with requests.get(
                 f"{API}/search/stream", params={"q": to_run, "backend": backend}, stream=True, timeout=120
@@ -173,6 +212,7 @@ with col2:
                     if t == "places":
                         places = event["data"].get("places", [])
                         status_ph.info("Fuzzy-matching candidates…")
                         if places:
                             with places_ph.container():
@@ -192,6 +232,7 @@ with col2:
                                     )
                     elif t == "candidates":
                         status_ph.info("Generating SQL…")
                         with candidates_ph.container():
                             with st.expander("Candidate datasets", expanded=True):
@@ -203,6 +244,7 @@ with col2:
                     elif t == "sql_attempt":
                         iteration = event.get("iteration", "")
                         status_ph.info(f"Running SQL (attempt {iteration})…")
                         with sql_ph.container():
                             with st.expander("SQL", expanded=True):
@@ -216,6 +258,7 @@ with col2:
                     elif t == "geojson":
                         geojson = event["data"]
                         n = len(geojson.get("features", []))
                         status_ph.success(f"**{to_run}** → {n} feature(s)")
                         _render_map(geojson, map_ph)
@@ -227,3 +270,31 @@ with col2:
             status_ph.error(
                 f"API error: {e}. Is the API running? `uv run uvicorn gazet.api:app --reload`"
             )

     return pdk.ViewState(latitude=lat, longitude=lng, zoom=zoom)
+def _has_line_geometries(features):
+    """Return True if features are predominantly line/point (non-polygon) geometries."""
+    line_types = {"LineString", "MultiLineString", "Point", "MultiPoint"}
+    count = sum(
+        1 for f in features
+        if f.get("geometry", {}).get("type") in line_types
+    )
+    return count > len(features) / 2
 def _render_map(geojson, placeholder):
+    features = geojson.get("features", [])
+    n = len(features)
     if pdk and n:
+        is_linear = _has_line_geometries(features)
         layer = pdk.Layer(
             "GeoJsonLayer",
             data=geojson,
+            stroked=True,
+            filled=not is_linear,
+            get_fill_color=[40, 180, 160, 120],
+            get_line_color=[0, 140, 255, 255] if is_linear else [10, 50, 46, 255],
+            get_line_width=500 if is_linear else 80,
+            line_width_min_pixels=2 if is_linear else 1,
             pickable=True,
         )
         with placeholder.container():
+            selected_idx = None
+            if n > 1:
+                names = [
+                    f.get("properties", {}).get("name", f"Feature {i}")
+                    for i, f in enumerate(features)
+                ]
+                choice = st.selectbox(
+                    "Zoom to feature",
+                    ["All features"] + names,
+                    key="feature_zoom",
+                )
+                if choice != "All features":
+                    selected_idx = names.index(choice)
+            if selected_idx is not None:
+                single = {"type": "FeatureCollection", "features": [features[selected_idx]]}
+                bbox = bbox_from_geojson(single)
+            else:
+                bbox = bbox_from_geojson(geojson)
+            view = (
+                view_state_for_bbox(bbox)
+                if bbox
+                else pdk.ViewState(latitude=0, longitude=0, zoom=1)
+            )
             st.pydeck_chart(
                 pdk.Deck(
                     layers=[layer],
 if "run_q" not in st.session_state:
     st.session_state.run_q = None
+if "last_result" not in st.session_state:
+    st.session_state.last_result = None
 col1, col2 = st.columns([1, 2])
 with col1:
     to_run = st.session_state.run_q
     if to_run:
         st.session_state.run_q = None
+        st.session_state.last_result = None
         status_ph = st.empty()
         map_ph = st.empty()
         status_ph.info("Extracting places…")
+        result = {"query": to_run, "places": None, "candidates": None, "sql": None, "geojson": None}
         try:
             with requests.get(
                 f"{API}/search/stream", params={"q": to_run, "backend": backend}, stream=True, timeout=120
                     if t == "places":
                         places = event["data"].get("places", [])
+                        result["places"] = places
                         status_ph.info("Fuzzy-matching candidates…")
                         if places:
                             with places_ph.container():
                                     )
                     elif t == "candidates":
+                        result["candidates"] = event["data"]
                         status_ph.info("Generating SQL…")
                         with candidates_ph.container():
                             with st.expander("Candidate datasets", expanded=True):
                     elif t == "sql_attempt":
                         iteration = event.get("iteration", "")
+                        result["sql"] = event["data"]
                         status_ph.info(f"Running SQL (attempt {iteration})…")
                         with sql_ph.container():
                             with st.expander("SQL", expanded=True):
                     elif t == "geojson":
                         geojson = event["data"]
+                        result["geojson"] = geojson
                         n = len(geojson.get("features", []))
                         status_ph.success(f"**{to_run}** → {n} feature(s)")
                         _render_map(geojson, map_ph)
             status_ph.error(
                 f"API error: {e}. Is the API running? `uv run uvicorn gazet.api:app --reload`"
             )
+        st.session_state.last_result = result
+    elif st.session_state.last_result:
+        result = st.session_state.last_result
+        query = result["query"]
+        n_feat = len((result["geojson"] or {}).get("features", []))
+        st.success(f"**{query}** -> {n_feat} feature(s)")
+        _render_map(result["geojson"], st.empty())
+        if result["places"]:
+            with st.expander("Extracted place names"):
+                st.dataframe(
+                    pd.DataFrame(result["places"]).rename(
+                        columns={"place": "Place", "country": "Country", "subtype": "Subtype"}
+                    ),
+                    use_container_width=True,
+                    hide_index=True,
+                )
+        if result["candidates"]:
+            with st.expander("Candidate datasets"):
+                st.dataframe(
+                    pd.DataFrame(result["candidates"]),
+                    use_container_width=True,
+                    hide_index=True,
+                )
+        if result["sql"]:
+            with st.expander("SQL"):
+                st.code(result["sql"], language="sql")

ingest/convert_natural_earth.py CHANGED Viewed

@@ -107,7 +107,7 @@ def _load_shapefile(src: pathlib.Path, source_key: str) -> gpd.GeoDataFrame:
     # subtype: featurecla or source key
     if "featurecla" in gdf.columns:
-        subtype = gdf["featurecla"]
     else:
         subtype = pd.Series([source_key] * n)

     # subtype: featurecla or source key
     if "featurecla" in gdf.columns:
+        subtype = gdf["featurecla"].str.lower()
     else:
         subtype = pd.Series([source_key] * n)

src/gazet/config.py CHANGED Viewed

@@ -6,8 +6,33 @@ import pathlib
 _DATA_DIR = pathlib.Path(os.environ.get("GAZET_DATA_DIR", str(
     pathlib.Path(__file__).resolve().parent.parent.parent / "data"
 )))
-DIVISIONS_AREA_PATH = str(_DATA_DIR / "overture/divisions_area/*.parquet")
-NATURAL_EARTH_PATH = str(_DATA_DIR / "natural_earth_geoparquet/ne_geography.parquet")
 # MODEL = "qwen3.5:cloud"
 # MODEL = "granite4:350m"

 _DATA_DIR = pathlib.Path(os.environ.get("GAZET_DATA_DIR", str(
     pathlib.Path(__file__).resolve().parent.parent.parent / "data"
 )))
+def _prefer_normalized(path_normalized: pathlib.Path, path_original: pathlib.Path) -> pathlib.Path:
+    """Prefer normalized geodata copies when present."""
+    use_normalized = os.environ.get("GAZET_USE_NORMALIZED_DATA", "1") != "0"
+    if use_normalized:
+        parent = path_normalized.parent
+        if "*" in path_normalized.name:
+            if parent.exists() and any(parent.glob(path_normalized.name)):
+                return path_normalized
+        elif path_normalized.exists():
+            return path_normalized
+    return path_original
+DIVISIONS_AREA_PATH = str(
+    _prefer_normalized(
+        _DATA_DIR / "overture_normalized/divisions_area/*.parquet",
+        _DATA_DIR / "overture/divisions_area/*.parquet",
+    )
+)
+NATURAL_EARTH_PATH = str(
+    _prefer_normalized(
+        _DATA_DIR / "natural_earth_normalized/ne_geography.parquet",
+        _DATA_DIR / "natural_earth_geoparquet/ne_geography.parquet",
+    )
+)
 # MODEL = "qwen3.5:cloud"
 # MODEL = "granite4:350m"

src/gazet/sql.py CHANGED Viewed

@@ -74,6 +74,28 @@ def _rewrite_data_paths(sql: str) -> str:
     return sql
 def _strip_fences(sql: Optional[str]) -> str:
     """Remove markdown code fences that the LM may wrap the SQL in."""
     if not sql:
@@ -143,6 +165,7 @@ def run_geo_sql_gguf(
         return
     sql = _rewrite_data_paths(sql)
     print(f"\n[SQL·GGUF] Generated:\n{sql}\n")
     yield {"type": "sql_attempt", "sql": sql, "iteration": 1}
     yield from _execute_sql(con, sql, "SQL·GGUF", iteration=1)
@@ -183,6 +206,8 @@ def run_geo_sql_dspy(
                 execution_error=error,
             )
             sql = _strip_fences(pred.sql)
         except Exception as exc:
             error = f"LM generation failed: {exc}"
             print(f"Generation error: {error}")

     return sql
+# Title-cased NE subtype literals the trained model may emit.
+# Data is now fully lowercased, so we normalise at query time.
+_NE_SUBTYPE_FIXES = {
+    "'River'": "'river'",
+    "'Lake'": "'lake'",
+    "'Basin'": "'basin'",
+    "'Range/mtn'": "'range/mtn'",
+    "'Peninsula'": "'peninsula'",
+    "'Depression'": "'depression'",
+    "'Island group'": "'island group'",
+    "'Ocean'": "'ocean'",
+    "'Sea'": "'sea'",
+}
+def _normalize_ne_subtypes(sql: str) -> str:
+    """Lowercase known NE subtype literals so they match the normalised data."""
+    for old, new in _NE_SUBTYPE_FIXES.items():
+        sql = sql.replace(old, new)
+    return sql
 def _strip_fences(sql: Optional[str]) -> str:
     """Remove markdown code fences that the LM may wrap the SQL in."""
     if not sql:
         return
     sql = _rewrite_data_paths(sql)
+    sql = _normalize_ne_subtypes(sql)
     print(f"\n[SQL·GGUF] Generated:\n{sql}\n")
     yield {"type": "sql_attempt", "sql": sql, "iteration": 1}
     yield from _execute_sql(con, sql, "SQL·GGUF", iteration=1)
                 execution_error=error,
             )
             sql = _strip_fences(pred.sql)
+            sql = _rewrite_data_paths(sql)
+            sql = _normalize_ne_subtypes(sql)
         except Exception as exc:
             error = f"LM generation failed: {exc}"
             print(f"Generation error: {error}")