srmsoumya commited on
Commit
a5fd57d
·
1 Parent(s): dfb9466

Lowercase subtype names, fix question type hints

Browse files
IMPROVEMENTS.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gazet Improvement Notes
2
+
3
+ Issues identified during testing. Each item is a candidate for the next training/template pass.
4
+
5
+ ---
6
+
7
+ ## 1. Missing "buffer-only" template
8
+
9
+ **Query**: "10 km buffer around Odisha"
10
+ **Expected**: Return the buffered geometry polygon itself.
11
+ **Actual**: Model picks `buffer_01`, which finds all features intersecting the buffer (200 rows).
12
+
13
+ **Root cause**: All buffer templates (`buffer_01` through `buffer_05`) perform an intersection join to find neighboring features. No template simply returns `ST_AsGeoJSON(ST_Buffer(...))`.
14
+
15
+ **Fix**: Add a `buffer_06` template that returns the buffer polygon directly:
16
+
17
+ ```sql
18
+ SELECT ST_AsGeoJSON(ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0)) AS geometry
19
+ FROM read_parquet('divisions_area')
20
+ WHERE id = '{anchor_id}'
21
+ ```
22
+
23
+ With hints like "10 km buffer around {anchor_name}", "draw a {buffer_km} km buffer around {anchor_name}". Consider a NE variant too.
24
+
25
+ ---
26
+
27
+ ## 2. Place extractor misses NE physical features (mixed-source queries)
28
+
29
+ **Query**: "The part of Ecuador that is in the Amazon Basin"
30
+ **Expected**: Place extractor returns both "Ecuador" and "Amazon Basin"; candidate search finds correct IDs for both.
31
+ **Actual**: Only "Ecuador" extracted. SQL model uses memorized wrong NE ID (`ne_1159120655` = Cuando River) instead of the correct one (`ne_1159104325` = AMAZON BASIN).
32
+
33
+ **Root cause**: The GGUF place extraction model was not trained to extract physical features. The runtime prompt (`_PLACES_SYSTEM_PROMPT`) has been updated but the finetuned model may ignore prompt changes. A re-finetune with NE feature examples is the definitive fix.
34
+
35
+ **Affected templates**: `partial_05`, `diff_02` (mixed-source), and all NE-anchored templates (`intersect_03`, `contain_03/04`, `buffer_03/04/05`, `lookup_02`).
36
+
37
+ ---
38
+
39
+ ## 3. Missing NE-anchor to county intersection template
40
+
41
+ **Query**: "Indravati River flows through which districts"
42
+ **Expected**: `ST_Intersects` with `target_subtype='county'`
43
+ **Actual**: Model sometimes uses `ST_Within` (wrong predicate) because `intersect_03` only targets `region`, not `county`.
44
+
45
+ **Fix**: Add `intersect_05` (NE anchor -> county, `ST_Intersects`) with district-oriented question hints.
46
+
47
+ ---
48
+
49
+ ## 4. Model hallucinates NE subtype values
50
+
51
+ **Query**: "which mountain ranges cross Odisha"
52
+ **Expected**: `n.subtype IN ('range/mtn', 'peninsula', 'depression')` (from `adj_05`)
53
+ **Actual**: Model generates `'Terrain area'` which does not exist in the data.
54
+
55
+ **Fix**: More training examples for `adj_05`. Consider adding common hallucinated values to `_NE_SUBTYPE_FIXES` in `sql.py` as a runtime safety net.
56
+
57
+ ---
58
+
59
+ ## 5. NE subtype casing inconsistency between model output and data
60
+
61
+ **Example**: Model generates `'River'`, `'Basin'`, `'Ocean'` but data has `'river'`, `'basin'`, `'ocean'`.
62
+
63
+ **Current workaround**: `_normalize_ne_subtypes()` in `sql.py` does string replacement of known title-cased literals at query time (`_NE_SUBTYPE_FIXES` dict). This is brittle and only covers a hardcoded list.
64
+
65
+ **Root cause**: The original Natural Earth data had title-cased `featurecla` values (e.g. `River`, `Basin`, `Ocean`). Training data was generated before the lowercase fix to `convert_natural_earth.py`, so the model learned to emit title-cased subtypes. The data is now lowercased but the model still outputs the old casing.
66
+
67
+ **Fix**: Regenerate training data with the lowercased NE parquet so all subtype literals in SQL examples are lowercase. After re-finetune, the model will natively emit lowercase subtypes and the `_normalize_ne_subtypes` hack can be removed.
68
+
69
+ ---
70
+
71
+ ## 6. "Largest/smallest" queries always return at least 3 results
72
+
73
+ **Query**: "the largest region in India", "smallest county in France"
74
+ **Expected**: Return 1 result (the single largest/smallest).
75
+ **Actual**: Model generates `LIMIT 3` by default, returning top 3 instead of 1.
76
+
77
+ **Root cause**: The aggregation templates (`agg_01`, `agg_02`) use `LIMIT 3` as the default. The model learns this as a fixed pattern and applies it even when the query clearly asks for a single result ("the largest", "the smallest").
78
+
79
+ **Fix**: During data generation, vary the LIMIT value based on the question hint phrasing. Use `LIMIT 1` for singular hints ("the largest X", "the smallest X") and `LIMIT 3` or `LIMIT 5` for plural hints ("the 3 largest", "top 5 smallest"). This teaches the model to infer the correct LIMIT from the query.
dataset/config.yaml CHANGED
@@ -19,11 +19,11 @@ countries:
19
  sample_targets:
20
  direct_lookup: 1000
21
  disambiguation: 2000 # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
22
- adjacency: 2000 # 6 templates (adj_01..06) - adj_06 is counties
23
  multi_adjacency: 1000
24
  containment: 2000 # 4 templates (contain_01..04) - contain_02 reversed, contain_03/04 NE anchor
25
- intersection: 2000 # 4 templates (intersect_01..04) - intersect_02/03 NE anchor
26
- buffer: 2000 # 5 templates (buffer_01..05)
27
  chained: 2000 # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
28
  difference: 2000 # 2 templates, one is mixed (diff_02)
29
  border_corridor: 1000
 
19
  sample_targets:
20
  direct_lookup: 1000
21
  disambiguation: 2000 # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
22
+ adjacency: 2000 # 8 templates (adj_01..08) - adj_05/07/08 cover terrain queries, adj_06 is counties
23
  multi_adjacency: 1000
24
  containment: 2000 # 4 templates (contain_01..04) - contain_02 reversed, contain_03/04 NE anchor
25
+ intersection: 2000 # 5 templates (intersect_01..05) - intersect_02/03/05 include NE anchor patterns
26
+ buffer: 2000 # 6 templates (buffer_01..06)
27
  chained: 2000 # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
28
  difference: 2000 # 2 templates, one is mixed (diff_02)
29
  border_corridor: 1000
dataset/scripts/build_relations.py CHANGED
@@ -52,16 +52,19 @@ _CONTAINMENT_SUBTYPE_PAIRS = (
52
  _NE_CROSS_SOURCE_SUBTYPES = (
53
  "sea",
54
  "ocean",
55
- "Lake",
56
- "River",
57
- "Basin",
58
  "gulf",
59
  "bay",
60
- "Island group",
61
- "Peninsula",
62
  "strait",
63
- "Range/mtn",
64
- "Depression",
 
 
 
 
 
65
  )
66
 
67
 
 
52
  _NE_CROSS_SOURCE_SUBTYPES = (
53
  "sea",
54
  "ocean",
55
+ "lake",
56
+ "river",
57
+ "basin",
58
  "gulf",
59
  "bay",
 
 
60
  "strait",
61
+ "range/mtn",
62
+ "plateau",
63
+ "plain",
64
+ "lowland",
65
+ "valley",
66
+ "depression",
67
+ "gorge",
68
  )
69
 
70
 
dataset/scripts/export_training_data.py CHANGED
@@ -7,8 +7,8 @@ Produces two task datasets from the same source samples:
7
  2. Place extraction (prompt = question only, completion = PlacesResult JSON)
8
 
9
  Place extraction pairs are derived automatically: for each SQL sample the
10
- selected_candidates give us the correct place names, subtypes, and country
11
- codes that the extractor should return.
12
 
13
  Output layout (all paths relative to dataset/):
14
  output/runs/{run_name}/sql/train.jsonl
@@ -124,7 +124,7 @@ You have access to two DuckDB parquet tables. Given a set of candidate entities
124
  id VARCHAR -- unique feature id prefixed 'ne_'
125
  names STRUCT("primary" VARCHAR, ...)
126
  country VARCHAR
127
- subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'Range/mtn', 'Island group'
128
  class VARCHAR
129
  region VARCHAR
130
  admin_level INTEGER
@@ -139,7 +139,7 @@ Use ST_AsGeoJSON(geometry) for all geometry outputs."""
139
 
140
  _CANDIDATES_COLS = [
141
  "source", "id", "name", "subtype", "country", "region",
142
- "admin_level", "similarity",
143
  ]
144
 
145
 
@@ -209,87 +209,74 @@ def sample_to_sql_pair(sample: Dict[str, Any]) -> Optional[Dict]:
209
  _PLACE_SYSTEM = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
210
 
211
  OUTPUT FORMAT:
212
- {"places": [{"place": "<name>", "country": "<ISO-2>", "subtype": "<subtype>"}]}
213
- "country" and "subtype" are optional; omit if not applicable.
214
 
215
  RULES:
216
- - Extract the place(s) that are the target of the query.
217
- - When a place is followed by its containing region, state, or country as disambiguation context ("Puri, Odisha", "Lisboa, Portugal", "Goa, India", "Manchester in US"), extract ONLY the specific place. Do not return the container as a separate place — record its info on the target using `country` (ISO-2) when unambiguous.
218
- - When a query names two or more distinct anchors joined by words like "and", "both", "between", "or" ("France and Germany", "between Nairobi and Mombasa"), or mixes an admin area with a physical feature as independent anchors ("part of Ecuador in the Amazon basin"), extract every anchor in the order they appear.
219
- - Do not infer or expand category nouns like "regions", "districts", "counties", "rivers", "mountains" when they refer to a type rather than a specific place ("regions of India" -> extract "India" only).
 
 
220
  - No duplicate place names.
221
- - "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
222
- - "subtype": include only when the geographic level is clear from the query.
223
-
224
- SUBTYPES:
225
- country, dependency, region, county, localadmin, locality, macrohood, neighborhood, microhood
226
- - Default to locality for cities/towns; omit for physical features (oceans, seas, rivers, lakes, basins, mountains, ranges, peninsulas, islands, terrain areas).
227
 
228
  EXAMPLES:
229
  Query: "Puri, Odisha"
230
- -> {"places": [{"place": "Puri", "subtype": "locality", "country": "IN"}]}
231
 
232
  Query: "Lisboa, Portugal"
233
- -> {"places": [{"place": "Lisboa", "subtype": "locality", "country": "PT"}]}
234
 
235
  Query: "Goa, India"
236
- -> {"places": [{"place": "Goa", "subtype": "region", "country": "IN"}]}
237
 
238
  Query: "Manchester in US"
239
- -> {"places": [{"place": "Manchester", "subtype": "locality", "country": "US"}]}
240
 
241
  Query: "Springfield, Illinois"
242
- -> {"places": [{"place": "Springfield", "subtype": "locality", "country": "US"}]}
243
 
244
  Query: "coastal districts of Brazil"
245
- -> {"places": [{"place": "Brazil", "subtype": "country"}]}
246
 
247
  Query: "northern half of India"
248
- -> {"places": [{"place": "India", "subtype": "country"}]}
249
 
250
  Query: "what's within 50 km of Paris?"
251
- -> {"places": [{"place": "Paris", "subtype": "locality"}]}
252
 
253
  Query: "countries the Nile crosses"
254
  -> {"places": [{"place": "Nile"}]}
255
 
 
 
 
 
 
 
256
  Query: "part of Ecuador in the Amazon basin"
257
- -> {"places": [{"place": "Ecuador", "subtype": "country"}, {"place": "Amazon basin"}]}
258
 
259
  Query: "Amazon basin inside Ecuador"
260
- -> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador", "subtype": "country"}]}
 
 
 
261
 
262
  Query: "which regions border both France and Germany?"
263
- -> {"places": [{"place": "France", "subtype": "country"}, {"place": "Germany", "subtype": "country"}]}
264
 
265
  Query: "merge Nairobi and Mombasa"
266
- -> {"places": [{"place": "Nairobi", "subtype": "locality"}, {"place": "Mombasa", "subtype": "locality"}]}"""
267
-
268
- # Overture division subtypes — used to filter out natural_earth candidates
269
- # from the place extraction output (NE features don't have these subtypes).
270
- _DIVISION_SUBTYPES = {
271
- "country", "region", "dependency", "county", "localadmin",
272
- "locality", "macrohood", "neighborhood", "microhood",
273
- }
274
 
275
 
276
  def _candidate_to_place(c: Dict) -> Optional[Dict]:
277
- """Convert a selected candidate to a Place dict for PlacesResult."""
278
  name = c.get("name", "").strip()
279
  if not name:
280
  return None
281
 
282
- place: Dict[str, Any] = {"place": name}
283
-
284
- subtype = c.get("subtype", "")
285
- if subtype in _DIVISION_SUBTYPES:
286
- place["subtype"] = subtype
287
-
288
- country = c.get("country", "")
289
- if country and len(country) == 2:
290
- place["country"] = country
291
-
292
- return place
293
 
294
 
295
  def sample_to_place_pair(sample: Dict[str, Any]) -> Optional[Dict]:
 
7
  2. Place extraction (prompt = question only, completion = PlacesResult JSON)
8
 
9
  Place extraction pairs are derived automatically: for each SQL sample the
10
+ selected_candidates give us the correct place names that the extractor should
11
+ return.
12
 
13
  Output layout (all paths relative to dataset/):
14
  output/runs/{run_name}/sql/train.jsonl
 
124
  id VARCHAR -- unique feature id prefixed 'ne_'
125
  names STRUCT("primary" VARCHAR, ...)
126
  country VARCHAR
127
+ subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
128
  class VARCHAR
129
  region VARCHAR
130
  admin_level INTEGER
 
139
 
140
  _CANDIDATES_COLS = [
141
  "source", "id", "name", "subtype", "country", "region",
142
+ "admin_level",
143
  ]
144
 
145
 
 
209
  _PLACE_SYSTEM = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
210
 
211
  OUTPUT FORMAT:
212
+ {"places": [{"place": "<name>"}]}
 
213
 
214
  RULES:
215
+ - Extract the place or places that are the actual anchors of the query.
216
+ - Physical features are valid places: oceans, seas, gulfs, bays, straits, rivers, lakes, basins, mountain ranges, peninsulas, island groups, deserts, and terrain regions.
217
+ - When a place is followed by its containing region, state, or country as disambiguation context ("Puri, Odisha", "Lisboa, Portugal", "Goa, India", "Manchester in US"), extract ONLY the specific place. Do not return the container as a separate place.
218
+ - When a query names two or more distinct anchors joined by words like "and", "both", "between", or mixes an admin area with a physical feature as separate anchors, extract every anchor in the order they appear.
219
+ - Do not infer or expand category nouns like "regions", "districts", "counties", "rivers", or "mountains" when they refer to a type rather than a specific named place ("regions of India" -> extract "India" only).
220
+ - Only extract places explicitly mentioned.
221
  - No duplicate place names.
 
 
 
 
 
 
222
 
223
  EXAMPLES:
224
  Query: "Puri, Odisha"
225
+ -> {"places": [{"place": "Puri"}]}
226
 
227
  Query: "Lisboa, Portugal"
228
+ -> {"places": [{"place": "Lisboa"}]}
229
 
230
  Query: "Goa, India"
231
+ -> {"places": [{"place": "Goa"}]}
232
 
233
  Query: "Manchester in US"
234
+ -> {"places": [{"place": "Manchester"}]}
235
 
236
  Query: "Springfield, Illinois"
237
+ -> {"places": [{"place": "Springfield"}]}
238
 
239
  Query: "coastal districts of Brazil"
240
+ -> {"places": [{"place": "Brazil"}]}
241
 
242
  Query: "northern half of India"
243
+ -> {"places": [{"place": "India"}]}
244
 
245
  Query: "what's within 50 km of Paris?"
246
+ -> {"places": [{"place": "Paris"}]}
247
 
248
  Query: "countries the Nile crosses"
249
  -> {"places": [{"place": "Nile"}]}
250
 
251
+ Query: "which countries touch the Gulf of Maine"
252
+ -> {"places": [{"place": "Gulf of Maine"}]}
253
+
254
+ Query: "10 km buffer around Odisha"
255
+ -> {"places": [{"place": "Odisha"}]}
256
+
257
  Query: "part of Ecuador in the Amazon basin"
258
+ -> {"places": [{"place": "Ecuador"}, {"place": "Amazon basin"}]}
259
 
260
  Query: "Amazon basin inside Ecuador"
261
+ -> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador"}]}
262
+
263
+ Query: "the part of Chad in Lake Chad"
264
+ -> {"places": [{"place": "Chad"}, {"place": "Lake Chad"}]}
265
 
266
  Query: "which regions border both France and Germany?"
267
+ -> {"places": [{"place": "France"}, {"place": "Germany"}]}
268
 
269
  Query: "merge Nairobi and Mombasa"
270
+ -> {"places": [{"place": "Nairobi"}, {"place": "Mombasa"}]}"""
 
 
 
 
 
 
 
271
 
272
 
273
  def _candidate_to_place(c: Dict) -> Optional[Dict]:
274
+ """Convert a selected candidate to a minimal Place dict for PlacesResult."""
275
  name = c.get("name", "").strip()
276
  if not name:
277
  return None
278
 
279
+ return {"place": name}
 
 
 
 
 
 
 
 
 
 
280
 
281
 
282
  def sample_to_place_pair(sample: Dict[str, Any]) -> Optional[Dict]:
dataset/scripts/generate_samples.py CHANGED
@@ -61,29 +61,30 @@ get_templates_by_family = sql_templates.get_templates_by_family
61
 
62
 
63
  _NE_NAMED_LOOKUP_SUBTYPES = {
64
- 'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay',
65
- 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression',
66
  }
67
 
68
  _NE_TEMPLATE_SUBTYPES = {
69
- 'lookup_02': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
70
  'adj_03': {'sea', 'ocean'},
71
- 'adj_04': {'River', 'Lake', 'Basin'},
72
- 'adj_05': {'Range/mtn', 'Peninsula', 'Depression'},
73
- 'contain_03': {'sea', 'ocean', 'gulf', 'bay', 'Basin', 'Island group', 'Peninsula', 'Range/mtn', 'Depression'},
74
  'contain_04': {'sea', 'ocean', 'gulf', 'bay', 'strait'},
75
- 'intersect_02': {'River', 'Lake', 'Basin', 'gulf', 'bay', 'strait', 'Range/mtn', 'Peninsula', 'Depression'},
76
- 'intersect_03': {'River', 'Lake', 'Basin', 'gulf', 'bay', 'strait', 'Range/mtn', 'Peninsula', 'Depression'},
77
- 'buffer_03': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
78
- 'buffer_04': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
79
- 'buffer_05': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
80
- 'chained_03': {'Island group', 'Peninsula', 'Range/mtn', 'Depression'},
81
- 'chained_04': {'River', 'Lake', 'Basin'},
82
- 'chained_05': {'Range/mtn', 'Depression'},
83
- 'chained_08': {'River', 'Lake', 'Basin'},
84
- 'chained_09': {'Range/mtn', 'Depression'},
85
- 'partial_05': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
86
- 'diff_02': {'sea', 'ocean', 'Lake', 'River', 'Basin', 'gulf', 'bay', 'Island group', 'Peninsula', 'strait', 'Range/mtn', 'Depression'},
 
87
  }
88
 
89
 
@@ -722,14 +723,17 @@ def generate_template_based_sample(
722
  anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
723
 
724
  elif template.family == "adjacency":
725
- # adj_03/04/05 target natural_earth features (seas, rivers, ranges).
 
726
  # Their SQL hardcodes NE subtypes and does not use {target_subtype}.
727
  # Sample from cross_source_relations so the anchor is a division
728
  # that actually intersects the right NE features.
729
  _NE_ADJ_SUBTYPES = {
730
  "adj_03": ("ocean", "sea"),
731
- "adj_04": ("River", "Lake", "Basin"),
732
- "adj_05": ("Range/mtn", "Peninsula", "Depression"),
 
 
733
  }
734
  if template.template_id in _NE_ADJ_SUBTYPES:
735
  cs_df = tables.get('cross_source_relations', pd.DataFrame())
@@ -1019,8 +1023,10 @@ def generate_template_based_sample(
1019
 
1020
  elif template.family == "buffer":
1021
  # Buffer operations
1022
- # Kilometre distances used by buffer_01 and buffer_03 templates.
1023
- # Metre distances used by buffer_02 and buffer_04 templates.
 
 
1024
  # The template SQL divides by 111 320 to convert to degrees.
1025
  _buffer_km_choices = [1, 2, 5, 10, 25, 50, 100, 200]
1026
  _buffer_m_choices = [100, 250, 500, 1000, 2000, 5000]
@@ -1166,8 +1172,13 @@ def generate_template_based_sample(
1166
  candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
1167
 
1168
  elif template.family == "aggregation":
1169
- top_n = random.choice([3, 5, 10])
 
 
1170
  target_subtype = random.choice(['locality', 'region'])
 
 
 
1171
 
1172
  if template.template_id in ['agg_03', 'agg_04']:
1173
  # Country-level aggregation: SQL uses country code, so the anchor
@@ -1194,7 +1205,7 @@ def generate_template_based_sample(
1194
  num_candidates=10, difficulty="hard"
1195
  )
1196
 
1197
- question = random.choice(template.question_hints).format(
1198
  top_n=top_n,
1199
  target_subtype=target_subtype,
1200
  anchor_name=anchor['name'],
@@ -1216,7 +1227,7 @@ def generate_template_based_sample(
1216
  num_candidates=10, difficulty="hard"
1217
  )
1218
 
1219
- question = random.choice(template.question_hints).format(
1220
  top_n=top_n,
1221
  target_subtype=target_subtype,
1222
  anchor_name=anchor['container_name'],
 
61
 
62
 
63
  _NE_NAMED_LOOKUP_SUBTYPES = {
64
+ 'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay',
65
+ 'island group', 'peninsula', 'strait', 'range/mtn', 'depression',
66
  }
67
 
68
  _NE_TEMPLATE_SUBTYPES = {
69
+ 'lookup_02': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
70
  'adj_03': {'sea', 'ocean'},
71
+ 'adj_04': {'river', 'lake', 'basin'},
72
+ 'adj_05': {'range/mtn', 'peninsula', 'depression'},
73
+ 'contain_03': {'sea', 'ocean', 'gulf', 'bay', 'basin', 'island group', 'peninsula', 'range/mtn', 'depression'},
74
  'contain_04': {'sea', 'ocean', 'gulf', 'bay', 'strait'},
75
+ 'intersect_02': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
76
+ 'intersect_03': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
77
+ 'intersect_05': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
78
+ 'buffer_03': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
79
+ 'buffer_04': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
80
+ 'buffer_05': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
81
+ 'chained_03': {'island group', 'peninsula', 'range/mtn', 'depression'},
82
+ 'chained_04': {'river', 'lake', 'basin'},
83
+ 'chained_05': {'range/mtn', 'depression'},
84
+ 'chained_08': {'river', 'lake', 'basin'},
85
+ 'chained_09': {'range/mtn', 'depression'},
86
+ 'partial_05': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
87
+ 'diff_02': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
88
  }
89
 
90
 
 
723
  anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
724
 
725
  elif template.family == "adjacency":
726
+ # adj_03/04/05/07/08 target natural_earth features (marine, water,
727
+ # mountain, plateau, and broad landform regions).
728
  # Their SQL hardcodes NE subtypes and does not use {target_subtype}.
729
  # Sample from cross_source_relations so the anchor is a division
730
  # that actually intersects the right NE features.
731
  _NE_ADJ_SUBTYPES = {
732
  "adj_03": ("ocean", "sea"),
733
+ "adj_04": ("river", "lake", "basin"),
734
+ "adj_05": ("range/mtn",),
735
+ "adj_07": ("plateau",),
736
+ "adj_08": ("plain", "lowland", "basin", "valley", "depression", "gorge"),
737
  }
738
  if template.template_id in _NE_ADJ_SUBTYPES:
739
  cs_df = tables.get('cross_source_relations', pd.DataFrame())
 
1023
 
1024
  elif template.family == "buffer":
1025
  # Buffer operations
1026
+ # Kilometre distances used by km-based buffer templates (for example
1027
+ # buffer_01, buffer_03, buffer_05, and buffer_06).
1028
+ # Metre distances used by metre-based buffer templates (buffer_02 and
1029
+ # buffer_04).
1030
  # The template SQL divides by 111 320 to convert to degrees.
1031
  _buffer_km_choices = [1, 2, 5, 10, 25, 50, 100, 200]
1032
  _buffer_m_choices = [100, 250, 500, 1000, 2000, 5000]
 
1172
  candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
1173
 
1174
  elif template.family == "aggregation":
1175
+ # Teach the model to distinguish singular superlatives ("the largest")
1176
+ # from explicit top-N requests ("top 5 largest").
1177
+ top_n = random.choice([1, 3, 5, 10])
1178
  target_subtype = random.choice(['locality', 'region'])
1179
+ singular_hints = [h for h in template.question_hints if '{top_n}' not in h]
1180
+ plural_hints = [h for h in template.question_hints if '{top_n}' in h]
1181
+ question_hint_pool = singular_hints if top_n == 1 and singular_hints else plural_hints or template.question_hints
1182
 
1183
  if template.template_id in ['agg_03', 'agg_04']:
1184
  # Country-level aggregation: SQL uses country code, so the anchor
 
1205
  num_candidates=10, difficulty="hard"
1206
  )
1207
 
1208
+ question = random.choice(question_hint_pool).format(
1209
  top_n=top_n,
1210
  target_subtype=target_subtype,
1211
  anchor_name=anchor['name'],
 
1227
  num_candidates=10, difficulty="hard"
1228
  )
1229
 
1230
+ question = random.choice(question_hint_pool).format(
1231
  top_n=top_n,
1232
  target_subtype=target_subtype,
1233
  anchor_name=anchor['container_name'],
dataset/scripts/sql_templates.py CHANGED
@@ -125,6 +125,10 @@ TEMPLATES = [
125
  "map the {anchor_name}",
126
  "how big is the {anchor_name}?",
127
  "outline of the {anchor_name}",
 
 
 
 
128
  ],
129
  ),
130
 
@@ -163,6 +167,10 @@ TEMPLATES = [
163
  "{anchor_name} {container_name}",
164
  "pull up {anchor_name} in {container_name}",
165
  "find {anchor_name} in {container_name}",
 
 
 
 
166
  ],
167
  ),
168
 
@@ -189,6 +197,10 @@ TEMPLATES = [
189
  "pull up {anchor_name} ({container_name})",
190
  "find {anchor_name} in {container_name}",
191
  "{anchor_name} {container_name}",
 
 
 
 
192
  ],
193
  ),
194
 
@@ -215,6 +227,10 @@ TEMPLATES = [
215
  "{anchor_name} province of {container_name}",
216
  "pull up {anchor_name} in {container_name}",
217
  "find {anchor_name} {container_name}",
 
 
 
 
218
  ],
219
  ),
220
 
@@ -246,6 +262,11 @@ TEMPLATES = [
246
  "what surrounds {anchor_name}?",
247
  "places next to {anchor_name}",
248
  "everything bordering {anchor_name}",
 
 
 
 
 
249
  ],
250
  ),
251
 
@@ -275,6 +296,10 @@ TEMPLATES = [
275
  "which {target_subtype}s are adjacent to {anchor_name}?",
276
  "{target_subtype}s along the {anchor_name} border",
277
  "find {target_subtype}s next to {anchor_name}",
 
 
 
 
278
  ],
279
  ),
280
 
@@ -305,6 +330,11 @@ TEMPLATES = [
305
  "which water bodies does {anchor_name} border?",
306
  "does {anchor_name} have sea access?",
307
  "what ocean is {anchor_name} on?",
 
 
 
 
 
308
  ],
309
  ),
310
 
@@ -368,6 +398,12 @@ TEMPLATES = [
368
  "regions adjacent to both {anchor_1_name} and {anchor_2_name}",
369
  "what lies between {anchor_1_name} and {anchor_2_name}?",
370
  "common neighbours of {anchor_1_name} and {anchor_2_name}",
 
 
 
 
 
 
371
  ],
372
  ),
373
 
@@ -399,6 +435,13 @@ TEMPLATES = [
399
  "all {target_subtype}s within {anchor_name}",
400
  "{target_subtype}s of {anchor_name}",
401
  "show every {target_subtype} in {anchor_name}",
 
 
 
 
 
 
 
402
  ],
403
  ),
404
 
@@ -428,6 +471,12 @@ TEMPLATES = [
428
  "{anchor_name} is part of which country?",
429
  "where is {anchor_name}",
430
  "what country is {anchor_name} in",
 
 
 
 
 
 
431
  ],
432
  ),
433
 
@@ -456,6 +505,14 @@ TEMPLATES = [
456
  "all regions inside the {anchor_name}",
457
  "what {target_subtype}s does the {anchor_name} contain?",
458
  "{target_subtype}s covered by the {anchor_name}",
 
 
 
 
 
 
 
 
459
  ],
460
  ),
461
 
@@ -486,6 +543,12 @@ TEMPLATES = [
486
  "which {target_subtype}s overlap {anchor_name}?",
487
  "{target_subtype}s partially inside {anchor_name}",
488
  "what {target_subtype}s extend into {anchor_name}?",
 
 
 
 
 
 
489
  ],
490
  ),
491
 
@@ -516,6 +579,10 @@ TEMPLATES = [
516
  "countries along the {anchor_name}",
517
  "what countries does the {anchor_name} cover?",
518
  "countries the {anchor_name} spans across",
 
 
 
 
519
  ],
520
  ),
521
 
@@ -607,6 +674,15 @@ TEMPLATES = [
607
  "what falls within {buffer_km} km of the {anchor_name}?",
608
  "admin divisions within a {buffer_km} km radius of the {anchor_name}",
609
  "places within {buffer_km} kilometers of the {anchor_name}",
 
 
 
 
 
 
 
 
 
610
  ],
611
  ),
612
 
@@ -633,6 +709,40 @@ TEMPLATES = [
633
  "admin units within {buffer_m} m of the {anchor_name}",
634
  "places within {buffer_m} metres of the {anchor_name}",
635
  "{buffer_m} meter buffer around the {anchor_name}",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
636
  ],
637
  ),
638
 
@@ -670,6 +780,11 @@ TEMPLATES = [
670
  "{target_subtype}s in {anchor_name} bordering the sea",
671
  "oceanfront {target_subtype}s in {anchor_name}",
672
  "which {target_subtype}s in {anchor_name} have a coastline?",
 
 
 
 
 
673
  ],
674
  ),
675
 
@@ -702,6 +817,11 @@ TEMPLATES = [
702
  "{target_subtype}s in {anchor_name} with no coastline",
703
  "which {target_subtype}s within {anchor_name} are landlocked?",
704
  "interior {target_subtype}s of {anchor_name} with no ocean border",
 
 
 
 
 
705
  ],
706
  ),
707
 
@@ -723,7 +843,7 @@ TEMPLATES = [
723
  " AND ST_Within(b.geometry, region.geometry)"
724
  " AND EXISTS ("
725
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
726
- " WHERE n.subtype IN ('Range/mtn', 'Island group', 'Peninsula', 'Depression')"
727
  " AND ST_Intersects(b.geometry, n.geometry)"
728
  " )"
729
  ),
@@ -732,6 +852,10 @@ TEMPLATES = [
732
  "{target_subtype}s of {anchor_name} on a peninsula or island group",
733
  "{target_subtype}s within {anchor_name} on notable landforms",
734
  "island and peninsula {target_subtype}s of {anchor_name}",
 
 
 
 
735
  ],
736
  ),
737
 
@@ -764,6 +888,10 @@ TEMPLATES = [
764
  "{anchor_1_name} with {anchor_2_name} cut out",
765
  "subtract {anchor_2_name} from {anchor_1_name}",
766
  "what's left of {anchor_1_name} after removing {anchor_2_name}?",
 
 
 
 
767
  ],
768
  ),
769
 
@@ -792,6 +920,10 @@ TEMPLATES = [
792
  "{anchor_name} with the {clip_feature_name} removed",
793
  "what's left of {anchor_name} after removing the {clip_feature_name}?",
794
  "show me {anchor_name} excluding the {clip_feature_name}",
 
 
 
 
795
  ],
796
  ),
797
 
@@ -828,6 +960,11 @@ TEMPLATES = [
828
  "the region straddling the border of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
829
  "{buffer_km} km on either side of the {anchor_1_name} and {anchor_2_name} border",
830
  "buffer the {anchor_1_name}-{anchor_2_name} boundary by {buffer_km} km",
 
 
 
 
 
831
  ],
832
  ),
833
 
@@ -899,6 +1036,10 @@ TEMPLATES = [
899
  "show {target_subtype}s across {anchor_1_name} and {anchor_2_name}",
900
  "{target_subtype}s belonging to {anchor_1_name} and {anchor_2_name}",
901
  "list {target_subtype}s in both {anchor_1_name} and {anchor_2_name}",
 
 
 
 
902
  ],
903
  ),
904
 
@@ -921,6 +1062,10 @@ TEMPLATES = [
921
  "all {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
922
  "show {target_subtype}s across {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
923
  "list {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
 
 
 
 
924
  ],
925
  ),
926
 
@@ -947,6 +1092,11 @@ TEMPLATES = [
947
  "union of all {target_subtype}s within {anchor_name}",
948
  "all {target_subtype}s of {anchor_name} merged together",
949
  "the overall extent of {target_subtype}s in {anchor_name}",
 
 
 
 
 
950
  ],
951
  ),
952
 
@@ -979,6 +1129,11 @@ TEMPLATES = [
979
  "the top half of {anchor_name}",
980
  "northern portion of {anchor_name}",
981
  "upper half of {anchor_name}",
 
 
 
 
 
982
  ],
983
  ),
984
 
@@ -1008,6 +1163,11 @@ TEMPLATES = [
1008
  "the bottom half of {anchor_name}",
1009
  "southern portion of {anchor_name}",
1010
  "lower half of {anchor_name}",
 
 
 
 
 
1011
  ],
1012
  ),
1013
 
@@ -1036,6 +1196,11 @@ TEMPLATES = [
1036
  "eastern part of {anchor_name}",
1037
  "the right half of {anchor_name}",
1038
  "eastern portion of {anchor_name}",
 
 
 
 
 
1039
  ],
1040
  ),
1041
 
@@ -1064,6 +1229,11 @@ TEMPLATES = [
1064
  "western part of {anchor_name}",
1065
  "the left half of {anchor_name}",
1066
  "western portion of {anchor_name}",
 
 
 
 
 
1067
  ],
1068
  ),
1069
 
@@ -1095,6 +1265,10 @@ TEMPLATES = [
1095
  "{clip_feature_name} inside {anchor_name}",
1096
  "parts of {anchor_name} covered by the {clip_feature_name}",
1097
  "show me where {anchor_name} and the {clip_feature_name} overlap",
 
 
 
 
1098
  ],
1099
  ),
1100
 
@@ -1129,6 +1303,10 @@ TEMPLATES = [
1129
  "the {top_n} biggest {target_subtype}s within {anchor_name}",
1130
  "largest {target_subtype} in {anchor_name}",
1131
  "which {target_subtype} in {anchor_name} has the most area?",
 
 
 
 
1132
  ],
1133
  ),
1134
 
@@ -1160,6 +1338,10 @@ TEMPLATES = [
1160
  "the {top_n} tiniest {target_subtype}s within {anchor_name}",
1161
  "smallest {target_subtype} in {anchor_name}",
1162
  "which {target_subtype} in {anchor_name} has the least area?",
 
 
 
 
1163
  ],
1164
  ),
1165
 
@@ -1188,6 +1370,10 @@ TEMPLATES = [
1188
  "the {top_n} largest {target_subtype}s in {anchor_name}",
1189
  "biggest {target_subtype} in {anchor_name}",
1190
  "which {target_subtype} in {anchor_name} is the largest?",
 
 
 
 
1191
  ],
1192
  ),
1193
 
@@ -1216,6 +1402,10 @@ TEMPLATES = [
1216
  "the {top_n} smallest {target_subtype}s in {anchor_name}",
1217
  "smallest {target_subtype} in {anchor_name}",
1218
  "which {target_subtype} in {anchor_name} is the smallest?",
 
 
 
 
1219
  ],
1220
  ),
1221
 
@@ -1249,6 +1439,11 @@ TEMPLATES = [
1249
  "biggest {target_subtype} per region in {anchor_name}",
1250
  "largest {target_subtype} for every region of {anchor_name}",
1251
  "the biggest {target_subtype} in each province of {anchor_name}",
 
 
 
 
 
1252
  ],
1253
  ),
1254
 
@@ -1279,6 +1474,11 @@ TEMPLATES = [
1279
  "smallest {target_subtype} per region in {anchor_name}",
1280
  "tiniest {target_subtype} for every region of {anchor_name}",
1281
  "the smallest {target_subtype} in each province of {anchor_name}",
 
 
 
 
 
1282
  ],
1283
  ),
1284
 
@@ -1307,6 +1507,10 @@ TEMPLATES = [
1307
  "{anchor_name}'s land {target_subtype}s",
1308
  "dependencies of {anchor_name} with land area",
1309
  "show the land dependencies of {anchor_name}",
 
 
 
 
1310
  ],
1311
  ),
1312
 
@@ -1330,6 +1534,11 @@ TEMPLATES = [
1330
  "official territorial divisions of {anchor_name}",
1331
  "recognised territorial {target_subtype}s belonging to {anchor_name}",
1332
  "which territorial regions does {anchor_name} have?",
 
 
 
 
 
1333
  ],
1334
  ),
1335
 
@@ -1353,6 +1562,11 @@ TEMPLATES = [
1353
  "{target_subtype}s of {anchor_name} that are not on land",
1354
  "water-associated {target_subtype}s of {anchor_name}",
1355
  "marine or offshore {target_subtype}s of {anchor_name}",
 
 
 
 
 
1356
  ],
1357
  ),
1358
 
@@ -1374,7 +1588,7 @@ TEMPLATES = [
1374
  " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1375
  " ST_AsGeoJSON(n.geometry) AS geometry"
1376
  " FROM read_parquet('natural_earth') AS n, a"
1377
- " WHERE n.subtype IN ('River', 'Lake', 'Basin')"
1378
  " AND ST_Intersects(a.geometry, n.geometry)"
1379
  ),
1380
  question_hints=[
@@ -1386,6 +1600,10 @@ TEMPLATES = [
1386
  "what bodies of water cross {anchor_name}?",
1387
  "rivers of {anchor_name}",
1388
  "show me the lakes in {anchor_name}",
 
 
 
 
1389
  ],
1390
  ),
1391
 
@@ -1395,7 +1613,7 @@ TEMPLATES = [
1395
  sql_difficulty="medium",
1396
  anchor_source="divisions_area",
1397
  num_anchors=1,
1398
- target_subtype="range",
1399
  sql_template=(
1400
  "WITH a AS ("
1401
  " SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
@@ -1403,18 +1621,84 @@ TEMPLATES = [
1403
  " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1404
  " ST_AsGeoJSON(n.geometry) AS geometry"
1405
  " FROM read_parquet('natural_earth') AS n, a"
1406
- " WHERE n.subtype IN ('Range/mtn', 'Peninsula', 'Depression')"
1407
  " AND ST_Intersects(a.geometry, n.geometry)"
1408
  ),
1409
  question_hints=[
1410
  "what mountain ranges are in {anchor_name}?",
1411
- "terrain features of {anchor_name}",
1412
  "which mountain ranges cross {anchor_name}?",
1413
- "landforms inside {anchor_name}",
1414
- "peninsulas and ranges in {anchor_name}",
1415
- "geographic features within {anchor_name}",
1416
  "mountains of {anchor_name}",
1417
- "what terrain does {anchor_name} contain?",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1418
  ],
1419
  ),
1420
 
@@ -1449,6 +1733,13 @@ TEMPLATES = [
1449
  "what provinces does the {anchor_name} span?",
1450
  "regions along the {anchor_name}",
1451
  "which provinces overlap the {anchor_name}?",
 
 
 
 
 
 
 
1452
  ],
1453
  ),
1454
 
@@ -1475,6 +1766,45 @@ TEMPLATES = [
1475
  "everything natural that touches {anchor_name}",
1476
  "what geographic features does {anchor_name} contain?",
1477
  "natural features within or crossing {anchor_name}",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1478
  ],
1479
  ),
1480
 
@@ -1500,7 +1830,7 @@ TEMPLATES = [
1500
  " AND ST_Within(b.geometry, region.geometry)"
1501
  " AND EXISTS ("
1502
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1503
- " WHERE n.subtype IN ('River', 'Lake', 'Basin')"
1504
  " AND ST_Intersects(b.geometry, n.geometry)"
1505
  " )"
1506
  ),
@@ -1512,6 +1842,10 @@ TEMPLATES = [
1512
  "{target_subtype}s in {anchor_name} that touch a river",
1513
  "which {target_subtype}s in {anchor_name} are on a lake?",
1514
  "waterfront {target_subtype}s of {anchor_name}",
 
 
 
 
1515
  ],
1516
  ),
1517
 
@@ -1533,7 +1867,7 @@ TEMPLATES = [
1533
  " AND ST_Within(b.geometry, region.geometry)"
1534
  " AND EXISTS ("
1535
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1536
- " WHERE n.subtype IN ('Range/mtn', 'Depression')"
1537
  " AND ST_Intersects(b.geometry, n.geometry)"
1538
  " )"
1539
  ),
@@ -1544,6 +1878,10 @@ TEMPLATES = [
1544
  "highland {target_subtype}s within {anchor_name}",
1545
  "{target_subtype}s of {anchor_name} in mountainous terrain",
1546
  "{target_subtype}s in {anchor_name} near a mountain range",
 
 
 
 
1547
  ],
1548
  ),
1549
 
@@ -1581,6 +1919,10 @@ TEMPLATES = [
1581
  "{target_subtype}s of {anchor_name} with ocean access",
1582
  "which {target_subtype}s in {anchor_name} touch the sea?",
1583
  "maritime {target_subtype}s of {anchor_name}",
 
 
 
 
1584
  ],
1585
  ),
1586
 
@@ -1613,6 +1955,10 @@ TEMPLATES = [
1613
  "{target_subtype}s in {anchor_name} with no sea access",
1614
  "non-coastal {target_subtype}s of {anchor_name}",
1615
  "inland {target_subtype}s of {anchor_name}",
 
 
 
 
1616
  ],
1617
  ),
1618
 
@@ -1634,7 +1980,7 @@ TEMPLATES = [
1634
  " AND ST_Within(b.geometry, region.geometry)"
1635
  " AND EXISTS ("
1636
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1637
- " WHERE n.subtype IN ('River', 'Lake', 'Basin')"
1638
  " AND ST_Intersects(b.geometry, n.geometry)"
1639
  " )"
1640
  ),
@@ -1645,6 +1991,10 @@ TEMPLATES = [
1645
  "lakeside {target_subtype}s within {anchor_name}",
1646
  "{target_subtype}s of {anchor_name} along a river",
1647
  "which {target_subtype}s in {anchor_name} border a lake?",
 
 
 
 
1648
  ],
1649
  ),
1650
 
@@ -1666,7 +2016,7 @@ TEMPLATES = [
1666
  " AND ST_Within(b.geometry, region.geometry)"
1667
  " AND EXISTS ("
1668
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1669
- " WHERE n.subtype IN ('Range/mtn', 'Depression')"
1670
  " AND ST_Intersects(b.geometry, n.geometry)"
1671
  " )"
1672
  ),
@@ -1677,6 +2027,10 @@ TEMPLATES = [
1677
  "highland {target_subtype}s within {anchor_name}",
1678
  "{target_subtype}s of {anchor_name} in mountainous terrain",
1679
  "which {target_subtype}s in {anchor_name} have mountain ranges?",
 
 
 
 
1680
  ],
1681
  ),
1682
 
@@ -1718,6 +2072,10 @@ TEMPLATES = [
1718
  "seaside regions of {anchor_name}",
1719
  "which provinces of {anchor_name} touch the sea?",
1720
  "states of {anchor_name} along the coast",
 
 
 
 
1721
  ],
1722
  ),
1723
 
@@ -1752,6 +2110,10 @@ TEMPLATES = [
1752
  "regions of {anchor_name} without sea access",
1753
  "interior states of {anchor_name}",
1754
  "states of {anchor_name} that don't border the ocean",
 
 
 
 
1755
  ],
1756
  ),
1757
 
@@ -1784,6 +2146,10 @@ TEMPLATES = [
1784
  "which countries touch the {anchor_name}?",
1785
  "countries with coastline on the {anchor_name}",
1786
  "what nations lie on the {anchor_name}?",
 
 
 
 
1787
  ],
1788
  ),
1789
 
@@ -1816,6 +2182,15 @@ TEMPLATES = [
1816
  "everything within {buffer_km} km of the {anchor_name}",
1817
  "what natural features are close to the {anchor_name}?",
1818
  "{buffer_km} km radius around the {anchor_name}",
 
 
 
 
 
 
 
 
 
1819
  ],
1820
  ),
1821
 
 
125
  "map the {anchor_name}",
126
  "how big is the {anchor_name}?",
127
  "outline of the {anchor_name}",
128
+ "show the shape of the {anchor_name}",
129
+ "trace the {anchor_name}",
130
+ "map out the {anchor_name}",
131
+ "where exactly is the {anchor_name}",
132
  ],
133
  ),
134
 
 
167
  "{anchor_name} {container_name}",
168
  "pull up {anchor_name} in {container_name}",
169
  "find {anchor_name} in {container_name}",
170
+ "locate {anchor_name} in {container_name}",
171
+ "need {anchor_name} from {container_name}",
172
+ "show {anchor_name} under {container_name}",
173
+ "{anchor_name} near {container_name}",
174
  ],
175
  ),
176
 
 
197
  "pull up {anchor_name} ({container_name})",
198
  "find {anchor_name} in {container_name}",
199
  "{anchor_name} {container_name}",
200
+ "locate {anchor_name} in {container_name}",
201
+ "need the {anchor_name} in {container_name}",
202
+ "show {anchor_name} from {container_name}",
203
+ "bring up {anchor_name}, {container_name}",
204
  ],
205
  ),
206
 
 
227
  "{anchor_name} province of {container_name}",
228
  "pull up {anchor_name} in {container_name}",
229
  "find {anchor_name} {container_name}",
230
+ "locate {anchor_name} within {container_name}",
231
+ "show the {anchor_name} part of {container_name}",
232
+ "need {anchor_name} from {container_name}",
233
+ "bring up {anchor_name} in {container_name}",
234
  ],
235
  ),
236
 
 
262
  "what surrounds {anchor_name}?",
263
  "places next to {anchor_name}",
264
  "everything bordering {anchor_name}",
265
+ "show adjacent places to {anchor_name}",
266
+ "areas touching {anchor_name}",
267
+ "find the neighbours of {anchor_name}",
268
+ "bordering places for {anchor_name}",
269
+ "places that meet {anchor_name}",
270
  ],
271
  ),
272
 
 
296
  "which {target_subtype}s are adjacent to {anchor_name}?",
297
  "{target_subtype}s along the {anchor_name} border",
298
  "find {target_subtype}s next to {anchor_name}",
299
+ "show {target_subtype}s bordering {anchor_name}",
300
+ "{target_subtype}s beside {anchor_name}",
301
+ "all {target_subtype}s touching {anchor_name}",
302
+ "{target_subtype}s meeting {anchor_name}",
303
  ],
304
  ),
305
 
 
330
  "which water bodies does {anchor_name} border?",
331
  "does {anchor_name} have sea access?",
332
  "what ocean is {anchor_name} on?",
333
+ "is {anchor_name} on the coast?",
334
+ "what sea is off the coast of {anchor_name}?",
335
+ "which ocean lies off {anchor_name}?",
336
+ "what water is {anchor_name} on the shore of?",
337
+ "which sea or ocean is {anchor_name} along?",
338
  ],
339
  ),
340
 
 
398
  "regions adjacent to both {anchor_1_name} and {anchor_2_name}",
399
  "what lies between {anchor_1_name} and {anchor_2_name}?",
400
  "common neighbours of {anchor_1_name} and {anchor_2_name}",
401
+ "show places touching both {anchor_1_name} and {anchor_2_name}",
402
+ "areas bordering both {anchor_1_name} and {anchor_2_name}",
403
+ "shared neighbours of {anchor_1_name} and {anchor_2_name}",
404
+ "find places adjacent to both {anchor_1_name} and {anchor_2_name}",
405
+ "places meeting both {anchor_1_name} and {anchor_2_name}",
406
+ "places on the border of both {anchor_1_name} and {anchor_2_name}",
407
  ],
408
  ),
409
 
 
435
  "all {target_subtype}s within {anchor_name}",
436
  "{target_subtype}s of {anchor_name}",
437
  "show every {target_subtype} in {anchor_name}",
438
+ "show {target_subtype}s inside {anchor_name}",
439
+ "find {target_subtype}s in {anchor_name}",
440
+ "give me the {target_subtype}s in {anchor_name}",
441
+ "{anchor_name} {target_subtype}s",
442
+ "which {target_subtype}s does {anchor_name} contain?",
443
+ "what all {target_subtype}s are there in {anchor_name}?",
444
+ "{target_subtype}s under {anchor_name}",
445
  ],
446
  ),
447
 
 
471
  "{anchor_name} is part of which country?",
472
  "where is {anchor_name}",
473
  "what country is {anchor_name} in",
474
+ "{anchor_name} belongs to which country?",
475
+ "show country for {anchor_name}",
476
+ "find the country of {anchor_name}",
477
+ "which country does {anchor_name} fall in?",
478
+ "{anchor_name} under which country",
479
+ "tell me the country for {anchor_name}",
480
  ],
481
  ),
482
 
 
505
  "all regions inside the {anchor_name}",
506
  "what {target_subtype}s does the {anchor_name} contain?",
507
  "{target_subtype}s covered by the {anchor_name}",
508
+ "which regions are in the {anchor_name} basin?",
509
+ "what admin regions lie within the {anchor_name}?",
510
+ "which provinces are inside the {anchor_name}?",
511
+ "show the regions in the {anchor_name}",
512
+ "find provinces inside the {anchor_name}",
513
+ "give me admin regions within the {anchor_name}",
514
+ "regions belonging to the {anchor_name}",
515
+ "areas contained in the {anchor_name}",
516
  ],
517
  ),
518
 
 
543
  "which {target_subtype}s overlap {anchor_name}?",
544
  "{target_subtype}s partially inside {anchor_name}",
545
  "what {target_subtype}s extend into {anchor_name}?",
546
+ "show {target_subtype}s overlapping {anchor_name}",
547
+ "find {target_subtype}s crossing {anchor_name}",
548
+ "{target_subtype}s meeting {anchor_name}",
549
+ "areas intersecting {anchor_name}",
550
+ "{anchor_name} overlapping {target_subtype}s",
551
+ "which {target_subtype}s are partly in {anchor_name}?",
552
  ],
553
  ),
554
 
 
579
  "countries along the {anchor_name}",
580
  "what countries does the {anchor_name} cover?",
581
  "countries the {anchor_name} spans across",
582
+ "what countries is the {anchor_name} in?",
583
+ "which countries lie along the {anchor_name}?",
584
+ "what countries does the {anchor_name} run through?",
585
+ "which countries border the {anchor_name}?",
586
  ],
587
  ),
588
 
 
674
  "what falls within {buffer_km} km of the {anchor_name}?",
675
  "admin divisions within a {buffer_km} km radius of the {anchor_name}",
676
  "places within {buffer_km} kilometers of the {anchor_name}",
677
+ "what places are near the {anchor_name}?",
678
+ "what admin areas are close to the {anchor_name}?",
679
+ "which regions are around the {anchor_name}?",
680
+ "what lies within {buffer_km} km of the shoreline of the {anchor_name}?",
681
+ "show places around the {anchor_name}",
682
+ "find areas near the {anchor_name}",
683
+ "admin units close to the {anchor_name}",
684
+ "what is around the {anchor_name} within {buffer_km} km",
685
+ "give me nearby admin regions for the {anchor_name}",
686
  ],
687
  ),
688
 
 
709
  "admin units within {buffer_m} m of the {anchor_name}",
710
  "places within {buffer_m} metres of the {anchor_name}",
711
  "{buffer_m} meter buffer around the {anchor_name}",
712
+ "what places are right next to the {anchor_name}?",
713
+ "what lies close to the {anchor_name}?",
714
+ "which admin units are near the edge of the {anchor_name}?",
715
+ "show places very near the {anchor_name}",
716
+ "find admin areas next to the {anchor_name}",
717
+ "what is just around the {anchor_name}",
718
+ "give me places within {buffer_m} m of the {anchor_name}",
719
+ ],
720
+ ),
721
+
722
+ SQLTemplate(
723
+ template_id="buffer_06",
724
+ family="buffer",
725
+ sql_difficulty="medium",
726
+ anchor_source="divisions_area",
727
+ num_anchors=1,
728
+ requires_buffer=True,
729
+ sql_template=(
730
+ "SELECT ST_AsGeoJSON(ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0)) AS geometry"
731
+ " FROM read_parquet('divisions_area')"
732
+ " WHERE id = '{anchor_id}'"
733
+ ),
734
+ question_hints=[
735
+ "{buffer_km} km buffer around {anchor_name}",
736
+ "draw a {buffer_km} km buffer around {anchor_name}",
737
+ "show the {buffer_km} km buffer around {anchor_name}",
738
+ "create a {buffer_km} kilometer buffer around {anchor_name}",
739
+ "map a {buffer_km} km radius around {anchor_name}",
740
+ "outline the {buffer_km} km buffer around {anchor_name}",
741
+ "make a {buffer_km} km zone around {anchor_name}",
742
+ "generate a {buffer_km} km radius for {anchor_name}",
743
+ "buffer {anchor_name} by {buffer_km} km",
744
+ "show radius {buffer_km} km from {anchor_name}",
745
+ "{anchor_name} with a {buffer_km} km buffer",
746
  ],
747
  ),
748
 
 
780
  "{target_subtype}s in {anchor_name} bordering the sea",
781
  "oceanfront {target_subtype}s in {anchor_name}",
782
  "which {target_subtype}s in {anchor_name} have a coastline?",
783
+ "show coastal {target_subtype}s in {anchor_name}",
784
+ "find {target_subtype}s of {anchor_name} on the shore",
785
+ "{target_subtype}s of {anchor_name} by the sea",
786
+ "which {target_subtype}s of {anchor_name} touch the ocean?",
787
+ "coastline {target_subtype}s in {anchor_name}",
788
  ],
789
  ),
790
 
 
817
  "{target_subtype}s in {anchor_name} with no coastline",
818
  "which {target_subtype}s within {anchor_name} are landlocked?",
819
  "interior {target_subtype}s of {anchor_name} with no ocean border",
820
+ "show inland {target_subtype}s in {anchor_name}",
821
+ "find non-coastal {target_subtype}s of {anchor_name}",
822
+ "{target_subtype}s of {anchor_name} away from the sea",
823
+ "which {target_subtype}s in {anchor_name} are not coastal?",
824
+ "inner {target_subtype}s of {anchor_name}",
825
  ],
826
  ),
827
 
 
843
  " AND ST_Within(b.geometry, region.geometry)"
844
  " AND EXISTS ("
845
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
846
+ " WHERE n.subtype IN ('range/mtn', 'island group', 'peninsula', 'depression')"
847
  " AND ST_Intersects(b.geometry, n.geometry)"
848
  " )"
849
  ),
 
852
  "{target_subtype}s of {anchor_name} on a peninsula or island group",
853
  "{target_subtype}s within {anchor_name} on notable landforms",
854
  "island and peninsula {target_subtype}s of {anchor_name}",
855
+ "show {target_subtype}s in {anchor_name} on major landforms",
856
+ "find {target_subtype}s of {anchor_name} on islands or peninsulas",
857
+ "{target_subtype}s in {anchor_name} on terrain regions",
858
+ "places in {anchor_name} associated with islands or peninsulas",
859
  ],
860
  ),
861
 
 
888
  "{anchor_1_name} with {anchor_2_name} cut out",
889
  "subtract {anchor_2_name} from {anchor_1_name}",
890
  "what's left of {anchor_1_name} after removing {anchor_2_name}?",
891
+ "difference between {anchor_1_name} and {anchor_2_name}",
892
+ "keep only {anchor_1_name} outside {anchor_2_name}",
893
+ "cut {anchor_2_name} out of {anchor_1_name}",
894
+ "show {anchor_1_name} without {anchor_2_name}",
895
  ],
896
  ),
897
 
 
920
  "{anchor_name} with the {clip_feature_name} removed",
921
  "what's left of {anchor_name} after removing the {clip_feature_name}?",
922
  "show me {anchor_name} excluding the {clip_feature_name}",
923
+ "keep only the part of {anchor_name} outside the {clip_feature_name}",
924
+ "cut the {clip_feature_name} out of {anchor_name}",
925
+ "difference of {anchor_name} and the {clip_feature_name}",
926
+ "{anchor_name} after subtracting the {clip_feature_name}",
927
  ],
928
  ),
929
 
 
960
  "the region straddling the border of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
961
  "{buffer_km} km on either side of the {anchor_1_name} and {anchor_2_name} border",
962
  "buffer the {anchor_1_name}-{anchor_2_name} boundary by {buffer_km} km",
963
+ "show the border zone between {anchor_1_name} and {anchor_2_name}",
964
+ "map the corridor along the {anchor_1_name}-{anchor_2_name} border",
965
+ "find the area near the border of {anchor_1_name} and {anchor_2_name}",
966
+ "give me the border buffer for {anchor_1_name} and {anchor_2_name}",
967
+ "border area of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
968
  ],
969
  ),
970
 
 
1036
  "show {target_subtype}s across {anchor_1_name} and {anchor_2_name}",
1037
  "{target_subtype}s belonging to {anchor_1_name} and {anchor_2_name}",
1038
  "list {target_subtype}s in both {anchor_1_name} and {anchor_2_name}",
1039
+ "give me {target_subtype}s from {anchor_1_name} and {anchor_2_name}",
1040
+ "{anchor_1_name} and {anchor_2_name} {target_subtype}s",
1041
+ "show all {target_subtype}s for {anchor_1_name} plus {anchor_2_name}",
1042
+ "find {target_subtype}s across both {anchor_1_name} and {anchor_2_name}",
1043
  ],
1044
  ),
1045
 
 
1062
  "all {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
1063
  "show {target_subtype}s across {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
1064
  "list {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
1065
+ "give me {target_subtype}s from {anchor_1_name}, {anchor_2_name}, and {anchor_3_name}",
1066
+ "{anchor_1_name}, {anchor_2_name}, and {anchor_3_name} {target_subtype}s",
1067
+ "show all {target_subtype}s for these three: {anchor_1_name}, {anchor_2_name}, {anchor_3_name}",
1068
+ "find {target_subtype}s across {anchor_1_name}, {anchor_2_name}, and {anchor_3_name}",
1069
  ],
1070
  ),
1071
 
 
1092
  "union of all {target_subtype}s within {anchor_name}",
1093
  "all {target_subtype}s of {anchor_name} merged together",
1094
  "the overall extent of {target_subtype}s in {anchor_name}",
1095
+ "show one merged shape for all {target_subtype}s in {anchor_name}",
1096
+ "dissolve all {target_subtype}s in {anchor_name}",
1097
+ "make one geometry from all {target_subtype}s in {anchor_name}",
1098
+ "single outline of all {target_subtype}s in {anchor_name}",
1099
+ "combine the {target_subtype}s of {anchor_name} into one area",
1100
  ],
1101
  ),
1102
 
 
1129
  "the top half of {anchor_name}",
1130
  "northern portion of {anchor_name}",
1131
  "upper half of {anchor_name}",
1132
+ "top side of {anchor_name}",
1133
+ "show north half of {anchor_name}",
1134
+ "cut {anchor_name} to the north half",
1135
+ "only the northern side of {anchor_name}",
1136
+ "north part of {anchor_name}",
1137
  ],
1138
  ),
1139
 
 
1163
  "the bottom half of {anchor_name}",
1164
  "southern portion of {anchor_name}",
1165
  "lower half of {anchor_name}",
1166
+ "bottom side of {anchor_name}",
1167
+ "show south half of {anchor_name}",
1168
+ "cut {anchor_name} to the south half",
1169
+ "only the southern side of {anchor_name}",
1170
+ "south part of {anchor_name}",
1171
  ],
1172
  ),
1173
 
 
1196
  "eastern part of {anchor_name}",
1197
  "the right half of {anchor_name}",
1198
  "eastern portion of {anchor_name}",
1199
+ "east side of {anchor_name}",
1200
+ "show east half of {anchor_name}",
1201
+ "cut {anchor_name} to the east half",
1202
+ "only the eastern side of {anchor_name}",
1203
+ "right side of {anchor_name}",
1204
  ],
1205
  ),
1206
 
 
1229
  "western part of {anchor_name}",
1230
  "the left half of {anchor_name}",
1231
  "western portion of {anchor_name}",
1232
+ "west side of {anchor_name}",
1233
+ "show west half of {anchor_name}",
1234
+ "cut {anchor_name} to the west half",
1235
+ "only the western side of {anchor_name}",
1236
+ "left side of {anchor_name}",
1237
  ],
1238
  ),
1239
 
 
1265
  "{clip_feature_name} inside {anchor_name}",
1266
  "parts of {anchor_name} covered by the {clip_feature_name}",
1267
  "show me where {anchor_name} and the {clip_feature_name} overlap",
1268
+ "keep only the part of {anchor_name} in the {clip_feature_name}",
1269
+ "intersection of {anchor_name} and the {clip_feature_name}",
1270
+ "give me the overlap between {anchor_name} and the {clip_feature_name}",
1271
+ "only the shared area of {anchor_name} and the {clip_feature_name}",
1272
  ],
1273
  ),
1274
 
 
1303
  "the {top_n} biggest {target_subtype}s within {anchor_name}",
1304
  "largest {target_subtype} in {anchor_name}",
1305
  "which {target_subtype} in {anchor_name} has the most area?",
1306
+ "show the biggest {target_subtype}s in {anchor_name}",
1307
+ "list the largest {target_subtype}s for {anchor_name}",
1308
+ "give me the biggest {target_subtype}s in {anchor_name}",
1309
+ "{anchor_name} largest {target_subtype}s",
1310
  ],
1311
  ),
1312
 
 
1338
  "the {top_n} tiniest {target_subtype}s within {anchor_name}",
1339
  "smallest {target_subtype} in {anchor_name}",
1340
  "which {target_subtype} in {anchor_name} has the least area?",
1341
+ "show the smallest {target_subtype}s in {anchor_name}",
1342
+ "list the smallest {target_subtype}s for {anchor_name}",
1343
+ "give me the tiniest {target_subtype}s in {anchor_name}",
1344
+ "{anchor_name} smallest {target_subtype}s",
1345
  ],
1346
  ),
1347
 
 
1370
  "the {top_n} largest {target_subtype}s in {anchor_name}",
1371
  "biggest {target_subtype} in {anchor_name}",
1372
  "which {target_subtype} in {anchor_name} is the largest?",
1373
+ "show the largest {target_subtype}s in {anchor_name}",
1374
+ "list the biggest {target_subtype}s in {anchor_name}",
1375
+ "give me the largest {target_subtype}s for {anchor_name}",
1376
+ "{anchor_name} biggest {target_subtype}s",
1377
  ],
1378
  ),
1379
 
 
1402
  "the {top_n} smallest {target_subtype}s in {anchor_name}",
1403
  "smallest {target_subtype} in {anchor_name}",
1404
  "which {target_subtype} in {anchor_name} is the smallest?",
1405
+ "show the smallest {target_subtype}s in {anchor_name}",
1406
+ "list the smallest {target_subtype}s in {anchor_name}",
1407
+ "give me the smallest {target_subtype}s for {anchor_name}",
1408
+ "{anchor_name} tiniest {target_subtype}s",
1409
  ],
1410
  ),
1411
 
 
1439
  "biggest {target_subtype} per region in {anchor_name}",
1440
  "largest {target_subtype} for every region of {anchor_name}",
1441
  "the biggest {target_subtype} in each province of {anchor_name}",
1442
+ "show the largest {target_subtype} in every region of {anchor_name}",
1443
+ "list biggest {target_subtype} by region in {anchor_name}",
1444
+ "for each region in {anchor_name}, give the largest {target_subtype}",
1445
+ "largest {target_subtype}s grouped by region in {anchor_name}",
1446
+ "one biggest {target_subtype} for each region of {anchor_name}",
1447
  ],
1448
  ),
1449
 
 
1474
  "smallest {target_subtype} per region in {anchor_name}",
1475
  "tiniest {target_subtype} for every region of {anchor_name}",
1476
  "the smallest {target_subtype} in each province of {anchor_name}",
1477
+ "show the smallest {target_subtype} in every region of {anchor_name}",
1478
+ "list smallest {target_subtype} by region in {anchor_name}",
1479
+ "for each region in {anchor_name}, give the smallest {target_subtype}",
1480
+ "smallest {target_subtype}s grouped by region in {anchor_name}",
1481
+ "one tiniest {target_subtype} for each region of {anchor_name}",
1482
  ],
1483
  ),
1484
 
 
1507
  "{anchor_name}'s land {target_subtype}s",
1508
  "dependencies of {anchor_name} with land area",
1509
  "show the land dependencies of {anchor_name}",
1510
+ "find land dependencies of {anchor_name}",
1511
+ "give me {anchor_name} dependencies on land",
1512
+ "which dependencies of {anchor_name} are land-based?",
1513
+ "non-island {target_subtype}s of {anchor_name}",
1514
  ],
1515
  ),
1516
 
 
1534
  "official territorial divisions of {anchor_name}",
1535
  "recognised territorial {target_subtype}s belonging to {anchor_name}",
1536
  "which territorial regions does {anchor_name} have?",
1537
+ "show territorial {target_subtype}s of {anchor_name}",
1538
+ "find official territorial regions of {anchor_name}",
1539
+ "give me recognised territorial {target_subtype}s of {anchor_name}",
1540
+ "territorial regions under {anchor_name}",
1541
+ "{anchor_name} official territorial {target_subtype}s",
1542
  ],
1543
  ),
1544
 
 
1562
  "{target_subtype}s of {anchor_name} that are not on land",
1563
  "water-associated {target_subtype}s of {anchor_name}",
1564
  "marine or offshore {target_subtype}s of {anchor_name}",
1565
+ "show offshore {target_subtype}s of {anchor_name}",
1566
+ "find {target_subtype}s of {anchor_name} in water",
1567
+ "give me non-land {target_subtype}s of {anchor_name}",
1568
+ "water-side {target_subtype}s of {anchor_name}",
1569
+ "{anchor_name} {target_subtype}s not on land",
1570
  ],
1571
  ),
1572
 
 
1588
  " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1589
  " ST_AsGeoJSON(n.geometry) AS geometry"
1590
  " FROM read_parquet('natural_earth') AS n, a"
1591
+ " WHERE n.subtype IN ('river', 'lake', 'basin')"
1592
  " AND ST_Intersects(a.geometry, n.geometry)"
1593
  ),
1594
  question_hints=[
 
1600
  "what bodies of water cross {anchor_name}?",
1601
  "rivers of {anchor_name}",
1602
  "show me the lakes in {anchor_name}",
1603
+ "what rivers run through {anchor_name}?",
1604
+ "which lakes lie in {anchor_name}?",
1605
+ "what waterways are in {anchor_name}?",
1606
+ "which basins, lakes, or rivers are in {anchor_name}?",
1607
  ],
1608
  ),
1609
 
 
1613
  sql_difficulty="medium",
1614
  anchor_source="divisions_area",
1615
  num_anchors=1,
1616
+ target_subtype="range/mtn",
1617
  sql_template=(
1618
  "WITH a AS ("
1619
  " SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
 
1621
  " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1622
  " ST_AsGeoJSON(n.geometry) AS geometry"
1623
  " FROM read_parquet('natural_earth') AS n, a"
1624
+ " WHERE n.subtype = 'range/mtn'"
1625
  " AND ST_Intersects(a.geometry, n.geometry)"
1626
  ),
1627
  question_hints=[
1628
  "what mountain ranges are in {anchor_name}?",
 
1629
  "which mountain ranges cross {anchor_name}?",
 
 
 
1630
  "mountains of {anchor_name}",
1631
+ "mountain regions in {anchor_name}",
1632
+ "what hills are in {anchor_name}?",
1633
+ "which hills cross {anchor_name}?",
1634
+ "ghats of {anchor_name}",
1635
+ "what ghats are in {anchor_name}?",
1636
+ "highlands in {anchor_name}",
1637
+ "mountain belts within {anchor_name}",
1638
+ ],
1639
+ ),
1640
+
1641
+ SQLTemplate(
1642
+ template_id="adj_07",
1643
+ family="adjacency",
1644
+ sql_difficulty="medium",
1645
+ anchor_source="divisions_area",
1646
+ num_anchors=1,
1647
+ target_subtype="plateau",
1648
+ sql_template=(
1649
+ "WITH a AS ("
1650
+ " SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
1651
+ ")"
1652
+ " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1653
+ " ST_AsGeoJSON(n.geometry) AS geometry"
1654
+ " FROM read_parquet('natural_earth') AS n, a"
1655
+ " WHERE n.subtype = 'plateau'"
1656
+ " AND ST_Intersects(a.geometry, n.geometry)"
1657
+ ),
1658
+ question_hints=[
1659
+ "what plateaus are in {anchor_name}?",
1660
+ "which plateaus cross {anchor_name}?",
1661
+ "uplands in {anchor_name}",
1662
+ "what uplands are in {anchor_name}?",
1663
+ "tablelands of {anchor_name}",
1664
+ "plateau regions within {anchor_name}",
1665
+ "show plateaus in {anchor_name}",
1666
+ "find uplands of {anchor_name}",
1667
+ "give me plateau areas in {anchor_name}",
1668
+ "{anchor_name} plateaus and uplands",
1669
+ "what tablelands are there in {anchor_name}?",
1670
+ ],
1671
+ ),
1672
+
1673
+ SQLTemplate(
1674
+ template_id="adj_08",
1675
+ family="adjacency",
1676
+ sql_difficulty="medium",
1677
+ anchor_source="divisions_area",
1678
+ num_anchors=1,
1679
+ target_subtype="landform",
1680
+ sql_template=(
1681
+ "WITH a AS ("
1682
+ " SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
1683
+ ")"
1684
+ " SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
1685
+ " ST_AsGeoJSON(n.geometry) AS geometry"
1686
+ " FROM read_parquet('natural_earth') AS n, a"
1687
+ " WHERE n.subtype IN ('plain', 'lowland', 'basin', 'valley', 'depression', 'gorge')"
1688
+ " AND ST_Intersects(a.geometry, n.geometry)"
1689
+ ),
1690
+ question_hints=[
1691
+ "what plains are in {anchor_name}?",
1692
+ "basins and valleys of {anchor_name}",
1693
+ "which basins are in {anchor_name}?",
1694
+ "what valleys cross {anchor_name}?",
1695
+ "lowlands in {anchor_name}",
1696
+ "major landforms in {anchor_name}",
1697
+ "plains, basins, and valleys within {anchor_name}",
1698
+ "show me the main landforms in {anchor_name}",
1699
+ "landforms of {anchor_name}",
1700
+ "find plains and basins in {anchor_name}",
1701
+ "{anchor_name} valleys and lowlands",
1702
  ],
1703
  ),
1704
 
 
1733
  "what provinces does the {anchor_name} span?",
1734
  "regions along the {anchor_name}",
1735
  "which provinces overlap the {anchor_name}?",
1736
+ "which regions is the {anchor_name} in?",
1737
+ "what states does the {anchor_name} run through?",
1738
+ "which provinces lie along the {anchor_name}?",
1739
+ "show regions crossed by the {anchor_name}",
1740
+ "find administrative regions along the {anchor_name}",
1741
+ "give me the regions touched by the {anchor_name}",
1742
+ "regions of the {anchor_name}",
1743
  ],
1744
  ),
1745
 
 
1766
  "everything natural that touches {anchor_name}",
1767
  "what geographic features does {anchor_name} contain?",
1768
  "natural features within or crossing {anchor_name}",
1769
+ "show natural features overlapping {anchor_name}",
1770
+ "find the natural features in or across {anchor_name}",
1771
+ "{anchor_name} intersecting natural features",
1772
+ "what physical features are associated with {anchor_name}?",
1773
+ ],
1774
+ ),
1775
+
1776
+ SQLTemplate(
1777
+ template_id="intersect_05",
1778
+ family="intersection",
1779
+ sql_difficulty="medium-hard",
1780
+ anchor_source="natural_earth",
1781
+ num_anchors=1,
1782
+ target_subtype="county",
1783
+ sql_template=(
1784
+ "WITH a AS ("
1785
+ " SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
1786
+ ")"
1787
+ " SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
1788
+ " ST_AsGeoJSON(b.geometry) AS geometry"
1789
+ " FROM read_parquet('divisions_area') AS b, a"
1790
+ " WHERE b.subtype = '{target_subtype}'"
1791
+ " AND ST_Intersects(b.geometry, a.geometry)"
1792
+ ),
1793
+ question_hints=[
1794
+ "which districts does the {anchor_name} pass through?",
1795
+ "what districts does the {anchor_name} cross?",
1796
+ "districts intersected by the {anchor_name}",
1797
+ "which counties does the {anchor_name} flow through?",
1798
+ "what counties overlap the {anchor_name}?",
1799
+ "districts along the {anchor_name}",
1800
+ "which districts are crossed by the {anchor_name}?",
1801
+ "what districts is the {anchor_name} in?",
1802
+ "which counties lie along the {anchor_name}?",
1803
+ "what districts does the {anchor_name} run through?",
1804
+ "show districts crossed by the {anchor_name}",
1805
+ "find counties along the {anchor_name}",
1806
+ "give me the districts touched by the {anchor_name}",
1807
+ "districts of the {anchor_name}",
1808
  ],
1809
  ),
1810
 
 
1830
  " AND ST_Within(b.geometry, region.geometry)"
1831
  " AND EXISTS ("
1832
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1833
+ " WHERE n.subtype IN ('river', 'lake', 'basin')"
1834
  " AND ST_Intersects(b.geometry, n.geometry)"
1835
  " )"
1836
  ),
 
1842
  "{target_subtype}s in {anchor_name} that touch a river",
1843
  "which {target_subtype}s in {anchor_name} are on a lake?",
1844
  "waterfront {target_subtype}s of {anchor_name}",
1845
+ "show water-side {target_subtype}s in {anchor_name}",
1846
+ "find {target_subtype}s of {anchor_name} by rivers or lakes",
1847
+ "give me riverside places in {anchor_name}",
1848
+ "{target_subtype}s of {anchor_name} on the water",
1849
  ],
1850
  ),
1851
 
 
1867
  " AND ST_Within(b.geometry, region.geometry)"
1868
  " AND EXISTS ("
1869
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1870
+ " WHERE n.subtype IN ('range/mtn', 'depression')"
1871
  " AND ST_Intersects(b.geometry, n.geometry)"
1872
  " )"
1873
  ),
 
1878
  "highland {target_subtype}s within {anchor_name}",
1879
  "{target_subtype}s of {anchor_name} in mountainous terrain",
1880
  "{target_subtype}s in {anchor_name} near a mountain range",
1881
+ "show mountain {target_subtype}s in {anchor_name}",
1882
+ "find {target_subtype}s of {anchor_name} in hilly terrain",
1883
+ "give me highland {target_subtype}s of {anchor_name}",
1884
+ "{target_subtype}s of {anchor_name} by the mountains",
1885
  ],
1886
  ),
1887
 
 
1919
  "{target_subtype}s of {anchor_name} with ocean access",
1920
  "which {target_subtype}s in {anchor_name} touch the sea?",
1921
  "maritime {target_subtype}s of {anchor_name}",
1922
+ "show coastal districts of {anchor_name}",
1923
+ "find {target_subtype}s of {anchor_name} by the sea",
1924
+ "give me shoreline {target_subtype}s in {anchor_name}",
1925
+ "{target_subtype}s of {anchor_name} on the coast",
1926
  ],
1927
  ),
1928
 
 
1955
  "{target_subtype}s in {anchor_name} with no sea access",
1956
  "non-coastal {target_subtype}s of {anchor_name}",
1957
  "inland {target_subtype}s of {anchor_name}",
1958
+ "show inland districts of {anchor_name}",
1959
+ "find non-coastal {target_subtype}s in {anchor_name}",
1960
+ "give me inner {target_subtype}s of {anchor_name}",
1961
+ "{target_subtype}s of {anchor_name} away from the coast",
1962
  ],
1963
  ),
1964
 
 
1980
  " AND ST_Within(b.geometry, region.geometry)"
1981
  " AND EXISTS ("
1982
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
1983
+ " WHERE n.subtype IN ('river', 'lake', 'basin')"
1984
  " AND ST_Intersects(b.geometry, n.geometry)"
1985
  " )"
1986
  ),
 
1991
  "lakeside {target_subtype}s within {anchor_name}",
1992
  "{target_subtype}s of {anchor_name} along a river",
1993
  "which {target_subtype}s in {anchor_name} border a lake?",
1994
+ "show water-side {target_subtype}s of {anchor_name}",
1995
+ "find {target_subtype}s in {anchor_name} near rivers and lakes",
1996
+ "give me riverside districts of {anchor_name}",
1997
+ "{target_subtype}s of {anchor_name} by the water",
1998
  ],
1999
  ),
2000
 
 
2016
  " AND ST_Within(b.geometry, region.geometry)"
2017
  " AND EXISTS ("
2018
  " SELECT 1 FROM read_parquet('natural_earth') AS n"
2019
+ " WHERE n.subtype IN ('range/mtn', 'depression')"
2020
  " AND ST_Intersects(b.geometry, n.geometry)"
2021
  " )"
2022
  ),
 
2027
  "highland {target_subtype}s within {anchor_name}",
2028
  "{target_subtype}s of {anchor_name} in mountainous terrain",
2029
  "which {target_subtype}s in {anchor_name} have mountain ranges?",
2030
+ "show mountain districts of {anchor_name}",
2031
+ "find {target_subtype}s of {anchor_name} in hilly areas",
2032
+ "give me highland {target_subtype}s in {anchor_name}",
2033
+ "{target_subtype}s of {anchor_name} by the mountains",
2034
  ],
2035
  ),
2036
 
 
2072
  "seaside regions of {anchor_name}",
2073
  "which provinces of {anchor_name} touch the sea?",
2074
  "states of {anchor_name} along the coast",
2075
+ "show coastal regions of {anchor_name}",
2076
+ "find seaside states of {anchor_name}",
2077
+ "give me provinces of {anchor_name} on the coast",
2078
+ "{anchor_name} regions by the sea",
2079
  ],
2080
  ),
2081
 
 
2110
  "regions of {anchor_name} without sea access",
2111
  "interior states of {anchor_name}",
2112
  "states of {anchor_name} that don't border the ocean",
2113
+ "show inland regions of {anchor_name}",
2114
+ "find non-coastal states of {anchor_name}",
2115
+ "give me inner provinces of {anchor_name}",
2116
+ "{anchor_name} regions away from the sea",
2117
  ],
2118
  ),
2119
 
 
2146
  "which countries touch the {anchor_name}?",
2147
  "countries with coastline on the {anchor_name}",
2148
  "what nations lie on the {anchor_name}?",
2149
+ "which countries are on the coast of the {anchor_name}?",
2150
+ "what countries lie around the {anchor_name}?",
2151
+ "which nations have shores on the {anchor_name}?",
2152
+ "what countries front the {anchor_name}?",
2153
  ],
2154
  ),
2155
 
 
2182
  "everything within {buffer_km} km of the {anchor_name}",
2183
  "what natural features are close to the {anchor_name}?",
2184
  "{buffer_km} km radius around the {anchor_name}",
2185
+ "what natural features are around the {anchor_name}?",
2186
+ "what lies near the {anchor_name}?",
2187
+ "which features are close to the {anchor_name}?",
2188
+ "what natural features are near the shoreline of the {anchor_name}?",
2189
+ "show nearby natural features for the {anchor_name}",
2190
+ "find features around the {anchor_name}",
2191
+ "give me natural features near the {anchor_name}",
2192
+ "features around the {anchor_name}",
2193
+ "what is close to the {anchor_name} within {buffer_km} km",
2194
  ],
2195
  ),
2196
 
src/gazet/config.py CHANGED
@@ -80,7 +80,7 @@ Available DuckDB datasets (read via read_parquet):
80
  columns:
81
  id VARCHAR -- unique feature id prefixed 'ne_'
82
  names STRUCT("primary" VARCHAR, ...)
83
- subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'Terrain area', 'Island group'
84
  class VARCHAR
85
  country VARCHAR
86
  region VARCHAR
 
80
  columns:
81
  id VARCHAR -- unique feature id prefixed 'ne_'
82
  names STRUCT("primary" VARCHAR, ...)
83
+ subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
84
  class VARCHAR
85
  country VARCHAR
86
  region VARCHAR
src/gazet/lm.py CHANGED
@@ -31,8 +31,6 @@ class ExtractPlaces(dspy.Signature):
31
  - If the user mentions a natural earth physical feature, use the natural earth physical features.
32
  - If the user mentions a place name that is not in the overture divisions or natural earth physical features, return the place name as is.
33
 
34
- Where possible and relevant, also extract the ISO country code for each place.
35
-
36
  Only extract place names that are explicitly mentioned in the query.
37
  Do NOT generate or infer place names from your own knowledge.
38
  For example:
@@ -41,47 +39,14 @@ class ExtractPlaces(dspy.Signature):
41
  - "neighbouring states of Odisha" -> extract "Odisha", NOT neighbouring state names
42
 
43
  Do not repeat the same place name in the result.
44
-
45
- If the user does not explicitly mention a country, dont add the country code to the result.
46
-
47
- If the user does not mention an admin level, dont add the subtype to the result.
48
-
49
- If the query asks for some kind of subdivision (e.g. 'municipalities in Bern', 'States in Brazil'),
50
- return the subdivision type in the places result.
51
-
52
- When identifying a place name from the user's query, also infer the most appropriate
53
- Overture division subtype from the list below. Only include a subtype if the query
54
- makes it reasonably clear what geographic level is intended. If ambiguous, omit it.
55
-
56
- SUBTYPES:
57
- - country : Sovereign nation. E.g. "France", "Brazil"
58
- - dependency : Territory dependent on a country but not a full sub-region. E.g. "Puerto Rico", "Guam"
59
- - region : Largest admin unit within a country; state, province, canton, etc. E.g. "California", "Alberta", "Bavaria"
60
- - county : Second-level admin subdivision within a region. E.g. "Kings County", "Kent"
61
- - localadmin : A governing layer (common in Europe) that contains localities which have no authority of their own. E.g. a French commune or Belgian municipality. Use when the place is clearly an admin unit but not a city itself.
62
- - locality : A populated place — city, town, village. The most common subtype for named settlements. E.g. "Lisbon", "Taipei", "Salt Lake City"
63
- - macrohood : A large super-neighborhood grouping smaller neighborhoods. E.g. "BoCoCa" in Brooklyn
64
- - neighborhood : A named community area within a city or town. E.g. "Cobble Hill", "Alfama"
65
- - microhood : A mini-neighborhood within a neighborhood. Very fine-grained, rarely referenced explicitly.
66
-
67
- HIERARCHY (coarse to fine):
68
- country → dependency / region → county → localadmin → locality → macrohood → neighborhood → microhood
69
-
70
- GUIDANCE:
71
- - "Paris" with no qualifier → locality
72
- - "Île-de-France" or "Catalonia" → region
73
- - "the 11th arrondissement" → neighborhood (or localadmin)
74
- - "Greater London" style phrasing → county or region depending on context
75
- - If the user says "neighborhood in X" or "district of X" → neighborhood
76
- - Default to locality for any named city/town if unsure
77
- - Omit subtype entirely if the query gives no signal (e.g. bare coordinates or a POI name)
78
  """
79
 
80
  query: str = dspy.InputField(
81
  desc="Natural language query mentioning one or more place names"
82
  )
83
  result: PlacesResult = dspy.OutputField(
84
- desc="Extracted places with optional country codes and optional subtype"
85
  )
86
 
87
 
@@ -203,7 +168,7 @@ You have access to two DuckDB parquet tables. Given a set of candidate entities
203
  id VARCHAR -- unique feature id prefixed 'ne_'
204
  names STRUCT("primary" VARCHAR, ...)
205
  country VARCHAR
206
- subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'Terrain area', 'Island group'
207
  class VARCHAR
208
  region VARCHAR
209
  admin_level INTEGER
@@ -265,21 +230,68 @@ def _llama_chat_complete(messages: list[dict]) -> str:
265
  return resp.json()["choices"][0]["message"]["content"]
266
 
267
 
268
- _PLACES_SYSTEM_PROMPT = """You are a geographic entity extractor. Extract place names from the user query and return valid JSON only.
269
 
270
  OUTPUT FORMAT:
271
- {"places": [{"place": "<name>", "country": "<ISO-2>", "subtype": "<subtype>"}]}
272
- "country" and "subtype" are optional; omit if not applicable.
273
 
274
  RULES:
275
- - Only extract places explicitly mentioned. Never infer or expand (e.g. "states of India" -> extract "India" only).
 
 
 
 
 
276
  - No duplicate place names.
277
- - "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
278
- - "subtype": include only when the geographic level is clear from the query.
279
 
280
- SUBTYPES:
281
- country, dependency, region, county, localadmin, locality, macrohood, neighborhood, microhood
282
- - Default to locality for cities/towns; omit for physical features (oceans, rivers, mountains)."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
 
284
 
285
  def generate_places(user_query: str) -> PlacesResult:
@@ -318,7 +330,7 @@ def generate_sql(user_query: str, candidates_df: pd.DataFrame) -> str:
318
  Single-shot — no retry loop (the finetuned model can't improve from error feedback).
319
  """
320
  # Keep only columns the model was trained on
321
- keep_cols = ["source", "id", "name", "subtype", "country", "region", "admin_level", "similarity"]
322
  cols = [c for c in keep_cols if c in candidates_df.columns]
323
  candidates_csv = candidates_df[cols].to_csv(index=False)
324
 
 
31
  - If the user mentions a natural earth physical feature, use the natural earth physical features.
32
  - If the user mentions a place name that is not in the overture divisions or natural earth physical features, return the place name as is.
33
 
 
 
34
  Only extract place names that are explicitly mentioned in the query.
35
  Do NOT generate or infer place names from your own knowledge.
36
  For example:
 
39
  - "neighbouring states of Odisha" -> extract "Odisha", NOT neighbouring state names
40
 
41
  Do not repeat the same place name in the result.
42
+ Return only the place names, in the order they appear in the query.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  """
44
 
45
  query: str = dspy.InputField(
46
  desc="Natural language query mentioning one or more place names"
47
  )
48
  result: PlacesResult = dspy.OutputField(
49
+ desc="Extracted place names in query order"
50
  )
51
 
52
 
 
168
  id VARCHAR -- unique feature id prefixed 'ne_'
169
  names STRUCT("primary" VARCHAR, ...)
170
  country VARCHAR
171
+ subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
172
  class VARCHAR
173
  region VARCHAR
174
  admin_level INTEGER
 
230
  return resp.json()["choices"][0]["message"]["content"]
231
 
232
 
233
+ _PLACES_SYSTEM_PROMPT = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
234
 
235
  OUTPUT FORMAT:
236
+ {"places": [{"place": "<name>"}]}
 
237
 
238
  RULES:
239
+ - Extract the place or places that are the actual anchors of the query.
240
+ - Physical features are valid places: oceans, seas, gulfs, bays, straits, rivers, lakes, basins, mountain ranges, peninsulas, island groups, deserts, and terrain regions.
241
+ - When a place is followed by its containing region, state, or country as disambiguation context ("Puri, Odisha", "Lisboa, Portugal", "Goa, India", "Manchester in US"), extract ONLY the specific place. Do not return the container as a separate place.
242
+ - When a query names two or more distinct anchors joined by words like "and", "both", "between", or mixes an admin area with a physical feature as separate anchors, extract every anchor in the order they appear.
243
+ - Do not infer or expand category nouns like "regions", "districts", "counties", "rivers", or "mountains" when they refer to a type rather than a specific named place ("regions of India" -> extract "India" only).
244
+ - Only extract places explicitly mentioned.
245
  - No duplicate place names.
 
 
246
 
247
+ EXAMPLES:
248
+ Query: "Puri, Odisha"
249
+ -> {"places": [{"place": "Puri"}]}
250
+
251
+ Query: "Lisboa, Portugal"
252
+ -> {"places": [{"place": "Lisboa"}]}
253
+
254
+ Query: "Goa, India"
255
+ -> {"places": [{"place": "Goa"}]}
256
+
257
+ Query: "Manchester in US"
258
+ -> {"places": [{"place": "Manchester"}]}
259
+
260
+ Query: "Springfield, Illinois"
261
+ -> {"places": [{"place": "Springfield"}]}
262
+
263
+ Query: "coastal districts of Brazil"
264
+ -> {"places": [{"place": "Brazil"}]}
265
+
266
+ Query: "northern half of India"
267
+ -> {"places": [{"place": "India"}]}
268
+
269
+ Query: "what's within 50 km of Paris?"
270
+ -> {"places": [{"place": "Paris"}]}
271
+
272
+ Query: "countries the Nile crosses"
273
+ -> {"places": [{"place": "Nile"}]}
274
+
275
+ Query: "which countries touch the Gulf of Maine"
276
+ -> {"places": [{"place": "Gulf of Maine"}]}
277
+
278
+ Query: "10 km buffer around Odisha"
279
+ -> {"places": [{"place": "Odisha"}]}
280
+
281
+ Query: "part of Ecuador in the Amazon basin"
282
+ -> {"places": [{"place": "Ecuador"}, {"place": "Amazon basin"}]}
283
+
284
+ Query: "Amazon basin inside Ecuador"
285
+ -> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador"}]}
286
+
287
+ Query: "the part of Chad in Lake Chad"
288
+ -> {"places": [{"place": "Chad"}, {"place": "Lake Chad"}]}
289
+
290
+ Query: "which regions border both France and Germany?"
291
+ -> {"places": [{"place": "France"}, {"place": "Germany"}]}
292
+
293
+ Query: "merge Nairobi and Mombasa"
294
+ -> {"places": [{"place": "Nairobi"}, {"place": "Mombasa"}]}"""
295
 
296
 
297
  def generate_places(user_query: str) -> PlacesResult:
 
330
  Single-shot — no retry loop (the finetuned model can't improve from error feedback).
331
  """
332
  # Keep only columns the model was trained on
333
+ keep_cols = ["source", "id", "name", "subtype", "country", "region", "admin_level"]
334
  cols = [c for c in keep_cols if c in candidates_df.columns]
335
  candidates_csv = candidates_df[cols].to_csv(index=False)
336
 
src/gazet/schemas.py CHANGED
@@ -292,10 +292,7 @@ COUNTRIES = Literal[
292
 
293
  class Place(BaseModel):
294
  place: str
295
- country: Optional[COUNTRIES] = None
296
- subtype: Optional[SUBTYPES] = None
297
 
298
 
299
  class PlacesResult(BaseModel):
300
  places: List[Place]
301
- subtype: Optional[SUBTYPES] = None
 
292
 
293
  class Place(BaseModel):
294
  place: str
 
 
295
 
296
 
297
  class PlacesResult(BaseModel):
298
  places: List[Place]
 
src/gazet/sql.py CHANGED
@@ -2,6 +2,17 @@ import json
2
  import re
3
  from typing import Any, Generator, Optional
4
 
 
 
 
 
 
 
 
 
 
 
 
5
  import duckdb
6
  import pandas as pd
7
  from shapely import wkb
@@ -88,11 +99,21 @@ _NE_SUBTYPE_FIXES = {
88
  "'Sea'": "'sea'",
89
  }
90
 
 
 
 
 
 
91
 
92
  def _normalize_ne_subtypes(sql: str) -> str:
93
- """Lowercase known NE subtype literals so they match the normalised data."""
94
  for old, new in _NE_SUBTYPE_FIXES.items():
95
  sql = sql.replace(old, new)
 
 
 
 
 
96
  return sql
97
 
98
 
@@ -189,7 +210,8 @@ def run_geo_sql_dspy(
189
  yield {"type": "result", "df": None, "sql": ""}
190
  return
191
 
192
- candidates_str = candidates_df.to_string(index=False)
 
193
  previous_sql = ""
194
  error = ""
195
 
 
2
  import re
3
  from typing import Any, Generator, Optional
4
 
5
+
6
+ _CANDIDATE_PROMPT_COLS = [
7
+ "source",
8
+ "id",
9
+ "name",
10
+ "subtype",
11
+ "country",
12
+ "region",
13
+ "admin_level",
14
+ ]
15
+
16
  import duckdb
17
  import pandas as pd
18
  from shapely import wkb
 
99
  "'Sea'": "'sea'",
100
  }
101
 
102
+ _TERRAIN_AREA_PATTERN = re.compile(
103
+ r"n\.subtype\s*(=|IN\s*\()\s*'Terrain area'\s*\)?",
104
+ flags=re.IGNORECASE,
105
+ )
106
+
107
 
108
  def _normalize_ne_subtypes(sql: str) -> str:
109
+ """Lowercase known NE subtype literals and fix common terrain hallucinations."""
110
  for old, new in _NE_SUBTYPE_FIXES.items():
111
  sql = sql.replace(old, new)
112
+
113
+ sql = _TERRAIN_AREA_PATTERN.sub(
114
+ "n.subtype IN ('range/mtn', 'peninsula', 'depression')",
115
+ sql,
116
+ )
117
  return sql
118
 
119
 
 
210
  yield {"type": "result", "df": None, "sql": ""}
211
  return
212
 
213
+ cols = [c for c in _CANDIDATE_PROMPT_COLS if c in candidates_df.columns]
214
+ candidates_str = candidates_df[cols].to_string(index=False)
215
  previous_sql = ""
216
  error = ""
217