Spaces:
Running
Running
Lowercase subtype names, fix question type hints
Browse files- IMPROVEMENTS.md +79 -0
- dataset/config.yaml +3 -3
- dataset/scripts/build_relations.py +10 -7
- dataset/scripts/export_training_data.py +34 -47
- dataset/scripts/generate_samples.py +37 -26
- dataset/scripts/sql_templates.py +388 -13
- src/gazet/config.py +1 -1
- src/gazet/lm.py +60 -48
- src/gazet/schemas.py +0 -3
- src/gazet/sql.py +24 -2
IMPROVEMENTS.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Gazet Improvement Notes
|
| 2 |
+
|
| 3 |
+
Issues identified during testing. Each item is a candidate for the next training/template pass.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Missing "buffer-only" template
|
| 8 |
+
|
| 9 |
+
**Query**: "10 km buffer around Odisha"
|
| 10 |
+
**Expected**: Return the buffered geometry polygon itself.
|
| 11 |
+
**Actual**: Model picks `buffer_01`, which finds all features intersecting the buffer (200 rows).
|
| 12 |
+
|
| 13 |
+
**Root cause**: All buffer templates (`buffer_01` through `buffer_05`) perform an intersection join to find neighboring features. No template simply returns `ST_AsGeoJSON(ST_Buffer(...))`.
|
| 14 |
+
|
| 15 |
+
**Fix**: Add a `buffer_06` template that returns the buffer polygon directly:
|
| 16 |
+
|
| 17 |
+
```sql
|
| 18 |
+
SELECT ST_AsGeoJSON(ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0)) AS geometry
|
| 19 |
+
FROM read_parquet('divisions_area')
|
| 20 |
+
WHERE id = '{anchor_id}'
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
With hints like "10 km buffer around {anchor_name}", "draw a {buffer_km} km buffer around {anchor_name}". Consider a NE variant too.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 2. Place extractor misses NE physical features (mixed-source queries)
|
| 28 |
+
|
| 29 |
+
**Query**: "The part of Ecuador that is in the Amazon Basin"
|
| 30 |
+
**Expected**: Place extractor returns both "Ecuador" and "Amazon Basin"; candidate search finds correct IDs for both.
|
| 31 |
+
**Actual**: Only "Ecuador" extracted. SQL model uses memorized wrong NE ID (`ne_1159120655` = Cuando River) instead of the correct one (`ne_1159104325` = AMAZON BASIN).
|
| 32 |
+
|
| 33 |
+
**Root cause**: The GGUF place extraction model was not trained to extract physical features. The runtime prompt (`_PLACES_SYSTEM_PROMPT`) has been updated but the finetuned model may ignore prompt changes. A re-finetune with NE feature examples is the definitive fix.
|
| 34 |
+
|
| 35 |
+
**Affected templates**: `partial_05`, `diff_02` (mixed-source), and all NE-anchored templates (`intersect_03`, `contain_03/04`, `buffer_03/04/05`, `lookup_02`).
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## 3. Missing NE-anchor to county intersection template
|
| 40 |
+
|
| 41 |
+
**Query**: "Indravati River flows through which districts"
|
| 42 |
+
**Expected**: `ST_Intersects` with `target_subtype='county'`
|
| 43 |
+
**Actual**: Model sometimes uses `ST_Within` (wrong predicate) because `intersect_03` only targets `region`, not `county`.
|
| 44 |
+
|
| 45 |
+
**Fix**: Add `intersect_05` (NE anchor -> county, `ST_Intersects`) with district-oriented question hints.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 4. Model hallucinates NE subtype values
|
| 50 |
+
|
| 51 |
+
**Query**: "which mountain ranges cross Odisha"
|
| 52 |
+
**Expected**: `n.subtype IN ('range/mtn', 'peninsula', 'depression')` (from `adj_05`)
|
| 53 |
+
**Actual**: Model generates `'Terrain area'` which does not exist in the data.
|
| 54 |
+
|
| 55 |
+
**Fix**: More training examples for `adj_05`. Consider adding common hallucinated values to `_NE_SUBTYPE_FIXES` in `sql.py` as a runtime safety net.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 5. NE subtype casing inconsistency between model output and data
|
| 60 |
+
|
| 61 |
+
**Example**: Model generates `'River'`, `'Basin'`, `'Ocean'` but data has `'river'`, `'basin'`, `'ocean'`.
|
| 62 |
+
|
| 63 |
+
**Current workaround**: `_normalize_ne_subtypes()` in `sql.py` does string replacement of known title-cased literals at query time (`_NE_SUBTYPE_FIXES` dict). This is brittle and only covers a hardcoded list.
|
| 64 |
+
|
| 65 |
+
**Root cause**: The original Natural Earth data had title-cased `featurecla` values (e.g. `River`, `Basin`, `Ocean`). Training data was generated before the lowercase fix to `convert_natural_earth.py`, so the model learned to emit title-cased subtypes. The data is now lowercased but the model still outputs the old casing.
|
| 66 |
+
|
| 67 |
+
**Fix**: Regenerate training data with the lowercased NE parquet so all subtype literals in SQL examples are lowercase. After re-finetune, the model will natively emit lowercase subtypes and the `_normalize_ne_subtypes` hack can be removed.
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## 6. "Largest/smallest" queries always return at least 3 results
|
| 72 |
+
|
| 73 |
+
**Query**: "the largest region in India", "smallest county in France"
|
| 74 |
+
**Expected**: Return 1 result (the single largest/smallest).
|
| 75 |
+
**Actual**: Model generates `LIMIT 3` by default, returning top 3 instead of 1.
|
| 76 |
+
|
| 77 |
+
**Root cause**: The aggregation templates (`agg_01`, `agg_02`) use `LIMIT 3` as the default. The model learns this as a fixed pattern and applies it even when the query clearly asks for a single result ("the largest", "the smallest").
|
| 78 |
+
|
| 79 |
+
**Fix**: During data generation, vary the LIMIT value based on the question hint phrasing. Use `LIMIT 1` for singular hints ("the largest X", "the smallest X") and `LIMIT 3` or `LIMIT 5` for plural hints ("the 3 largest", "top 5 smallest"). This teaches the model to infer the correct LIMIT from the query.
|
dataset/config.yaml
CHANGED
|
@@ -19,11 +19,11 @@ countries:
|
|
| 19 |
sample_targets:
|
| 20 |
direct_lookup: 1000
|
| 21 |
disambiguation: 2000 # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
|
| 22 |
-
adjacency: 2000 #
|
| 23 |
multi_adjacency: 1000
|
| 24 |
containment: 2000 # 4 templates (contain_01..04) - contain_02 reversed, contain_03/04 NE anchor
|
| 25 |
-
intersection: 2000 #
|
| 26 |
-
buffer: 2000 #
|
| 27 |
chained: 2000 # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
|
| 28 |
difference: 2000 # 2 templates, one is mixed (diff_02)
|
| 29 |
border_corridor: 1000
|
|
|
|
| 19 |
sample_targets:
|
| 20 |
direct_lookup: 1000
|
| 21 |
disambiguation: 2000 # 3 templates (disambiguate_01..03) - "Puri, Odisha" pattern
|
| 22 |
+
adjacency: 2000 # 8 templates (adj_01..08) - adj_05/07/08 cover terrain queries, adj_06 is counties
|
| 23 |
multi_adjacency: 1000
|
| 24 |
containment: 2000 # 4 templates (contain_01..04) - contain_02 reversed, contain_03/04 NE anchor
|
| 25 |
+
intersection: 2000 # 5 templates (intersect_01..05) - intersect_02/03/05 include NE anchor patterns
|
| 26 |
+
buffer: 2000 # 6 templates (buffer_01..06)
|
| 27 |
chained: 2000 # 11 templates (chained_01..11) - 10/11 are coastal/inland regions
|
| 28 |
difference: 2000 # 2 templates, one is mixed (diff_02)
|
| 29 |
border_corridor: 1000
|
dataset/scripts/build_relations.py
CHANGED
|
@@ -52,16 +52,19 @@ _CONTAINMENT_SUBTYPE_PAIRS = (
|
|
| 52 |
_NE_CROSS_SOURCE_SUBTYPES = (
|
| 53 |
"sea",
|
| 54 |
"ocean",
|
| 55 |
-
"
|
| 56 |
-
"
|
| 57 |
-
"
|
| 58 |
"gulf",
|
| 59 |
"bay",
|
| 60 |
-
"Island group",
|
| 61 |
-
"Peninsula",
|
| 62 |
"strait",
|
| 63 |
-
"
|
| 64 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
)
|
| 66 |
|
| 67 |
|
|
|
|
| 52 |
_NE_CROSS_SOURCE_SUBTYPES = (
|
| 53 |
"sea",
|
| 54 |
"ocean",
|
| 55 |
+
"lake",
|
| 56 |
+
"river",
|
| 57 |
+
"basin",
|
| 58 |
"gulf",
|
| 59 |
"bay",
|
|
|
|
|
|
|
| 60 |
"strait",
|
| 61 |
+
"range/mtn",
|
| 62 |
+
"plateau",
|
| 63 |
+
"plain",
|
| 64 |
+
"lowland",
|
| 65 |
+
"valley",
|
| 66 |
+
"depression",
|
| 67 |
+
"gorge",
|
| 68 |
)
|
| 69 |
|
| 70 |
|
dataset/scripts/export_training_data.py
CHANGED
|
@@ -7,8 +7,8 @@ Produces two task datasets from the same source samples:
|
|
| 7 |
2. Place extraction (prompt = question only, completion = PlacesResult JSON)
|
| 8 |
|
| 9 |
Place extraction pairs are derived automatically: for each SQL sample the
|
| 10 |
-
selected_candidates give us the correct place names
|
| 11 |
-
|
| 12 |
|
| 13 |
Output layout (all paths relative to dataset/):
|
| 14 |
output/runs/{run_name}/sql/train.jsonl
|
|
@@ -124,7 +124,7 @@ You have access to two DuckDB parquet tables. Given a set of candidate entities
|
|
| 124 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 125 |
names STRUCT("primary" VARCHAR, ...)
|
| 126 |
country VARCHAR
|
| 127 |
-
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', '
|
| 128 |
class VARCHAR
|
| 129 |
region VARCHAR
|
| 130 |
admin_level INTEGER
|
|
@@ -139,7 +139,7 @@ Use ST_AsGeoJSON(geometry) for all geometry outputs."""
|
|
| 139 |
|
| 140 |
_CANDIDATES_COLS = [
|
| 141 |
"source", "id", "name", "subtype", "country", "region",
|
| 142 |
-
"admin_level",
|
| 143 |
]
|
| 144 |
|
| 145 |
|
|
@@ -209,87 +209,74 @@ def sample_to_sql_pair(sample: Dict[str, Any]) -> Optional[Dict]:
|
|
| 209 |
_PLACE_SYSTEM = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
|
| 210 |
|
| 211 |
OUTPUT FORMAT:
|
| 212 |
-
{"places": [{"place": "<name>"
|
| 213 |
-
"country" and "subtype" are optional; omit if not applicable.
|
| 214 |
|
| 215 |
RULES:
|
| 216 |
-
- Extract the place
|
| 217 |
-
-
|
| 218 |
-
- When a
|
| 219 |
-
-
|
|
|
|
|
|
|
| 220 |
- No duplicate place names.
|
| 221 |
-
- "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
|
| 222 |
-
- "subtype": include only when the geographic level is clear from the query.
|
| 223 |
-
|
| 224 |
-
SUBTYPES:
|
| 225 |
-
country, dependency, region, county, localadmin, locality, macrohood, neighborhood, microhood
|
| 226 |
-
- Default to locality for cities/towns; omit for physical features (oceans, seas, rivers, lakes, basins, mountains, ranges, peninsulas, islands, terrain areas).
|
| 227 |
|
| 228 |
EXAMPLES:
|
| 229 |
Query: "Puri, Odisha"
|
| 230 |
-
-> {"places": [{"place": "Puri"
|
| 231 |
|
| 232 |
Query: "Lisboa, Portugal"
|
| 233 |
-
-> {"places": [{"place": "Lisboa"
|
| 234 |
|
| 235 |
Query: "Goa, India"
|
| 236 |
-
-> {"places": [{"place": "Goa"
|
| 237 |
|
| 238 |
Query: "Manchester in US"
|
| 239 |
-
-> {"places": [{"place": "Manchester"
|
| 240 |
|
| 241 |
Query: "Springfield, Illinois"
|
| 242 |
-
-> {"places": [{"place": "Springfield"
|
| 243 |
|
| 244 |
Query: "coastal districts of Brazil"
|
| 245 |
-
-> {"places": [{"place": "Brazil"
|
| 246 |
|
| 247 |
Query: "northern half of India"
|
| 248 |
-
-> {"places": [{"place": "India"
|
| 249 |
|
| 250 |
Query: "what's within 50 km of Paris?"
|
| 251 |
-
-> {"places": [{"place": "Paris"
|
| 252 |
|
| 253 |
Query: "countries the Nile crosses"
|
| 254 |
-> {"places": [{"place": "Nile"}]}
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
Query: "part of Ecuador in the Amazon basin"
|
| 257 |
-
-> {"places": [{"place": "Ecuador"
|
| 258 |
|
| 259 |
Query: "Amazon basin inside Ecuador"
|
| 260 |
-
-> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador"
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
Query: "which regions border both France and Germany?"
|
| 263 |
-
-> {"places": [{"place": "France"
|
| 264 |
|
| 265 |
Query: "merge Nairobi and Mombasa"
|
| 266 |
-
-> {"places": [{"place": "Nairobi"
|
| 267 |
-
|
| 268 |
-
# Overture division subtypes — used to filter out natural_earth candidates
|
| 269 |
-
# from the place extraction output (NE features don't have these subtypes).
|
| 270 |
-
_DIVISION_SUBTYPES = {
|
| 271 |
-
"country", "region", "dependency", "county", "localadmin",
|
| 272 |
-
"locality", "macrohood", "neighborhood", "microhood",
|
| 273 |
-
}
|
| 274 |
|
| 275 |
|
| 276 |
def _candidate_to_place(c: Dict) -> Optional[Dict]:
|
| 277 |
-
"""Convert a selected candidate to a Place dict for PlacesResult."""
|
| 278 |
name = c.get("name", "").strip()
|
| 279 |
if not name:
|
| 280 |
return None
|
| 281 |
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
subtype = c.get("subtype", "")
|
| 285 |
-
if subtype in _DIVISION_SUBTYPES:
|
| 286 |
-
place["subtype"] = subtype
|
| 287 |
-
|
| 288 |
-
country = c.get("country", "")
|
| 289 |
-
if country and len(country) == 2:
|
| 290 |
-
place["country"] = country
|
| 291 |
-
|
| 292 |
-
return place
|
| 293 |
|
| 294 |
|
| 295 |
def sample_to_place_pair(sample: Dict[str, Any]) -> Optional[Dict]:
|
|
|
|
| 7 |
2. Place extraction (prompt = question only, completion = PlacesResult JSON)
|
| 8 |
|
| 9 |
Place extraction pairs are derived automatically: for each SQL sample the
|
| 10 |
+
selected_candidates give us the correct place names that the extractor should
|
| 11 |
+
return.
|
| 12 |
|
| 13 |
Output layout (all paths relative to dataset/):
|
| 14 |
output/runs/{run_name}/sql/train.jsonl
|
|
|
|
| 124 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 125 |
names STRUCT("primary" VARCHAR, ...)
|
| 126 |
country VARCHAR
|
| 127 |
+
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
|
| 128 |
class VARCHAR
|
| 129 |
region VARCHAR
|
| 130 |
admin_level INTEGER
|
|
|
|
| 139 |
|
| 140 |
_CANDIDATES_COLS = [
|
| 141 |
"source", "id", "name", "subtype", "country", "region",
|
| 142 |
+
"admin_level",
|
| 143 |
]
|
| 144 |
|
| 145 |
|
|
|
|
| 209 |
_PLACE_SYSTEM = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
|
| 210 |
|
| 211 |
OUTPUT FORMAT:
|
| 212 |
+
{"places": [{"place": "<name>"}]}
|
|
|
|
| 213 |
|
| 214 |
RULES:
|
| 215 |
+
- Extract the place or places that are the actual anchors of the query.
|
| 216 |
+
- Physical features are valid places: oceans, seas, gulfs, bays, straits, rivers, lakes, basins, mountain ranges, peninsulas, island groups, deserts, and terrain regions.
|
| 217 |
+
- When a place is followed by its containing region, state, or country as disambiguation context ("Puri, Odisha", "Lisboa, Portugal", "Goa, India", "Manchester in US"), extract ONLY the specific place. Do not return the container as a separate place.
|
| 218 |
+
- When a query names two or more distinct anchors joined by words like "and", "both", "between", or mixes an admin area with a physical feature as separate anchors, extract every anchor in the order they appear.
|
| 219 |
+
- Do not infer or expand category nouns like "regions", "districts", "counties", "rivers", or "mountains" when they refer to a type rather than a specific named place ("regions of India" -> extract "India" only).
|
| 220 |
+
- Only extract places explicitly mentioned.
|
| 221 |
- No duplicate place names.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
EXAMPLES:
|
| 224 |
Query: "Puri, Odisha"
|
| 225 |
+
-> {"places": [{"place": "Puri"}]}
|
| 226 |
|
| 227 |
Query: "Lisboa, Portugal"
|
| 228 |
+
-> {"places": [{"place": "Lisboa"}]}
|
| 229 |
|
| 230 |
Query: "Goa, India"
|
| 231 |
+
-> {"places": [{"place": "Goa"}]}
|
| 232 |
|
| 233 |
Query: "Manchester in US"
|
| 234 |
+
-> {"places": [{"place": "Manchester"}]}
|
| 235 |
|
| 236 |
Query: "Springfield, Illinois"
|
| 237 |
+
-> {"places": [{"place": "Springfield"}]}
|
| 238 |
|
| 239 |
Query: "coastal districts of Brazil"
|
| 240 |
+
-> {"places": [{"place": "Brazil"}]}
|
| 241 |
|
| 242 |
Query: "northern half of India"
|
| 243 |
+
-> {"places": [{"place": "India"}]}
|
| 244 |
|
| 245 |
Query: "what's within 50 km of Paris?"
|
| 246 |
+
-> {"places": [{"place": "Paris"}]}
|
| 247 |
|
| 248 |
Query: "countries the Nile crosses"
|
| 249 |
-> {"places": [{"place": "Nile"}]}
|
| 250 |
|
| 251 |
+
Query: "which countries touch the Gulf of Maine"
|
| 252 |
+
-> {"places": [{"place": "Gulf of Maine"}]}
|
| 253 |
+
|
| 254 |
+
Query: "10 km buffer around Odisha"
|
| 255 |
+
-> {"places": [{"place": "Odisha"}]}
|
| 256 |
+
|
| 257 |
Query: "part of Ecuador in the Amazon basin"
|
| 258 |
+
-> {"places": [{"place": "Ecuador"}, {"place": "Amazon basin"}]}
|
| 259 |
|
| 260 |
Query: "Amazon basin inside Ecuador"
|
| 261 |
+
-> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador"}]}
|
| 262 |
+
|
| 263 |
+
Query: "the part of Chad in Lake Chad"
|
| 264 |
+
-> {"places": [{"place": "Chad"}, {"place": "Lake Chad"}]}
|
| 265 |
|
| 266 |
Query: "which regions border both France and Germany?"
|
| 267 |
+
-> {"places": [{"place": "France"}, {"place": "Germany"}]}
|
| 268 |
|
| 269 |
Query: "merge Nairobi and Mombasa"
|
| 270 |
+
-> {"places": [{"place": "Nairobi"}, {"place": "Mombasa"}]}"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
|
| 272 |
|
| 273 |
def _candidate_to_place(c: Dict) -> Optional[Dict]:
|
| 274 |
+
"""Convert a selected candidate to a minimal Place dict for PlacesResult."""
|
| 275 |
name = c.get("name", "").strip()
|
| 276 |
if not name:
|
| 277 |
return None
|
| 278 |
|
| 279 |
+
return {"place": name}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
|
| 282 |
def sample_to_place_pair(sample: Dict[str, Any]) -> Optional[Dict]:
|
dataset/scripts/generate_samples.py
CHANGED
|
@@ -61,29 +61,30 @@ get_templates_by_family = sql_templates.get_templates_by_family
|
|
| 61 |
|
| 62 |
|
| 63 |
_NE_NAMED_LOOKUP_SUBTYPES = {
|
| 64 |
-
'sea', 'ocean', '
|
| 65 |
-
'
|
| 66 |
}
|
| 67 |
|
| 68 |
_NE_TEMPLATE_SUBTYPES = {
|
| 69 |
-
'lookup_02': {'sea', 'ocean', '
|
| 70 |
'adj_03': {'sea', 'ocean'},
|
| 71 |
-
'adj_04': {'
|
| 72 |
-
'adj_05': {'
|
| 73 |
-
'contain_03': {'sea', 'ocean', 'gulf', 'bay', '
|
| 74 |
'contain_04': {'sea', 'ocean', 'gulf', 'bay', 'strait'},
|
| 75 |
-
'intersect_02': {'
|
| 76 |
-
'intersect_03': {'
|
| 77 |
-
'
|
| 78 |
-
'
|
| 79 |
-
'
|
| 80 |
-
'
|
| 81 |
-
'
|
| 82 |
-
'
|
| 83 |
-
'
|
| 84 |
-
'
|
| 85 |
-
'
|
| 86 |
-
'
|
|
|
|
| 87 |
}
|
| 88 |
|
| 89 |
|
|
@@ -722,14 +723,17 @@ def generate_template_based_sample(
|
|
| 722 |
anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
|
| 723 |
|
| 724 |
elif template.family == "adjacency":
|
| 725 |
-
# adj_03/04/05 target natural_earth features (
|
|
|
|
| 726 |
# Their SQL hardcodes NE subtypes and does not use {target_subtype}.
|
| 727 |
# Sample from cross_source_relations so the anchor is a division
|
| 728 |
# that actually intersects the right NE features.
|
| 729 |
_NE_ADJ_SUBTYPES = {
|
| 730 |
"adj_03": ("ocean", "sea"),
|
| 731 |
-
"adj_04": ("
|
| 732 |
-
"adj_05": ("
|
|
|
|
|
|
|
| 733 |
}
|
| 734 |
if template.template_id in _NE_ADJ_SUBTYPES:
|
| 735 |
cs_df = tables.get('cross_source_relations', pd.DataFrame())
|
|
@@ -1019,8 +1023,10 @@ def generate_template_based_sample(
|
|
| 1019 |
|
| 1020 |
elif template.family == "buffer":
|
| 1021 |
# Buffer operations
|
| 1022 |
-
# Kilometre distances used by
|
| 1023 |
-
#
|
|
|
|
|
|
|
| 1024 |
# The template SQL divides by 111 320 to convert to degrees.
|
| 1025 |
_buffer_km_choices = [1, 2, 5, 10, 25, 50, 100, 200]
|
| 1026 |
_buffer_m_choices = [100, 250, 500, 1000, 2000, 5000]
|
|
@@ -1166,8 +1172,13 @@ def generate_template_based_sample(
|
|
| 1166 |
candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
|
| 1167 |
|
| 1168 |
elif template.family == "aggregation":
|
| 1169 |
-
|
|
|
|
|
|
|
| 1170 |
target_subtype = random.choice(['locality', 'region'])
|
|
|
|
|
|
|
|
|
|
| 1171 |
|
| 1172 |
if template.template_id in ['agg_03', 'agg_04']:
|
| 1173 |
# Country-level aggregation: SQL uses country code, so the anchor
|
|
@@ -1194,7 +1205,7 @@ def generate_template_based_sample(
|
|
| 1194 |
num_candidates=10, difficulty="hard"
|
| 1195 |
)
|
| 1196 |
|
| 1197 |
-
question = random.choice(
|
| 1198 |
top_n=top_n,
|
| 1199 |
target_subtype=target_subtype,
|
| 1200 |
anchor_name=anchor['name'],
|
|
@@ -1216,7 +1227,7 @@ def generate_template_based_sample(
|
|
| 1216 |
num_candidates=10, difficulty="hard"
|
| 1217 |
)
|
| 1218 |
|
| 1219 |
-
question = random.choice(
|
| 1220 |
top_n=top_n,
|
| 1221 |
target_subtype=target_subtype,
|
| 1222 |
anchor_name=anchor['container_name'],
|
|
|
|
| 61 |
|
| 62 |
|
| 63 |
_NE_NAMED_LOOKUP_SUBTYPES = {
|
| 64 |
+
'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay',
|
| 65 |
+
'island group', 'peninsula', 'strait', 'range/mtn', 'depression',
|
| 66 |
}
|
| 67 |
|
| 68 |
_NE_TEMPLATE_SUBTYPES = {
|
| 69 |
+
'lookup_02': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 70 |
'adj_03': {'sea', 'ocean'},
|
| 71 |
+
'adj_04': {'river', 'lake', 'basin'},
|
| 72 |
+
'adj_05': {'range/mtn', 'peninsula', 'depression'},
|
| 73 |
+
'contain_03': {'sea', 'ocean', 'gulf', 'bay', 'basin', 'island group', 'peninsula', 'range/mtn', 'depression'},
|
| 74 |
'contain_04': {'sea', 'ocean', 'gulf', 'bay', 'strait'},
|
| 75 |
+
'intersect_02': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
|
| 76 |
+
'intersect_03': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
|
| 77 |
+
'intersect_05': {'river', 'lake', 'basin', 'gulf', 'bay', 'strait', 'range/mtn', 'peninsula', 'depression'},
|
| 78 |
+
'buffer_03': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 79 |
+
'buffer_04': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 80 |
+
'buffer_05': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 81 |
+
'chained_03': {'island group', 'peninsula', 'range/mtn', 'depression'},
|
| 82 |
+
'chained_04': {'river', 'lake', 'basin'},
|
| 83 |
+
'chained_05': {'range/mtn', 'depression'},
|
| 84 |
+
'chained_08': {'river', 'lake', 'basin'},
|
| 85 |
+
'chained_09': {'range/mtn', 'depression'},
|
| 86 |
+
'partial_05': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 87 |
+
'diff_02': {'sea', 'ocean', 'lake', 'river', 'basin', 'gulf', 'bay', 'island group', 'peninsula', 'strait', 'range/mtn', 'depression'},
|
| 88 |
}
|
| 89 |
|
| 90 |
|
|
|
|
| 723 |
anchor = {"id": pair["contained_id"], "name": pair["contained_name"]}
|
| 724 |
|
| 725 |
elif template.family == "adjacency":
|
| 726 |
+
# adj_03/04/05/07/08 target natural_earth features (marine, water,
|
| 727 |
+
# mountain, plateau, and broad landform regions).
|
| 728 |
# Their SQL hardcodes NE subtypes and does not use {target_subtype}.
|
| 729 |
# Sample from cross_source_relations so the anchor is a division
|
| 730 |
# that actually intersects the right NE features.
|
| 731 |
_NE_ADJ_SUBTYPES = {
|
| 732 |
"adj_03": ("ocean", "sea"),
|
| 733 |
+
"adj_04": ("river", "lake", "basin"),
|
| 734 |
+
"adj_05": ("range/mtn",),
|
| 735 |
+
"adj_07": ("plateau",),
|
| 736 |
+
"adj_08": ("plain", "lowland", "basin", "valley", "depression", "gorge"),
|
| 737 |
}
|
| 738 |
if template.template_id in _NE_ADJ_SUBTYPES:
|
| 739 |
cs_df = tables.get('cross_source_relations', pd.DataFrame())
|
|
|
|
| 1023 |
|
| 1024 |
elif template.family == "buffer":
|
| 1025 |
# Buffer operations
|
| 1026 |
+
# Kilometre distances used by km-based buffer templates (for example
|
| 1027 |
+
# buffer_01, buffer_03, buffer_05, and buffer_06).
|
| 1028 |
+
# Metre distances used by metre-based buffer templates (buffer_02 and
|
| 1029 |
+
# buffer_04).
|
| 1030 |
# The template SQL divides by 111 320 to convert to degrees.
|
| 1031 |
_buffer_km_choices = [1, 2, 5, 10, 25, 50, 100, 200]
|
| 1032 |
_buffer_m_choices = [100, 250, 500, 1000, 2000, 5000]
|
|
|
|
| 1172 |
candidates = _merge_candidate_lists(div_cands, ne_cands, max_total=10)
|
| 1173 |
|
| 1174 |
elif template.family == "aggregation":
|
| 1175 |
+
# Teach the model to distinguish singular superlatives ("the largest")
|
| 1176 |
+
# from explicit top-N requests ("top 5 largest").
|
| 1177 |
+
top_n = random.choice([1, 3, 5, 10])
|
| 1178 |
target_subtype = random.choice(['locality', 'region'])
|
| 1179 |
+
singular_hints = [h for h in template.question_hints if '{top_n}' not in h]
|
| 1180 |
+
plural_hints = [h for h in template.question_hints if '{top_n}' in h]
|
| 1181 |
+
question_hint_pool = singular_hints if top_n == 1 and singular_hints else plural_hints or template.question_hints
|
| 1182 |
|
| 1183 |
if template.template_id in ['agg_03', 'agg_04']:
|
| 1184 |
# Country-level aggregation: SQL uses country code, so the anchor
|
|
|
|
| 1205 |
num_candidates=10, difficulty="hard"
|
| 1206 |
)
|
| 1207 |
|
| 1208 |
+
question = random.choice(question_hint_pool).format(
|
| 1209 |
top_n=top_n,
|
| 1210 |
target_subtype=target_subtype,
|
| 1211 |
anchor_name=anchor['name'],
|
|
|
|
| 1227 |
num_candidates=10, difficulty="hard"
|
| 1228 |
)
|
| 1229 |
|
| 1230 |
+
question = random.choice(question_hint_pool).format(
|
| 1231 |
top_n=top_n,
|
| 1232 |
target_subtype=target_subtype,
|
| 1233 |
anchor_name=anchor['container_name'],
|
dataset/scripts/sql_templates.py
CHANGED
|
@@ -125,6 +125,10 @@ TEMPLATES = [
|
|
| 125 |
"map the {anchor_name}",
|
| 126 |
"how big is the {anchor_name}?",
|
| 127 |
"outline of the {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
],
|
| 129 |
),
|
| 130 |
|
|
@@ -163,6 +167,10 @@ TEMPLATES = [
|
|
| 163 |
"{anchor_name} {container_name}",
|
| 164 |
"pull up {anchor_name} in {container_name}",
|
| 165 |
"find {anchor_name} in {container_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
],
|
| 167 |
),
|
| 168 |
|
|
@@ -189,6 +197,10 @@ TEMPLATES = [
|
|
| 189 |
"pull up {anchor_name} ({container_name})",
|
| 190 |
"find {anchor_name} in {container_name}",
|
| 191 |
"{anchor_name} {container_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
],
|
| 193 |
),
|
| 194 |
|
|
@@ -215,6 +227,10 @@ TEMPLATES = [
|
|
| 215 |
"{anchor_name} province of {container_name}",
|
| 216 |
"pull up {anchor_name} in {container_name}",
|
| 217 |
"find {anchor_name} {container_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
],
|
| 219 |
),
|
| 220 |
|
|
@@ -246,6 +262,11 @@ TEMPLATES = [
|
|
| 246 |
"what surrounds {anchor_name}?",
|
| 247 |
"places next to {anchor_name}",
|
| 248 |
"everything bordering {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
],
|
| 250 |
),
|
| 251 |
|
|
@@ -275,6 +296,10 @@ TEMPLATES = [
|
|
| 275 |
"which {target_subtype}s are adjacent to {anchor_name}?",
|
| 276 |
"{target_subtype}s along the {anchor_name} border",
|
| 277 |
"find {target_subtype}s next to {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
],
|
| 279 |
),
|
| 280 |
|
|
@@ -305,6 +330,11 @@ TEMPLATES = [
|
|
| 305 |
"which water bodies does {anchor_name} border?",
|
| 306 |
"does {anchor_name} have sea access?",
|
| 307 |
"what ocean is {anchor_name} on?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
],
|
| 309 |
),
|
| 310 |
|
|
@@ -368,6 +398,12 @@ TEMPLATES = [
|
|
| 368 |
"regions adjacent to both {anchor_1_name} and {anchor_2_name}",
|
| 369 |
"what lies between {anchor_1_name} and {anchor_2_name}?",
|
| 370 |
"common neighbours of {anchor_1_name} and {anchor_2_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 371 |
],
|
| 372 |
),
|
| 373 |
|
|
@@ -399,6 +435,13 @@ TEMPLATES = [
|
|
| 399 |
"all {target_subtype}s within {anchor_name}",
|
| 400 |
"{target_subtype}s of {anchor_name}",
|
| 401 |
"show every {target_subtype} in {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 402 |
],
|
| 403 |
),
|
| 404 |
|
|
@@ -428,6 +471,12 @@ TEMPLATES = [
|
|
| 428 |
"{anchor_name} is part of which country?",
|
| 429 |
"where is {anchor_name}",
|
| 430 |
"what country is {anchor_name} in",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 431 |
],
|
| 432 |
),
|
| 433 |
|
|
@@ -456,6 +505,14 @@ TEMPLATES = [
|
|
| 456 |
"all regions inside the {anchor_name}",
|
| 457 |
"what {target_subtype}s does the {anchor_name} contain?",
|
| 458 |
"{target_subtype}s covered by the {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 459 |
],
|
| 460 |
),
|
| 461 |
|
|
@@ -486,6 +543,12 @@ TEMPLATES = [
|
|
| 486 |
"which {target_subtype}s overlap {anchor_name}?",
|
| 487 |
"{target_subtype}s partially inside {anchor_name}",
|
| 488 |
"what {target_subtype}s extend into {anchor_name}?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 489 |
],
|
| 490 |
),
|
| 491 |
|
|
@@ -516,6 +579,10 @@ TEMPLATES = [
|
|
| 516 |
"countries along the {anchor_name}",
|
| 517 |
"what countries does the {anchor_name} cover?",
|
| 518 |
"countries the {anchor_name} spans across",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 519 |
],
|
| 520 |
),
|
| 521 |
|
|
@@ -607,6 +674,15 @@ TEMPLATES = [
|
|
| 607 |
"what falls within {buffer_km} km of the {anchor_name}?",
|
| 608 |
"admin divisions within a {buffer_km} km radius of the {anchor_name}",
|
| 609 |
"places within {buffer_km} kilometers of the {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 610 |
],
|
| 611 |
),
|
| 612 |
|
|
@@ -633,6 +709,40 @@ TEMPLATES = [
|
|
| 633 |
"admin units within {buffer_m} m of the {anchor_name}",
|
| 634 |
"places within {buffer_m} metres of the {anchor_name}",
|
| 635 |
"{buffer_m} meter buffer around the {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 636 |
],
|
| 637 |
),
|
| 638 |
|
|
@@ -670,6 +780,11 @@ TEMPLATES = [
|
|
| 670 |
"{target_subtype}s in {anchor_name} bordering the sea",
|
| 671 |
"oceanfront {target_subtype}s in {anchor_name}",
|
| 672 |
"which {target_subtype}s in {anchor_name} have a coastline?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 673 |
],
|
| 674 |
),
|
| 675 |
|
|
@@ -702,6 +817,11 @@ TEMPLATES = [
|
|
| 702 |
"{target_subtype}s in {anchor_name} with no coastline",
|
| 703 |
"which {target_subtype}s within {anchor_name} are landlocked?",
|
| 704 |
"interior {target_subtype}s of {anchor_name} with no ocean border",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 705 |
],
|
| 706 |
),
|
| 707 |
|
|
@@ -723,7 +843,7 @@ TEMPLATES = [
|
|
| 723 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 724 |
" AND EXISTS ("
|
| 725 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 726 |
-
" WHERE n.subtype IN ('
|
| 727 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 728 |
" )"
|
| 729 |
),
|
|
@@ -732,6 +852,10 @@ TEMPLATES = [
|
|
| 732 |
"{target_subtype}s of {anchor_name} on a peninsula or island group",
|
| 733 |
"{target_subtype}s within {anchor_name} on notable landforms",
|
| 734 |
"island and peninsula {target_subtype}s of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 735 |
],
|
| 736 |
),
|
| 737 |
|
|
@@ -764,6 +888,10 @@ TEMPLATES = [
|
|
| 764 |
"{anchor_1_name} with {anchor_2_name} cut out",
|
| 765 |
"subtract {anchor_2_name} from {anchor_1_name}",
|
| 766 |
"what's left of {anchor_1_name} after removing {anchor_2_name}?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 767 |
],
|
| 768 |
),
|
| 769 |
|
|
@@ -792,6 +920,10 @@ TEMPLATES = [
|
|
| 792 |
"{anchor_name} with the {clip_feature_name} removed",
|
| 793 |
"what's left of {anchor_name} after removing the {clip_feature_name}?",
|
| 794 |
"show me {anchor_name} excluding the {clip_feature_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 795 |
],
|
| 796 |
),
|
| 797 |
|
|
@@ -828,6 +960,11 @@ TEMPLATES = [
|
|
| 828 |
"the region straddling the border of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
|
| 829 |
"{buffer_km} km on either side of the {anchor_1_name} and {anchor_2_name} border",
|
| 830 |
"buffer the {anchor_1_name}-{anchor_2_name} boundary by {buffer_km} km",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 831 |
],
|
| 832 |
),
|
| 833 |
|
|
@@ -899,6 +1036,10 @@ TEMPLATES = [
|
|
| 899 |
"show {target_subtype}s across {anchor_1_name} and {anchor_2_name}",
|
| 900 |
"{target_subtype}s belonging to {anchor_1_name} and {anchor_2_name}",
|
| 901 |
"list {target_subtype}s in both {anchor_1_name} and {anchor_2_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 902 |
],
|
| 903 |
),
|
| 904 |
|
|
@@ -921,6 +1062,10 @@ TEMPLATES = [
|
|
| 921 |
"all {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
| 922 |
"show {target_subtype}s across {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
| 923 |
"list {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 924 |
],
|
| 925 |
),
|
| 926 |
|
|
@@ -947,6 +1092,11 @@ TEMPLATES = [
|
|
| 947 |
"union of all {target_subtype}s within {anchor_name}",
|
| 948 |
"all {target_subtype}s of {anchor_name} merged together",
|
| 949 |
"the overall extent of {target_subtype}s in {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 950 |
],
|
| 951 |
),
|
| 952 |
|
|
@@ -979,6 +1129,11 @@ TEMPLATES = [
|
|
| 979 |
"the top half of {anchor_name}",
|
| 980 |
"northern portion of {anchor_name}",
|
| 981 |
"upper half of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 982 |
],
|
| 983 |
),
|
| 984 |
|
|
@@ -1008,6 +1163,11 @@ TEMPLATES = [
|
|
| 1008 |
"the bottom half of {anchor_name}",
|
| 1009 |
"southern portion of {anchor_name}",
|
| 1010 |
"lower half of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1011 |
],
|
| 1012 |
),
|
| 1013 |
|
|
@@ -1036,6 +1196,11 @@ TEMPLATES = [
|
|
| 1036 |
"eastern part of {anchor_name}",
|
| 1037 |
"the right half of {anchor_name}",
|
| 1038 |
"eastern portion of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1039 |
],
|
| 1040 |
),
|
| 1041 |
|
|
@@ -1064,6 +1229,11 @@ TEMPLATES = [
|
|
| 1064 |
"western part of {anchor_name}",
|
| 1065 |
"the left half of {anchor_name}",
|
| 1066 |
"western portion of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1067 |
],
|
| 1068 |
),
|
| 1069 |
|
|
@@ -1095,6 +1265,10 @@ TEMPLATES = [
|
|
| 1095 |
"{clip_feature_name} inside {anchor_name}",
|
| 1096 |
"parts of {anchor_name} covered by the {clip_feature_name}",
|
| 1097 |
"show me where {anchor_name} and the {clip_feature_name} overlap",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1098 |
],
|
| 1099 |
),
|
| 1100 |
|
|
@@ -1129,6 +1303,10 @@ TEMPLATES = [
|
|
| 1129 |
"the {top_n} biggest {target_subtype}s within {anchor_name}",
|
| 1130 |
"largest {target_subtype} in {anchor_name}",
|
| 1131 |
"which {target_subtype} in {anchor_name} has the most area?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1132 |
],
|
| 1133 |
),
|
| 1134 |
|
|
@@ -1160,6 +1338,10 @@ TEMPLATES = [
|
|
| 1160 |
"the {top_n} tiniest {target_subtype}s within {anchor_name}",
|
| 1161 |
"smallest {target_subtype} in {anchor_name}",
|
| 1162 |
"which {target_subtype} in {anchor_name} has the least area?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1163 |
],
|
| 1164 |
),
|
| 1165 |
|
|
@@ -1188,6 +1370,10 @@ TEMPLATES = [
|
|
| 1188 |
"the {top_n} largest {target_subtype}s in {anchor_name}",
|
| 1189 |
"biggest {target_subtype} in {anchor_name}",
|
| 1190 |
"which {target_subtype} in {anchor_name} is the largest?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1191 |
],
|
| 1192 |
),
|
| 1193 |
|
|
@@ -1216,6 +1402,10 @@ TEMPLATES = [
|
|
| 1216 |
"the {top_n} smallest {target_subtype}s in {anchor_name}",
|
| 1217 |
"smallest {target_subtype} in {anchor_name}",
|
| 1218 |
"which {target_subtype} in {anchor_name} is the smallest?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1219 |
],
|
| 1220 |
),
|
| 1221 |
|
|
@@ -1249,6 +1439,11 @@ TEMPLATES = [
|
|
| 1249 |
"biggest {target_subtype} per region in {anchor_name}",
|
| 1250 |
"largest {target_subtype} for every region of {anchor_name}",
|
| 1251 |
"the biggest {target_subtype} in each province of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1252 |
],
|
| 1253 |
),
|
| 1254 |
|
|
@@ -1279,6 +1474,11 @@ TEMPLATES = [
|
|
| 1279 |
"smallest {target_subtype} per region in {anchor_name}",
|
| 1280 |
"tiniest {target_subtype} for every region of {anchor_name}",
|
| 1281 |
"the smallest {target_subtype} in each province of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1282 |
],
|
| 1283 |
),
|
| 1284 |
|
|
@@ -1307,6 +1507,10 @@ TEMPLATES = [
|
|
| 1307 |
"{anchor_name}'s land {target_subtype}s",
|
| 1308 |
"dependencies of {anchor_name} with land area",
|
| 1309 |
"show the land dependencies of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1310 |
],
|
| 1311 |
),
|
| 1312 |
|
|
@@ -1330,6 +1534,11 @@ TEMPLATES = [
|
|
| 1330 |
"official territorial divisions of {anchor_name}",
|
| 1331 |
"recognised territorial {target_subtype}s belonging to {anchor_name}",
|
| 1332 |
"which territorial regions does {anchor_name} have?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1333 |
],
|
| 1334 |
),
|
| 1335 |
|
|
@@ -1353,6 +1562,11 @@ TEMPLATES = [
|
|
| 1353 |
"{target_subtype}s of {anchor_name} that are not on land",
|
| 1354 |
"water-associated {target_subtype}s of {anchor_name}",
|
| 1355 |
"marine or offshore {target_subtype}s of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1356 |
],
|
| 1357 |
),
|
| 1358 |
|
|
@@ -1374,7 +1588,7 @@ TEMPLATES = [
|
|
| 1374 |
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1375 |
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1376 |
" FROM read_parquet('natural_earth') AS n, a"
|
| 1377 |
-
" WHERE n.subtype IN ('
|
| 1378 |
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1379 |
),
|
| 1380 |
question_hints=[
|
|
@@ -1386,6 +1600,10 @@ TEMPLATES = [
|
|
| 1386 |
"what bodies of water cross {anchor_name}?",
|
| 1387 |
"rivers of {anchor_name}",
|
| 1388 |
"show me the lakes in {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1389 |
],
|
| 1390 |
),
|
| 1391 |
|
|
@@ -1395,7 +1613,7 @@ TEMPLATES = [
|
|
| 1395 |
sql_difficulty="medium",
|
| 1396 |
anchor_source="divisions_area",
|
| 1397 |
num_anchors=1,
|
| 1398 |
-
target_subtype="range",
|
| 1399 |
sql_template=(
|
| 1400 |
"WITH a AS ("
|
| 1401 |
" SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
|
|
@@ -1403,18 +1621,84 @@ TEMPLATES = [
|
|
| 1403 |
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1404 |
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1405 |
" FROM read_parquet('natural_earth') AS n, a"
|
| 1406 |
-
" WHERE n.subtype
|
| 1407 |
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1408 |
),
|
| 1409 |
question_hints=[
|
| 1410 |
"what mountain ranges are in {anchor_name}?",
|
| 1411 |
-
"terrain features of {anchor_name}",
|
| 1412 |
"which mountain ranges cross {anchor_name}?",
|
| 1413 |
-
"landforms inside {anchor_name}",
|
| 1414 |
-
"peninsulas and ranges in {anchor_name}",
|
| 1415 |
-
"geographic features within {anchor_name}",
|
| 1416 |
"mountains of {anchor_name}",
|
| 1417 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1418 |
],
|
| 1419 |
),
|
| 1420 |
|
|
@@ -1449,6 +1733,13 @@ TEMPLATES = [
|
|
| 1449 |
"what provinces does the {anchor_name} span?",
|
| 1450 |
"regions along the {anchor_name}",
|
| 1451 |
"which provinces overlap the {anchor_name}?",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1452 |
],
|
| 1453 |
),
|
| 1454 |
|
|
@@ -1475,6 +1766,45 @@ TEMPLATES = [
|
|
| 1475 |
"everything natural that touches {anchor_name}",
|
| 1476 |
"what geographic features does {anchor_name} contain?",
|
| 1477 |
"natural features within or crossing {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1478 |
],
|
| 1479 |
),
|
| 1480 |
|
|
@@ -1500,7 +1830,7 @@ TEMPLATES = [
|
|
| 1500 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1501 |
" AND EXISTS ("
|
| 1502 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1503 |
-
" WHERE n.subtype IN ('
|
| 1504 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1505 |
" )"
|
| 1506 |
),
|
|
@@ -1512,6 +1842,10 @@ TEMPLATES = [
|
|
| 1512 |
"{target_subtype}s in {anchor_name} that touch a river",
|
| 1513 |
"which {target_subtype}s in {anchor_name} are on a lake?",
|
| 1514 |
"waterfront {target_subtype}s of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1515 |
],
|
| 1516 |
),
|
| 1517 |
|
|
@@ -1533,7 +1867,7 @@ TEMPLATES = [
|
|
| 1533 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1534 |
" AND EXISTS ("
|
| 1535 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1536 |
-
" WHERE n.subtype IN ('
|
| 1537 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1538 |
" )"
|
| 1539 |
),
|
|
@@ -1544,6 +1878,10 @@ TEMPLATES = [
|
|
| 1544 |
"highland {target_subtype}s within {anchor_name}",
|
| 1545 |
"{target_subtype}s of {anchor_name} in mountainous terrain",
|
| 1546 |
"{target_subtype}s in {anchor_name} near a mountain range",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1547 |
],
|
| 1548 |
),
|
| 1549 |
|
|
@@ -1581,6 +1919,10 @@ TEMPLATES = [
|
|
| 1581 |
"{target_subtype}s of {anchor_name} with ocean access",
|
| 1582 |
"which {target_subtype}s in {anchor_name} touch the sea?",
|
| 1583 |
"maritime {target_subtype}s of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1584 |
],
|
| 1585 |
),
|
| 1586 |
|
|
@@ -1613,6 +1955,10 @@ TEMPLATES = [
|
|
| 1613 |
"{target_subtype}s in {anchor_name} with no sea access",
|
| 1614 |
"non-coastal {target_subtype}s of {anchor_name}",
|
| 1615 |
"inland {target_subtype}s of {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1616 |
],
|
| 1617 |
),
|
| 1618 |
|
|
@@ -1634,7 +1980,7 @@ TEMPLATES = [
|
|
| 1634 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1635 |
" AND EXISTS ("
|
| 1636 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1637 |
-
" WHERE n.subtype IN ('
|
| 1638 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1639 |
" )"
|
| 1640 |
),
|
|
@@ -1645,6 +1991,10 @@ TEMPLATES = [
|
|
| 1645 |
"lakeside {target_subtype}s within {anchor_name}",
|
| 1646 |
"{target_subtype}s of {anchor_name} along a river",
|
| 1647 |
"which {target_subtype}s in {anchor_name} border a lake?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1648 |
],
|
| 1649 |
),
|
| 1650 |
|
|
@@ -1666,7 +2016,7 @@ TEMPLATES = [
|
|
| 1666 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1667 |
" AND EXISTS ("
|
| 1668 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1669 |
-
" WHERE n.subtype IN ('
|
| 1670 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1671 |
" )"
|
| 1672 |
),
|
|
@@ -1677,6 +2027,10 @@ TEMPLATES = [
|
|
| 1677 |
"highland {target_subtype}s within {anchor_name}",
|
| 1678 |
"{target_subtype}s of {anchor_name} in mountainous terrain",
|
| 1679 |
"which {target_subtype}s in {anchor_name} have mountain ranges?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1680 |
],
|
| 1681 |
),
|
| 1682 |
|
|
@@ -1718,6 +2072,10 @@ TEMPLATES = [
|
|
| 1718 |
"seaside regions of {anchor_name}",
|
| 1719 |
"which provinces of {anchor_name} touch the sea?",
|
| 1720 |
"states of {anchor_name} along the coast",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1721 |
],
|
| 1722 |
),
|
| 1723 |
|
|
@@ -1752,6 +2110,10 @@ TEMPLATES = [
|
|
| 1752 |
"regions of {anchor_name} without sea access",
|
| 1753 |
"interior states of {anchor_name}",
|
| 1754 |
"states of {anchor_name} that don't border the ocean",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1755 |
],
|
| 1756 |
),
|
| 1757 |
|
|
@@ -1784,6 +2146,10 @@ TEMPLATES = [
|
|
| 1784 |
"which countries touch the {anchor_name}?",
|
| 1785 |
"countries with coastline on the {anchor_name}",
|
| 1786 |
"what nations lie on the {anchor_name}?",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1787 |
],
|
| 1788 |
),
|
| 1789 |
|
|
@@ -1816,6 +2182,15 @@ TEMPLATES = [
|
|
| 1816 |
"everything within {buffer_km} km of the {anchor_name}",
|
| 1817 |
"what natural features are close to the {anchor_name}?",
|
| 1818 |
"{buffer_km} km radius around the {anchor_name}",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1819 |
],
|
| 1820 |
),
|
| 1821 |
|
|
|
|
| 125 |
"map the {anchor_name}",
|
| 126 |
"how big is the {anchor_name}?",
|
| 127 |
"outline of the {anchor_name}",
|
| 128 |
+
"show the shape of the {anchor_name}",
|
| 129 |
+
"trace the {anchor_name}",
|
| 130 |
+
"map out the {anchor_name}",
|
| 131 |
+
"where exactly is the {anchor_name}",
|
| 132 |
],
|
| 133 |
),
|
| 134 |
|
|
|
|
| 167 |
"{anchor_name} {container_name}",
|
| 168 |
"pull up {anchor_name} in {container_name}",
|
| 169 |
"find {anchor_name} in {container_name}",
|
| 170 |
+
"locate {anchor_name} in {container_name}",
|
| 171 |
+
"need {anchor_name} from {container_name}",
|
| 172 |
+
"show {anchor_name} under {container_name}",
|
| 173 |
+
"{anchor_name} near {container_name}",
|
| 174 |
],
|
| 175 |
),
|
| 176 |
|
|
|
|
| 197 |
"pull up {anchor_name} ({container_name})",
|
| 198 |
"find {anchor_name} in {container_name}",
|
| 199 |
"{anchor_name} {container_name}",
|
| 200 |
+
"locate {anchor_name} in {container_name}",
|
| 201 |
+
"need the {anchor_name} in {container_name}",
|
| 202 |
+
"show {anchor_name} from {container_name}",
|
| 203 |
+
"bring up {anchor_name}, {container_name}",
|
| 204 |
],
|
| 205 |
),
|
| 206 |
|
|
|
|
| 227 |
"{anchor_name} province of {container_name}",
|
| 228 |
"pull up {anchor_name} in {container_name}",
|
| 229 |
"find {anchor_name} {container_name}",
|
| 230 |
+
"locate {anchor_name} within {container_name}",
|
| 231 |
+
"show the {anchor_name} part of {container_name}",
|
| 232 |
+
"need {anchor_name} from {container_name}",
|
| 233 |
+
"bring up {anchor_name} in {container_name}",
|
| 234 |
],
|
| 235 |
),
|
| 236 |
|
|
|
|
| 262 |
"what surrounds {anchor_name}?",
|
| 263 |
"places next to {anchor_name}",
|
| 264 |
"everything bordering {anchor_name}",
|
| 265 |
+
"show adjacent places to {anchor_name}",
|
| 266 |
+
"areas touching {anchor_name}",
|
| 267 |
+
"find the neighbours of {anchor_name}",
|
| 268 |
+
"bordering places for {anchor_name}",
|
| 269 |
+
"places that meet {anchor_name}",
|
| 270 |
],
|
| 271 |
),
|
| 272 |
|
|
|
|
| 296 |
"which {target_subtype}s are adjacent to {anchor_name}?",
|
| 297 |
"{target_subtype}s along the {anchor_name} border",
|
| 298 |
"find {target_subtype}s next to {anchor_name}",
|
| 299 |
+
"show {target_subtype}s bordering {anchor_name}",
|
| 300 |
+
"{target_subtype}s beside {anchor_name}",
|
| 301 |
+
"all {target_subtype}s touching {anchor_name}",
|
| 302 |
+
"{target_subtype}s meeting {anchor_name}",
|
| 303 |
],
|
| 304 |
),
|
| 305 |
|
|
|
|
| 330 |
"which water bodies does {anchor_name} border?",
|
| 331 |
"does {anchor_name} have sea access?",
|
| 332 |
"what ocean is {anchor_name} on?",
|
| 333 |
+
"is {anchor_name} on the coast?",
|
| 334 |
+
"what sea is off the coast of {anchor_name}?",
|
| 335 |
+
"which ocean lies off {anchor_name}?",
|
| 336 |
+
"what water is {anchor_name} on the shore of?",
|
| 337 |
+
"which sea or ocean is {anchor_name} along?",
|
| 338 |
],
|
| 339 |
),
|
| 340 |
|
|
|
|
| 398 |
"regions adjacent to both {anchor_1_name} and {anchor_2_name}",
|
| 399 |
"what lies between {anchor_1_name} and {anchor_2_name}?",
|
| 400 |
"common neighbours of {anchor_1_name} and {anchor_2_name}",
|
| 401 |
+
"show places touching both {anchor_1_name} and {anchor_2_name}",
|
| 402 |
+
"areas bordering both {anchor_1_name} and {anchor_2_name}",
|
| 403 |
+
"shared neighbours of {anchor_1_name} and {anchor_2_name}",
|
| 404 |
+
"find places adjacent to both {anchor_1_name} and {anchor_2_name}",
|
| 405 |
+
"places meeting both {anchor_1_name} and {anchor_2_name}",
|
| 406 |
+
"places on the border of both {anchor_1_name} and {anchor_2_name}",
|
| 407 |
],
|
| 408 |
),
|
| 409 |
|
|
|
|
| 435 |
"all {target_subtype}s within {anchor_name}",
|
| 436 |
"{target_subtype}s of {anchor_name}",
|
| 437 |
"show every {target_subtype} in {anchor_name}",
|
| 438 |
+
"show {target_subtype}s inside {anchor_name}",
|
| 439 |
+
"find {target_subtype}s in {anchor_name}",
|
| 440 |
+
"give me the {target_subtype}s in {anchor_name}",
|
| 441 |
+
"{anchor_name} {target_subtype}s",
|
| 442 |
+
"which {target_subtype}s does {anchor_name} contain?",
|
| 443 |
+
"what all {target_subtype}s are there in {anchor_name}?",
|
| 444 |
+
"{target_subtype}s under {anchor_name}",
|
| 445 |
],
|
| 446 |
),
|
| 447 |
|
|
|
|
| 471 |
"{anchor_name} is part of which country?",
|
| 472 |
"where is {anchor_name}",
|
| 473 |
"what country is {anchor_name} in",
|
| 474 |
+
"{anchor_name} belongs to which country?",
|
| 475 |
+
"show country for {anchor_name}",
|
| 476 |
+
"find the country of {anchor_name}",
|
| 477 |
+
"which country does {anchor_name} fall in?",
|
| 478 |
+
"{anchor_name} under which country",
|
| 479 |
+
"tell me the country for {anchor_name}",
|
| 480 |
],
|
| 481 |
),
|
| 482 |
|
|
|
|
| 505 |
"all regions inside the {anchor_name}",
|
| 506 |
"what {target_subtype}s does the {anchor_name} contain?",
|
| 507 |
"{target_subtype}s covered by the {anchor_name}",
|
| 508 |
+
"which regions are in the {anchor_name} basin?",
|
| 509 |
+
"what admin regions lie within the {anchor_name}?",
|
| 510 |
+
"which provinces are inside the {anchor_name}?",
|
| 511 |
+
"show the regions in the {anchor_name}",
|
| 512 |
+
"find provinces inside the {anchor_name}",
|
| 513 |
+
"give me admin regions within the {anchor_name}",
|
| 514 |
+
"regions belonging to the {anchor_name}",
|
| 515 |
+
"areas contained in the {anchor_name}",
|
| 516 |
],
|
| 517 |
),
|
| 518 |
|
|
|
|
| 543 |
"which {target_subtype}s overlap {anchor_name}?",
|
| 544 |
"{target_subtype}s partially inside {anchor_name}",
|
| 545 |
"what {target_subtype}s extend into {anchor_name}?",
|
| 546 |
+
"show {target_subtype}s overlapping {anchor_name}",
|
| 547 |
+
"find {target_subtype}s crossing {anchor_name}",
|
| 548 |
+
"{target_subtype}s meeting {anchor_name}",
|
| 549 |
+
"areas intersecting {anchor_name}",
|
| 550 |
+
"{anchor_name} overlapping {target_subtype}s",
|
| 551 |
+
"which {target_subtype}s are partly in {anchor_name}?",
|
| 552 |
],
|
| 553 |
),
|
| 554 |
|
|
|
|
| 579 |
"countries along the {anchor_name}",
|
| 580 |
"what countries does the {anchor_name} cover?",
|
| 581 |
"countries the {anchor_name} spans across",
|
| 582 |
+
"what countries is the {anchor_name} in?",
|
| 583 |
+
"which countries lie along the {anchor_name}?",
|
| 584 |
+
"what countries does the {anchor_name} run through?",
|
| 585 |
+
"which countries border the {anchor_name}?",
|
| 586 |
],
|
| 587 |
),
|
| 588 |
|
|
|
|
| 674 |
"what falls within {buffer_km} km of the {anchor_name}?",
|
| 675 |
"admin divisions within a {buffer_km} km radius of the {anchor_name}",
|
| 676 |
"places within {buffer_km} kilometers of the {anchor_name}",
|
| 677 |
+
"what places are near the {anchor_name}?",
|
| 678 |
+
"what admin areas are close to the {anchor_name}?",
|
| 679 |
+
"which regions are around the {anchor_name}?",
|
| 680 |
+
"what lies within {buffer_km} km of the shoreline of the {anchor_name}?",
|
| 681 |
+
"show places around the {anchor_name}",
|
| 682 |
+
"find areas near the {anchor_name}",
|
| 683 |
+
"admin units close to the {anchor_name}",
|
| 684 |
+
"what is around the {anchor_name} within {buffer_km} km",
|
| 685 |
+
"give me nearby admin regions for the {anchor_name}",
|
| 686 |
],
|
| 687 |
),
|
| 688 |
|
|
|
|
| 709 |
"admin units within {buffer_m} m of the {anchor_name}",
|
| 710 |
"places within {buffer_m} metres of the {anchor_name}",
|
| 711 |
"{buffer_m} meter buffer around the {anchor_name}",
|
| 712 |
+
"what places are right next to the {anchor_name}?",
|
| 713 |
+
"what lies close to the {anchor_name}?",
|
| 714 |
+
"which admin units are near the edge of the {anchor_name}?",
|
| 715 |
+
"show places very near the {anchor_name}",
|
| 716 |
+
"find admin areas next to the {anchor_name}",
|
| 717 |
+
"what is just around the {anchor_name}",
|
| 718 |
+
"give me places within {buffer_m} m of the {anchor_name}",
|
| 719 |
+
],
|
| 720 |
+
),
|
| 721 |
+
|
| 722 |
+
SQLTemplate(
|
| 723 |
+
template_id="buffer_06",
|
| 724 |
+
family="buffer",
|
| 725 |
+
sql_difficulty="medium",
|
| 726 |
+
anchor_source="divisions_area",
|
| 727 |
+
num_anchors=1,
|
| 728 |
+
requires_buffer=True,
|
| 729 |
+
sql_template=(
|
| 730 |
+
"SELECT ST_AsGeoJSON(ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0)) AS geometry"
|
| 731 |
+
" FROM read_parquet('divisions_area')"
|
| 732 |
+
" WHERE id = '{anchor_id}'"
|
| 733 |
+
),
|
| 734 |
+
question_hints=[
|
| 735 |
+
"{buffer_km} km buffer around {anchor_name}",
|
| 736 |
+
"draw a {buffer_km} km buffer around {anchor_name}",
|
| 737 |
+
"show the {buffer_km} km buffer around {anchor_name}",
|
| 738 |
+
"create a {buffer_km} kilometer buffer around {anchor_name}",
|
| 739 |
+
"map a {buffer_km} km radius around {anchor_name}",
|
| 740 |
+
"outline the {buffer_km} km buffer around {anchor_name}",
|
| 741 |
+
"make a {buffer_km} km zone around {anchor_name}",
|
| 742 |
+
"generate a {buffer_km} km radius for {anchor_name}",
|
| 743 |
+
"buffer {anchor_name} by {buffer_km} km",
|
| 744 |
+
"show radius {buffer_km} km from {anchor_name}",
|
| 745 |
+
"{anchor_name} with a {buffer_km} km buffer",
|
| 746 |
],
|
| 747 |
),
|
| 748 |
|
|
|
|
| 780 |
"{target_subtype}s in {anchor_name} bordering the sea",
|
| 781 |
"oceanfront {target_subtype}s in {anchor_name}",
|
| 782 |
"which {target_subtype}s in {anchor_name} have a coastline?",
|
| 783 |
+
"show coastal {target_subtype}s in {anchor_name}",
|
| 784 |
+
"find {target_subtype}s of {anchor_name} on the shore",
|
| 785 |
+
"{target_subtype}s of {anchor_name} by the sea",
|
| 786 |
+
"which {target_subtype}s of {anchor_name} touch the ocean?",
|
| 787 |
+
"coastline {target_subtype}s in {anchor_name}",
|
| 788 |
],
|
| 789 |
),
|
| 790 |
|
|
|
|
| 817 |
"{target_subtype}s in {anchor_name} with no coastline",
|
| 818 |
"which {target_subtype}s within {anchor_name} are landlocked?",
|
| 819 |
"interior {target_subtype}s of {anchor_name} with no ocean border",
|
| 820 |
+
"show inland {target_subtype}s in {anchor_name}",
|
| 821 |
+
"find non-coastal {target_subtype}s of {anchor_name}",
|
| 822 |
+
"{target_subtype}s of {anchor_name} away from the sea",
|
| 823 |
+
"which {target_subtype}s in {anchor_name} are not coastal?",
|
| 824 |
+
"inner {target_subtype}s of {anchor_name}",
|
| 825 |
],
|
| 826 |
),
|
| 827 |
|
|
|
|
| 843 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 844 |
" AND EXISTS ("
|
| 845 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 846 |
+
" WHERE n.subtype IN ('range/mtn', 'island group', 'peninsula', 'depression')"
|
| 847 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 848 |
" )"
|
| 849 |
),
|
|
|
|
| 852 |
"{target_subtype}s of {anchor_name} on a peninsula or island group",
|
| 853 |
"{target_subtype}s within {anchor_name} on notable landforms",
|
| 854 |
"island and peninsula {target_subtype}s of {anchor_name}",
|
| 855 |
+
"show {target_subtype}s in {anchor_name} on major landforms",
|
| 856 |
+
"find {target_subtype}s of {anchor_name} on islands or peninsulas",
|
| 857 |
+
"{target_subtype}s in {anchor_name} on terrain regions",
|
| 858 |
+
"places in {anchor_name} associated with islands or peninsulas",
|
| 859 |
],
|
| 860 |
),
|
| 861 |
|
|
|
|
| 888 |
"{anchor_1_name} with {anchor_2_name} cut out",
|
| 889 |
"subtract {anchor_2_name} from {anchor_1_name}",
|
| 890 |
"what's left of {anchor_1_name} after removing {anchor_2_name}?",
|
| 891 |
+
"difference between {anchor_1_name} and {anchor_2_name}",
|
| 892 |
+
"keep only {anchor_1_name} outside {anchor_2_name}",
|
| 893 |
+
"cut {anchor_2_name} out of {anchor_1_name}",
|
| 894 |
+
"show {anchor_1_name} without {anchor_2_name}",
|
| 895 |
],
|
| 896 |
),
|
| 897 |
|
|
|
|
| 920 |
"{anchor_name} with the {clip_feature_name} removed",
|
| 921 |
"what's left of {anchor_name} after removing the {clip_feature_name}?",
|
| 922 |
"show me {anchor_name} excluding the {clip_feature_name}",
|
| 923 |
+
"keep only the part of {anchor_name} outside the {clip_feature_name}",
|
| 924 |
+
"cut the {clip_feature_name} out of {anchor_name}",
|
| 925 |
+
"difference of {anchor_name} and the {clip_feature_name}",
|
| 926 |
+
"{anchor_name} after subtracting the {clip_feature_name}",
|
| 927 |
],
|
| 928 |
),
|
| 929 |
|
|
|
|
| 960 |
"the region straddling the border of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
|
| 961 |
"{buffer_km} km on either side of the {anchor_1_name} and {anchor_2_name} border",
|
| 962 |
"buffer the {anchor_1_name}-{anchor_2_name} boundary by {buffer_km} km",
|
| 963 |
+
"show the border zone between {anchor_1_name} and {anchor_2_name}",
|
| 964 |
+
"map the corridor along the {anchor_1_name}-{anchor_2_name} border",
|
| 965 |
+
"find the area near the border of {anchor_1_name} and {anchor_2_name}",
|
| 966 |
+
"give me the border buffer for {anchor_1_name} and {anchor_2_name}",
|
| 967 |
+
"border area of {anchor_1_name} and {anchor_2_name} within {buffer_km} km",
|
| 968 |
],
|
| 969 |
),
|
| 970 |
|
|
|
|
| 1036 |
"show {target_subtype}s across {anchor_1_name} and {anchor_2_name}",
|
| 1037 |
"{target_subtype}s belonging to {anchor_1_name} and {anchor_2_name}",
|
| 1038 |
"list {target_subtype}s in both {anchor_1_name} and {anchor_2_name}",
|
| 1039 |
+
"give me {target_subtype}s from {anchor_1_name} and {anchor_2_name}",
|
| 1040 |
+
"{anchor_1_name} and {anchor_2_name} {target_subtype}s",
|
| 1041 |
+
"show all {target_subtype}s for {anchor_1_name} plus {anchor_2_name}",
|
| 1042 |
+
"find {target_subtype}s across both {anchor_1_name} and {anchor_2_name}",
|
| 1043 |
],
|
| 1044 |
),
|
| 1045 |
|
|
|
|
| 1062 |
"all {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
| 1063 |
"show {target_subtype}s across {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
| 1064 |
"list {target_subtype}s in {anchor_1_name}, {anchor_2_name} and {anchor_3_name}",
|
| 1065 |
+
"give me {target_subtype}s from {anchor_1_name}, {anchor_2_name}, and {anchor_3_name}",
|
| 1066 |
+
"{anchor_1_name}, {anchor_2_name}, and {anchor_3_name} {target_subtype}s",
|
| 1067 |
+
"show all {target_subtype}s for these three: {anchor_1_name}, {anchor_2_name}, {anchor_3_name}",
|
| 1068 |
+
"find {target_subtype}s across {anchor_1_name}, {anchor_2_name}, and {anchor_3_name}",
|
| 1069 |
],
|
| 1070 |
),
|
| 1071 |
|
|
|
|
| 1092 |
"union of all {target_subtype}s within {anchor_name}",
|
| 1093 |
"all {target_subtype}s of {anchor_name} merged together",
|
| 1094 |
"the overall extent of {target_subtype}s in {anchor_name}",
|
| 1095 |
+
"show one merged shape for all {target_subtype}s in {anchor_name}",
|
| 1096 |
+
"dissolve all {target_subtype}s in {anchor_name}",
|
| 1097 |
+
"make one geometry from all {target_subtype}s in {anchor_name}",
|
| 1098 |
+
"single outline of all {target_subtype}s in {anchor_name}",
|
| 1099 |
+
"combine the {target_subtype}s of {anchor_name} into one area",
|
| 1100 |
],
|
| 1101 |
),
|
| 1102 |
|
|
|
|
| 1129 |
"the top half of {anchor_name}",
|
| 1130 |
"northern portion of {anchor_name}",
|
| 1131 |
"upper half of {anchor_name}",
|
| 1132 |
+
"top side of {anchor_name}",
|
| 1133 |
+
"show north half of {anchor_name}",
|
| 1134 |
+
"cut {anchor_name} to the north half",
|
| 1135 |
+
"only the northern side of {anchor_name}",
|
| 1136 |
+
"north part of {anchor_name}",
|
| 1137 |
],
|
| 1138 |
),
|
| 1139 |
|
|
|
|
| 1163 |
"the bottom half of {anchor_name}",
|
| 1164 |
"southern portion of {anchor_name}",
|
| 1165 |
"lower half of {anchor_name}",
|
| 1166 |
+
"bottom side of {anchor_name}",
|
| 1167 |
+
"show south half of {anchor_name}",
|
| 1168 |
+
"cut {anchor_name} to the south half",
|
| 1169 |
+
"only the southern side of {anchor_name}",
|
| 1170 |
+
"south part of {anchor_name}",
|
| 1171 |
],
|
| 1172 |
),
|
| 1173 |
|
|
|
|
| 1196 |
"eastern part of {anchor_name}",
|
| 1197 |
"the right half of {anchor_name}",
|
| 1198 |
"eastern portion of {anchor_name}",
|
| 1199 |
+
"east side of {anchor_name}",
|
| 1200 |
+
"show east half of {anchor_name}",
|
| 1201 |
+
"cut {anchor_name} to the east half",
|
| 1202 |
+
"only the eastern side of {anchor_name}",
|
| 1203 |
+
"right side of {anchor_name}",
|
| 1204 |
],
|
| 1205 |
),
|
| 1206 |
|
|
|
|
| 1229 |
"western part of {anchor_name}",
|
| 1230 |
"the left half of {anchor_name}",
|
| 1231 |
"western portion of {anchor_name}",
|
| 1232 |
+
"west side of {anchor_name}",
|
| 1233 |
+
"show west half of {anchor_name}",
|
| 1234 |
+
"cut {anchor_name} to the west half",
|
| 1235 |
+
"only the western side of {anchor_name}",
|
| 1236 |
+
"left side of {anchor_name}",
|
| 1237 |
],
|
| 1238 |
),
|
| 1239 |
|
|
|
|
| 1265 |
"{clip_feature_name} inside {anchor_name}",
|
| 1266 |
"parts of {anchor_name} covered by the {clip_feature_name}",
|
| 1267 |
"show me where {anchor_name} and the {clip_feature_name} overlap",
|
| 1268 |
+
"keep only the part of {anchor_name} in the {clip_feature_name}",
|
| 1269 |
+
"intersection of {anchor_name} and the {clip_feature_name}",
|
| 1270 |
+
"give me the overlap between {anchor_name} and the {clip_feature_name}",
|
| 1271 |
+
"only the shared area of {anchor_name} and the {clip_feature_name}",
|
| 1272 |
],
|
| 1273 |
),
|
| 1274 |
|
|
|
|
| 1303 |
"the {top_n} biggest {target_subtype}s within {anchor_name}",
|
| 1304 |
"largest {target_subtype} in {anchor_name}",
|
| 1305 |
"which {target_subtype} in {anchor_name} has the most area?",
|
| 1306 |
+
"show the biggest {target_subtype}s in {anchor_name}",
|
| 1307 |
+
"list the largest {target_subtype}s for {anchor_name}",
|
| 1308 |
+
"give me the biggest {target_subtype}s in {anchor_name}",
|
| 1309 |
+
"{anchor_name} largest {target_subtype}s",
|
| 1310 |
],
|
| 1311 |
),
|
| 1312 |
|
|
|
|
| 1338 |
"the {top_n} tiniest {target_subtype}s within {anchor_name}",
|
| 1339 |
"smallest {target_subtype} in {anchor_name}",
|
| 1340 |
"which {target_subtype} in {anchor_name} has the least area?",
|
| 1341 |
+
"show the smallest {target_subtype}s in {anchor_name}",
|
| 1342 |
+
"list the smallest {target_subtype}s for {anchor_name}",
|
| 1343 |
+
"give me the tiniest {target_subtype}s in {anchor_name}",
|
| 1344 |
+
"{anchor_name} smallest {target_subtype}s",
|
| 1345 |
],
|
| 1346 |
),
|
| 1347 |
|
|
|
|
| 1370 |
"the {top_n} largest {target_subtype}s in {anchor_name}",
|
| 1371 |
"biggest {target_subtype} in {anchor_name}",
|
| 1372 |
"which {target_subtype} in {anchor_name} is the largest?",
|
| 1373 |
+
"show the largest {target_subtype}s in {anchor_name}",
|
| 1374 |
+
"list the biggest {target_subtype}s in {anchor_name}",
|
| 1375 |
+
"give me the largest {target_subtype}s for {anchor_name}",
|
| 1376 |
+
"{anchor_name} biggest {target_subtype}s",
|
| 1377 |
],
|
| 1378 |
),
|
| 1379 |
|
|
|
|
| 1402 |
"the {top_n} smallest {target_subtype}s in {anchor_name}",
|
| 1403 |
"smallest {target_subtype} in {anchor_name}",
|
| 1404 |
"which {target_subtype} in {anchor_name} is the smallest?",
|
| 1405 |
+
"show the smallest {target_subtype}s in {anchor_name}",
|
| 1406 |
+
"list the smallest {target_subtype}s in {anchor_name}",
|
| 1407 |
+
"give me the smallest {target_subtype}s for {anchor_name}",
|
| 1408 |
+
"{anchor_name} tiniest {target_subtype}s",
|
| 1409 |
],
|
| 1410 |
),
|
| 1411 |
|
|
|
|
| 1439 |
"biggest {target_subtype} per region in {anchor_name}",
|
| 1440 |
"largest {target_subtype} for every region of {anchor_name}",
|
| 1441 |
"the biggest {target_subtype} in each province of {anchor_name}",
|
| 1442 |
+
"show the largest {target_subtype} in every region of {anchor_name}",
|
| 1443 |
+
"list biggest {target_subtype} by region in {anchor_name}",
|
| 1444 |
+
"for each region in {anchor_name}, give the largest {target_subtype}",
|
| 1445 |
+
"largest {target_subtype}s grouped by region in {anchor_name}",
|
| 1446 |
+
"one biggest {target_subtype} for each region of {anchor_name}",
|
| 1447 |
],
|
| 1448 |
),
|
| 1449 |
|
|
|
|
| 1474 |
"smallest {target_subtype} per region in {anchor_name}",
|
| 1475 |
"tiniest {target_subtype} for every region of {anchor_name}",
|
| 1476 |
"the smallest {target_subtype} in each province of {anchor_name}",
|
| 1477 |
+
"show the smallest {target_subtype} in every region of {anchor_name}",
|
| 1478 |
+
"list smallest {target_subtype} by region in {anchor_name}",
|
| 1479 |
+
"for each region in {anchor_name}, give the smallest {target_subtype}",
|
| 1480 |
+
"smallest {target_subtype}s grouped by region in {anchor_name}",
|
| 1481 |
+
"one tiniest {target_subtype} for each region of {anchor_name}",
|
| 1482 |
],
|
| 1483 |
),
|
| 1484 |
|
|
|
|
| 1507 |
"{anchor_name}'s land {target_subtype}s",
|
| 1508 |
"dependencies of {anchor_name} with land area",
|
| 1509 |
"show the land dependencies of {anchor_name}",
|
| 1510 |
+
"find land dependencies of {anchor_name}",
|
| 1511 |
+
"give me {anchor_name} dependencies on land",
|
| 1512 |
+
"which dependencies of {anchor_name} are land-based?",
|
| 1513 |
+
"non-island {target_subtype}s of {anchor_name}",
|
| 1514 |
],
|
| 1515 |
),
|
| 1516 |
|
|
|
|
| 1534 |
"official territorial divisions of {anchor_name}",
|
| 1535 |
"recognised territorial {target_subtype}s belonging to {anchor_name}",
|
| 1536 |
"which territorial regions does {anchor_name} have?",
|
| 1537 |
+
"show territorial {target_subtype}s of {anchor_name}",
|
| 1538 |
+
"find official territorial regions of {anchor_name}",
|
| 1539 |
+
"give me recognised territorial {target_subtype}s of {anchor_name}",
|
| 1540 |
+
"territorial regions under {anchor_name}",
|
| 1541 |
+
"{anchor_name} official territorial {target_subtype}s",
|
| 1542 |
],
|
| 1543 |
),
|
| 1544 |
|
|
|
|
| 1562 |
"{target_subtype}s of {anchor_name} that are not on land",
|
| 1563 |
"water-associated {target_subtype}s of {anchor_name}",
|
| 1564 |
"marine or offshore {target_subtype}s of {anchor_name}",
|
| 1565 |
+
"show offshore {target_subtype}s of {anchor_name}",
|
| 1566 |
+
"find {target_subtype}s of {anchor_name} in water",
|
| 1567 |
+
"give me non-land {target_subtype}s of {anchor_name}",
|
| 1568 |
+
"water-side {target_subtype}s of {anchor_name}",
|
| 1569 |
+
"{anchor_name} {target_subtype}s not on land",
|
| 1570 |
],
|
| 1571 |
),
|
| 1572 |
|
|
|
|
| 1588 |
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1589 |
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1590 |
" FROM read_parquet('natural_earth') AS n, a"
|
| 1591 |
+
" WHERE n.subtype IN ('river', 'lake', 'basin')"
|
| 1592 |
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1593 |
),
|
| 1594 |
question_hints=[
|
|
|
|
| 1600 |
"what bodies of water cross {anchor_name}?",
|
| 1601 |
"rivers of {anchor_name}",
|
| 1602 |
"show me the lakes in {anchor_name}",
|
| 1603 |
+
"what rivers run through {anchor_name}?",
|
| 1604 |
+
"which lakes lie in {anchor_name}?",
|
| 1605 |
+
"what waterways are in {anchor_name}?",
|
| 1606 |
+
"which basins, lakes, or rivers are in {anchor_name}?",
|
| 1607 |
],
|
| 1608 |
),
|
| 1609 |
|
|
|
|
| 1613 |
sql_difficulty="medium",
|
| 1614 |
anchor_source="divisions_area",
|
| 1615 |
num_anchors=1,
|
| 1616 |
+
target_subtype="range/mtn",
|
| 1617 |
sql_template=(
|
| 1618 |
"WITH a AS ("
|
| 1619 |
" SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
|
|
|
|
| 1621 |
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1622 |
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1623 |
" FROM read_parquet('natural_earth') AS n, a"
|
| 1624 |
+
" WHERE n.subtype = 'range/mtn'"
|
| 1625 |
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1626 |
),
|
| 1627 |
question_hints=[
|
| 1628 |
"what mountain ranges are in {anchor_name}?",
|
|
|
|
| 1629 |
"which mountain ranges cross {anchor_name}?",
|
|
|
|
|
|
|
|
|
|
| 1630 |
"mountains of {anchor_name}",
|
| 1631 |
+
"mountain regions in {anchor_name}",
|
| 1632 |
+
"what hills are in {anchor_name}?",
|
| 1633 |
+
"which hills cross {anchor_name}?",
|
| 1634 |
+
"ghats of {anchor_name}",
|
| 1635 |
+
"what ghats are in {anchor_name}?",
|
| 1636 |
+
"highlands in {anchor_name}",
|
| 1637 |
+
"mountain belts within {anchor_name}",
|
| 1638 |
+
],
|
| 1639 |
+
),
|
| 1640 |
+
|
| 1641 |
+
SQLTemplate(
|
| 1642 |
+
template_id="adj_07",
|
| 1643 |
+
family="adjacency",
|
| 1644 |
+
sql_difficulty="medium",
|
| 1645 |
+
anchor_source="divisions_area",
|
| 1646 |
+
num_anchors=1,
|
| 1647 |
+
target_subtype="plateau",
|
| 1648 |
+
sql_template=(
|
| 1649 |
+
"WITH a AS ("
|
| 1650 |
+
" SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
|
| 1651 |
+
")"
|
| 1652 |
+
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1653 |
+
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1654 |
+
" FROM read_parquet('natural_earth') AS n, a"
|
| 1655 |
+
" WHERE n.subtype = 'plateau'"
|
| 1656 |
+
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1657 |
+
),
|
| 1658 |
+
question_hints=[
|
| 1659 |
+
"what plateaus are in {anchor_name}?",
|
| 1660 |
+
"which plateaus cross {anchor_name}?",
|
| 1661 |
+
"uplands in {anchor_name}",
|
| 1662 |
+
"what uplands are in {anchor_name}?",
|
| 1663 |
+
"tablelands of {anchor_name}",
|
| 1664 |
+
"plateau regions within {anchor_name}",
|
| 1665 |
+
"show plateaus in {anchor_name}",
|
| 1666 |
+
"find uplands of {anchor_name}",
|
| 1667 |
+
"give me plateau areas in {anchor_name}",
|
| 1668 |
+
"{anchor_name} plateaus and uplands",
|
| 1669 |
+
"what tablelands are there in {anchor_name}?",
|
| 1670 |
+
],
|
| 1671 |
+
),
|
| 1672 |
+
|
| 1673 |
+
SQLTemplate(
|
| 1674 |
+
template_id="adj_08",
|
| 1675 |
+
family="adjacency",
|
| 1676 |
+
sql_difficulty="medium",
|
| 1677 |
+
anchor_source="divisions_area",
|
| 1678 |
+
num_anchors=1,
|
| 1679 |
+
target_subtype="landform",
|
| 1680 |
+
sql_template=(
|
| 1681 |
+
"WITH a AS ("
|
| 1682 |
+
" SELECT geometry FROM read_parquet('divisions_area') WHERE id = '{anchor_id}'"
|
| 1683 |
+
")"
|
| 1684 |
+
" SELECT n.id, n.names.\"primary\" AS name, n.subtype,"
|
| 1685 |
+
" ST_AsGeoJSON(n.geometry) AS geometry"
|
| 1686 |
+
" FROM read_parquet('natural_earth') AS n, a"
|
| 1687 |
+
" WHERE n.subtype IN ('plain', 'lowland', 'basin', 'valley', 'depression', 'gorge')"
|
| 1688 |
+
" AND ST_Intersects(a.geometry, n.geometry)"
|
| 1689 |
+
),
|
| 1690 |
+
question_hints=[
|
| 1691 |
+
"what plains are in {anchor_name}?",
|
| 1692 |
+
"basins and valleys of {anchor_name}",
|
| 1693 |
+
"which basins are in {anchor_name}?",
|
| 1694 |
+
"what valleys cross {anchor_name}?",
|
| 1695 |
+
"lowlands in {anchor_name}",
|
| 1696 |
+
"major landforms in {anchor_name}",
|
| 1697 |
+
"plains, basins, and valleys within {anchor_name}",
|
| 1698 |
+
"show me the main landforms in {anchor_name}",
|
| 1699 |
+
"landforms of {anchor_name}",
|
| 1700 |
+
"find plains and basins in {anchor_name}",
|
| 1701 |
+
"{anchor_name} valleys and lowlands",
|
| 1702 |
],
|
| 1703 |
),
|
| 1704 |
|
|
|
|
| 1733 |
"what provinces does the {anchor_name} span?",
|
| 1734 |
"regions along the {anchor_name}",
|
| 1735 |
"which provinces overlap the {anchor_name}?",
|
| 1736 |
+
"which regions is the {anchor_name} in?",
|
| 1737 |
+
"what states does the {anchor_name} run through?",
|
| 1738 |
+
"which provinces lie along the {anchor_name}?",
|
| 1739 |
+
"show regions crossed by the {anchor_name}",
|
| 1740 |
+
"find administrative regions along the {anchor_name}",
|
| 1741 |
+
"give me the regions touched by the {anchor_name}",
|
| 1742 |
+
"regions of the {anchor_name}",
|
| 1743 |
],
|
| 1744 |
),
|
| 1745 |
|
|
|
|
| 1766 |
"everything natural that touches {anchor_name}",
|
| 1767 |
"what geographic features does {anchor_name} contain?",
|
| 1768 |
"natural features within or crossing {anchor_name}",
|
| 1769 |
+
"show natural features overlapping {anchor_name}",
|
| 1770 |
+
"find the natural features in or across {anchor_name}",
|
| 1771 |
+
"{anchor_name} intersecting natural features",
|
| 1772 |
+
"what physical features are associated with {anchor_name}?",
|
| 1773 |
+
],
|
| 1774 |
+
),
|
| 1775 |
+
|
| 1776 |
+
SQLTemplate(
|
| 1777 |
+
template_id="intersect_05",
|
| 1778 |
+
family="intersection",
|
| 1779 |
+
sql_difficulty="medium-hard",
|
| 1780 |
+
anchor_source="natural_earth",
|
| 1781 |
+
num_anchors=1,
|
| 1782 |
+
target_subtype="county",
|
| 1783 |
+
sql_template=(
|
| 1784 |
+
"WITH a AS ("
|
| 1785 |
+
" SELECT geometry FROM read_parquet('natural_earth') WHERE id = '{anchor_id}'"
|
| 1786 |
+
")"
|
| 1787 |
+
" SELECT b.id, b.names.\"primary\" AS name, b.subtype, b.country,"
|
| 1788 |
+
" ST_AsGeoJSON(b.geometry) AS geometry"
|
| 1789 |
+
" FROM read_parquet('divisions_area') AS b, a"
|
| 1790 |
+
" WHERE b.subtype = '{target_subtype}'"
|
| 1791 |
+
" AND ST_Intersects(b.geometry, a.geometry)"
|
| 1792 |
+
),
|
| 1793 |
+
question_hints=[
|
| 1794 |
+
"which districts does the {anchor_name} pass through?",
|
| 1795 |
+
"what districts does the {anchor_name} cross?",
|
| 1796 |
+
"districts intersected by the {anchor_name}",
|
| 1797 |
+
"which counties does the {anchor_name} flow through?",
|
| 1798 |
+
"what counties overlap the {anchor_name}?",
|
| 1799 |
+
"districts along the {anchor_name}",
|
| 1800 |
+
"which districts are crossed by the {anchor_name}?",
|
| 1801 |
+
"what districts is the {anchor_name} in?",
|
| 1802 |
+
"which counties lie along the {anchor_name}?",
|
| 1803 |
+
"what districts does the {anchor_name} run through?",
|
| 1804 |
+
"show districts crossed by the {anchor_name}",
|
| 1805 |
+
"find counties along the {anchor_name}",
|
| 1806 |
+
"give me the districts touched by the {anchor_name}",
|
| 1807 |
+
"districts of the {anchor_name}",
|
| 1808 |
],
|
| 1809 |
),
|
| 1810 |
|
|
|
|
| 1830 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1831 |
" AND EXISTS ("
|
| 1832 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1833 |
+
" WHERE n.subtype IN ('river', 'lake', 'basin')"
|
| 1834 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1835 |
" )"
|
| 1836 |
),
|
|
|
|
| 1842 |
"{target_subtype}s in {anchor_name} that touch a river",
|
| 1843 |
"which {target_subtype}s in {anchor_name} are on a lake?",
|
| 1844 |
"waterfront {target_subtype}s of {anchor_name}",
|
| 1845 |
+
"show water-side {target_subtype}s in {anchor_name}",
|
| 1846 |
+
"find {target_subtype}s of {anchor_name} by rivers or lakes",
|
| 1847 |
+
"give me riverside places in {anchor_name}",
|
| 1848 |
+
"{target_subtype}s of {anchor_name} on the water",
|
| 1849 |
],
|
| 1850 |
),
|
| 1851 |
|
|
|
|
| 1867 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1868 |
" AND EXISTS ("
|
| 1869 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1870 |
+
" WHERE n.subtype IN ('range/mtn', 'depression')"
|
| 1871 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1872 |
" )"
|
| 1873 |
),
|
|
|
|
| 1878 |
"highland {target_subtype}s within {anchor_name}",
|
| 1879 |
"{target_subtype}s of {anchor_name} in mountainous terrain",
|
| 1880 |
"{target_subtype}s in {anchor_name} near a mountain range",
|
| 1881 |
+
"show mountain {target_subtype}s in {anchor_name}",
|
| 1882 |
+
"find {target_subtype}s of {anchor_name} in hilly terrain",
|
| 1883 |
+
"give me highland {target_subtype}s of {anchor_name}",
|
| 1884 |
+
"{target_subtype}s of {anchor_name} by the mountains",
|
| 1885 |
],
|
| 1886 |
),
|
| 1887 |
|
|
|
|
| 1919 |
"{target_subtype}s of {anchor_name} with ocean access",
|
| 1920 |
"which {target_subtype}s in {anchor_name} touch the sea?",
|
| 1921 |
"maritime {target_subtype}s of {anchor_name}",
|
| 1922 |
+
"show coastal districts of {anchor_name}",
|
| 1923 |
+
"find {target_subtype}s of {anchor_name} by the sea",
|
| 1924 |
+
"give me shoreline {target_subtype}s in {anchor_name}",
|
| 1925 |
+
"{target_subtype}s of {anchor_name} on the coast",
|
| 1926 |
],
|
| 1927 |
),
|
| 1928 |
|
|
|
|
| 1955 |
"{target_subtype}s in {anchor_name} with no sea access",
|
| 1956 |
"non-coastal {target_subtype}s of {anchor_name}",
|
| 1957 |
"inland {target_subtype}s of {anchor_name}",
|
| 1958 |
+
"show inland districts of {anchor_name}",
|
| 1959 |
+
"find non-coastal {target_subtype}s in {anchor_name}",
|
| 1960 |
+
"give me inner {target_subtype}s of {anchor_name}",
|
| 1961 |
+
"{target_subtype}s of {anchor_name} away from the coast",
|
| 1962 |
],
|
| 1963 |
),
|
| 1964 |
|
|
|
|
| 1980 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 1981 |
" AND EXISTS ("
|
| 1982 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 1983 |
+
" WHERE n.subtype IN ('river', 'lake', 'basin')"
|
| 1984 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 1985 |
" )"
|
| 1986 |
),
|
|
|
|
| 1991 |
"lakeside {target_subtype}s within {anchor_name}",
|
| 1992 |
"{target_subtype}s of {anchor_name} along a river",
|
| 1993 |
"which {target_subtype}s in {anchor_name} border a lake?",
|
| 1994 |
+
"show water-side {target_subtype}s of {anchor_name}",
|
| 1995 |
+
"find {target_subtype}s in {anchor_name} near rivers and lakes",
|
| 1996 |
+
"give me riverside districts of {anchor_name}",
|
| 1997 |
+
"{target_subtype}s of {anchor_name} by the water",
|
| 1998 |
],
|
| 1999 |
),
|
| 2000 |
|
|
|
|
| 2016 |
" AND ST_Within(b.geometry, region.geometry)"
|
| 2017 |
" AND EXISTS ("
|
| 2018 |
" SELECT 1 FROM read_parquet('natural_earth') AS n"
|
| 2019 |
+
" WHERE n.subtype IN ('range/mtn', 'depression')"
|
| 2020 |
" AND ST_Intersects(b.geometry, n.geometry)"
|
| 2021 |
" )"
|
| 2022 |
),
|
|
|
|
| 2027 |
"highland {target_subtype}s within {anchor_name}",
|
| 2028 |
"{target_subtype}s of {anchor_name} in mountainous terrain",
|
| 2029 |
"which {target_subtype}s in {anchor_name} have mountain ranges?",
|
| 2030 |
+
"show mountain districts of {anchor_name}",
|
| 2031 |
+
"find {target_subtype}s of {anchor_name} in hilly areas",
|
| 2032 |
+
"give me highland {target_subtype}s in {anchor_name}",
|
| 2033 |
+
"{target_subtype}s of {anchor_name} by the mountains",
|
| 2034 |
],
|
| 2035 |
),
|
| 2036 |
|
|
|
|
| 2072 |
"seaside regions of {anchor_name}",
|
| 2073 |
"which provinces of {anchor_name} touch the sea?",
|
| 2074 |
"states of {anchor_name} along the coast",
|
| 2075 |
+
"show coastal regions of {anchor_name}",
|
| 2076 |
+
"find seaside states of {anchor_name}",
|
| 2077 |
+
"give me provinces of {anchor_name} on the coast",
|
| 2078 |
+
"{anchor_name} regions by the sea",
|
| 2079 |
],
|
| 2080 |
),
|
| 2081 |
|
|
|
|
| 2110 |
"regions of {anchor_name} without sea access",
|
| 2111 |
"interior states of {anchor_name}",
|
| 2112 |
"states of {anchor_name} that don't border the ocean",
|
| 2113 |
+
"show inland regions of {anchor_name}",
|
| 2114 |
+
"find non-coastal states of {anchor_name}",
|
| 2115 |
+
"give me inner provinces of {anchor_name}",
|
| 2116 |
+
"{anchor_name} regions away from the sea",
|
| 2117 |
],
|
| 2118 |
),
|
| 2119 |
|
|
|
|
| 2146 |
"which countries touch the {anchor_name}?",
|
| 2147 |
"countries with coastline on the {anchor_name}",
|
| 2148 |
"what nations lie on the {anchor_name}?",
|
| 2149 |
+
"which countries are on the coast of the {anchor_name}?",
|
| 2150 |
+
"what countries lie around the {anchor_name}?",
|
| 2151 |
+
"which nations have shores on the {anchor_name}?",
|
| 2152 |
+
"what countries front the {anchor_name}?",
|
| 2153 |
],
|
| 2154 |
),
|
| 2155 |
|
|
|
|
| 2182 |
"everything within {buffer_km} km of the {anchor_name}",
|
| 2183 |
"what natural features are close to the {anchor_name}?",
|
| 2184 |
"{buffer_km} km radius around the {anchor_name}",
|
| 2185 |
+
"what natural features are around the {anchor_name}?",
|
| 2186 |
+
"what lies near the {anchor_name}?",
|
| 2187 |
+
"which features are close to the {anchor_name}?",
|
| 2188 |
+
"what natural features are near the shoreline of the {anchor_name}?",
|
| 2189 |
+
"show nearby natural features for the {anchor_name}",
|
| 2190 |
+
"find features around the {anchor_name}",
|
| 2191 |
+
"give me natural features near the {anchor_name}",
|
| 2192 |
+
"features around the {anchor_name}",
|
| 2193 |
+
"what is close to the {anchor_name} within {buffer_km} km",
|
| 2194 |
],
|
| 2195 |
),
|
| 2196 |
|
src/gazet/config.py
CHANGED
|
@@ -80,7 +80,7 @@ Available DuckDB datasets (read via read_parquet):
|
|
| 80 |
columns:
|
| 81 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 82 |
names STRUCT("primary" VARCHAR, ...)
|
| 83 |
-
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', '
|
| 84 |
class VARCHAR
|
| 85 |
country VARCHAR
|
| 86 |
region VARCHAR
|
|
|
|
| 80 |
columns:
|
| 81 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 82 |
names STRUCT("primary" VARCHAR, ...)
|
| 83 |
+
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
|
| 84 |
class VARCHAR
|
| 85 |
country VARCHAR
|
| 86 |
region VARCHAR
|
src/gazet/lm.py
CHANGED
|
@@ -31,8 +31,6 @@ class ExtractPlaces(dspy.Signature):
|
|
| 31 |
- If the user mentions a natural earth physical feature, use the natural earth physical features.
|
| 32 |
- If the user mentions a place name that is not in the overture divisions or natural earth physical features, return the place name as is.
|
| 33 |
|
| 34 |
-
Where possible and relevant, also extract the ISO country code for each place.
|
| 35 |
-
|
| 36 |
Only extract place names that are explicitly mentioned in the query.
|
| 37 |
Do NOT generate or infer place names from your own knowledge.
|
| 38 |
For example:
|
|
@@ -41,47 +39,14 @@ class ExtractPlaces(dspy.Signature):
|
|
| 41 |
- "neighbouring states of Odisha" -> extract "Odisha", NOT neighbouring state names
|
| 42 |
|
| 43 |
Do not repeat the same place name in the result.
|
| 44 |
-
|
| 45 |
-
If the user does not explicitly mention a country, dont add the country code to the result.
|
| 46 |
-
|
| 47 |
-
If the user does not mention an admin level, dont add the subtype to the result.
|
| 48 |
-
|
| 49 |
-
If the query asks for some kind of subdivision (e.g. 'municipalities in Bern', 'States in Brazil'),
|
| 50 |
-
return the subdivision type in the places result.
|
| 51 |
-
|
| 52 |
-
When identifying a place name from the user's query, also infer the most appropriate
|
| 53 |
-
Overture division subtype from the list below. Only include a subtype if the query
|
| 54 |
-
makes it reasonably clear what geographic level is intended. If ambiguous, omit it.
|
| 55 |
-
|
| 56 |
-
SUBTYPES:
|
| 57 |
-
- country : Sovereign nation. E.g. "France", "Brazil"
|
| 58 |
-
- dependency : Territory dependent on a country but not a full sub-region. E.g. "Puerto Rico", "Guam"
|
| 59 |
-
- region : Largest admin unit within a country; state, province, canton, etc. E.g. "California", "Alberta", "Bavaria"
|
| 60 |
-
- county : Second-level admin subdivision within a region. E.g. "Kings County", "Kent"
|
| 61 |
-
- localadmin : A governing layer (common in Europe) that contains localities which have no authority of their own. E.g. a French commune or Belgian municipality. Use when the place is clearly an admin unit but not a city itself.
|
| 62 |
-
- locality : A populated place — city, town, village. The most common subtype for named settlements. E.g. "Lisbon", "Taipei", "Salt Lake City"
|
| 63 |
-
- macrohood : A large super-neighborhood grouping smaller neighborhoods. E.g. "BoCoCa" in Brooklyn
|
| 64 |
-
- neighborhood : A named community area within a city or town. E.g. "Cobble Hill", "Alfama"
|
| 65 |
-
- microhood : A mini-neighborhood within a neighborhood. Very fine-grained, rarely referenced explicitly.
|
| 66 |
-
|
| 67 |
-
HIERARCHY (coarse to fine):
|
| 68 |
-
country → dependency / region → county → localadmin → locality → macrohood → neighborhood → microhood
|
| 69 |
-
|
| 70 |
-
GUIDANCE:
|
| 71 |
-
- "Paris" with no qualifier → locality
|
| 72 |
-
- "Île-de-France" or "Catalonia" → region
|
| 73 |
-
- "the 11th arrondissement" → neighborhood (or localadmin)
|
| 74 |
-
- "Greater London" style phrasing → county or region depending on context
|
| 75 |
-
- If the user says "neighborhood in X" or "district of X" → neighborhood
|
| 76 |
-
- Default to locality for any named city/town if unsure
|
| 77 |
-
- Omit subtype entirely if the query gives no signal (e.g. bare coordinates or a POI name)
|
| 78 |
"""
|
| 79 |
|
| 80 |
query: str = dspy.InputField(
|
| 81 |
desc="Natural language query mentioning one or more place names"
|
| 82 |
)
|
| 83 |
result: PlacesResult = dspy.OutputField(
|
| 84 |
-
desc="Extracted
|
| 85 |
)
|
| 86 |
|
| 87 |
|
|
@@ -203,7 +168,7 @@ You have access to two DuckDB parquet tables. Given a set of candidate entities
|
|
| 203 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 204 |
names STRUCT("primary" VARCHAR, ...)
|
| 205 |
country VARCHAR
|
| 206 |
-
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', '
|
| 207 |
class VARCHAR
|
| 208 |
region VARCHAR
|
| 209 |
admin_level INTEGER
|
|
@@ -265,21 +230,68 @@ def _llama_chat_complete(messages: list[dict]) -> str:
|
|
| 265 |
return resp.json()["choices"][0]["message"]["content"]
|
| 266 |
|
| 267 |
|
| 268 |
-
_PLACES_SYSTEM_PROMPT = """You are a geographic entity extractor. Extract place names
|
| 269 |
|
| 270 |
OUTPUT FORMAT:
|
| 271 |
-
{"places": [{"place": "<name>"
|
| 272 |
-
"country" and "subtype" are optional; omit if not applicable.
|
| 273 |
|
| 274 |
RULES:
|
| 275 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
- No duplicate place names.
|
| 277 |
-
- "country": ISO 3166-1 alpha-2. Include only if explicitly mentioned or unambiguous.
|
| 278 |
-
- "subtype": include only when the geographic level is clear from the query.
|
| 279 |
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
|
| 284 |
|
| 285 |
def generate_places(user_query: str) -> PlacesResult:
|
|
@@ -318,7 +330,7 @@ def generate_sql(user_query: str, candidates_df: pd.DataFrame) -> str:
|
|
| 318 |
Single-shot — no retry loop (the finetuned model can't improve from error feedback).
|
| 319 |
"""
|
| 320 |
# Keep only columns the model was trained on
|
| 321 |
-
keep_cols = ["source", "id", "name", "subtype", "country", "region", "admin_level"
|
| 322 |
cols = [c for c in keep_cols if c in candidates_df.columns]
|
| 323 |
candidates_csv = candidates_df[cols].to_csv(index=False)
|
| 324 |
|
|
|
|
| 31 |
- If the user mentions a natural earth physical feature, use the natural earth physical features.
|
| 32 |
- If the user mentions a place name that is not in the overture divisions or natural earth physical features, return the place name as is.
|
| 33 |
|
|
|
|
|
|
|
| 34 |
Only extract place names that are explicitly mentioned in the query.
|
| 35 |
Do NOT generate or infer place names from your own knowledge.
|
| 36 |
For example:
|
|
|
|
| 39 |
- "neighbouring states of Odisha" -> extract "Odisha", NOT neighbouring state names
|
| 40 |
|
| 41 |
Do not repeat the same place name in the result.
|
| 42 |
+
Return only the place names, in the order they appear in the query.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
"""
|
| 44 |
|
| 45 |
query: str = dspy.InputField(
|
| 46 |
desc="Natural language query mentioning one or more place names"
|
| 47 |
)
|
| 48 |
result: PlacesResult = dspy.OutputField(
|
| 49 |
+
desc="Extracted place names in query order"
|
| 50 |
)
|
| 51 |
|
| 52 |
|
|
|
|
| 168 |
id VARCHAR -- unique feature id prefixed 'ne_'
|
| 169 |
names STRUCT("primary" VARCHAR, ...)
|
| 170 |
country VARCHAR
|
| 171 |
+
subtype VARCHAR -- e.g. 'ocean', 'sea', 'bay', 'range/mtn', 'island group'
|
| 172 |
class VARCHAR
|
| 173 |
region VARCHAR
|
| 174 |
admin_level INTEGER
|
|
|
|
| 230 |
return resp.json()["choices"][0]["message"]["content"]
|
| 231 |
|
| 232 |
|
| 233 |
+
_PLACES_SYSTEM_PROMPT = """You are a geographic entity extractor. Extract the place names the user is asking about and return valid JSON only.
|
| 234 |
|
| 235 |
OUTPUT FORMAT:
|
| 236 |
+
{"places": [{"place": "<name>"}]}
|
|
|
|
| 237 |
|
| 238 |
RULES:
|
| 239 |
+
- Extract the place or places that are the actual anchors of the query.
|
| 240 |
+
- Physical features are valid places: oceans, seas, gulfs, bays, straits, rivers, lakes, basins, mountain ranges, peninsulas, island groups, deserts, and terrain regions.
|
| 241 |
+
- When a place is followed by its containing region, state, or country as disambiguation context ("Puri, Odisha", "Lisboa, Portugal", "Goa, India", "Manchester in US"), extract ONLY the specific place. Do not return the container as a separate place.
|
| 242 |
+
- When a query names two or more distinct anchors joined by words like "and", "both", "between", or mixes an admin area with a physical feature as separate anchors, extract every anchor in the order they appear.
|
| 243 |
+
- Do not infer or expand category nouns like "regions", "districts", "counties", "rivers", or "mountains" when they refer to a type rather than a specific named place ("regions of India" -> extract "India" only).
|
| 244 |
+
- Only extract places explicitly mentioned.
|
| 245 |
- No duplicate place names.
|
|
|
|
|
|
|
| 246 |
|
| 247 |
+
EXAMPLES:
|
| 248 |
+
Query: "Puri, Odisha"
|
| 249 |
+
-> {"places": [{"place": "Puri"}]}
|
| 250 |
+
|
| 251 |
+
Query: "Lisboa, Portugal"
|
| 252 |
+
-> {"places": [{"place": "Lisboa"}]}
|
| 253 |
+
|
| 254 |
+
Query: "Goa, India"
|
| 255 |
+
-> {"places": [{"place": "Goa"}]}
|
| 256 |
+
|
| 257 |
+
Query: "Manchester in US"
|
| 258 |
+
-> {"places": [{"place": "Manchester"}]}
|
| 259 |
+
|
| 260 |
+
Query: "Springfield, Illinois"
|
| 261 |
+
-> {"places": [{"place": "Springfield"}]}
|
| 262 |
+
|
| 263 |
+
Query: "coastal districts of Brazil"
|
| 264 |
+
-> {"places": [{"place": "Brazil"}]}
|
| 265 |
+
|
| 266 |
+
Query: "northern half of India"
|
| 267 |
+
-> {"places": [{"place": "India"}]}
|
| 268 |
+
|
| 269 |
+
Query: "what's within 50 km of Paris?"
|
| 270 |
+
-> {"places": [{"place": "Paris"}]}
|
| 271 |
+
|
| 272 |
+
Query: "countries the Nile crosses"
|
| 273 |
+
-> {"places": [{"place": "Nile"}]}
|
| 274 |
+
|
| 275 |
+
Query: "which countries touch the Gulf of Maine"
|
| 276 |
+
-> {"places": [{"place": "Gulf of Maine"}]}
|
| 277 |
+
|
| 278 |
+
Query: "10 km buffer around Odisha"
|
| 279 |
+
-> {"places": [{"place": "Odisha"}]}
|
| 280 |
+
|
| 281 |
+
Query: "part of Ecuador in the Amazon basin"
|
| 282 |
+
-> {"places": [{"place": "Ecuador"}, {"place": "Amazon basin"}]}
|
| 283 |
+
|
| 284 |
+
Query: "Amazon basin inside Ecuador"
|
| 285 |
+
-> {"places": [{"place": "Amazon basin"}, {"place": "Ecuador"}]}
|
| 286 |
+
|
| 287 |
+
Query: "the part of Chad in Lake Chad"
|
| 288 |
+
-> {"places": [{"place": "Chad"}, {"place": "Lake Chad"}]}
|
| 289 |
+
|
| 290 |
+
Query: "which regions border both France and Germany?"
|
| 291 |
+
-> {"places": [{"place": "France"}, {"place": "Germany"}]}
|
| 292 |
+
|
| 293 |
+
Query: "merge Nairobi and Mombasa"
|
| 294 |
+
-> {"places": [{"place": "Nairobi"}, {"place": "Mombasa"}]}"""
|
| 295 |
|
| 296 |
|
| 297 |
def generate_places(user_query: str) -> PlacesResult:
|
|
|
|
| 330 |
Single-shot — no retry loop (the finetuned model can't improve from error feedback).
|
| 331 |
"""
|
| 332 |
# Keep only columns the model was trained on
|
| 333 |
+
keep_cols = ["source", "id", "name", "subtype", "country", "region", "admin_level"]
|
| 334 |
cols = [c for c in keep_cols if c in candidates_df.columns]
|
| 335 |
candidates_csv = candidates_df[cols].to_csv(index=False)
|
| 336 |
|
src/gazet/schemas.py
CHANGED
|
@@ -292,10 +292,7 @@ COUNTRIES = Literal[
|
|
| 292 |
|
| 293 |
class Place(BaseModel):
|
| 294 |
place: str
|
| 295 |
-
country: Optional[COUNTRIES] = None
|
| 296 |
-
subtype: Optional[SUBTYPES] = None
|
| 297 |
|
| 298 |
|
| 299 |
class PlacesResult(BaseModel):
|
| 300 |
places: List[Place]
|
| 301 |
-
subtype: Optional[SUBTYPES] = None
|
|
|
|
| 292 |
|
| 293 |
class Place(BaseModel):
|
| 294 |
place: str
|
|
|
|
|
|
|
| 295 |
|
| 296 |
|
| 297 |
class PlacesResult(BaseModel):
|
| 298 |
places: List[Place]
|
|
|
src/gazet/sql.py
CHANGED
|
@@ -2,6 +2,17 @@ import json
|
|
| 2 |
import re
|
| 3 |
from typing import Any, Generator, Optional
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
import duckdb
|
| 6 |
import pandas as pd
|
| 7 |
from shapely import wkb
|
|
@@ -88,11 +99,21 @@ _NE_SUBTYPE_FIXES = {
|
|
| 88 |
"'Sea'": "'sea'",
|
| 89 |
}
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
def _normalize_ne_subtypes(sql: str) -> str:
|
| 93 |
-
"""Lowercase known NE subtype literals
|
| 94 |
for old, new in _NE_SUBTYPE_FIXES.items():
|
| 95 |
sql = sql.replace(old, new)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
return sql
|
| 97 |
|
| 98 |
|
|
@@ -189,7 +210,8 @@ def run_geo_sql_dspy(
|
|
| 189 |
yield {"type": "result", "df": None, "sql": ""}
|
| 190 |
return
|
| 191 |
|
| 192 |
-
|
|
|
|
| 193 |
previous_sql = ""
|
| 194 |
error = ""
|
| 195 |
|
|
|
|
| 2 |
import re
|
| 3 |
from typing import Any, Generator, Optional
|
| 4 |
|
| 5 |
+
|
| 6 |
+
_CANDIDATE_PROMPT_COLS = [
|
| 7 |
+
"source",
|
| 8 |
+
"id",
|
| 9 |
+
"name",
|
| 10 |
+
"subtype",
|
| 11 |
+
"country",
|
| 12 |
+
"region",
|
| 13 |
+
"admin_level",
|
| 14 |
+
]
|
| 15 |
+
|
| 16 |
import duckdb
|
| 17 |
import pandas as pd
|
| 18 |
from shapely import wkb
|
|
|
|
| 99 |
"'Sea'": "'sea'",
|
| 100 |
}
|
| 101 |
|
| 102 |
+
_TERRAIN_AREA_PATTERN = re.compile(
|
| 103 |
+
r"n\.subtype\s*(=|IN\s*\()\s*'Terrain area'\s*\)?",
|
| 104 |
+
flags=re.IGNORECASE,
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
|
| 108 |
def _normalize_ne_subtypes(sql: str) -> str:
|
| 109 |
+
"""Lowercase known NE subtype literals and fix common terrain hallucinations."""
|
| 110 |
for old, new in _NE_SUBTYPE_FIXES.items():
|
| 111 |
sql = sql.replace(old, new)
|
| 112 |
+
|
| 113 |
+
sql = _TERRAIN_AREA_PATTERN.sub(
|
| 114 |
+
"n.subtype IN ('range/mtn', 'peninsula', 'depression')",
|
| 115 |
+
sql,
|
| 116 |
+
)
|
| 117 |
return sql
|
| 118 |
|
| 119 |
|
|
|
|
| 210 |
yield {"type": "result", "df": None, "sql": ""}
|
| 211 |
return
|
| 212 |
|
| 213 |
+
cols = [c for c in _CANDIDATE_PROMPT_COLS if c in candidates_df.columns]
|
| 214 |
+
candidates_str = candidates_df[cols].to_string(index=False)
|
| 215 |
previous_sql = ""
|
| 216 |
error = ""
|
| 217 |
|