Spaces:
Sleeping
Sleeping
chore: Run eval on v3 dataset
Browse files- IMPROVEMENTS.md +0 -79
- finetune/eval_cli.py +1 -1
- finetune/train_modal_qwen35.py +2 -2
IMPROVEMENTS.md
DELETED
|
@@ -1,79 +0,0 @@
|
|
| 1 |
-
# Gazet Improvement Notes
|
| 2 |
-
|
| 3 |
-
Issues identified during testing. Each item is a candidate for the next training/template pass.
|
| 4 |
-
|
| 5 |
-
---
|
| 6 |
-
|
| 7 |
-
## 1. Missing "buffer-only" template
|
| 8 |
-
|
| 9 |
-
**Query**: "10 km buffer around Odisha"
|
| 10 |
-
**Expected**: Return the buffered geometry polygon itself.
|
| 11 |
-
**Actual**: Model picks `buffer_01`, which finds all features intersecting the buffer (200 rows).
|
| 12 |
-
|
| 13 |
-
**Root cause**: All buffer templates (`buffer_01` through `buffer_05`) perform an intersection join to find neighboring features. No template simply returns `ST_AsGeoJSON(ST_Buffer(...))`.
|
| 14 |
-
|
| 15 |
-
**Fix**: Add a `buffer_06` template that returns the buffer polygon directly:
|
| 16 |
-
|
| 17 |
-
```sql
|
| 18 |
-
SELECT ST_AsGeoJSON(ST_Buffer(geometry, {buffer_km} * 1000.0 / 111320.0)) AS geometry
|
| 19 |
-
FROM read_parquet('divisions_area')
|
| 20 |
-
WHERE id = '{anchor_id}'
|
| 21 |
-
```
|
| 22 |
-
|
| 23 |
-
With hints like "10 km buffer around {anchor_name}", "draw a {buffer_km} km buffer around {anchor_name}". Consider a NE variant too.
|
| 24 |
-
|
| 25 |
-
---
|
| 26 |
-
|
| 27 |
-
## 2. Place extractor misses NE physical features (mixed-source queries)
|
| 28 |
-
|
| 29 |
-
**Query**: "The part of Ecuador that is in the Amazon Basin"
|
| 30 |
-
**Expected**: Place extractor returns both "Ecuador" and "Amazon Basin"; candidate search finds correct IDs for both.
|
| 31 |
-
**Actual**: Only "Ecuador" extracted. SQL model uses memorized wrong NE ID (`ne_1159120655` = Cuando River) instead of the correct one (`ne_1159104325` = AMAZON BASIN).
|
| 32 |
-
|
| 33 |
-
**Root cause**: The GGUF place extraction model was not trained to extract physical features. The runtime prompt (`_PLACES_SYSTEM_PROMPT`) has been updated but the finetuned model may ignore prompt changes. A re-finetune with NE feature examples is the definitive fix.
|
| 34 |
-
|
| 35 |
-
**Affected templates**: `partial_05`, `diff_02` (mixed-source), and all NE-anchored templates (`intersect_03`, `contain_03/04`, `buffer_03/04/05`, `lookup_02`).
|
| 36 |
-
|
| 37 |
-
---
|
| 38 |
-
|
| 39 |
-
## 3. Missing NE-anchor to county intersection template
|
| 40 |
-
|
| 41 |
-
**Query**: "Indravati River flows through which districts"
|
| 42 |
-
**Expected**: `ST_Intersects` with `target_subtype='county'`
|
| 43 |
-
**Actual**: Model sometimes uses `ST_Within` (wrong predicate) because `intersect_03` only targets `region`, not `county`.
|
| 44 |
-
|
| 45 |
-
**Fix**: Add `intersect_05` (NE anchor -> county, `ST_Intersects`) with district-oriented question hints.
|
| 46 |
-
|
| 47 |
-
---
|
| 48 |
-
|
| 49 |
-
## 4. Model hallucinates NE subtype values
|
| 50 |
-
|
| 51 |
-
**Query**: "which mountain ranges cross Odisha"
|
| 52 |
-
**Expected**: `n.subtype IN ('range/mtn', 'peninsula', 'depression')` (from `adj_05`)
|
| 53 |
-
**Actual**: Model generates `'Terrain area'` which does not exist in the data.
|
| 54 |
-
|
| 55 |
-
**Fix**: More training examples for `adj_05`. Consider adding common hallucinated values to `_NE_SUBTYPE_FIXES` in `sql.py` as a runtime safety net.
|
| 56 |
-
|
| 57 |
-
---
|
| 58 |
-
|
| 59 |
-
## 5. NE subtype casing inconsistency between model output and data
|
| 60 |
-
|
| 61 |
-
**Example**: Model generates `'River'`, `'Basin'`, `'Ocean'` but data has `'river'`, `'basin'`, `'ocean'`.
|
| 62 |
-
|
| 63 |
-
**Current workaround**: `_normalize_ne_subtypes()` in `sql.py` does string replacement of known title-cased literals at query time (`_NE_SUBTYPE_FIXES` dict). This is brittle and only covers a hardcoded list.
|
| 64 |
-
|
| 65 |
-
**Root cause**: The original Natural Earth data had title-cased `featurecla` values (e.g. `River`, `Basin`, `Ocean`). Training data was generated before the lowercase fix to `convert_natural_earth.py`, so the model learned to emit title-cased subtypes. The data is now lowercased but the model still outputs the old casing.
|
| 66 |
-
|
| 67 |
-
**Fix**: Regenerate training data with the lowercased NE parquet so all subtype literals in SQL examples are lowercase. After re-finetune, the model will natively emit lowercase subtypes and the `_normalize_ne_subtypes` hack can be removed.
|
| 68 |
-
|
| 69 |
-
---
|
| 70 |
-
|
| 71 |
-
## 6. "Largest/smallest" queries always return at least 3 results
|
| 72 |
-
|
| 73 |
-
**Query**: "the largest region in India", "smallest county in France"
|
| 74 |
-
**Expected**: Return 1 result (the single largest/smallest).
|
| 75 |
-
**Actual**: Model generates `LIMIT 3` by default, returning top 3 instead of 1.
|
| 76 |
-
|
| 77 |
-
**Root cause**: The aggregation templates (`agg_01`, `agg_02`) use `LIMIT 3` as the default. The model learns this as a fixed pattern and applies it even when the query clearly asks for a single result ("the largest", "the smallest").
|
| 78 |
-
|
| 79 |
-
**Fix**: During data generation, vary the LIMIT value based on the question hint phrasing. Use `LIMIT 1` for singular hints ("the largest X", "the smallest X") and `LIMIT 3` or `LIMIT 5` for plural hints ("the 3 largest", "top 5 smallest"). This teaches the model to infer the correct LIMIT from the query.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
finetune/eval_cli.py
CHANGED
|
@@ -34,7 +34,7 @@ SERVER_URL = "http://localhost:9000"
|
|
| 34 |
MAX_TOKENS = 2048
|
| 35 |
TEMPERATURE = 0.6
|
| 36 |
|
| 37 |
-
DEFAULT_RUN_DIR = Path("dataset/output/runs/
|
| 38 |
|
| 39 |
|
| 40 |
def postprocess_sql(text: str) -> str:
|
|
|
|
| 34 |
MAX_TOKENS = 2048
|
| 35 |
TEMPERATURE = 0.6
|
| 36 |
|
| 37 |
+
DEFAULT_RUN_DIR = Path("dataset/output/runs/v3")
|
| 38 |
|
| 39 |
|
| 40 |
def postprocess_sql(text: str) -> str:
|
finetune/train_modal_qwen35.py
CHANGED
|
@@ -101,9 +101,9 @@ class Qwen35Config:
|
|
| 101 |
# Logging / saving
|
| 102 |
logging_steps: int = 10
|
| 103 |
save_strategy: str = "steps"
|
| 104 |
-
save_steps: int =
|
| 105 |
eval_strategy: str = "steps"
|
| 106 |
-
eval_steps: int =
|
| 107 |
report_to: str = "trackio"
|
| 108 |
trackio_space_id: Optional[str] = "srmsoumya/gazet-trackio"
|
| 109 |
project: str = "gazet-nlg-qwen35"
|
|
|
|
| 101 |
# Logging / saving
|
| 102 |
logging_steps: int = 10
|
| 103 |
save_strategy: str = "steps"
|
| 104 |
+
save_steps: int = 2000
|
| 105 |
eval_strategy: str = "steps"
|
| 106 |
+
eval_steps: int = 500
|
| 107 |
report_to: str = "trackio"
|
| 108 |
trackio_space_id: Optional[str] = "srmsoumya/gazet-trackio"
|
| 109 |
project: str = "gazet-nlg-qwen35"
|