File size: 9,156 Bytes
d4dc785 ebc11ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
---
license: apache-2.0
language:
- en
- vi
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- text-generation-inference
---
# Text-to-SQL Evaluation Pipeline
## Overview
This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required.
Key goals:
- Translate **natural language questions** into **valid SQLite SQL queries**
- Enforce **domain-specific constraints** via prompt engineering
- Benchmark model outputs using **multiple evaluation metrics**
---
## Key Features
> **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\
> This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works.
- The `MyModel` class structure, method names, and prompt formatting remain unchanged
- No code auto-formatting or refactoring is applied
- All behavior described below reflects the **original demo code**
## Key Features
- Deterministic SQL generation (`temperature = 0`)
- Strong prompt constraints (no markdown, no explanations, SQL only)
- Banking-specific metric grouping and unit conversion logic
- Multi-metric evaluation for both syntactic and semantic quality
---
## High-level Architecture
```text
User Question
β
βΌ
Prompt Builder (System + User)
β
βΌ
OpenAI ChatCompletion API
β
βΌ
Generated SQL
β
βΌ
Evaluation Metrics
```
---
## Suggested Project Structure
```text
.
βββ main.py # Entry point, runs inference and prints metrics
βββ model.py # OpenAI client wrapper (MyModel)
βββ evaluator.py # Evaluation metrics implementation
βββ prompts/
β βββ text2sql.txt # System prompt with banking rules
βββ README.md
βββ requirements.txt
```
---
## Prompt Design
### System Prompt Responsibilities
The system prompt enforces the following rules:
- **Always perform** an `INNER JOIN` between `system_data` and `branch`
- **Always SELECT** the following columns:
- `system_data.data_code`
- `system_data.year`
- `system_data.branch_id`
- `branch.name`
- `system_data.value`
- SQL keywords must be **UPPERCASE**
- Text filters must use `LIKE '%keyword%'`
- Vietnamese location names must use **exact accents**
- Output **SQL only** (no markdown, no explanations)
### Metric Grouping Logic
Metrics are classified by `metric_code` prefix:
| Group | Description |
| ----- | ------------------------------------ |
| A | Inbound metrics (`MET_A_%`) |
| B | Outbound metrics (`MET_B_%`) |
| C | Stock / snapshot metrics (`MET_C_%`) |
| D | Exposure / obligation metrics |
| E | Resource mobilization metrics |
| F | Ratio & efficiency metrics |
### Unit Conversion Rule
- Stored unit: **Million VND**
- If the question mentions **"Billion VND"**, multiply value by `1000`
---
## Example Input (Schema)
```sql
CREATE TABLE entity_a (
id INTEGER,
group_id INTEGER,
org_id INTEGER,
code VARCHAR(100),
name VARCHAR(255),
attr_1 VARCHAR(255),
attr_2 VARCHAR(255),
attr_3 TEXT
);
CREATE TABLE entity_b (
id INTEGER,
group_id INTEGER,
entity_a_id INTEGER,
time_key INTEGER,
metric_name VARCHAR(255),
metric_code VARCHAR(100),
metric_value REAL,
metric_unit VARCHAR(100)
);
```
---
## Evaluation Metrics
Evaluation results are printed **at the very top of the output**:
| Label | Value |
| -------------- | ------------ |
| rouge | 0.96 |
| meteor | 0.95 |
| binary | 0.65 |
| llm-as-a-judge | 0.82 |
### Metric Definitions
- **ROUGE**: Token-level overlap between generated and reference SQL
- **METEOR**: Semantic similarity with synonym awareness
- **Binary Match**: Exact string match (0 or 1)
- **LLM-as-a-Judge**: LLM-based holistic judgment of correctness
---
## Full Demo Code (Kept Exactly As-Is)
The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied.
```python
import argparse
from openai import OpenAI
DEFAULT_QUESTION = """CREATE TABLE entity_a (
id INTEGER,
group_id INTEGER,
org_id INTEGER,
code VARCHAR(100),
name VARCHAR(255),
attr_1 VARCHAR(255),
attr_2 VARCHAR(255),
attr_3 TEXT
);
CREATE TABLE entity_b (
id INTEGER,
group_id INTEGER,
entity_a_id INTEGER,
time_key INTEGER,
metric_name VARCHAR(255),
metric_code VARCHAR(100),
metric_value REAL,
metric_unit VARCHAR(100)
);
ENTITIES = {
"metric": {
"metric_code": "METRIC_X",
"metric_unit": "UNIT_A"
},
"entity_a_field": {
"attr_1": [],
"attr_2": [],
"attr_3": [],
"id": []
},
"time_key": [year],
Query:
}
"""
class MyModel(object):
def __init__(self, model_name: str, api_key: str):
self.model_name = model_name
self.client = OpenAI(base_url="", api_key=api_key)
def get_prompt(
self,
question: str,
) -> list[dict[str, str]]:
return [
{
"role": "system",
"content": """
You are a problem solving model working on task_description XML block:
<task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic.
### Input:
- Schema: Table definitions in SQL DDL format.
- Relationships: Key linking logic between tables (system_data.branch_id = branch.id).
- Data Content Context:
Indicator_Categories:
Group_A:
description: Primary metrics β inbound type
rule:
- metric_code LIKE 'MET_A_%'
Group_B:
description: Primary metrics β outbound type
rule:
- metric_code LIKE 'MET_B_%'
Group_C:
description: Stock / snapshot metrics
rule:
- metric_code LIKE 'MET_C_%'
Group_D:
description: Exposure / obligation related metrics
rule:
- metric_code LIKE 'MET_D_%'
- metric_code LIKE 'MET_D_TOTAL_%'
- metric_code = 'MET_D_SPECIAL'
Group_E:
description: Resource mobilization metrics
rule:
- metric_code LIKE 'MET_E_%'
Group_F:
description: Ratio & efficiency indicators
rule:
- Unit Logic: {Which dmain} data is stored in 'Triα»u VND'. If the Question mentions 'Tα»·', multiply the value by 1000.
- Entities: Extracted key information including data_code, year, and branch filtering criteria.
### Rules:
1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id.
2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value.
3. Use exact Vietnamese accents for location values.
4. Use LIKE '%keyword%' for text matching.
5. Use UPPERCASE for SQL keywords.
6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description>
You will be given a single task in the question XML block
Solve only the task in question block.
Generate only the answer, do not generate anything else
""",
},
{
"role": "user",
"content": f"""
Now for the real task, solve the task in question block.
Generate only the solution, do not generate anything else
<question>{question}</question>
""",
},
]
def invoke(self, question: str) -> str:
chat_response = self.client.chat.completions.create(
model=self.model_name,
messages=self.get_prompt(question),
temperature=0,
reasoning_effort="none",
)
return chat_response.choices[0].message.content
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False)
parser.add_argument("--api-key", type=str, default="", required=False)
parser.add_argument("--model", type=str, default="model", required=False)
args = parser.parse_args()
client = MyModel(model_name=args.model, api_key=args.api_key)
print(client.invoke(args.question))
```
---
## How to Run
```bash
python main.py \
--question "<QUESTION_TEXT>" \
--api-key "YOUR_OPENAI_API_KEY" \
--model "gpt-4.1-mini"
```
---
## Important Notes
- `temperature = 0` ensures reproducible results
- Function calling is intentionally avoided to prevent JSON-wrapped SQL
- The prompt is optimized for **SQLite dialect**
---
## Possible Extensions
- Multi-year queries using `IN` or ranges
- Queries combining multiple metric groups
- Execution-based evaluation (SQL result comparison)
- Support for additional SQL dialects (PostgreSQL, MySQL) |