|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- vi |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B-Instruct-2507 |
|
|
tags: |
|
|
- text-generation-inference |
|
|
--- |
|
|
# Text-to-SQL Evaluation Pipeline |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required. |
|
|
|
|
|
Key goals: |
|
|
|
|
|
- Translate **natural language questions** into **valid SQLite SQL queries** |
|
|
- Enforce **domain-specific constraints** via prompt engineering |
|
|
- Benchmark model outputs using **multiple evaluation metrics** |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Features |
|
|
|
|
|
> **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\ |
|
|
> This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works. |
|
|
|
|
|
- The `MyModel` class structure, method names, and prompt formatting remain unchanged |
|
|
- No code auto-formatting or refactoring is applied |
|
|
- All behavior described below reflects the **original demo code** |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- Deterministic SQL generation (`temperature = 0`) |
|
|
- Strong prompt constraints (no markdown, no explanations, SQL only) |
|
|
- Banking-specific metric grouping and unit conversion logic |
|
|
- Multi-metric evaluation for both syntactic and semantic quality |
|
|
|
|
|
--- |
|
|
|
|
|
## High-level Architecture |
|
|
|
|
|
```text |
|
|
User Question |
|
|
│ |
|
|
▼ |
|
|
Prompt Builder (System + User) |
|
|
│ |
|
|
▼ |
|
|
OpenAI ChatCompletion API |
|
|
│ |
|
|
▼ |
|
|
Generated SQL |
|
|
│ |
|
|
▼ |
|
|
Evaluation Metrics |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Suggested Project Structure |
|
|
|
|
|
```text |
|
|
. |
|
|
├── main.py # Entry point, runs inference and prints metrics |
|
|
├── model.py # OpenAI client wrapper (MyModel) |
|
|
├── evaluator.py # Evaluation metrics implementation |
|
|
├── prompts/ |
|
|
│ └── text2sql.txt # System prompt with banking rules |
|
|
├── README.md |
|
|
└── requirements.txt |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Prompt Design |
|
|
|
|
|
### System Prompt Responsibilities |
|
|
|
|
|
The system prompt enforces the following rules: |
|
|
|
|
|
- **Always perform** an `INNER JOIN` between `system_data` and `branch` |
|
|
- **Always SELECT** the following columns: |
|
|
- `system_data.data_code` |
|
|
- `system_data.year` |
|
|
- `system_data.branch_id` |
|
|
- `branch.name` |
|
|
- `system_data.value` |
|
|
- SQL keywords must be **UPPERCASE** |
|
|
- Text filters must use `LIKE '%keyword%'` |
|
|
- Vietnamese location names must use **exact accents** |
|
|
- Output **SQL only** (no markdown, no explanations) |
|
|
|
|
|
### Metric Grouping Logic |
|
|
|
|
|
Metrics are classified by `metric_code` prefix: |
|
|
|
|
|
| Group | Description | |
|
|
| ----- | ------------------------------------ | |
|
|
| A | Inbound metrics (`MET_A_%`) | |
|
|
| B | Outbound metrics (`MET_B_%`) | |
|
|
| C | Stock / snapshot metrics (`MET_C_%`) | |
|
|
| D | Exposure / obligation metrics | |
|
|
| E | Resource mobilization metrics | |
|
|
| F | Ratio & efficiency metrics | |
|
|
|
|
|
### Unit Conversion Rule |
|
|
|
|
|
- Stored unit: **Million VND** |
|
|
- If the question mentions **"Billion VND"**, multiply value by `1000` |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Input (Schema) |
|
|
|
|
|
```sql |
|
|
CREATE TABLE entity_a ( |
|
|
id INTEGER, |
|
|
group_id INTEGER, |
|
|
org_id INTEGER, |
|
|
code VARCHAR(100), |
|
|
name VARCHAR(255), |
|
|
attr_1 VARCHAR(255), |
|
|
attr_2 VARCHAR(255), |
|
|
attr_3 TEXT |
|
|
); |
|
|
|
|
|
CREATE TABLE entity_b ( |
|
|
id INTEGER, |
|
|
group_id INTEGER, |
|
|
entity_a_id INTEGER, |
|
|
time_key INTEGER, |
|
|
metric_name VARCHAR(255), |
|
|
metric_code VARCHAR(100), |
|
|
metric_value REAL, |
|
|
metric_unit VARCHAR(100) |
|
|
); |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
Evaluation results are printed **at the very top of the output**: |
|
|
|
|
|
| Label | Value | |
|
|
| -------------- | ------------ | |
|
|
| rouge | 0.96 | |
|
|
| meteor | 0.95 | |
|
|
| binary | 0.65 | |
|
|
| llm-as-a-judge | 0.82 | |
|
|
|
|
|
### Metric Definitions |
|
|
|
|
|
- **ROUGE**: Token-level overlap between generated and reference SQL |
|
|
- **METEOR**: Semantic similarity with synonym awareness |
|
|
- **Binary Match**: Exact string match (0 or 1) |
|
|
- **LLM-as-a-Judge**: LLM-based holistic judgment of correctness |
|
|
|
|
|
--- |
|
|
|
|
|
## Full Demo Code (Kept Exactly As-Is) |
|
|
|
|
|
The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied. |
|
|
|
|
|
```python |
|
|
import argparse |
|
|
|
|
|
from openai import OpenAI |
|
|
|
|
|
DEFAULT_QUESTION = """CREATE TABLE entity_a ( |
|
|
id INTEGER, |
|
|
group_id INTEGER, |
|
|
org_id INTEGER, |
|
|
code VARCHAR(100), |
|
|
name VARCHAR(255), |
|
|
attr_1 VARCHAR(255), |
|
|
attr_2 VARCHAR(255), |
|
|
attr_3 TEXT |
|
|
); |
|
|
CREATE TABLE entity_b ( |
|
|
id INTEGER, |
|
|
group_id INTEGER, |
|
|
entity_a_id INTEGER, |
|
|
time_key INTEGER, |
|
|
metric_name VARCHAR(255), |
|
|
metric_code VARCHAR(100), |
|
|
metric_value REAL, |
|
|
metric_unit VARCHAR(100) |
|
|
); |
|
|
ENTITIES = { |
|
|
"metric": { |
|
|
"metric_code": "METRIC_X", |
|
|
"metric_unit": "UNIT_A" |
|
|
}, |
|
|
"entity_a_field": { |
|
|
"attr_1": [], |
|
|
"attr_2": [], |
|
|
"attr_3": [], |
|
|
"id": [] |
|
|
}, |
|
|
"time_key": [year], |
|
|
Query: |
|
|
} |
|
|
|
|
|
|
|
|
""" |
|
|
|
|
|
|
|
|
class MyModel(object): |
|
|
def __init__(self, model_name: str, api_key: str): |
|
|
self.model_name = model_name |
|
|
self.client = OpenAI(base_url="", api_key=api_key) |
|
|
def get_prompt( |
|
|
self, |
|
|
question: str, |
|
|
) -> list[dict[str, str]]: |
|
|
return [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": """ |
|
|
You are a problem solving model working on task_description XML block: |
|
|
<task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic. |
|
|
### Input: |
|
|
- Schema: Table definitions in SQL DDL format. |
|
|
- Relationships: Key linking logic between tables (system_data.branch_id = branch.id). |
|
|
- Data Content Context: |
|
|
Indicator_Categories: |
|
|
Group_A: |
|
|
description: Primary metrics – inbound type |
|
|
rule: |
|
|
- metric_code LIKE 'MET_A_%' |
|
|
|
|
|
Group_B: |
|
|
description: Primary metrics – outbound type |
|
|
rule: |
|
|
- metric_code LIKE 'MET_B_%' |
|
|
|
|
|
Group_C: |
|
|
description: Stock / snapshot metrics |
|
|
rule: |
|
|
- metric_code LIKE 'MET_C_%' |
|
|
|
|
|
Group_D: |
|
|
description: Exposure / obligation related metrics |
|
|
rule: |
|
|
- metric_code LIKE 'MET_D_%' |
|
|
- metric_code LIKE 'MET_D_TOTAL_%' |
|
|
- metric_code = 'MET_D_SPECIAL' |
|
|
|
|
|
Group_E: |
|
|
description: Resource mobilization metrics |
|
|
rule: |
|
|
- metric_code LIKE 'MET_E_%' |
|
|
|
|
|
Group_F: |
|
|
description: Ratio & efficiency indicators |
|
|
rule: |
|
|
- Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000. |
|
|
- Entities: Extracted key information including data_code, year, and branch filtering criteria. |
|
|
### Rules: |
|
|
1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id. |
|
|
2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value. |
|
|
3. Use exact Vietnamese accents for location values. |
|
|
4. Use LIKE '%keyword%' for text matching. |
|
|
5. Use UPPERCASE for SQL keywords. |
|
|
6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description> |
|
|
You will be given a single task in the question XML block |
|
|
Solve only the task in question block. |
|
|
Generate only the answer, do not generate anything else |
|
|
""", |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": f""" |
|
|
Now for the real task, solve the task in question block. |
|
|
Generate only the solution, do not generate anything else |
|
|
<question>{question}</question> |
|
|
""", |
|
|
}, |
|
|
] |
|
|
def invoke(self, question: str) -> str: |
|
|
chat_response = self.client.chat.completions.create( |
|
|
model=self.model_name, |
|
|
messages=self.get_prompt(question), |
|
|
temperature=0, |
|
|
reasoning_effort="none", |
|
|
) |
|
|
return chat_response.choices[0].message.content |
|
|
|
|
|
if __name__ == "__main__": |
|
|
parser = argparse.ArgumentParser() |
|
|
parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False) |
|
|
parser.add_argument("--api-key", type=str, default="", required=False) |
|
|
parser.add_argument("--model", type=str, default="model", required=False) |
|
|
args = parser.parse_args() |
|
|
client = MyModel(model_name=args.model, api_key=args.api_key) |
|
|
print(client.invoke(args.question)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Run |
|
|
|
|
|
```bash |
|
|
python main.py \ |
|
|
--question "<QUESTION_TEXT>" \ |
|
|
--api-key "YOUR_OPENAI_API_KEY" \ |
|
|
--model "gpt-4.1-mini" |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Important Notes |
|
|
|
|
|
- `temperature = 0` ensures reproducible results |
|
|
- Function calling is intentionally avoided to prevent JSON-wrapped SQL |
|
|
- The prompt is optimized for **SQLite dialect** |
|
|
|
|
|
--- |
|
|
|
|
|
## Possible Extensions |
|
|
|
|
|
- Multi-year queries using `IN` or ranges |
|
|
- Queries combining multiple metric groups |
|
|
- Execution-based evaluation (SQL result comparison) |
|
|
- Support for additional SQL dialects (PostgreSQL, MySQL) |