Qwen3-4B-text2sql / README.md
Tnt3o5's picture
Update README.md
ebc11ad verified
---
license: apache-2.0
language:
- en
- vi
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- text-generation-inference
---
# Text-to-SQL Evaluation Pipeline
## Overview
This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required.
Key goals:
- Translate **natural language questions** into **valid SQLite SQL queries**
- Enforce **domain-specific constraints** via prompt engineering
- Benchmark model outputs using **multiple evaluation metrics**
---
## Key Features
> **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\
> This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works.
- The `MyModel` class structure, method names, and prompt formatting remain unchanged
- No code auto-formatting or refactoring is applied
- All behavior described below reflects the **original demo code**
## Key Features
- Deterministic SQL generation (`temperature = 0`)
- Strong prompt constraints (no markdown, no explanations, SQL only)
- Banking-specific metric grouping and unit conversion logic
- Multi-metric evaluation for both syntactic and semantic quality
---
## High-level Architecture
```text
User Question
Prompt Builder (System + User)
OpenAI ChatCompletion API
Generated SQL
Evaluation Metrics
```
---
## Suggested Project Structure
```text
.
├── main.py # Entry point, runs inference and prints metrics
├── model.py # OpenAI client wrapper (MyModel)
├── evaluator.py # Evaluation metrics implementation
├── prompts/
│ └── text2sql.txt # System prompt with banking rules
├── README.md
└── requirements.txt
```
---
## Prompt Design
### System Prompt Responsibilities
The system prompt enforces the following rules:
- **Always perform** an `INNER JOIN` between `system_data` and `branch`
- **Always SELECT** the following columns:
- `system_data.data_code`
- `system_data.year`
- `system_data.branch_id`
- `branch.name`
- `system_data.value`
- SQL keywords must be **UPPERCASE**
- Text filters must use `LIKE '%keyword%'`
- Vietnamese location names must use **exact accents**
- Output **SQL only** (no markdown, no explanations)
### Metric Grouping Logic
Metrics are classified by `metric_code` prefix:
| Group | Description |
| ----- | ------------------------------------ |
| A | Inbound metrics (`MET_A_%`) |
| B | Outbound metrics (`MET_B_%`) |
| C | Stock / snapshot metrics (`MET_C_%`) |
| D | Exposure / obligation metrics |
| E | Resource mobilization metrics |
| F | Ratio & efficiency metrics |
### Unit Conversion Rule
- Stored unit: **Million VND**
- If the question mentions **"Billion VND"**, multiply value by `1000`
---
## Example Input (Schema)
```sql
CREATE TABLE entity_a (
id INTEGER,
group_id INTEGER,
org_id INTEGER,
code VARCHAR(100),
name VARCHAR(255),
attr_1 VARCHAR(255),
attr_2 VARCHAR(255),
attr_3 TEXT
);
CREATE TABLE entity_b (
id INTEGER,
group_id INTEGER,
entity_a_id INTEGER,
time_key INTEGER,
metric_name VARCHAR(255),
metric_code VARCHAR(100),
metric_value REAL,
metric_unit VARCHAR(100)
);
```
---
## Evaluation Metrics
Evaluation results are printed **at the very top of the output**:
| Label | Value |
| -------------- | ------------ |
| rouge | 0.96 |
| meteor | 0.95 |
| binary | 0.65 |
| llm-as-a-judge | 0.82 |
### Metric Definitions
- **ROUGE**: Token-level overlap between generated and reference SQL
- **METEOR**: Semantic similarity with synonym awareness
- **Binary Match**: Exact string match (0 or 1)
- **LLM-as-a-Judge**: LLM-based holistic judgment of correctness
---
## Full Demo Code (Kept Exactly As-Is)
The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied.
```python
import argparse
from openai import OpenAI
DEFAULT_QUESTION = """CREATE TABLE entity_a (
id INTEGER,
group_id INTEGER,
org_id INTEGER,
code VARCHAR(100),
name VARCHAR(255),
attr_1 VARCHAR(255),
attr_2 VARCHAR(255),
attr_3 TEXT
);
CREATE TABLE entity_b (
id INTEGER,
group_id INTEGER,
entity_a_id INTEGER,
time_key INTEGER,
metric_name VARCHAR(255),
metric_code VARCHAR(100),
metric_value REAL,
metric_unit VARCHAR(100)
);
ENTITIES = {
"metric": {
"metric_code": "METRIC_X",
"metric_unit": "UNIT_A"
},
"entity_a_field": {
"attr_1": [],
"attr_2": [],
"attr_3": [],
"id": []
},
"time_key": [year],
Query:
}
"""
class MyModel(object):
def __init__(self, model_name: str, api_key: str):
self.model_name = model_name
self.client = OpenAI(base_url="", api_key=api_key)
def get_prompt(
self,
question: str,
) -> list[dict[str, str]]:
return [
{
"role": "system",
"content": """
You are a problem solving model working on task_description XML block:
<task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic.
### Input:
- Schema: Table definitions in SQL DDL format.
- Relationships: Key linking logic between tables (system_data.branch_id = branch.id).
- Data Content Context:
Indicator_Categories:
Group_A:
description: Primary metrics – inbound type
rule:
- metric_code LIKE 'MET_A_%'
Group_B:
description: Primary metrics – outbound type
rule:
- metric_code LIKE 'MET_B_%'
Group_C:
description: Stock / snapshot metrics
rule:
- metric_code LIKE 'MET_C_%'
Group_D:
description: Exposure / obligation related metrics
rule:
- metric_code LIKE 'MET_D_%'
- metric_code LIKE 'MET_D_TOTAL_%'
- metric_code = 'MET_D_SPECIAL'
Group_E:
description: Resource mobilization metrics
rule:
- metric_code LIKE 'MET_E_%'
Group_F:
description: Ratio & efficiency indicators
rule:
- Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000.
- Entities: Extracted key information including data_code, year, and branch filtering criteria.
### Rules:
1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id.
2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value.
3. Use exact Vietnamese accents for location values.
4. Use LIKE '%keyword%' for text matching.
5. Use UPPERCASE for SQL keywords.
6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description>
You will be given a single task in the question XML block
Solve only the task in question block.
Generate only the answer, do not generate anything else
""",
},
{
"role": "user",
"content": f"""
Now for the real task, solve the task in question block.
Generate only the solution, do not generate anything else
<question>{question}</question>
""",
},
]
def invoke(self, question: str) -> str:
chat_response = self.client.chat.completions.create(
model=self.model_name,
messages=self.get_prompt(question),
temperature=0,
reasoning_effort="none",
)
return chat_response.choices[0].message.content
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False)
parser.add_argument("--api-key", type=str, default="", required=False)
parser.add_argument("--model", type=str, default="model", required=False)
args = parser.parse_args()
client = MyModel(model_name=args.model, api_key=args.api_key)
print(client.invoke(args.question))
```
---
## How to Run
```bash
python main.py \
--question "<QUESTION_TEXT>" \
--api-key "YOUR_OPENAI_API_KEY" \
--model "gpt-4.1-mini"
```
---
## Important Notes
- `temperature = 0` ensures reproducible results
- Function calling is intentionally avoided to prevent JSON-wrapped SQL
- The prompt is optimized for **SQLite dialect**
---
## Possible Extensions
- Multi-year queries using `IN` or ranges
- Queries combining multiple metric groups
- Execution-based evaluation (SQL result comparison)
- Support for additional SQL dialects (PostgreSQL, MySQL)