---
license: apache-2.0
language:
- en
- vi
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- text-generation-inference
---
# Text-to-SQL Evaluation Pipeline

## Overview

This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required.

Key goals:

- Translate **natural language questions** into **valid SQLite SQL queries**
- Enforce **domain-specific constraints** via prompt engineering
- Benchmark model outputs using **multiple evaluation metrics**

---

## Key Features

> **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\
> This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works.

- The `MyModel` class structure, method names, and prompt formatting remain unchanged
- No code auto-formatting or refactoring is applied
- All behavior described below reflects the **original demo code**

## Key Features

- Deterministic SQL generation (`temperature = 0`)
- Strong prompt constraints (no markdown, no explanations, SQL only)
- Banking-specific metric grouping and unit conversion logic
- Multi-metric evaluation for both syntactic and semantic quality

---

## High-level Architecture

```text
User Question
     │
     ▼
Prompt Builder (System + User)
     │
     ▼
OpenAI ChatCompletion API
     │
     ▼
Generated SQL
     │
     ▼
Evaluation Metrics
```

---

## Suggested Project Structure

```text
.
├── main.py              # Entry point, runs inference and prints metrics
├── model.py             # OpenAI client wrapper (MyModel)
├── evaluator.py         # Evaluation metrics implementation
├── prompts/
│   └── text2sql.txt     # System prompt with banking rules
├── README.md
└── requirements.txt
```

---

## Prompt Design

### System Prompt Responsibilities

The system prompt enforces the following rules:

- **Always perform** an `INNER JOIN` between `system_data` and `branch`
- **Always SELECT** the following columns:
  - `system_data.data_code`
  - `system_data.year`
  - `system_data.branch_id`
  - `branch.name`
  - `system_data.value`
- SQL keywords must be **UPPERCASE**
- Text filters must use `LIKE '%keyword%'`
- Vietnamese location names must use **exact accents**
- Output **SQL only** (no markdown, no explanations)

### Metric Grouping Logic

Metrics are classified by `metric_code` prefix:

| Group | Description                          |
| ----- | ------------------------------------ |
| A     | Inbound metrics (`MET_A_%`)          |
| B     | Outbound metrics (`MET_B_%`)         |
| C     | Stock / snapshot metrics (`MET_C_%`) |
| D     | Exposure / obligation metrics        |
| E     | Resource mobilization metrics        |
| F     | Ratio & efficiency metrics           |

### Unit Conversion Rule

- Stored unit: **Million VND**
- If the question mentions **"Billion VND"**, multiply value by `1000`

---

## Example Input (Schema)

```sql
CREATE TABLE entity_a (
    id INTEGER,
    group_id INTEGER,
    org_id INTEGER,
    code VARCHAR(100),
    name VARCHAR(255),
    attr_1 VARCHAR(255),
    attr_2 VARCHAR(255),
    attr_3 TEXT
);

CREATE TABLE entity_b (
    id INTEGER,
    group_id INTEGER,
    entity_a_id INTEGER,
    time_key INTEGER,
    metric_name VARCHAR(255),
    metric_code VARCHAR(100),
    metric_value REAL,
    metric_unit VARCHAR(100)
);
```

---

## Evaluation Metrics

Evaluation results are printed **at the very top of the output**:

| Label          | Value        |
| -------------- | ------------ |
| rouge          | 0.96         |
| meteor         | 0.95         | 
| binary         | 0.65         |
| llm-as-a-judge | 0.82         |

### Metric Definitions

- **ROUGE**: Token-level overlap between generated and reference SQL
- **METEOR**: Semantic similarity with synonym awareness
- **Binary Match**: Exact string match (0 or 1)
- **LLM-as-a-Judge**: LLM-based holistic judgment of correctness

---

## Full Demo Code (Kept Exactly As-Is)

The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied.

```python
import argparse

from openai import OpenAI

DEFAULT_QUESTION = """CREATE TABLE entity_a (
    id INTEGER,
    group_id INTEGER,
    org_id INTEGER,
    code VARCHAR(100),
    name VARCHAR(255),
    attr_1 VARCHAR(255),
    attr_2 VARCHAR(255),
    attr_3 TEXT
);
CREATE TABLE entity_b (
    id INTEGER,
    group_id INTEGER,
    entity_a_id INTEGER,
    time_key INTEGER,
    metric_name VARCHAR(255),
    metric_code VARCHAR(100),
    metric_value REAL,
    metric_unit VARCHAR(100)
);
ENTITIES = {
    "metric": {
        "metric_code": "METRIC_X",
        "metric_unit": "UNIT_A"
    },
    "entity_a_field": {
        "attr_1": [],
        "attr_2": [],
        "attr_3": [],
        "id": []
    },
    "time_key": [year],
Query:
}


"""


class MyModel(object):
    def __init__(self, model_name: str, api_key: str):
        self.model_name = model_name
        self.client = OpenAI(base_url="", api_key=api_key)
    def get_prompt(
        self,
        question: str,
    ) -> list[dict[str, str]]:
        return [
            {
                "role": "system",
                "content": """
You are a problem solving model working on task_description XML block:
<task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic.
### Input:
- Schema: Table definitions in SQL DDL format.
- Relationships: Key linking logic between tables (system_data.branch_id = branch.id).
- Data Content Context:
    Indicator_Categories:
      Group_A:
        description: Primary metrics – inbound type
        rule:
          - metric_code LIKE 'MET_A_%'
    
      Group_B:
        description: Primary metrics – outbound type
        rule:
          - metric_code LIKE 'MET_B_%'
    
      Group_C:
        description: Stock / snapshot metrics
        rule:
          - metric_code LIKE 'MET_C_%'
    
      Group_D:
        description: Exposure / obligation related metrics
        rule:
          - metric_code LIKE 'MET_D_%'
          - metric_code LIKE 'MET_D_TOTAL_%'
          - metric_code = 'MET_D_SPECIAL'
    
      Group_E:
        description: Resource mobilization metrics
        rule:
          - metric_code LIKE 'MET_E_%'
    
      Group_F:
        description: Ratio & efficiency indicators
        rule:
  - Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000.
- Entities: Extracted key information including data_code, year, and branch filtering criteria.
### Rules:
1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id.
2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value.
3. Use exact Vietnamese accents for location values.
4. Use LIKE '%keyword%' for text matching.
5. Use UPPERCASE for SQL keywords.
6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description>
You will be given a single task in the question XML block
Solve only the task in question block.
Generate only the answer, do not generate anything else
""",
            },
            {
                "role": "user",
                "content": f"""
Now for the real task, solve the task in question block.
Generate only the solution, do not generate anything else
<question>{question}</question>
""",
            },
        ]
    def invoke(self, question: str) -> str:
        chat_response = self.client.chat.completions.create(
            model=self.model_name,
            messages=self.get_prompt(question),
            temperature=0,
            reasoning_effort="none",
        )
        return chat_response.choices[0].message.content

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False)
    parser.add_argument("--api-key", type=str, default="", required=False)
    parser.add_argument("--model", type=str, default="model", required=False)
    args = parser.parse_args()
    client = MyModel(model_name=args.model, api_key=args.api_key)
    print(client.invoke(args.question))
```

---

## How to Run

```bash
python main.py \
  --question "<QUESTION_TEXT>" \
  --api-key "YOUR_OPENAI_API_KEY" \
  --model "gpt-4.1-mini"
```

---

## Important Notes

- `temperature = 0` ensures reproducible results
- Function calling is intentionally avoided to prevent JSON-wrapped SQL
- The prompt is optimized for **SQLite dialect**

---

## Possible Extensions

- Multi-year queries using `IN` or ranges
- Queries combining multiple metric groups
- Execution-based evaluation (SQL result comparison)
- Support for additional SQL dialects (PostgreSQL, MySQL)