--- license: apache-2.0 language: - en - vi base_model: - Qwen/Qwen3-4B-Instruct-2507 tags: - text-generation-inference --- # Text-to-SQL Evaluation Pipeline ## Overview This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required. Key goals: - Translate **natural language questions** into **valid SQLite SQL queries** - Enforce **domain-specific constraints** via prompt engineering - Benchmark model outputs using **multiple evaluation metrics** --- ## Key Features > **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\ > This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works. - The `MyModel` class structure, method names, and prompt formatting remain unchanged - No code auto-formatting or refactoring is applied - All behavior described below reflects the **original demo code** ## Key Features - Deterministic SQL generation (`temperature = 0`) - Strong prompt constraints (no markdown, no explanations, SQL only) - Banking-specific metric grouping and unit conversion logic - Multi-metric evaluation for both syntactic and semantic quality --- ## High-level Architecture ```text User Question │ ▼ Prompt Builder (System + User) │ ▼ OpenAI ChatCompletion API │ ▼ Generated SQL │ ▼ Evaluation Metrics ``` --- ## Suggested Project Structure ```text . ├── main.py # Entry point, runs inference and prints metrics ├── model.py # OpenAI client wrapper (MyModel) ├── evaluator.py # Evaluation metrics implementation ├── prompts/ │ └── text2sql.txt # System prompt with banking rules ├── README.md └── requirements.txt ``` --- ## Prompt Design ### System Prompt Responsibilities The system prompt enforces the following rules: - **Always perform** an `INNER JOIN` between `system_data` and `branch` - **Always SELECT** the following columns: - `system_data.data_code` - `system_data.year` - `system_data.branch_id` - `branch.name` - `system_data.value` - SQL keywords must be **UPPERCASE** - Text filters must use `LIKE '%keyword%'` - Vietnamese location names must use **exact accents** - Output **SQL only** (no markdown, no explanations) ### Metric Grouping Logic Metrics are classified by `metric_code` prefix: | Group | Description | | ----- | ------------------------------------ | | A | Inbound metrics (`MET_A_%`) | | B | Outbound metrics (`MET_B_%`) | | C | Stock / snapshot metrics (`MET_C_%`) | | D | Exposure / obligation metrics | | E | Resource mobilization metrics | | F | Ratio & efficiency metrics | ### Unit Conversion Rule - Stored unit: **Million VND** - If the question mentions **"Billion VND"**, multiply value by `1000` --- ## Example Input (Schema) ```sql CREATE TABLE entity_a ( id INTEGER, group_id INTEGER, org_id INTEGER, code VARCHAR(100), name VARCHAR(255), attr_1 VARCHAR(255), attr_2 VARCHAR(255), attr_3 TEXT ); CREATE TABLE entity_b ( id INTEGER, group_id INTEGER, entity_a_id INTEGER, time_key INTEGER, metric_name VARCHAR(255), metric_code VARCHAR(100), metric_value REAL, metric_unit VARCHAR(100) ); ``` --- ## Evaluation Metrics Evaluation results are printed **at the very top of the output**: | Label | Value | | -------------- | ------------ | | rouge | 0.96 | | meteor | 0.95 | | binary | 0.65 | | llm-as-a-judge | 0.82 | ### Metric Definitions - **ROUGE**: Token-level overlap between generated and reference SQL - **METEOR**: Semantic similarity with synonym awareness - **Binary Match**: Exact string match (0 or 1) - **LLM-as-a-Judge**: LLM-based holistic judgment of correctness --- ## Full Demo Code (Kept Exactly As-Is) The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied. ```python import argparse from openai import OpenAI DEFAULT_QUESTION = """CREATE TABLE entity_a ( id INTEGER, group_id INTEGER, org_id INTEGER, code VARCHAR(100), name VARCHAR(255), attr_1 VARCHAR(255), attr_2 VARCHAR(255), attr_3 TEXT ); CREATE TABLE entity_b ( id INTEGER, group_id INTEGER, entity_a_id INTEGER, time_key INTEGER, metric_name VARCHAR(255), metric_code VARCHAR(100), metric_value REAL, metric_unit VARCHAR(100) ); ENTITIES = { "metric": { "metric_code": "METRIC_X", "metric_unit": "UNIT_A" }, "entity_a_field": { "attr_1": [], "attr_2": [], "attr_3": [], "id": [] }, "time_key": [year], Query: } """ class MyModel(object): def __init__(self, model_name: str, api_key: str): self.model_name = model_name self.client = OpenAI(base_url="", api_key=api_key) def get_prompt( self, question: str, ) -> list[dict[str, str]]: return [ { "role": "system", "content": """ You are a problem solving model working on task_description XML block: You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic. ### Input: - Schema: Table definitions in SQL DDL format. - Relationships: Key linking logic between tables (system_data.branch_id = branch.id). - Data Content Context: Indicator_Categories: Group_A: description: Primary metrics – inbound type rule: - metric_code LIKE 'MET_A_%' Group_B: description: Primary metrics – outbound type rule: - metric_code LIKE 'MET_B_%' Group_C: description: Stock / snapshot metrics rule: - metric_code LIKE 'MET_C_%' Group_D: description: Exposure / obligation related metrics rule: - metric_code LIKE 'MET_D_%' - metric_code LIKE 'MET_D_TOTAL_%' - metric_code = 'MET_D_SPECIAL' Group_E: description: Resource mobilization metrics rule: - metric_code LIKE 'MET_E_%' Group_F: description: Ratio & efficiency indicators rule: - Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000. - Entities: Extracted key information including data_code, year, and branch filtering criteria. ### Rules: 1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id. 2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value. 3. Use exact Vietnamese accents for location values. 4. Use LIKE '%keyword%' for text matching. 5. Use UPPERCASE for SQL keywords. 6. Output ONLY the SQL query. No explanations or markdown blocks. You will be given a single task in the question XML block Solve only the task in question block. Generate only the answer, do not generate anything else """, }, { "role": "user", "content": f""" Now for the real task, solve the task in question block. Generate only the solution, do not generate anything else {question} """, }, ] def invoke(self, question: str) -> str: chat_response = self.client.chat.completions.create( model=self.model_name, messages=self.get_prompt(question), temperature=0, reasoning_effort="none", ) return chat_response.choices[0].message.content if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False) parser.add_argument("--api-key", type=str, default="", required=False) parser.add_argument("--model", type=str, default="model", required=False) args = parser.parse_args() client = MyModel(model_name=args.model, api_key=args.api_key) print(client.invoke(args.question)) ``` --- ## How to Run ```bash python main.py \ --question "" \ --api-key "YOUR_OPENAI_API_KEY" \ --model "gpt-4.1-mini" ``` --- ## Important Notes - `temperature = 0` ensures reproducible results - Function calling is intentionally avoided to prevent JSON-wrapped SQL - The prompt is optimized for **SQLite dialect** --- ## Possible Extensions - Multi-year queries using `IN` or ranges - Queries combining multiple metric groups - Execution-based evaluation (SQL result comparison) - Support for additional SQL dialects (PostgreSQL, MySQL)