| --- |
| license: apache-2.0 |
| --- |
| # π¦
Ftel-Text2SQL |
|
|
| **Ftel-Text2SQL** is a Text-to-SQL model developed by FPT Telecom. Fine-tuned specifically for complex, cross-domain database querying, it achieves high execution accuracy on challenging benchmarks like BIRD-SQL. |
|
|
| ## π― BirdSQL results |
| - **Dev set:** 71.19 |
| - **Test set:** 72.78 |
|
|
| ## π Model Details |
| - **Base Architecture:** [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) |
| - **Developer:** FPT Telecom |
| - **Language(s):** English |
| - **License:** Apache 2.0 |
| - **Task:** Text-to-SQL (Code Generation) |
|
|
| ## π§ Training Methodology |
| This model was trained using a robust two-stage pipeline designed to enhance both structural understanding and logical reasoning for SQL generation: |
|
|
| ### Stage 1: Supervised Fine-Tuning (SFT) |
| In the first phase, the base Qwen2.5-Coder model underwent SFT using high-quality, curated Text-to-SQL datasets. This stage focused on teaching the model complex database schemas, foreign key relationships, and advanced SQL dialect nuances (such as window functions and complex joins). |
|
|
| ### Stage 2: Group Relative Policy Optimization (GRPO) |
| To further align the model's reasoning capabilities, we implemented a reinforcement learning phase using **GRPO** (Group Relative Policy Optimization). |
| Unlike standard RLHF which relies on human preference models, our GRPO implementation utilized **execution-based reward functions**. The model was rewarded based on: |
| 1. **Execution Correctness:** Whether the generated SQL query executed successfully without syntax errors. |
| 2. **Result Equivalence:** Whether the executed output matched the ground truth execution results. |
|
|
| This dual-stage approach significantly reduced syntax hallucination and improved the model's multi-step reasoning capabilities for real-world database queries. |
|
|
| ## π» Usage & Prompt Format |
| This model inherits the ChatML format from the Qwen2.5 family. We highly recommend using a **Few-shot** prompting strategy with Self-Consistency (Majority Voting) for optimal performance. |
|
|
| ### Example Inference Code (vLLM) |
| ```python |
| from vllm import LLM, SamplingParams |
| |
| llm = LLM(model="FSD-LAB/FTEL-Text2SQL", tensor_parallel_size=2) |
| sampling_params = SamplingParams(temperature=0.0, max_tokens=512) |
| |
| # Standard ChatML Prompting |
| prompt = """<|im_start|>system |
| You are a data science expert. Below, you are provided with a database schema and a natural language question. Your task is to understand the schema and generate a valid SQL query to answer the question. |
| <|im_end|> |
| <|im_start|>user |
| Database Schema: |
| {data["db_desc"]} |
| This schema describes the database's structure, including tables, columns, primary keys, foreign keys, and any relevant relationships or constraints. |
| Question: |
| {data["question"]} |
| </answer> |
| <|im_end|> |
| <|im_start|>assistant |
| Let me solve this step by step. \n<think> |
| """ |
| |
| outputs = llm.generate([prompt], sampling_params) |
| print(outputs[0].outputs[0].text) |