FSD-LAB
/

FTEL-Text2SQL

Model card Files Files and versions

FTEL-Text2SQL / README.md

Phucdh35's picture

Update README.md

6ff9d55 verified about 1 month ago

|

history blame contribute delete

2.95 kB

	---
	license: apache-2.0
	---
	# 🦅 Ftel-Text2SQL

	Ftel-Text2SQL is a Text-to-SQL model developed by FPT Telecom. Fine-tuned specifically for complex, cross-domain database querying, it achieves high execution accuracy on challenging benchmarks like BIRD-SQL.

	## 🎯 BirdSQL results
	- Dev set: 71.19
	- Test set: 72.78

	## 🚀 Model Details
	- Base Architecture: [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)
	- Developer: FPT Telecom
	- Language(s): English
	- License: Apache 2.0
	- Task: Text-to-SQL (Code Generation)

	## 🧠 Training Methodology
	This model was trained using a robust two-stage pipeline designed to enhance both structural understanding and logical reasoning for SQL generation:

	### Stage 1: Supervised Fine-Tuning (SFT)
	In the first phase, the base Qwen2.5-Coder model underwent SFT using high-quality, curated Text-to-SQL datasets. This stage focused on teaching the model complex database schemas, foreign key relationships, and advanced SQL dialect nuances (such as window functions and complex joins).

	### Stage 2: Group Relative Policy Optimization (GRPO)
	To further align the model's reasoning capabilities, we implemented a reinforcement learning phase using GRPO (Group Relative Policy Optimization).
	Unlike standard RLHF which relies on human preference models, our GRPO implementation utilized execution-based reward functions. The model was rewarded based on:
	1. Execution Correctness: Whether the generated SQL query executed successfully without syntax errors.
	2. Result Equivalence: Whether the executed output matched the ground truth execution results.

	This dual-stage approach significantly reduced syntax hallucination and improved the model's multi-step reasoning capabilities for real-world database queries.

	## 💻 Usage & Prompt Format
	This model inherits the ChatML format from the Qwen2.5 family. We highly recommend using a Few-shot prompting strategy with Self-Consistency (Majority Voting) for optimal performance.

	### Example Inference Code (vLLM)
	```python
	from vllm import LLM, SamplingParams

	llm = LLM(model="FSD-LAB/FTEL-Text2SQL", tensor_parallel_size=2)
	sampling_params = SamplingParams(temperature=0.0, max_tokens=512)

	# Standard ChatML Prompting
	prompt = """<\|im_start\|>system
	You are a data science expert. Below, you are provided with a database schema and a natural language question. Your task is to understand the schema and generate a valid SQL query to answer the question.
	<\|im_end\|>
	<\|im_start\|>user
	Database Schema:
	{data["db_desc"]}
	This schema describes the database's structure, including tables, columns, primary keys, foreign keys, and any relevant relationships or constraints.
	Question:
	{data["question"]}
	</answer>
	<\|im_end\|>
	<\|im_start\|>assistant
	Let me solve this step by step. \n<think>
	"""

	outputs = llm.generate([prompt], sampling_params)
	print(outputs[0].outputs[0].text)