|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-to-SQL |
|
|
- SQL |
|
|
- code-generation |
|
|
- NLQ-to-SQL |
|
|
- text2SQL |
|
|
- darija |
|
|
- moroccan-dialect |
|
|
datasets: |
|
|
- salmane11/Dialect2SQL |
|
|
language: |
|
|
- ar |
|
|
- mar |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-coder-3B |
|
|
--- |
|
|
|
|
|
# Darija2SQL-3B |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Darija2SQL** is a specialized **Code LLM** designed for translating **Moroccan Arabic (Darija)** natural language questions into **SQL queries**. |
|
|
It is based on the [Qwen2.5-coder-3B](https://huggingface.co/Qwen/Qwen2.5-coder-3B) model and fine-tuned on [Dialect2SQL](https://huggingface.co/datasets/salmane11/Dialect2SQL), a dataset of **Darija–SQL** pairs spanning multiple database schemas and domains. |
|
|
|
|
|
The goal of this model is to bridge the gap between dialectal Arabic users and database systems, enabling users to query structured data in their native dialect. |
|
|
|
|
|
--- |
|
|
|
|
|
## Finetuning Procedure |
|
|
|
|
|
Darija2SQL was fine-tuned using **PEFT (Parameter-Efficient Fine-Tuning)** methods, specifically **LoRA (Low-Rank Adaptation)** adapters, to preserve the strong code reasoning capabilities of Qwen2.5-coder-3B while specializing it for Darija text-to-SQL generation. |
|
|
|
|
|
Training included both schema comprehension and dialectal normalization steps, ensuring the model understands common Darija variants and their SQL intent. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use and Limitations |
|
|
|
|
|
This model is intended for **research and educational purposes**, specifically in the areas of: |
|
|
- Natural Language to SQL generation for dialectal Arabic |
|
|
- Code model adaptation to low-resource dialects |
|
|
- Database interaction interfaces in Arabic contexts |
|
|
|
|
|
While Darija2SQL performs well on the Dialect2SQL dataset, it may not generalize perfectly to unseen schemas or domains without adaptation. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Example 1: **Ecommerce_DB** |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
device = "cuda" |
|
|
tokenizer = AutoTokenizer.from_pretrained("salmane11/Darija2SQL-3B") |
|
|
model = AutoModelForCausalLM.from_pretrained("salmane11/Darija2SQL-3B").to(device) |
|
|
|
|
|
input_text = """ |
|
|
CREATE TABLE Products { |
|
|
product_id number, |
|
|
product_name varchar, |
|
|
category varchar, |
|
|
price real, |
|
|
stock number |
|
|
} |
|
|
|
|
|
CREATE TABLE Orders { |
|
|
order_id number, |
|
|
product_id number, |
|
|
customer_id number, |
|
|
quantity number, |
|
|
order_date date |
|
|
} |
|
|
|
|
|
CREATE TABLE Customers { |
|
|
customer_id number, |
|
|
name varchar, |
|
|
city varchar |
|
|
} |
|
|
|
|
|
-- Using valid SQLite, answer the following question in Darija: |
|
|
-- شحال من order دار الزبون لي ساكن ف كازا؟ |
|
|
|
|
|
SELECT |
|
|
""" |
|
|
|
|
|
encoding = tokenizer.encode_plus(input_text, pad_to_max_length=True, return_tensors="pt").to(device) |
|
|
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device) |
|
|
|
|
|
outputs = model.generate( |
|
|
input_ids=input_ids, attention_mask=attention_masks, |
|
|
max_length=512, |
|
|
do_sample=True, |
|
|
top_k=120, |
|
|
top_p=0.95, |
|
|
early_stopping=True, |
|
|
) |
|
|
line = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) |
|
|
query_beginning = line.find("SELECT") |
|
|
print(line[query_beginning:]) |
|
|
``` |
|
|
|
|
|
## Cite our work |
|
|
```bibtex |
|
|
@inproceedings{chafik2025dialect2sql, |
|
|
title={Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija}, |
|
|
author={Chafik, Salmane and Ezzini, Saad and Berrada, Ismail}, |
|
|
booktitle={Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)}, |
|
|
pages={86--92}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
@article{10.1145/3787497, |
|
|
author = {Chafik, Salmane and Ezzini, Saad and Berrada, Ismail}, |
|
|
title = {DarijaDB: Unlocking Text-to-SQL for Arabic Dialects}, |
|
|
year = {2026}, |
|
|
publisher = {Association for Computing Machinery}, |
|
|
address = {New York, NY, USA}, |
|
|
issn = {2375-4699}, |
|
|
url = {https://doi.org/10.1145/3787497}, |
|
|
doi = {10.1145/3787497}, |
|
|
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, |
|
|
month = jan |
|
|
} |