README.md · salmane11/Darija-to-SQL at main

File size: 3,876 Bytes

---
library_name: transformers
tags:
- text-to-SQL
- SQL
- code-generation
- NLQ-to-SQL
- text2SQL
- darija
- moroccan-dialect
datasets:
- salmane11/Dialect2SQL
language:
- ar
- mar
base_model:
- Qwen/Qwen2.5-coder-3B
---

# Darija2SQL-3B

## Model Description

**Darija2SQL** is a specialized **Code LLM** designed for translating **Moroccan Arabic (Darija)** natural language questions into **SQL queries**.  
It is based on the [Qwen2.5-coder-3B](https://huggingface.co/Qwen/Qwen2.5-coder-3B) model and fine-tuned on [Dialect2SQL](https://huggingface.co/datasets/salmane11/Dialect2SQL), a dataset of **Darija–SQL** pairs spanning multiple database schemas and domains.  

The goal of this model is to bridge the gap between dialectal Arabic users and database systems, enabling users to query structured data in their native dialect.

---

## Finetuning Procedure

Darija2SQL was fine-tuned using **PEFT (Parameter-Efficient Fine-Tuning)** methods, specifically **LoRA (Low-Rank Adaptation)** adapters, to preserve the strong code reasoning capabilities of Qwen2.5-coder-3B while specializing it for Darija text-to-SQL generation.

Training included both schema comprehension and dialectal normalization steps, ensuring the model understands common Darija variants and their SQL intent.

---

## Intended Use and Limitations

This model is intended for **research and educational purposes**, specifically in the areas of:
- Natural Language to SQL generation for dialectal Arabic
- Code model adaptation to low-resource dialects
- Database interaction interfaces in Arabic contexts

While Darija2SQL performs well on the Dialect2SQL dataset, it may not generalize perfectly to unseen schemas or domains without adaptation.

---

## How to Use

Example 1: **Ecommerce_DB**

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/Darija2SQL-3B")
model = AutoModelForCausalLM.from_pretrained("salmane11/Darija2SQL-3B").to(device)

input_text = """
CREATE TABLE Products {
    product_id number,
    product_name varchar,
    category varchar,
    price real,
    stock number
}

CREATE TABLE Orders {
    order_id number,
    product_id number,
    customer_id number,
    quantity number,
    order_date date
}

CREATE TABLE Customers {
    customer_id number,
    name varchar,
    city varchar
}

-- Using valid SQLite, answer the following question in Darija:
-- شحال من order دار الزبون لي ساكن ف كازا؟

SELECT
"""

encoding = tokenizer.encode_plus(input_text, pad_to_max_length=True, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
query_beginning = line.find("SELECT")
print(line[query_beginning:])
```

## Cite our work
```bibtex
@inproceedings{chafik2025dialect2sql,
  title={Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija},
  author={Chafik, Salmane and Ezzini, Saad and Berrada, Ismail},
  booktitle={Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)},
  pages={86--92},
  year={2025}
}

@article{10.1145/3787497,
  author = {Chafik, Salmane and Ezzini, Saad and Berrada, Ismail},
  title = {DarijaDB: Unlocking Text-to-SQL for Arabic Dialects},
  year = {2026},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  issn = {2375-4699},
  url = {https://doi.org/10.1145/3787497},
  doi = {10.1145/3787497},
  journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
  month = jan
}