File size: 3,876 Bytes
f627e30
 
06d7ca3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f627e30
 
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
 
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
 
 
 
f627e30
06d7ca3
f627e30
06d7ca3
f627e30
06d7ca3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f05ddd
06d7ca3
 
 
 
 
 
3f639ad
 
2f56881
 
 
 
 
 
 
 
 
 
 
06d7ca3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
library_name: transformers
tags:
- text-to-SQL
- SQL
- code-generation
- NLQ-to-SQL
- text2SQL
- darija
- moroccan-dialect
datasets:
- salmane11/Dialect2SQL
language:
- ar
- mar
base_model:
- Qwen/Qwen2.5-coder-3B
---

# Darija2SQL-3B

## Model Description

**Darija2SQL** is a specialized **Code LLM** designed for translating **Moroccan Arabic (Darija)** natural language questions into **SQL queries**.  
It is based on the [Qwen2.5-coder-3B](https://huggingface.co/Qwen/Qwen2.5-coder-3B) model and fine-tuned on [Dialect2SQL](https://huggingface.co/datasets/salmane11/Dialect2SQL), a dataset of **Darija–SQL** pairs spanning multiple database schemas and domains.  

The goal of this model is to bridge the gap between dialectal Arabic users and database systems, enabling users to query structured data in their native dialect.

---

## Finetuning Procedure

Darija2SQL was fine-tuned using **PEFT (Parameter-Efficient Fine-Tuning)** methods, specifically **LoRA (Low-Rank Adaptation)** adapters, to preserve the strong code reasoning capabilities of Qwen2.5-coder-3B while specializing it for Darija text-to-SQL generation.

Training included both schema comprehension and dialectal normalization steps, ensuring the model understands common Darija variants and their SQL intent.

---

## Intended Use and Limitations

This model is intended for **research and educational purposes**, specifically in the areas of:
- Natural Language to SQL generation for dialectal Arabic
- Code model adaptation to low-resource dialects
- Database interaction interfaces in Arabic contexts

While Darija2SQL performs well on the Dialect2SQL dataset, it may not generalize perfectly to unseen schemas or domains without adaptation.

---

## How to Use

Example 1: **Ecommerce_DB**

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/Darija2SQL-3B")
model = AutoModelForCausalLM.from_pretrained("salmane11/Darija2SQL-3B").to(device)

input_text = """
CREATE TABLE Products {
    product_id number,
    product_name varchar,
    category varchar,
    price real,
    stock number
}

CREATE TABLE Orders {
    order_id number,
    product_id number,
    customer_id number,
    quantity number,
    order_date date
}

CREATE TABLE Customers {
    customer_id number,
    name varchar,
    city varchar
}

-- Using valid SQLite, answer the following question in Darija:
-- شحال من order دار الزبون لي ساكن ف كازا؟

SELECT
"""

encoding = tokenizer.encode_plus(input_text, pad_to_max_length=True, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
query_beginning = line.find("SELECT")
print(line[query_beginning:])
```

## Cite our work
```bibtex
@inproceedings{chafik2025dialect2sql,
  title={Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija},
  author={Chafik, Salmane and Ezzini, Saad and Berrada, Ismail},
  booktitle={Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)},
  pages={86--92},
  year={2025}
}

@article{10.1145/3787497,
  author = {Chafik, Salmane and Ezzini, Saad and Berrada, Ismail},
  title = {DarijaDB: Unlocking Text-to-SQL for Arabic Dialects},
  year = {2026},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  issn = {2375-4699},
  url = {https://doi.org/10.1145/3787497},
  doi = {10.1145/3787497},
  journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
  month = jan
}