--- library_name: transformers tags: - text-to-SQL - SQL - code-generation - NLQ-to-SQL - text2SQL - darija - moroccan-dialect datasets: - salmane11/Dialect2SQL language: - ar - mar base_model: - Qwen/Qwen2.5-coder-3B --- # Darija2SQL-3B ## Model Description **Darija2SQL** is a specialized **Code LLM** designed for translating **Moroccan Arabic (Darija)** natural language questions into **SQL queries**. It is based on the [Qwen2.5-coder-3B](https://huggingface.co/Qwen/Qwen2.5-coder-3B) model and fine-tuned on [Dialect2SQL](https://huggingface.co/datasets/salmane11/Dialect2SQL), a dataset of **Darija–SQL** pairs spanning multiple database schemas and domains. The goal of this model is to bridge the gap between dialectal Arabic users and database systems, enabling users to query structured data in their native dialect. --- ## Finetuning Procedure Darija2SQL was fine-tuned using **PEFT (Parameter-Efficient Fine-Tuning)** methods, specifically **LoRA (Low-Rank Adaptation)** adapters, to preserve the strong code reasoning capabilities of Qwen2.5-coder-3B while specializing it for Darija text-to-SQL generation. Training included both schema comprehension and dialectal normalization steps, ensuring the model understands common Darija variants and their SQL intent. --- ## Intended Use and Limitations This model is intended for **research and educational purposes**, specifically in the areas of: - Natural Language to SQL generation for dialectal Arabic - Code model adaptation to low-resource dialects - Database interaction interfaces in Arabic contexts While Darija2SQL performs well on the Dialect2SQL dataset, it may not generalize perfectly to unseen schemas or domains without adaptation. --- ## How to Use Example 1: **Ecommerce_DB** ```python from transformers import AutoTokenizer, AutoModelForCausalLM device = "cuda" tokenizer = AutoTokenizer.from_pretrained("salmane11/Darija2SQL-3B") model = AutoModelForCausalLM.from_pretrained("salmane11/Darija2SQL-3B").to(device) input_text = """ CREATE TABLE Products { product_id number, product_name varchar, category varchar, price real, stock number } CREATE TABLE Orders { order_id number, product_id number, customer_id number, quantity number, order_date date } CREATE TABLE Customers { customer_id number, name varchar, city varchar } -- Using valid SQLite, answer the following question in Darija: -- شحال من order دار الزبون لي ساكن ف كازا؟ SELECT """ encoding = tokenizer.encode_plus(input_text, pad_to_max_length=True, return_tensors="pt").to(device) input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device) outputs = model.generate( input_ids=input_ids, attention_mask=attention_masks, max_length=512, do_sample=True, top_k=120, top_p=0.95, early_stopping=True, ) line = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) query_beginning = line.find("SELECT") print(line[query_beginning:]) ``` ## Cite our work ```bibtex @inproceedings{chafik2025dialect2sql, title={Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija}, author={Chafik, Salmane and Ezzini, Saad and Berrada, Ismail}, booktitle={Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)}, pages={86--92}, year={2025} } @article{10.1145/3787497, author = {Chafik, Salmane and Ezzini, Saad and Berrada, Ismail}, title = {DarijaDB: Unlocking Text-to-SQL for Arabic Dialects}, year = {2026}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, issn = {2375-4699}, url = {https://doi.org/10.1145/3787497}, doi = {10.1145/3787497}, journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, month = jan }