Arabic SQL Coder

A bilingual Arabic + English text-to-SQL model. Give it a database schema and a question — in either language — and it returns a SQL query.

Fine-tuned from Qwen/Qwen2.5-Coder-3B-Instruct on a translated subset of gretelai/synthetic_text_to_sql, with English questions translated to Modern Standard Arabic via Helsinki-NLP/opus-mt-en-ar.

Highlights

  • 🌍 Bilingual — handles Arabic (MSA, الفصحى) and English natively
  • Compact — 3B parameters, runs on consumer GPUs (8 GB+ in 4-bit, 16 GB+ in bf16)
  • 🎯 SQL-focused — fine-tuned for producing executable queries, not a general LLM
  • 📚 Wide skill coverageSELECT, WHERE, JOIN, GROUP BY, HAVING, subqueries, LEFT JOIN ... IS NULL, LIKE, DISTINCT, date filtering, ORDER BY, aggregations
  • 🧪 Open methodology — full training code and reproducible recipe

Training curves

Trained for 2 epochs on ~100K bilingual samples; eval loss converged smoothly:

Training loss curves

The cosine learning-rate schedule with 5% warmup:

Learning rate schedule

Evaluation

Evaluated on a hand-crafted held-out test suite covering 12 SQL skill categories (basic SELECT, JOIN, GROUP BY, HAVING, subqueries, LEFT JOIN with NULL, date filtering, DISTINCT COUNT, LIKE, ORDER BY + LIMIT, single + multi-table aggregations) with real execution against an in-memory SQLite database

Each test case includes seeded data, so accuracy is verified by running both the predicted SQL and the gold reference SQL and comparing result rows — not by string matching.

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

REPO = "mohamedelmadany/Qwen2.5-Arabic-to-SQL-Coder"  

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
    REPO,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

SYSTEM_PROMPT = (
    "أنت مساعد ذكي متخصص في كتابة استعلامات SQL.\n"
    "بناءً على السياق (schema) والسؤال المقدم، اكتب استعلام SQL صحيح ودقيق.\n"
    "اكتب أبسط استعلام يجيب على السؤال. "
)

schema = (
    "CREATE TABLE employees ("
    "  id INT, name VARCHAR(100), department VARCHAR(50), salary INT"
    ");"
)
question = "اعرض أعلى 5 موظفين في الراتب"   # "Show the top 5 employees by salary"

user_msg = f"### Schema:\n{schema}\n\n### Question:\n{question}"
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user",   "content": user_msg},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

sql = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(sql.strip())

Recommended use cases

  • 🧑‍💻 Analyst tooling — SQL drafted by the model, reviewed and run by an analyst
  • 📊 Internal data exploration — natural-language interface for ad-hoc queries
  • 🎓 SQL learning aids — Arabic-speaking learners seeing how questions translate to queries
  • 🌐 Bilingual data products — Arabic-first BI tools without losing English support
  • 🛠️ Chat-style data assistants with human-in-the-loop validation

Training details

Base model Qwen/Qwen2.5-Coder-3B-Instruct
Method QLoRA (4-bit NF4 + double-quant base, bf16 LoRA)
LoRA rank / alpha / dropout 64 / 128 / 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data ~100K bilingual samples (50K English + 50K MSA Arabic)
Sequence length 1,024
Epochs 2
Effective batch size 32 (8 per device × 4 grad accum)
Learning rate 2e-4, cosine schedule, 5% warmup
Optimizer paged AdamW 8-bit
Gradient checkpointing Enabled
Hardware 1× NVIDIA A100 40 GB
Wall time ~7 hours
Best checkpoint selected by minimum eval loss
Final eval loss 0.247

Dataset preparation

  1. Selected 50,000 rows of gretelai/synthetic_text_to_sql.
  2. Translated each English sql_prompt to Modern Standard Arabic via Helsinki-NLP/opus-mt-en-ar (GPU batched, beam search width 4).
  3. Each source row produced two training examples — one English, one Arabic — sharing the same schema and gold SQL. This bilingual pairing teaches the model to map either language onto the same SQL space.
  4. Final training set: ~100K examples, balanced 50/50 English/Arabic.

Limitations

  • The model produces a draft SQL query — outputs should be reviewed by a human or validated against the schema before execution against production data.
  • Performance may vary on highly complex schemas (deep joins across many tables, recursive CTEs, vendor-specific extensions). Training emphasized small-to-moderate schema patterns.
  • Training data is synthetic; real-world data distributions and domain-specific vocabularies may differ from patterns the model has been exposed to.
  • Arabic understanding is grounded in Modern Standard Arabic. Heavy regional dialectal phrasings may produce less reliable results.
  • As with any generative model, outputs may occasionally be syntactically valid but semantically off-target. Always test on a development database before production use.

Citation

This model builds on:

@misc{qwen2.5-coder,
  title={Qwen2.5-Coder Technical Report},
  author={Qwen Team},
  year={2024},
  url={https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct}
}

@dataset{gretelai_synthetic_sql,
  title={gretelai/synthetic_text_to_sql},
  author={Gretel AI},
  year={2024},
  url={https://huggingface.co/datasets/gretelai/synthetic_text_to_sql}
}

License

Apache 2.0.

Downloads last month
22
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mohamedelmadany/Qwen2.5-Arabic-to-SQL-Coder

Base model

Qwen/Qwen2.5-3B
Adapter
(33)
this model

Dataset used to train mohamedelmadany/Qwen2.5-Arabic-to-SQL-Coder