--- license: apache-2.0 language: - ar - en base_model: Qwen/Qwen2.5-Coder-3B-Instruct library_name: transformers tags: - text-to-sql - arabic - bilingual - qwen - lora - code datasets: - gretelai/synthetic_text_to_sql pipeline_tag: text-generation --- # Arabic SQL Coder A bilingual **Arabic + English** text-to-SQL model. Give it a database schema and a question โ€” in either language โ€” and it returns a SQL query. Fine-tuned from [`Qwen/Qwen2.5-Coder-3B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) on a translated subset of [`gretelai/synthetic_text_to_sql`](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql), with English questions translated to Modern Standard Arabic via [`Helsinki-NLP/opus-mt-en-ar`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ar). ## Highlights - ๐ŸŒ **Bilingual** โ€” handles Arabic (MSA, ุงู„ูุตุญู‰) and English natively - โšก **Compact** โ€” 3B parameters, runs on consumer GPUs (8 GB+ in 4-bit, 16 GB+ in bf16) - ๐ŸŽฏ **SQL-focused** โ€” fine-tuned for producing executable queries, not a general LLM - ๐Ÿ“š **Wide skill coverage** โ€” `SELECT`, `WHERE`, `JOIN`, `GROUP BY`, `HAVING`, subqueries, `LEFT JOIN ... IS NULL`, `LIKE`, `DISTINCT`, date filtering, `ORDER BY`, aggregations - ๐Ÿงช **Open methodology** โ€” full training code and reproducible recipe ## Training curves Trained for 2 epochs on ~100K bilingual samples; eval loss converged smoothly: ![Training loss curves](training_loss.png) The cosine learning-rate schedule with 5% warmup: ![Learning rate schedule](learning_rate.png) ## Evaluation Evaluated on a hand-crafted held-out test suite covering 12 SQL skill categories (basic SELECT, JOIN, GROUP BY, HAVING, subqueries, LEFT JOIN with NULL, date filtering, DISTINCT COUNT, LIKE, ORDER BY + LIMIT, single + multi-table aggregations) with **real execution against an in-memory SQLite database** Each test case includes seeded data, so accuracy is verified by running both the predicted SQL and the gold reference SQL and comparing result rows โ€” not by string matching. ## Quick start ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch REPO = "mohamedelmadany/Qwen2.5-Arabic-to-SQL-Coder" tokenizer = AutoTokenizer.from_pretrained(REPO) model = AutoModelForCausalLM.from_pretrained( REPO, torch_dtype=torch.bfloat16, device_map="auto", ) SYSTEM_PROMPT = ( "ุฃู†ุช ู…ุณุงุนุฏ ุฐูƒูŠ ู…ุชุฎุตุต ููŠ ูƒุชุงุจุฉ ุงุณุชุนู„ุงู…ุงุช SQL.\n" "ุจู†ุงุกู‹ ุนู„ู‰ ุงู„ุณูŠุงู‚ (schema) ูˆุงู„ุณุคุงู„ ุงู„ู…ู‚ุฏู…ุŒ ุงูƒุชุจ ุงุณุชุนู„ุงู… SQL ุตุญูŠุญ ูˆุฏู‚ูŠู‚.\n" "ุงูƒุชุจ ุฃุจุณุท ุงุณุชุนู„ุงู… ูŠุฌูŠุจ ุนู„ู‰ ุงู„ุณุคุงู„. " ) schema = ( "CREATE TABLE employees (" " id INT, name VARCHAR(100), department VARCHAR(50), salary INT" ");" ) question = "ุงุนุฑุถ ุฃุนู„ู‰ 5 ู…ูˆุธููŠู† ููŠ ุงู„ุฑุงุชุจ" # "Show the top 5 employees by salary" user_msg = f"### Schema:\n{schema}\n\n### Question:\n{question}" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_msg}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=200, temperature=0.1, do_sample=True, pad_token_id=tokenizer.eos_token_id, ) sql = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(sql.strip()) ``` ## Recommended use cases - ๐Ÿง‘โ€๐Ÿ’ป **Analyst tooling** โ€” SQL drafted by the model, reviewed and run by an analyst - ๐Ÿ“Š **Internal data exploration** โ€” natural-language interface for ad-hoc queries - ๐ŸŽ“ **SQL learning aids** โ€” Arabic-speaking learners seeing how questions translate to queries - ๐ŸŒ **Bilingual data products** โ€” Arabic-first BI tools without losing English support - ๐Ÿ› ๏ธ **Chat-style data assistants** with human-in-the-loop validation ## Training details | | | |---|---| | Base model | `Qwen/Qwen2.5-Coder-3B-Instruct` | | Method | QLoRA (4-bit NF4 + double-quant base, bf16 LoRA) | | LoRA rank / alpha / dropout | 64 / 128 / 0.05 | | LoRA target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | Training data | ~100K bilingual samples (50K English + 50K MSA Arabic) | | Sequence length | 1,024 | | Epochs | 2 | | Effective batch size | 32 (8 per device ร— 4 grad accum) | | Learning rate | 2e-4, cosine schedule, 5% warmup | | Optimizer | paged AdamW 8-bit | | Gradient checkpointing | Enabled | | Hardware | 1ร— NVIDIA A100 40 GB | | Wall time | ~7 hours | | Best checkpoint selected by | minimum eval loss | | Final eval loss | 0.247 | ## Dataset preparation 1. Selected 50,000 rows of `gretelai/synthetic_text_to_sql`. 2. Translated each English `sql_prompt` to Modern Standard Arabic via `Helsinki-NLP/opus-mt-en-ar` (GPU batched, beam search width 4). 3. Each source row produced two training examples โ€” one English, one Arabic โ€” sharing the same schema and gold SQL. This bilingual pairing teaches the model to map either language onto the same SQL space. 4. Final training set: ~100K examples, balanced 50/50 English/Arabic. ## Limitations - The model produces a **draft** SQL query โ€” outputs should be reviewed by a human or validated against the schema before execution against production data. - Performance may vary on highly complex schemas (deep joins across many tables, recursive CTEs, vendor-specific extensions). Training emphasized small-to-moderate schema patterns. - Training data is synthetic; real-world data distributions and domain-specific vocabularies may differ from patterns the model has been exposed to. - Arabic understanding is grounded in Modern Standard Arabic. Heavy regional dialectal phrasings may produce less reliable results. - As with any generative model, outputs may occasionally be syntactically valid but semantically off-target. Always test on a development database before production use. ## Citation This model builds on: ```bibtex @misc{qwen2.5-coder, title={Qwen2.5-Coder Technical Report}, author={Qwen Team}, year={2024}, url={https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct} } @dataset{gretelai_synthetic_sql, title={gretelai/synthetic_text_to_sql}, author={Gretel AI}, year={2024}, url={https://huggingface.co/datasets/gretelai/synthetic_text_to_sql} } ``` ## License Apache 2.0.