🇵🇰 ZabaanAI-v2 — Pakistan Multilingual Instruction Model

Next-generation multilingual AI for Pakistan, built on Qwen2.5-7B-Instruct

🏗️ Architecture: Qwen2.5-7B-Instruct

Why Qwen2.5-7B-Instruct over Llama 3.1 / Mistral:

Factor	Qwen2.5-7B ✅	Llama 3.1 8B	Mistral 7B
Vocabulary	151K (best for Arabic/Nastaliq)	128K	32K (catastrophic for Urdu)
Urdu tokens/char	0.63 (best)	0.68	1.02 (worse than chars!)
Sindhi tokens/char	0.60 (best)	0.74	N/A
Arabic support	0.35 (excellent)	0.33	0.90 (poor)
Native chat format	✅ ChatML	Llama-3 template	No standard
Pretraining	18T tokens (massive multilingual)	15T tokens	8T tokens
Instruction tuned	✅ Already	✅ Already	✅ Already
License	Apache 2.0	Llama 3.1 (restrictive)	Apache 2.0

Key Research Insight (Mantra, arxiv:2504.09753): Qwen2.5's built-in multilingual capability is strong enough that SFT-only with cultural data may suffice — no CPT stage needed. This saves enormous compute cost.

Qehwa (Pashto LLM) chose Qwen2.5-7B — the only Pakistan-language LLM to use Qwen. Achieved 85.3% accuracy.

🌐 Supported Languages

Priority	Language	Script	Goal
🔴 High	Urdu	Arabic (Nastaliq)	Native fluency, formal+casual, Roman Urdu
🔴 High	Punjabi (Shahmukhi)	Arabic	Conversational fluency, folk language
🔴 High	Sindhi	Arabic	Native grammar, admin style, education
🔴 High	Pashto	Arabic	Regional dialect robustness
🔴 High	English	Latin	Strong reasoning, technical support
🟡 Medium	Balochi	Arabic	Basic + expanding support
🟡 Medium	Saraiki	Arabic	Conversational, translation
🔵 Secondary	Arabic	Arabic	Cross-translation, Quranic context
🔵 Secondary	Persian	Arabic	Literary translation
🔵 Secondary	Hindi	Devanagari	Understanding only

🎯 Core Capabilities

Capability	Training Data Source
Natural multilingual conversation	Multi-turn dialogues in all languages
High-quality translation	Parallel corpora, translation matrix
Summarization	News articles + summaries
Grammar correction	Error-correction pairs
Writing assistant	Email, CV, complaint letter templates
Reasoning & explanations	Math, science, logic tasks
Educational tutoring	Matric, FSC, CSS, O/A Level content
Government documentation	Pakistan dept forms, processes
Customer support automation	Telecom, banking, e-commerce chats
Code-switch understanding	Roman Urdu, mixed Punjabi-Urdu

📊 Training Method: SFT + QLoRA

Based on Qehwa (Pashto) and Qalb (Urdu) recipes, adapted for multilingual:

Stage 1: QLoRA SFT (Primary)

# From Qehwa + Qalb recipes:
Base: Qwen2.5-7B-Instruct (not base — SFT-only approach from Mantra)
Method: QLoRA (4-bit quantization + LoRA)
LoRA rank: 64 (from Qehwa) or 128 (from Qalb)
LoRA alpha: 128 (2x rank)
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 2e-4 (10x base for LoRA)
Batch size: 32 (with gradient accumulation)
Max seq length: 4096 (Qwen2.5 supports up to 131K)
Epochs: 2-3

Stage 2: DPO (Optional, for preference tuning)

# After SFT, align with human preferences
Method: DPO with LoRA
Learning rate: 5e-6 (10x DPO base for LoRA)

📂 Repository Structure

zabaanai-v2/
├── scripts/
│   ├── 01_curate_sft_data.py         # Download & format instruction data
│   ├── 02_train_sft_qlora.py         # QLoRA SFT training
│   ├── 03_train_dpo.py               # DPO preference tuning
│   ├── 04_merge_and_export.py        # Merge LoRA + export
│   ├── 05_evaluate.py                # Benchmark evaluation
│   ├── 06_quantize_gguf.py           # GGUF quantization
│   └── 07_deploy.py                  # Deploy to HF
├── configs/
│   ├── sft_config.yaml              # SFT training config
│   ├── dpo_config.yaml              # DPO training config
│   └── data_mixture.yaml            # Data mixture weights
├── data/
│   ├── raw/                          # Downloaded datasets
│   ├── formatted/                    # ChatML-formatted SFT data
│   └── pakistan_special/            # Pakistan-specific knowledge
├── deployment/
│   ├── app.py                        # Gradio chat app
│   ├── api_server.py                 # FastAPI server
│   ├── Dockerfile                    # Docker container
│   └── requirements.txt              # Dependencies
├── docs/
│   ├── 01_TRAINING_RECIPE.md         # Full training recipe
│   ├── 02_DATA_ARCHITECTURE.md       # Data format & mixture
│   ├── 03_EVALUATION_GUIDE.md        # Evaluation methodology
│   └── 04_DEPLOYMENT_GUIDE.md        # Deployment options
└── README.md                         # This file

⚡ Quick Start

Train SFT (QLoRA on A100 40GB)

python scripts/01_curate_sft_data.py
python scripts/02_train_sft_qlora.py --config configs/sft_config.yaml

Evaluate

python scripts/05_evaluate.py --model_path outputs/zabaanai-v2

Deploy

python scripts/06_quantize_gguf.py --model_path outputs/zabaanai-v2
python scripts/07_deploy.py --username your-username

📜 License

Apache 2.0 (Qwen2.5 license)

🙏 Acknowledgments

Qwen Team for Qwen2.5-7B-Instruct
Qehwa (junaid008) for Pashto LLM recipe
Qalb for Urdu LLM recipe
Alif (large-traversaal) for Urdu instruction data

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for shaikhsalman/zabaanai-v2

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Paper • 2504.09753 • Published Apr 13, 2025 • 6