YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ‡΅πŸ‡° ZabaanAI-v2 β€” Pakistan Multilingual Instruction Model

Next-generation multilingual AI for Pakistan, built on Qwen2.5-7B-Instruct

Python License Model Languages

πŸ—οΈ Architecture: Qwen2.5-7B-Instruct

Why Qwen2.5-7B-Instruct over Llama 3.1 / Mistral:

Factor Qwen2.5-7B βœ… Llama 3.1 8B Mistral 7B
Vocabulary 151K (best for Arabic/Nastaliq) 128K 32K (catastrophic for Urdu)
Urdu tokens/char 0.63 (best) 0.68 1.02 (worse than chars!)
Sindhi tokens/char 0.60 (best) 0.74 N/A
Arabic support 0.35 (excellent) 0.33 0.90 (poor)
Native chat format βœ… ChatML Llama-3 template No standard
Pretraining 18T tokens (massive multilingual) 15T tokens 8T tokens
Instruction tuned βœ… Already βœ… Already βœ… Already
License Apache 2.0 Llama 3.1 (restrictive) Apache 2.0

Key Research Insight (Mantra, arxiv:2504.09753): Qwen2.5's built-in multilingual capability is strong enough that SFT-only with cultural data may suffice β€” no CPT stage needed. This saves enormous compute cost.

Qehwa (Pashto LLM) chose Qwen2.5-7B β€” the only Pakistan-language LLM to use Qwen. Achieved 85.3% accuracy.

🌐 Supported Languages

Priority Language Script Goal
πŸ”΄ High Urdu Arabic (Nastaliq) Native fluency, formal+casual, Roman Urdu
πŸ”΄ High Punjabi (Shahmukhi) Arabic Conversational fluency, folk language
πŸ”΄ High Sindhi Arabic Native grammar, admin style, education
πŸ”΄ High Pashto Arabic Regional dialect robustness
πŸ”΄ High English Latin Strong reasoning, technical support
🟑 Medium Balochi Arabic Basic + expanding support
🟑 Medium Saraiki Arabic Conversational, translation
πŸ”΅ Secondary Arabic Arabic Cross-translation, Quranic context
πŸ”΅ Secondary Persian Arabic Literary translation
πŸ”΅ Secondary Hindi Devanagari Understanding only

🎯 Core Capabilities

Capability Training Data Source
Natural multilingual conversation Multi-turn dialogues in all languages
High-quality translation Parallel corpora, translation matrix
Summarization News articles + summaries
Grammar correction Error-correction pairs
Writing assistant Email, CV, complaint letter templates
Reasoning & explanations Math, science, logic tasks
Educational tutoring Matric, FSC, CSS, O/A Level content
Government documentation Pakistan dept forms, processes
Customer support automation Telecom, banking, e-commerce chats
Code-switch understanding Roman Urdu, mixed Punjabi-Urdu

πŸ“Š Training Method: SFT + QLoRA

Based on Qehwa (Pashto) and Qalb (Urdu) recipes, adapted for multilingual:

Stage 1: QLoRA SFT (Primary)

# From Qehwa + Qalb recipes:
Base: Qwen2.5-7B-Instruct (not base β€” SFT-only approach from Mantra)
Method: QLoRA (4-bit quantization + LoRA)
LoRA rank: 64 (from Qehwa) or 128 (from Qalb)
LoRA alpha: 128 (2x rank)
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 2e-4 (10x base for LoRA)
Batch size: 32 (with gradient accumulation)
Max seq length: 4096 (Qwen2.5 supports up to 131K)
Epochs: 2-3

Stage 2: DPO (Optional, for preference tuning)

# After SFT, align with human preferences
Method: DPO with LoRA
Learning rate: 5e-6 (10x DPO base for LoRA)

πŸ“‚ Repository Structure

zabaanai-v2/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 01_curate_sft_data.py         # Download & format instruction data
β”‚   β”œβ”€β”€ 02_train_sft_qlora.py         # QLoRA SFT training
β”‚   β”œβ”€β”€ 03_train_dpo.py               # DPO preference tuning
β”‚   β”œβ”€β”€ 04_merge_and_export.py        # Merge LoRA + export
β”‚   β”œβ”€β”€ 05_evaluate.py                # Benchmark evaluation
β”‚   β”œβ”€β”€ 06_quantize_gguf.py           # GGUF quantization
β”‚   └── 07_deploy.py                  # Deploy to HF
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ sft_config.yaml              # SFT training config
β”‚   β”œβ”€β”€ dpo_config.yaml              # DPO training config
β”‚   └── data_mixture.yaml            # Data mixture weights
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                          # Downloaded datasets
β”‚   β”œβ”€β”€ formatted/                    # ChatML-formatted SFT data
β”‚   └── pakistan_special/            # Pakistan-specific knowledge
β”œβ”€β”€ deployment/
β”‚   β”œβ”€β”€ app.py                        # Gradio chat app
β”‚   β”œβ”€β”€ api_server.py                 # FastAPI server
β”‚   β”œβ”€β”€ Dockerfile                    # Docker container
β”‚   └── requirements.txt              # Dependencies
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ 01_TRAINING_RECIPE.md         # Full training recipe
β”‚   β”œβ”€β”€ 02_DATA_ARCHITECTURE.md       # Data format & mixture
β”‚   β”œβ”€β”€ 03_EVALUATION_GUIDE.md        # Evaluation methodology
β”‚   └── 04_DEPLOYMENT_GUIDE.md        # Deployment options
└── README.md                         # This file

⚑ Quick Start

Train SFT (QLoRA on A100 40GB)

python scripts/01_curate_sft_data.py
python scripts/02_train_sft_qlora.py --config configs/sft_config.yaml

Evaluate

python scripts/05_evaluate.py --model_path outputs/zabaanai-v2

Deploy

python scripts/06_quantize_gguf.py --model_path outputs/zabaanai-v2
python scripts/07_deploy.py --username your-username

πŸ“œ License

Apache 2.0 (Qwen2.5 license)

πŸ™ Acknowledgments

  • Qwen Team for Qwen2.5-7B-Instruct
  • Qehwa (junaid008) for Pashto LLM recipe
  • Qalb for Urdu LLM recipe
  • Alif (large-traversaal) for Urdu instruction data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for shaikhsalman/zabaanai-v2