Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Paper β’ 2504.09753 β’ Published β’ 6
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Next-generation multilingual AI for Pakistan, built on Qwen2.5-7B-Instruct
Why Qwen2.5-7B-Instruct over Llama 3.1 / Mistral:
| Factor | Qwen2.5-7B β | Llama 3.1 8B | Mistral 7B |
|---|---|---|---|
| Vocabulary | 151K (best for Arabic/Nastaliq) | 128K | 32K (catastrophic for Urdu) |
| Urdu tokens/char | 0.63 (best) | 0.68 | 1.02 (worse than chars!) |
| Sindhi tokens/char | 0.60 (best) | 0.74 | N/A |
| Arabic support | 0.35 (excellent) | 0.33 | 0.90 (poor) |
| Native chat format | β ChatML | Llama-3 template | No standard |
| Pretraining | 18T tokens (massive multilingual) | 15T tokens | 8T tokens |
| Instruction tuned | β Already | β Already | β Already |
| License | Apache 2.0 | Llama 3.1 (restrictive) | Apache 2.0 |
Key Research Insight (Mantra, arxiv:2504.09753): Qwen2.5's built-in multilingual capability is strong enough that SFT-only with cultural data may suffice β no CPT stage needed. This saves enormous compute cost.
Qehwa (Pashto LLM) chose Qwen2.5-7B β the only Pakistan-language LLM to use Qwen. Achieved 85.3% accuracy.
| Priority | Language | Script | Goal |
|---|---|---|---|
| π΄ High | Urdu | Arabic (Nastaliq) | Native fluency, formal+casual, Roman Urdu |
| π΄ High | Punjabi (Shahmukhi) | Arabic | Conversational fluency, folk language |
| π΄ High | Sindhi | Arabic | Native grammar, admin style, education |
| π΄ High | Pashto | Arabic | Regional dialect robustness |
| π΄ High | English | Latin | Strong reasoning, technical support |
| π‘ Medium | Balochi | Arabic | Basic + expanding support |
| π‘ Medium | Saraiki | Arabic | Conversational, translation |
| π΅ Secondary | Arabic | Arabic | Cross-translation, Quranic context |
| π΅ Secondary | Persian | Arabic | Literary translation |
| π΅ Secondary | Hindi | Devanagari | Understanding only |
| Capability | Training Data Source |
|---|---|
| Natural multilingual conversation | Multi-turn dialogues in all languages |
| High-quality translation | Parallel corpora, translation matrix |
| Summarization | News articles + summaries |
| Grammar correction | Error-correction pairs |
| Writing assistant | Email, CV, complaint letter templates |
| Reasoning & explanations | Math, science, logic tasks |
| Educational tutoring | Matric, FSC, CSS, O/A Level content |
| Government documentation | Pakistan dept forms, processes |
| Customer support automation | Telecom, banking, e-commerce chats |
| Code-switch understanding | Roman Urdu, mixed Punjabi-Urdu |
Based on Qehwa (Pashto) and Qalb (Urdu) recipes, adapted for multilingual:
# From Qehwa + Qalb recipes:
Base: Qwen2.5-7B-Instruct (not base β SFT-only approach from Mantra)
Method: QLoRA (4-bit quantization + LoRA)
LoRA rank: 64 (from Qehwa) or 128 (from Qalb)
LoRA alpha: 128 (2x rank)
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 2e-4 (10x base for LoRA)
Batch size: 32 (with gradient accumulation)
Max seq length: 4096 (Qwen2.5 supports up to 131K)
Epochs: 2-3
# After SFT, align with human preferences
Method: DPO with LoRA
Learning rate: 5e-6 (10x DPO base for LoRA)
zabaanai-v2/
βββ scripts/
β βββ 01_curate_sft_data.py # Download & format instruction data
β βββ 02_train_sft_qlora.py # QLoRA SFT training
β βββ 03_train_dpo.py # DPO preference tuning
β βββ 04_merge_and_export.py # Merge LoRA + export
β βββ 05_evaluate.py # Benchmark evaluation
β βββ 06_quantize_gguf.py # GGUF quantization
β βββ 07_deploy.py # Deploy to HF
βββ configs/
β βββ sft_config.yaml # SFT training config
β βββ dpo_config.yaml # DPO training config
β βββ data_mixture.yaml # Data mixture weights
βββ data/
β βββ raw/ # Downloaded datasets
β βββ formatted/ # ChatML-formatted SFT data
β βββ pakistan_special/ # Pakistan-specific knowledge
βββ deployment/
β βββ app.py # Gradio chat app
β βββ api_server.py # FastAPI server
β βββ Dockerfile # Docker container
β βββ requirements.txt # Dependencies
βββ docs/
β βββ 01_TRAINING_RECIPE.md # Full training recipe
β βββ 02_DATA_ARCHITECTURE.md # Data format & mixture
β βββ 03_EVALUATION_GUIDE.md # Evaluation methodology
β βββ 04_DEPLOYMENT_GUIDE.md # Deployment options
βββ README.md # This file
python scripts/01_curate_sft_data.py
python scripts/02_train_sft_qlora.py --config configs/sft_config.yaml
python scripts/05_evaluate.py --model_path outputs/zabaanai-v2
python scripts/06_quantize_gguf.py --model_path outputs/zabaanai-v2
python scripts/07_deploy.py --username your-username
Apache 2.0 (Qwen2.5 license)