Phi3-UrduInstruct

A fine-tuned version of Microsoft's Phi-3-mini-4k-instruct for Urdu language instruction following.

Model Description

Phi3-UrduInstruct is fine-tuned on a custom Urdu instruction dataset of 578 manually curated and verified examples. The model is designed to follow instructions in Urdu across multiple NLP tasks.

This work addresses the lack of instruction-tuned language models for Urdu, a low-resource language spoken by over 230 million people worldwide.

Training Data

A custom dataset of 578 Urdu instruction-response pairs was created for this project, covering 6 task categories:

Category Examples
Translation (Urdu → English) 105
Grammar Correction 100
Question Answering 100
Text Summarization 107
Text Completion 91
Formal/Informal Conversion 75
Total 578

All examples were manually written and verified by a native Urdu speaker to ensure linguistic quality and cultural accuracy.

Training Details

Parameter Value
Base Model Phi-3-mini-4k-instruct (4-bit)
Fine-tuning Method LoRA (r=16, alpha=16)
Training Epochs 3
Learning Rate 2e-4
Training Loss 1.37 → 0.47
Framework Unsloth + TRL
Hardware Google Colab T4 GPU

Usage

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Almanships/Phi3-UrduInstruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": 
     "اس جملے کا انگریزی میں ترجمہ کریں\nپاکستان ایک خوبصورت ملک ہے"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Outputs

Translation:

  • Input: پاکستان ایک خوبصورت ملک ہے
  • Output: Pakistan is a beautiful country

Grammar Correction:

  • Input: وہ گیا بازار آج
  • Output: وہ آج بازار گیا

Question Answering:

  • Input: پاکستان کا دارالحکومت کون سا ہے؟
  • Output: پاکستان کا دارالحکومت اسلام آباد ہے

Limitations

  • Trained on 578 examples — larger dataset would improve performance
  • Evaluation is currently qualitative; formal benchmarks pending
  • Best performance on the 6 trained task categories

Future Work

  • Expand dataset to 2000+ examples
  • Add formal evaluation benchmarks for Urdu NLP
  • Extend to Punjabi language instruction tuning
  • Compare against other multilingual models on Urdu tasks

Citation

If you use this model, please cite:

@misc{phi3-urduinstruct-2026,
  author = {Almanships},
  title = {Phi3-UrduInstruct: Instruction Tuning of 
           Phi-3 for Urdu Language},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Almanships/Phi3-UrduInstruct}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Almanships/Phi3-UrduInstruct

Adapter
(42)
this model