MT5 Sindhi Question Answering — SdQuAD

Model Description

This is the first publicly available Sindhi Question Answering model, fine-tuned on the SdQuAD dataset — the only Sindhi QA dataset in existence.

Sindhi is a low-resource South Asian language spoken by 30+ million people primarily in Sindh, Pakistan. This model addresses a critical gap in NLP resources for the Sindhi language.

Developed by: Ali Nawaz
University: Shaikh Ayaz University Shikarpur, Pakistan
Base model: google/mt5-base
Language: Sindhi (سنڌي) — Perso-Arabic script
Task: Question Answering (Generative)


How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

tokenizer = AutoTokenizer.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model = AutoModelForSeq2SeqLM.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model.eval()

def ask_sindhi(question):
    input_text = f'سنڌي سوال: {question}'
    inputs = tokenizer(
        input_text,
        return_tensors='pt',
        max_length=128,
        truncation=True
    )
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=64,
            num_beams=4,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
print(ask_sindhi('انرشيا جو مطلب ڇا آهي؟'))
# Output: جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت.

Training Details

Parameter Value
Base model google/mt5-base
Dataset Aliwj/SdQuAD
Train samples 9,596
Validation samples 1,199
Test samples 1,200
Epochs 10
Batch size 16 (effective)
Learning rate 5e-4
Optimizer Adafactor
Hardware Kaggle T4 GPU
Training time ~10 hours

Evaluation Results

Evaluated on SdQuAD test set (1,200 samples):

Metric Score
F1 50.06
Exact Match 22.08
ROUGE-1 8.18
ROUGE-L 8.18

Sample Predictions

Question Predicted Answer Correct?
انرشيا جو مطلب ڇا آهي؟ جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت.
پاڪستان جو وڏو شهر ڪهڙو آهي؟ پاڪستان جو وڏو شهر ڪراچي آهي.
سيل جي ميمبرين ڪهڙن ٻن مکيه ماليڪيولن مان ٺهيل هوندي آهي؟ سيل جي ميمبرين پروٽين ۽ پروٽين مان ٺهيل هوندي آهي. ⚠️ Partial

Limitations

  • This is a generative QA model — it generates answers without reading a context paragraph. This means it relies on knowledge learned during training rather than extracting answers from provided text.
  • May hallucinate answers for questions not well-represented in the training data.
  • Performance is lower than extractive QA models (baseline F1: 81.47 from SdQuAD paper) due to the harder generative task.
  • v2.0 coming soon with context-aware extractive QA and improved F1.

Roadmap

  • v1.0 — Generative QA baseline (F1: 50.06)
  • v2.0 — Improved hyperparameters (target F1: 60+)
  • v3.0 — Context-aware extractive QA (target F1: 80+)
  • Gradio demo on HuggingFace Spaces

Citation

If you use this model in your research, please cite:

@misc{nawaz2026sindhiqa,
  author = {Ali Nawaz},
  title = {MT5 Sindhi Question Answering Model},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/alinawazmahar/mt5-sindhi-qa-sdquad}
}

Also cite the SdQuAD dataset:

@inproceedings{ali2026sdquad,
  title = {SdQuAD: A Large Benchmark Question Answering Dataset for Low-resource Sindhi Language},
  author = {Wazir Ali et al.},
  booktitle = {RESOURCEFUL-2026, LREC},
  year = {2026}
}

Contact

Ali Nawaz
Shaikh Ayaz University Shikarpur, Pakistan
LinkedIn: Ali Nawaz

This model is part of ongoing research in Sindhi NLP — a severely under-resourced language deserving more attention from the global NLP community.

Downloads last month
182
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alinawazmahar/mt5-sindhi-qa-sdquad

Base model

google/mt5-base
Finetuned
(314)
this model