Sindhi Sentiment Analysis Model

A text classification model that detects positive, negative, and neutral sentiment in Sindhi language text. This is one of the first publicly available sentiment analysis models for the Sindhi language on Hugging Face.

Model Description

This model was trained on a custom Sindhi sentiment dataset collected from Sindhi newspaper corpora and expanded with additional labeled sentences. It classifies Sindhi text into three sentiment categories:

✅ Positive
❌ Negative
😐 Neutral

Architecture: Dual TF-IDF (character n-grams 2–6 + word n-grams 1–2, 50,000 combined features) → LinearSVC, calibrated for probability outputs via CalibratedClassifierCV. This is a classical ML pipeline (scikit-learn), not a transformer model.

Model Details

Property	Details
Language	Sindhi (`sd`)
Script	Arabic (Nastaliq)
Task	Sentiment Analysis / Text Classification
Labels	Positive, Negative, Neutral
Architecture	Dual TF-IDF + LinearSVC (scikit-learn)
Test Accuracy	91.7%
Macro F1	0.918
License	MIT
Developer	Ali Nawaz
Institution	Shaikh Ayaz University

Training Data

Trained on the Sindhi Sentiment Analysis Dataset — 4,420 sentences in Sindhi (expanded from an original 1,898-sentence release), labeled Positive / Negative / Neutral and balanced across all three classes (~1,500 each).

Class	Count
Positive	1,501
Negative	1,500
Neutral	1,419

How to Use

This model is a scikit-learn pipeline saved with joblib — it is not compatible with the transformers library's AutoModel/pipeline() API. Load it directly with joblib instead:

import joblib
from scipy.sparse import hstack
from huggingface_hub import hf_hub_download

# Download the model bundle
model_path = hf_hub_download(
    repo_id="alinawazmahar/sindhi-sentiment",
    filename="sentiment_model.joblib"
)
bundle = joblib.load(model_path)

clf = bundle["clf"]
char_vec = bundle["char_vec"]
word_vec = bundle["word_vec"]
label_map = bundle["label_map_inv"]

def predict(text):
    X = hstack([char_vec.transform([text]), word_vec.transform([text])])
    pred = int(clf.predict(X)[0])
    proba = clf.predict_proba(X)[0]
    return label_map[pred], {label_map[i]: float(proba[i]) for i in range(3)}

label, scores = predict("هي ڪتاب تمام سٺو آهي")
print(label, scores)
# Positive {'Negative': 0.003, 'Neutral': 0.356, 'Positive': 0.641}

Required packages: scikit-learn==1.7.2, joblib, scipy, huggingface_hub

all_models_ensemble.joblib is also provided, containing all three tuned classifiers (Logistic Regression, LinearSVC, Complement Naive Bayes) for ensemble/majority-vote use cases.

Live Demo

Try the model interactively on the Hugging Face Space: 👉 alinawazmahar/sindhi-sentiment (Space)

Intended Use

Sentiment analysis of Sindhi news articles
Social media monitoring in Sindhi
NLP research on low-resource South Asian languages
Educational and academic research

Evaluation Notes

Test accuracy is 91.7%, evaluated on a held-out, fully human-labeled stratified split. An earlier version of this model (trained on 1,909 sentences, partly pseudo-labeled from news sources at confidence ≥0.70) reported ~94.8% accuracy — but that evaluation set was implicitly filtered toward higher-confidence, easier examples. This release removes that filter, so the 91.7% figure reflects a fairer, more representative evaluation rather than a regression. Per-class F1 is balanced (Negative: 0.918, Neutral: 0.932, Positive: 0.902), with Neutral — typically the hardest class — performing best.

Limitations

Classical ML (TF-IDF + linear classifier), not a transformer — fast and interpretable, but without deep contextual/semantic understanding
Trained on newspaper-style and generated text; may perform differently on informal or social media Sindhi
Roman Sindhi (Latin script) is not supported — Arabic script only
No handling of sarcasm or implicit sentiment

Citation

If you use this model or dataset in your research, please cite:

@misc{alinawaz2025sindhi,
  author = {Ali Nawaz},
  title  = {Sindhi Sentiment Analysis Model},
  year   = {2025},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/alinawazmahar/sindhi-sentiment},
  institution = {Shaikh Ayaz University}
}

Acknowledgements

Dataset collected from Sindhi newspaper corpora and expanded with additional labeled sentences. Developed as part of NLP research at Shaikh Ayaz University.

Downloads last month: -; Downloads are not tracked for this model. How to track

alinawazmahar
/

sindhi-sentiment