Sindhi-Triple-Pillar-BERT

Sindhi-Triple-Pillar-BERT is a multi-task learning (MTL) model designed for the Sindhi language. It simultaneously performs Sentiment Analysis, Named Entity Recognition (NER), and Part-of-Speech (POS) Tagging.

This model is the result of an MSCS thesis project supervised by Dr. Tafseer Ahmed, focusing on low-resource NLP for the Sindhi language.

Model Details

Base Model: Specialized distilbert-base-multilingual-cased.
Language: Sindhi.
Tasks: Sentiment Analysis, NER, POS Tagging.
Training Data: 2.2M Sindhi rows (PCT) + specialized MTL datasets.
Tokenizer: Custom extended vocabulary with 20,000 Sindhi tokens (reducing fragmentation).

Training & Methodology

The model underwent a two-phase training process:

Phase 1: Specialized Pre-Training (PCT)

Dataset: arnizamani/Sindhi-texts-big-dataset (2.2M rows).
Duration: 8–10 hours of training across 45,000 steps.
Performance: Final MLM loss of 2.1–2.3.

Phase 2: Multi-Task Learning (MTL)

Optimization: Automatic Mixed Precision (AMP) with Batch Size 48 for high-speed convergence.
Balancing: POS dataset oversampled by 65x to prevent gradient starvation in low-resource settings.
Validation: Scientifically validated using an 80/20 Train-Test split.

Evaluation Results (Unseen Data)

On a strictly external test set, the model achieved the following real-world figures:

Task	F1-Score	Support
Sentiment Analysis	1.00	3,154
POS Tagging (Weighted)	0.96	2,087
NER (Macro Avg)	0.69	317,371

Research Insight: The model shows exceptional performance in identifying Geopolitical Entities (GPE F1: 0.85) and Person Names (Person F1: 0.81), making it highly effective for regional information extraction.

How to Use

import torch
import torch.nn as nn
from transformers import AutoTokenizer, DistilBertModel
from safetensors.torch import load_model
from huggingface_hub import hf_hub_download

# 1. Define the Multi-Task Architecture
# This must match the 'Triple Pillar' configuration used during training
class SindhiMultiTaskModel(nn.Module):
    def __init__(self, n_ner=23, n_pos=18, n_sent=2):
        super(SindhiMultiTaskModel, self).__init__()
        # Loads the PCT-specialized encoder foundation
        self.encoder = DistilBertModel.from_pretrained("Kashif786/sindhi_distilbert_colab")
        self.ner_head = nn.Linear(768, n_ner)
        self.pos_head = nn.Linear(768, n_pos)
        self.sent_head = nn.Linear(768, n_sent)

    def forward(self, ids, mask, task='ner'):
        out = self.encoder(ids, attention_mask=mask)[0]
        if task == 'sentiment':
            return self.sent_head(out[:, 0]) # Global context for Sentiment
        elif task == 'ner':
            return self.ner_head(out) # Token-level context for NER
        elif task == 'pos':
            return self.pos_head(out) # Token-level context for POS

# 2. Setup Device and Tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "Kashif786/Sindhi-Triple-Pillar-BERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 3. Load Model Weights via Safetensors
model = SindhiMultiTaskModel().to(device)
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
load_model(model, weights_path)
model.eval()

# 4. Inference Example
text = "ڊاڪٽر تفسير احمد سنڌي ٻوليءَ جي اين ايل پي تي ڪم ڪري رهيو آهي."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)

with torch.no_grad():
    # Example: Predict Named Entities (NER)
    ner_logits = model(inputs['input_ids'], inputs['attention_mask'], task='ner')
    predictions = torch.argmax(ner_logits, dim=2)

print("Model successfully loaded and ready for Sindhi NLP tasks!")

⚖️ Limitations

The model currently exhibits a slight bias toward "Negative" sentiment labels when processing general conversational text, as the sentiment head was fine-tuned primarily on e-commerce reviews (eproduct.csv).

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kashif786/Sindhi-Triple-Pillar-BERT

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(445)

this model