Sindhi-Triple-Pillar-BERT
Sindhi-Triple-Pillar-BERT is a multi-task learning (MTL) model designed for the Sindhi language. It simultaneously performs Sentiment Analysis, Named Entity Recognition (NER), and Part-of-Speech (POS) Tagging.
This model is the result of an MSCS thesis project supervised by Dr. Tafseer Ahmed, focusing on low-resource NLP for the Sindhi language.
Model Details
- Base Model: Specialized
distilbert-base-multilingual-cased. - Language: Sindhi.
- Tasks: Sentiment Analysis, NER, POS Tagging.
- Training Data: 2.2M Sindhi rows (PCT) + specialized MTL datasets.
- Tokenizer: Custom extended vocabulary with 20,000 Sindhi tokens (reducing fragmentation).
Training & Methodology
The model underwent a two-phase training process:
Phase 1: Specialized Pre-Training (PCT)
- Dataset:
arnizamani/Sindhi-texts-big-dataset(2.2M rows). - Duration: 8โ10 hours of training across 45,000 steps.
- Performance: Final MLM loss of 2.1โ2.3.
Phase 2: Multi-Task Learning (MTL)
- Optimization: Automatic Mixed Precision (AMP) with Batch Size 48 for high-speed convergence.
- Balancing: POS dataset oversampled by 65x to prevent gradient starvation in low-resource settings.
- Validation: Scientifically validated using an 80/20 Train-Test split.
Evaluation Results (Unseen Data)
On a strictly external test set, the model achieved the following real-world figures:
| Task | F1-Score | Support |
|---|---|---|
| Sentiment Analysis | 1.00 | 3,154 |
| POS Tagging (Weighted) | 0.96 | 2,087 |
| NER (Macro Avg) | 0.69 | 317,371 |
Research Insight: The model shows exceptional performance in identifying Geopolitical Entities (GPE F1: 0.85) and Person Names (Person F1: 0.81), making it highly effective for regional information extraction.
How to Use
import torch
import torch.nn as nn
from transformers import AutoTokenizer, DistilBertModel
from safetensors.torch import load_model
from huggingface_hub import hf_hub_download
# 1. Define the Multi-Task Architecture
# This must match the 'Triple Pillar' configuration used during training
class SindhiMultiTaskModel(nn.Module):
def __init__(self, n_ner=23, n_pos=18, n_sent=2):
super(SindhiMultiTaskModel, self).__init__()
# Loads the PCT-specialized encoder foundation
self.encoder = DistilBertModel.from_pretrained("Kashif786/sindhi_distilbert_colab")
self.ner_head = nn.Linear(768, n_ner)
self.pos_head = nn.Linear(768, n_pos)
self.sent_head = nn.Linear(768, n_sent)
def forward(self, ids, mask, task='ner'):
out = self.encoder(ids, attention_mask=mask)[0]
if task == 'sentiment':
return self.sent_head(out[:, 0]) # Global context for Sentiment
elif task == 'ner':
return self.ner_head(out) # Token-level context for NER
elif task == 'pos':
return self.pos_head(out) # Token-level context for POS
# 2. Setup Device and Tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "Kashif786/Sindhi-Triple-Pillar-BERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# 3. Load Model Weights via Safetensors
model = SindhiMultiTaskModel().to(device)
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
load_model(model, weights_path)
model.eval()
# 4. Inference Example
text = "ฺุงฺชูฝุฑ ุชูุณูุฑ ุงุญู
ุฏ ุณฺูู ูปูููุกู ุฌู ุงูู ุงูู ูพู ุชู ฺชู
ฺชุฑู ุฑููู ุขูู."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
# Example: Predict Named Entities (NER)
ner_logits = model(inputs['input_ids'], inputs['attention_mask'], task='ner')
predictions = torch.argmax(ner_logits, dim=2)
print("Model successfully loaded and ready for Sindhi NLP tasks!")
โ๏ธ Limitations
The model currently exhibits a slight bias toward "Negative" sentiment labels when processing general conversational text, as the sentiment head was fine-tuned primarily on e-commerce reviews (eproduct.csv).
- Downloads last month
- 58