Roman Urdu Sentiment Analysis Model
Model Description
This model performs sentiment analysis on Roman Urdu (Romanized Urdu) text, classifying input into three categories: Negative, Neutral, and Positive. It can analyze both single sentences and batch process multiple sentences from files.
Intended Uses
This model is designed for:
- Businesses and companies analyzing customer reviews and feedback in Roman Urdu
- Social media monitoring to understand public sentiment
- Product and service evaluation based on customer comments
- Market research on Urdu-speaking audiences using Roman script
Model Architecture
- Base Model:
nlptown/bert-base-multilingual-uncased-sentiment - Architecture: BERT (Bidirectional Encoder Representations from Transformers)
- Modifications: Custom classification head with dropout (p=0.3)
- Output Classes: 3 (Negative, Neutral, Positive)
- Max Sequence Length: 512 tokens
Training Data
The model was trained on a combined dataset of Roman Urdu reviews from three sources:
- Daraz Labelled Review Dataset: Customer reviews from Daraz, Pakistan's leading e-commerce platform
- Pakistan Car Reviews: Automotive reviews and feedback from Pakistani consumers
- Brand Reviews: General brand and product reviews in Roman Urdu
Dataset Characteristics:
- Language: Roman Urdu (Romanized Urdu script)
- Sources: E-commerce, automotive, and general brand reviews
- Combination Method: All three datasets were merged into a single training corpus
- Domain Coverage: Multi-domain (e-commerce products, automobiles, general brands)
- Labels: 3-class sentiment labels (Negative, Neutral, Positive)
This diverse combination of datasets enables the model to generalize across different product categories and review types commonly found in Pakistani consumer feedback.
Note: The model is particularly well-suited for analyzing product reviews, customer feedback, and brand sentiment in the Pakistani market context.
Performance Metrics
The model achieves 87%+ accuracy on sentiment classification tasks for Roman Urdu text.
Overall Performance:
- Accuracy: 87%+
Note: Detailed performance metrics including precision, recall, and F1-scores for individual classes will be added in future updates.
Evaluation:
The model was evaluated on a held-out test set from the combined dataset of Daraz reviews, Pakistan car reviews, and brand reviews.
How to Use
Installation
pip install transformers torch huggingface_hub
Basic Usage
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, BertConfig
from huggingface_hub import hf_hub_download
# Define the model class
class ModifiedBertForSentiment(nn.Module):
def __init__(self, config, n_classes):
super(ModifiedBertForSentiment, self).__init__()
self.bert = BertModel(config)
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs.last_hidden_state.mean(dim=1)
output = self.drop(pooled_output)
return self.out(output)
# Load model and tokenizer
class_names = ['Negative', 'Neutral', 'Positive']
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
config = BertConfig.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = ModifiedBertForSentiment(config, len(class_names))
# Download and load weights
model_file = hf_hub_download(
repo_id="makbar023/roman-sentiment-model",
filename="roman_Sentiment.pth"
)
model.load_state_dict(torch.load(model_file, map_location=device))
model.to(device)
model.eval()
# Predict sentiment
def predict_sentiment(text):
inputs = tokenizer(text, padding=True, truncation=True,
return_tensors='pt', max_length=512)
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
_, preds = torch.max(outputs, dim=1)
probs = torch.nn.functional.softmax(outputs, dim=1)
sentiment = class_names[preds.item()]
confidence = {class_names[i]: float(probs[0][i]) for i in range(len(class_names))}
return sentiment, confidence
# Example
text = "yeh product bohat acha hai"
sentiment, probabilities = predict_sentiment(text)
print(f"Sentiment: {sentiment}")
print(f"Probabilities: {probabilities}")
Limitations
- Script-specific: The model is trained specifically for Roman Urdu and may not perform well on native Urdu script (Nastaliq/Naskh)
- Code-mixing: Performance may vary with heavy English-Urdu code-mixing
- Domain specificity: The model's accuracy depends on the similarity between your use case and the training data domain
- Informal language: May struggle with heavy use of slang, abbreviations, or non-standard spellings
- Context length: Limited to 512 tokens; longer texts will be truncated
- Sarcasm and irony: Like most sentiment models, may misclassify sarcastic or ironic statements
Ethical Considerations
- Bias in training data: The model may reflect biases present in the training dataset. Users should validate outputs, especially for sensitive applications
- Cultural context: Sentiment expressions vary across cultures; this model is calibrated for Urdu-speaking communities
- Privacy: When analyzing user-generated content, ensure compliance with data privacy regulations (GDPR, local laws)
- Not a substitute for human judgment: Automated sentiment analysis should complement, not replace, human analysis for critical decisions
- Transparency: Inform users when their content is being analyzed by automated systems
- Misuse potential: Should not be used for surveillance, discrimination, or manipulation of individuals or groups
Citation
If you use this model in your research or applications, please cite:
@misc{roman-urdu-sentiment,
author = {Muhammad Akbar},
title = {Roman Urdu Sentiment Analysis Model},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/makbar023/roman-sentiment-model}}
}
Contact
For questions or feedback, please open an issue on the model repository or contact makber023@gmail.com.