Roman Urdu Sentiment Analysis Model

Model Description

This model performs sentiment analysis on Roman Urdu (Romanized Urdu) text, classifying input into three categories: Negative, Neutral, and Positive. It can analyze both single sentences and batch process multiple sentences from files.

Intended Uses

This model is designed for:

  • Businesses and companies analyzing customer reviews and feedback in Roman Urdu
  • Social media monitoring to understand public sentiment
  • Product and service evaluation based on customer comments
  • Market research on Urdu-speaking audiences using Roman script

Model Architecture

  • Base Model: nlptown/bert-base-multilingual-uncased-sentiment
  • Architecture: BERT (Bidirectional Encoder Representations from Transformers)
  • Modifications: Custom classification head with dropout (p=0.3)
  • Output Classes: 3 (Negative, Neutral, Positive)
  • Max Sequence Length: 512 tokens

Training Data

The model was trained on a combined dataset of Roman Urdu reviews from three sources:

  1. Daraz Labelled Review Dataset: Customer reviews from Daraz, Pakistan's leading e-commerce platform
  2. Pakistan Car Reviews: Automotive reviews and feedback from Pakistani consumers
  3. Brand Reviews: General brand and product reviews in Roman Urdu

Dataset Characteristics:

  • Language: Roman Urdu (Romanized Urdu script)
  • Sources: E-commerce, automotive, and general brand reviews
  • Combination Method: All three datasets were merged into a single training corpus
  • Domain Coverage: Multi-domain (e-commerce products, automobiles, general brands)
  • Labels: 3-class sentiment labels (Negative, Neutral, Positive)

This diverse combination of datasets enables the model to generalize across different product categories and review types commonly found in Pakistani consumer feedback.

Note: The model is particularly well-suited for analyzing product reviews, customer feedback, and brand sentiment in the Pakistani market context.

Performance Metrics

The model achieves 87%+ accuracy on sentiment classification tasks for Roman Urdu text.

Overall Performance:

  • Accuracy: 87%+

Note: Detailed performance metrics including precision, recall, and F1-scores for individual classes will be added in future updates.

Evaluation:

The model was evaluated on a held-out test set from the combined dataset of Daraz reviews, Pakistan car reviews, and brand reviews.

How to Use

Installation

pip install transformers torch huggingface_hub

Basic Usage

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer, BertConfig
from huggingface_hub import hf_hub_download

# Define the model class
class ModifiedBertForSentiment(nn.Module):
    def __init__(self, config, n_classes):
        super(ModifiedBertForSentiment, self).__init__()
        self.bert = BertModel(config)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.last_hidden_state.mean(dim=1)
        output = self.drop(pooled_output)
        return self.out(output)

# Load model and tokenizer
class_names = ['Negative', 'Neutral', 'Positive']
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
config = BertConfig.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = ModifiedBertForSentiment(config, len(class_names))

# Download and load weights
model_file = hf_hub_download(
    repo_id="makbar023/roman-sentiment-model",
    filename="roman_Sentiment.pth"
)
model.load_state_dict(torch.load(model_file, map_location=device))
model.to(device)
model.eval()

# Predict sentiment
def predict_sentiment(text):
    inputs = tokenizer(text, padding=True, truncation=True, 
                      return_tensors='pt', max_length=512)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, preds = torch.max(outputs, dim=1)
        probs = torch.nn.functional.softmax(outputs, dim=1)
    
    sentiment = class_names[preds.item()]
    confidence = {class_names[i]: float(probs[0][i]) for i in range(len(class_names))}
    
    return sentiment, confidence

# Example
text = "yeh product bohat acha hai"
sentiment, probabilities = predict_sentiment(text)
print(f"Sentiment: {sentiment}")
print(f"Probabilities: {probabilities}")

Limitations

  • Script-specific: The model is trained specifically for Roman Urdu and may not perform well on native Urdu script (Nastaliq/Naskh)
  • Code-mixing: Performance may vary with heavy English-Urdu code-mixing
  • Domain specificity: The model's accuracy depends on the similarity between your use case and the training data domain
  • Informal language: May struggle with heavy use of slang, abbreviations, or non-standard spellings
  • Context length: Limited to 512 tokens; longer texts will be truncated
  • Sarcasm and irony: Like most sentiment models, may misclassify sarcastic or ironic statements

Ethical Considerations

  • Bias in training data: The model may reflect biases present in the training dataset. Users should validate outputs, especially for sensitive applications
  • Cultural context: Sentiment expressions vary across cultures; this model is calibrated for Urdu-speaking communities
  • Privacy: When analyzing user-generated content, ensure compliance with data privacy regulations (GDPR, local laws)
  • Not a substitute for human judgment: Automated sentiment analysis should complement, not replace, human analysis for critical decisions
  • Transparency: Inform users when their content is being analyzed by automated systems
  • Misuse potential: Should not be used for surveillance, discrimination, or manipulation of individuals or groups

Citation

If you use this model in your research or applications, please cite:

@misc{roman-urdu-sentiment,
  author = {Muhammad Akbar},
  title = {Roman Urdu Sentiment Analysis Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/makbar023/roman-sentiment-model}}
}

Contact

For questions or feedback, please open an issue on the model repository or contact makber023@gmail.com.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using makbar023/roman-sentiment-model 1