|
|
--- |
|
|
language: |
|
|
- fil |
|
|
- eng |
|
|
license: cc-by-4.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- sentiment-analysis |
|
|
- code-switching |
|
|
- taglish |
|
|
- filipino-nlp |
|
|
- xlm-roberta |
|
|
- lexiliksik |
|
|
- thesis-model |
|
|
--- |
|
|
|
|
|
# LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis |
|
|
|
|
|
**Author**: Glenn Marcus D. Cinco |
|
|
**Institution**: Mapúa University |
|
|
**Thesis Title**: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik |
|
|
**Degree**: BS & MS in Computer Science |
|
|
**Date**: July 2025 |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎓 Abstract |
|
|
|
|
|
LexCAT is a fine-tuned XLM-RoBERTa model enhanced with **LexiLiksik**, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle **intra-sentential sentiment shifts** in code-switched Taglish text. LexCAT achieves **84.31% validation accuracy** on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%). |
|
|
|
|
|
This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 Model Architecture |
|
|
|
|
|
- **Base Model**: `xlm-roberta-base` |
|
|
- **Enhancement**: Integrated **LexiLiksik** via: |
|
|
- Soft constraints during fine-tuning |
|
|
- Attention weight adjustment for sentiment-relevant tokens |
|
|
- Metadata-aware pooling (POS, code-switching type) |
|
|
- **Fine-tuning Dataset**: FiReCS (10,487 Filipino–English code-switched reviews) |
|
|
- **Evaluation Dataset**: SentiTaglish Products and Services + FiReCS Test Set |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------------|---------| |
|
|
| Accuracy | 84.31% | |
|
|
| F1-Score | 0.8566 | |
|
|
| Precision | 0.8353 | |
|
|
| Recall | 0.8574 | |
|
|
| Cohen’s κ | 0.83 (from FiReCS annotators) | |
|
|
|
|
|
> ✅ **Key Strength**: Correctly classifies contrastive phrases like *“Maganda pero expensive”* as **negative**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📥 How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "your-hf-username/LexCAT-LexiLiksik-Final" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
text = "sobrang lambot ng burger pero expensive tlga" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=1).item() |
|
|
|
|
|
# Class: 0 = Negative, 1 = Neutral, 2 = Positive |
|
|
sentiment = ["Negative", "Neutral", "Positive"][predicted_class] |
|
|
print(f"Predicted Sentiment: {sentiment}") |
|
|
``` |
|
|
--- |
|
|
|
|
|
## 📚 Citation |
|
|
``` |
|
|
@mastersthesis{cinco2025lexcat, |
|
|
author = {Cinco, Glenn Marcus D.}, |
|
|
title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik}, |
|
|
school = {Mapúa University}, |
|
|
year = {2025}, |
|
|
month = {July} |
|
|
} |
|
|
``` |