Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,91 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- fil
|
| 4 |
+
- eng
|
| 5 |
+
license: cc-by-4.0
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- sentiment-analysis
|
| 9 |
+
- code-switching
|
| 10 |
+
- taglish
|
| 11 |
+
- filipino-nlp
|
| 12 |
+
- xlm-roberta
|
| 13 |
+
- lexiliksik
|
| 14 |
+
- thesis-model
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis
|
| 18 |
+
|
| 19 |
+
**Author**: Glenn Marcus D. Cinco
|
| 20 |
+
**Institution**: Mapúa University
|
| 21 |
+
**Thesis Title**: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik
|
| 22 |
+
**Degree**: BS & MS in Computer Science
|
| 23 |
+
**Date**: July 2025
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 🎓 Abstract
|
| 28 |
+
|
| 29 |
+
LexCAT is a fine-tuned XLM-RoBERTa model enhanced with **LexiLiksik**, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle **intra-sentential sentiment shifts** in code-switched Taglish text. LexCAT achieves **84.31% validation accuracy** on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%).
|
| 30 |
+
|
| 31 |
+
This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## 🔧 Model Architecture
|
| 36 |
+
|
| 37 |
+
- **Base Model**: `xlm-roberta-base`
|
| 38 |
+
- **Enhancement**: Integrated **LexiLiksik** via:
|
| 39 |
+
- Soft constraints during fine-tuning
|
| 40 |
+
- Attention weight adjustment for sentiment-relevant tokens
|
| 41 |
+
- Metadata-aware pooling (POS, code-switching type)
|
| 42 |
+
- **Fine-tuning Dataset**: FiReCS (10,487 Filipino–English code-switched reviews)
|
| 43 |
+
- **Evaluation Dataset**: SentiTaglish Products and Services + FiReCS Test Set
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 📊 Performance
|
| 48 |
+
|
| 49 |
+
| Metric | Value |
|
| 50 |
+
|--------------|---------|
|
| 51 |
+
| Accuracy | 84.31% |
|
| 52 |
+
| F1-Score | 0.8566 |
|
| 53 |
+
| Precision | 0.8353 |
|
| 54 |
+
| Recall | 0.8574 |
|
| 55 |
+
| Cohen’s κ | 0.83 (from FiReCS annotators) |
|
| 56 |
+
|
| 57 |
+
> ✅ **Key Strength**: Correctly classifies contrastive phrases like *“Maganda pero expensive”* as **negative**.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## 📥 How to Use
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 65 |
+
import torch
|
| 66 |
+
|
| 67 |
+
model_name = "your-hf-username/LexCAT-LexiLiksik-Final"
|
| 68 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 69 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 70 |
+
|
| 71 |
+
text = "sobrang lambot ng burger pero expensive tlga"
|
| 72 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 73 |
+
outputs = model(**inputs)
|
| 74 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 75 |
+
predicted_class = torch.argmax(predictions, dim=1).item()
|
| 76 |
+
|
| 77 |
+
# Class: 0 = Negative, 1 = Neutral, 2 = Positive
|
| 78 |
+
sentiment = ["Negative", "Neutral", "Positive"][predicted_class]
|
| 79 |
+
print(f"Predicted Sentiment: {sentiment}")
|
| 80 |
+
```
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## 📚 Citation
|
| 84 |
+
|
| 85 |
+
@mastersthesis{cinco2025lexcat,
|
| 86 |
+
author = {Cinco, Glenn Marcus D.},
|
| 87 |
+
title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik},
|
| 88 |
+
school = {Mapúa University},
|
| 89 |
+
year = {2025},
|
| 90 |
+
month = {July}
|
| 91 |
+
}
|