GMCTech
/

LexCAT

Text Classification

sentiment-analysis

text-embeddings-inference

Model card Files Files and versions

GMCTech commited on Sep 23, 2025

Commit

b4611f6

·

verified ·

1 Parent(s): 42586bc

Update README.md

Files changed (1) hide show

README.md +91 -3

README.md CHANGED Viewed

@@ -1,3 +1,91 @@
----
-license: cc-by-4.0
----

+---
+language:
+- fil
+- eng
+license: cc-by-4.0
+library_name: transformers
+tags:
+- sentiment-analysis
+- code-switching
+- taglish
+- filipino-nlp
+- xlm-roberta
+- lexiliksik
+- thesis-model
+---
+# LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis
+**Author**: Glenn Marcus D. Cinco
+**Institution**: Mapúa University
+**Thesis Title**: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik
+**Degree**: BS & MS in Computer Science
+**Date**: July 2025
+---
+## 🎓 Abstract
+LexCAT is a fine-tuned XLM-RoBERTa model enhanced with **LexiLiksik**, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle **intra-sentential sentiment shifts** in code-switched Taglish text. LexCAT achieves **84.31% validation accuracy** on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%).
+This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning.
+---
+## 🔧 Model Architecture
+- **Base Model**: `xlm-roberta-base`
+- **Enhancement**: Integrated **LexiLiksik** via:
+  - Soft constraints during fine-tuning
+  - Attention weight adjustment for sentiment-relevant tokens
+  - Metadata-aware pooling (POS, code-switching type)
+- **Fine-tuning Dataset**: FiReCS (10,487 Filipino–English code-switched reviews)
+- **Evaluation Dataset**: SentiTaglish Products and Services + FiReCS Test Set
+---
+## 📊 Performance
+| Metric       | Value   |
+|--------------|---------|
+| Accuracy     | 84.31%  |
+| F1-Score     | 0.8566  |
+| Precision    | 0.8353  |
+| Recall       | 0.8574  |
+| Cohen’s κ    | 0.83 (from FiReCS annotators) |
+> ✅ **Key Strength**: Correctly classifies contrastive phrases like *“Maganda pero expensive”* as **negative**.
+---
+## 📥 How to Use
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "your-hf-username/LexCAT-LexiLiksik-Final"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "sobrang lambot ng burger pero expensive tlga"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+outputs = model(**inputs)
+predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+predicted_class = torch.argmax(predictions, dim=1).item()
+# Class: 0 = Negative, 1 = Neutral, 2 = Positive
+sentiment = ["Negative", "Neutral", "Positive"][predicted_class]
+print(f"Predicted Sentiment: {sentiment}")
+```
+---
+## 📚 Citation
+@mastersthesis{cinco2025lexcat,
+  author = {Cinco, Glenn Marcus D.},
+  title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik},
+  school = {Mapúa University},
+  year = {2025},
+  month = {July}
+}