GMCTech
/

LexCAT

Text Classification

sentiment-analysis

Model card Files Files and versions

LexCAT / README.md

GMCTech's picture

Update Citation part of README

127d278 verified 3 months ago

|

history blame contribute delete

3.01 kB

	---
	language:
	- fil
	- eng
	license: cc-by-4.0
	library_name: transformers
	tags:
	- sentiment-analysis
	- code-switching
	- taglish
	- filipino-nlp
	- xlm-roberta
	- lexiliksik
	- thesis-model
	---

	# LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis

	Author: Glenn Marcus D. Cinco
	Institution: Mapúa University
	Thesis Title: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik
	Degree: BS & MS in Computer Science
	Date: July 2025

	---

	## 🎓 Abstract

	LexCAT is a fine-tuned XLM-RoBERTa model enhanced with LexiLiksik, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle intra-sentential sentiment shifts in code-switched Taglish text. LexCAT achieves 84.31% validation accuracy on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%).

	This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning.

	---

	## 🔧 Model Architecture

	- Base Model: `xlm-roberta-base`
	- Enhancement: Integrated LexiLiksik via:
	- Soft constraints during fine-tuning
	- Attention weight adjustment for sentiment-relevant tokens
	- Metadata-aware pooling (POS, code-switching type)
	- Fine-tuning Dataset: FiReCS (10,487 Filipino–English code-switched reviews)
	- Evaluation Dataset: SentiTaglish Products and Services + FiReCS Test Set

	---

	## 📊 Performance

	\| Metric \| Value \|
	\|--------------\|---------\|
	\| Accuracy \| 84.31% \|
	\| F1-Score \| 0.8566 \|
	\| Precision \| 0.8353 \|
	\| Recall \| 0.8574 \|
	\| Cohen’s κ \| 0.83 (from FiReCS annotators) \|

	> ✅ Key Strength: Correctly classifies contrastive phrases like “Maganda pero expensive” as negative.

	---

	## 📥 How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "your-hf-username/LexCAT-LexiLiksik-Final"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "sobrang lambot ng burger pero expensive tlga"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=1).item()

	# Class: 0 = Negative, 1 = Neutral, 2 = Positive
	sentiment = ["Negative", "Neutral", "Positive"][predicted_class]
	print(f"Predicted Sentiment: {sentiment}")
	```
	---

	## 📚 Citation
	```
	@mastersthesis{cinco2025lexcat,
	author = {Cinco, Glenn Marcus D.},
	title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik},
	school = {Mapúa University},
	year = {2025},
	month = {July}
	}
	```