GMCTech commited on
Commit
b4611f6
·
verified ·
1 Parent(s): 42586bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fil
4
+ - eng
5
+ license: cc-by-4.0
6
+ library_name: transformers
7
+ tags:
8
+ - sentiment-analysis
9
+ - code-switching
10
+ - taglish
11
+ - filipino-nlp
12
+ - xlm-roberta
13
+ - lexiliksik
14
+ - thesis-model
15
+ ---
16
+
17
+ # LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis
18
+
19
+ **Author**: Glenn Marcus D. Cinco
20
+ **Institution**: Mapúa University
21
+ **Thesis Title**: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik
22
+ **Degree**: BS & MS in Computer Science
23
+ **Date**: July 2025
24
+
25
+ ---
26
+
27
+ ## 🎓 Abstract
28
+
29
+ LexCAT is a fine-tuned XLM-RoBERTa model enhanced with **LexiLiksik**, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle **intra-sentential sentiment shifts** in code-switched Taglish text. LexCAT achieves **84.31% validation accuracy** on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%).
30
+
31
+ This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning.
32
+
33
+ ---
34
+
35
+ ## 🔧 Model Architecture
36
+
37
+ - **Base Model**: `xlm-roberta-base`
38
+ - **Enhancement**: Integrated **LexiLiksik** via:
39
+ - Soft constraints during fine-tuning
40
+ - Attention weight adjustment for sentiment-relevant tokens
41
+ - Metadata-aware pooling (POS, code-switching type)
42
+ - **Fine-tuning Dataset**: FiReCS (10,487 Filipino–English code-switched reviews)
43
+ - **Evaluation Dataset**: SentiTaglish Products and Services + FiReCS Test Set
44
+
45
+ ---
46
+
47
+ ## 📊 Performance
48
+
49
+ | Metric | Value |
50
+ |--------------|---------|
51
+ | Accuracy | 84.31% |
52
+ | F1-Score | 0.8566 |
53
+ | Precision | 0.8353 |
54
+ | Recall | 0.8574 |
55
+ | Cohen’s κ | 0.83 (from FiReCS annotators) |
56
+
57
+ > ✅ **Key Strength**: Correctly classifies contrastive phrases like *“Maganda pero expensive”* as **negative**.
58
+
59
+ ---
60
+
61
+ ## 📥 How to Use
62
+
63
+ ```python
64
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
65
+ import torch
66
+
67
+ model_name = "your-hf-username/LexCAT-LexiLiksik-Final"
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
69
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
70
+
71
+ text = "sobrang lambot ng burger pero expensive tlga"
72
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
73
+ outputs = model(**inputs)
74
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
75
+ predicted_class = torch.argmax(predictions, dim=1).item()
76
+
77
+ # Class: 0 = Negative, 1 = Neutral, 2 = Positive
78
+ sentiment = ["Negative", "Neutral", "Positive"][predicted_class]
79
+ print(f"Predicted Sentiment: {sentiment}")
80
+ ```
81
+ ---
82
+
83
+ ## 📚 Citation
84
+
85
+ @mastersthesis{cinco2025lexcat,
86
+ author = {Cinco, Glenn Marcus D.},
87
+ title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik},
88
+ school = {Mapúa University},
89
+ year = {2025},
90
+ month = {July}
91
+ }