File size: 12,334 Bytes

---
language:
- th
pipeline_tag: text-classification
tags:
- sentiment-analysis
- thai
- wangchanberta
- bilstm
- cnn
- lstm
license: apache-2.0
library_name: transformers
datasets:
- wisesight_sentiment
---

# Thai Sentiment (WangchanBERTa + LSTM Heads)

โมเดลสำหรับวิเคราะห์อารมณ์ (2 คลาส: NEG/POS) ภาษาไทย โดยใช้ **WangchanBERTa** เป็น backbone และเพิ่มหัว (heads) แบบ LSTM/CNN-LSTM หลายสถาปัตยกรรมสำหรับเปรียบเทียบและใช้งานตามบริบท

รีโปนี้บรรจุโมเดล 4 ตัว (เก็บเป็นโฟลเดอร์ย่อย):
- `WCB/` — WangchanBERTa (ใช้ [CLS])
- `WCB_BiLSTM/` — WangchanBERTa → BiLSTM → Pooling
- `WCB_CNN_BiLSTM/` — WangchanBERTa → CNN → BiLSTM → Pooling
- `WCB_4Layer_BiLSTM/` — WangchanBERTa (ถ่วงน้ำหนัก 4 เลเยอร์สุดท้าย) → BiLSTM → Pooling

แต่ละโฟลเดอร์มี `model.safetensors` และ `config.json` (เมตาดาต้า: `id2label/label2id`, `max_length`, `pooling_after_lstm`, `base_model`)

---

## สรุปผลการประเมิน (5-fold CV)

| Model | Accuracy (%) | F1-Score (%) | AUC (%) |
|---|---:|---:|---:|
| WCB | **90.33 ± 0.32** | **89.92 ± 0.33** | **95.72 ± 0.22** |
| WCB_BiLSTM | **90.93 ± 0.37** | **90.54 ± 0.39** | **95.57 ± 1.22** |
| WCB_CNN_BiLSTM | **90.14 ± 0.66** | **89.73 ± 0.68** | **95.83 ± 0.42** |
| WCB_4Layer_BiLSTM | **90.52 ± 0.65** | **90.13 ± 0.68** | **95.43 ± 0.36** |

**ข้อสังเกตย่อ**  
- **แม่นยำสูงสุด**: `WCB_BiLSTM` (Acc/F1 สูงสุด) แต่ AUC แปรปรวนกว่าตัวอื่นเล็กน้อย (±1.22%).  
- **AUC สูงสุด/เสถียรดี**: `WCB_CNN_BiLSTM` (AUC 95.83% ±0.42) เหมาะหากให้ความสำคัญกับการแยกคลาสจากสกอร์ความเชื่อมั่น แต่ Acc/F1 ต่ำกว่าเล็กน้อย.  
- **เร็ว/เสถียร**: `WCB` เร็วที่สุดและเสถียรสุด เหมาะงานทรัพยากรจำกัด.  

### เวลาเทรน (โดยเฉลี่ย)

| Model | วินาที/รอบ | เวลารวม (ชม.) |
|---|---:|---:|
| WCB | 54.67 | 4.58 |
| WCB_BiLSTM | 67.84 | 5.68 |
| WCB_CNN_BiLSTM | 68.72 | 5.76 |
| WCB_4Layer_BiLSTM | 72.91 | 6.11 |

---

## โครงสร้างรีโป

```
.
├─ WCB/
│  ├─ model.safetensors
│  └─ config.json
├─ WCB_BiLSTM/
│  ├─ model.safetensors
│  └─ config.json
├─ WCB_CNN_BiLSTM/
│  ├─ model.safetensors
│  └─ config.json
├─ WCB_4Layer_BiLSTM/
│  ├─ model.safetensors
│  └─ config.json
├─ common/
│  ├─ models.py
│  └─ __init__.py
├─ requirements.txt
├─ LICENSE
└─ README.md
```

---

## วิธีใช้งาน

### 🔧 ติดตั้ง Dependencies

```bash
pip install torch transformers huggingface-hub safetensors
```

### 📦 วิธีที่ 1: โหลดโมเดลจาก Hugging Face Hub (แนะนำ)

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
import importlib.util

# ===== ตั้งค่า =====
REPO_ID = "Dusit-P/thai-sentiment"  # เปลี่ยนเป็น repo ของคุณ
MODEL_NAME = "WCB_BiLSTM"  # เลือก: WCB, WCB_BiLSTM, WCB_CNN_BiLSTM, WCB_4Layer_BiLSTM

# ===== 1. ดาวน์โหลดไฟล์จำเป็น =====
config_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/config.json")
weights_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/model.safetensors")
models_py = hf_hub_download(REPO_ID, filename="common/models.py")

# ===== 2. โหลด config =====
with open(config_path, "r", encoding="utf-8") as f:
    config = json.load(f)

# ===== 3. โหลด tokenizer =====
base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
tokenizer = AutoTokenizer.from_pretrained(base_model)

# ===== 4. โหลดโมเดล =====
# Import models.py
spec = importlib.util.spec_from_file_location("models", models_py)
models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(models)

# สร้างโมเดล
architecture = config.get("architecture", MODEL_NAME)
num_labels = config.get("num_labels", 2)
pooling = config.get("pooling_after_lstm", "masked_mean")

model = models._build(architecture, base_model, num_labels, pooling)

# โหลด weights
state_dict = load_file(weights_path)
model.load_state_dict(state_dict, strict=False)
model.eval()

# ===== 5. ทำนาย =====
text = "มือถือรุ่นนี้ดีมาก ราคาคุ้มค่า แนะนำเลย!"

# Tokenize
inputs = tokenizer(
    text,
    truncation=True,
    padding=True,
    max_length=config.get("max_length", 128),
    return_tensors="pt"
)

# Predict
with torch.no_grad():
    logits = model(inputs["input_ids"], inputs["attention_mask"])
    probs = F.softmax(logits, dim=1)[0]
    pred_id = torch.argmax(logits, dim=1).item()

# แสดงผล
id2label = {int(k): v for k, v in config["id2label"].items()}
print(f"Text: {text}")
print(f"Prediction: {id2label[pred_id]}")
print(f"Probabilities: NEG={probs[0]:.4f}, POS={probs[1]:.4f}")
```

**Output ตัวอย่าง:**
```
Text: มือถือรุ่นนี้ดีมาก ราคาคุ้มค่า แนะนำเลย!
Prediction: positive
Probabilities: NEG=0.0234, POS=0.9766
```

---

### 📦 วิธีที่ 2: Clone Repo แล้วใช้งาน

```bash
git clone https://huggingface.co/Dusit-P/thai-sentiment
cd thai-sentiment
pip install -r requirements.txt
```

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer
from safetensors.torch import load_file
from common.models import _build
import json

# ===== เลือกโมเดล =====
MODEL_DIR = "WCB_BiLSTM"

# ===== โหลด config =====
with open(f"{MODEL_DIR}/config.json", "r") as f:
    config = json.load(f)

# ===== โหลด tokenizer =====
base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
tokenizer = AutoTokenizer.from_pretrained(base_model)

# ===== โหลดโมเดล =====
model = _build(
    config.get("architecture", MODEL_DIR),
    base_model,
    config.get("num_labels", 2),
    config.get("pooling_after_lstm", "masked_mean")
)

state_dict = load_file(f"{MODEL_DIR}/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

# ===== ทำนาย =====
text = "ของแพงไป คุณภาพไม่คุ้มราคา"

inputs = tokenizer(
    text,
    truncation=True,
    padding=True,
    max_length=config.get("max_length", 128),
    return_tensors="pt"
)

with torch.no_grad():
    logits = model(inputs["input_ids"], inputs["attention_mask"])
    probs = F.softmax(logits, dim=1)[0]
    pred_id = torch.argmax(logits, dim=1).item()

id2label = {int(k): v for k, v in config["id2label"].items()}
print(f"Prediction: {id2label[pred_id]}")
print(f"Probabilities: {probs}")
```

---

### 📦 วิธีที่ 3: ทำนายหลายข้อความพร้อมกัน (Batch Prediction)

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
import importlib.util

# ===== Setup =====
REPO_ID = "Dusit-P/thai-sentiment"
MODEL_NAME = "WCB_BiLSTM"

# ===== โหลดโมเดล (ตามวิธีที่ 1) =====
config_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/config.json")
weights_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/model.safetensors")
models_py = hf_hub_download(REPO_ID, filename="common/models.py")

with open(config_path, "r") as f:
    config = json.load(f)

base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
tokenizer = AutoTokenizer.from_pretrained(base_model)

spec = importlib.util.spec_from_file_location("models", models_py)
models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(models)

model = models._build(
    config.get("architecture", MODEL_NAME),
    base_model,
    config.get("num_labels", 2),
    config.get("pooling_after_lstm", "masked_mean")
)

state_dict = load_file(weights_path)
model.load_state_dict(state_dict, strict=False)
model.eval()

# ===== ทำนายหลายข้อความ =====
texts = [
    "อาหารอร่อยมาก บริการดีมาก",
    "ของแพงไป รสชาติก็ธรรมดา",
    "บรรยากาศดี แต่รอนานไป",
    "คุ้มค่ามาก แนะนำเลย"
]

# Tokenize batch
inputs = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=config.get("max_length", 128),
    return_tensors="pt"
)

# Predict batch
with torch.no_grad():
    logits = model(inputs["input_ids"], inputs["attention_mask"])
    probs = F.softmax(logits, dim=1)
    pred_ids = torch.argmax(logits, dim=1)

# แสดงผล
id2label = {int(k): v for k, v in config["id2label"].items()}

print("=" * 70)
for i, text in enumerate(texts):
    label = id2label[pred_ids[i].item()]
    neg_prob = probs[i][0].item()
    pos_prob = probs[i][1].item()
    
    print(f"Text: {text}")
    print(f"  → Prediction: {label}")
    print(f"  → Confidence: NEG={neg_prob:.4f}, POS={pos_prob:.4f}")
    print("-" * 70)
```

**Output ตัวอย่าง:**
```
======================================================================
Text: อาหารอร่อยมาก บริการดีมาก
  → Prediction: positive
  → Confidence: NEG=0.0156, POS=0.9844
----------------------------------------------------------------------
Text: ของแพงไป รสชาติก็ธรรมดา
  → Prediction: negative
  → Confidence: NEG=0.8923, POS=0.1077
----------------------------------------------------------------------
Text: บรรยากาศดี แต่รอนานไป
  → Prediction: positive
  → Confidence: NEG=0.3421, POS=0.6579
----------------------------------------------------------------------
Text: คุ้มค่ามาก แนะนำเลย
  → Prediction: positive
  → Confidence: NEG=0.0089, POS=0.9911
----------------------------------------------------------------------
```

---

## 🎯 เลือกโมเดลให้เหมาะกับงาน

- **ต้องการความแม่นยำสูงสุด** → `WCB_BiLSTM`  
  Acc/F1 สูงสุด (90.93% / 90.54%)

- **ทรัพยากรจำกัด / ต้องการความเร็ว** → `WCB`  
  เร็วที่สุด (54.67 วิ/รอบ) และเสถียรสุด

- **โฟกัส AUC / การจัดอันดับความเสี่ยง** → `WCB_CNN_BiLSTM`  
  AUC สูงสุด (95.83%) และเสถียร

- **สมดุลโดยรวม** → `WCB_4Layer_BiLSTM`  
  ประสิทธิภาพดี

---

## 🚀 Demo Application

ลองใช้งานโมเดลผ่าน Gradio Demo:
https://huggingface.co/spaces/Dusit-P/Thai-Sentiment-GUI

---

## 📄 License

Apache-2.0

---

## 🙏 Acknowledgments

- **WangchanBERTa**: airesearch/wangchanberta-base-att-spm-uncased
- **Dataset**: wisesight_sentiment

---

## 📧 Contact

หากมีคำถามหรือข้อเสนอแนะ กรุณาติดต่อผ่าน GitHub Issues