---
language: [th]
library_name: transformers
pipeline_tag: text-classification
tags:
  - thai
  - sentiment-analysis
  - text-classification
  - wangchanberta 
  - bilstm
  - cnn
  - gradio
  - space
base_model: airesearch/wangchanberta-base-att-spm-uncased
license: apache-2.0   # เปลี่ยนได้ตามที่คุณต้องการ
datasets:
  - wisesight/wisesight-sentiment
---

# Thai Sentiment (WangchanBERTa + LSTM/CNN/Last4 Heads)

> โมเดลวิเคราะห์อารมณ์ **ภาษาไทย** แบบ 2 คลาส (negative/positive) อิง **WangchanBERTa** และปล่อยหลายสถาปัตยกรรม (heads) เพื่อความยืดหยุ่นของงานจริง

- `cnn_bilstm` — WangchanBERTa → Conv1d → BiLSTM (**โมเดลหลัก**: ผลดีที่สุดบนชุดทดสอบแยก)
- `baseline` — WangchanBERTa → BiLSTM (รุ่นเบา/พื้นฐาน)
- `last4weighted_bilstm` — รวม **last 4 hidden layers** แบบถ่วงน้ำหนัก + (Bi)LSTM (คะแนนเฉลี่ย **CV** สูงสุด)

> Demo (Space): <https://huggingface.co/spaces/Dusit-P/thai-sentiment-api>>  

---

## สารบัญ
- [รายละเอียดโมเดล](#รายละเอียดโมเดล)
- [ชุดข้อมูลและการเตรียมข้อมูล](#ชุดข้อมูลและการเตรียมข้อมูล)
- [การประเมินผล—ผลลัพธ์](#การประเมินผล—ผลลัพธ์)
- [การใช้งานที่ตั้งใจไว้ (Intended Use)](#การใช้งานที่ตั้งใจไว้-intended-use)
- [ข้อจำกัดและข้อควรระวัง](#ข้อจำกัดและข้อควรระวัง)
- [โครงไฟล์ของรีโป](#โครงไฟล์ของรีโป)
- [Quickstart (Python)](#quickstart-python)
- [Space / REST API](#space--rest-api)
- [Reproducibility (ย่อ)](#reproducibility-ย่อ)
- [License & Attribution](#license--attribution)
- [Citation](#citation)
- [Changelog](#changelog)

---

## รายละเอียดโมเดล

> **ฐานโมเดล**: `airesearch/wangchanberta-base-att-spm-uncased`  
> **Labels**: `0 → negative`, `1 → positive` (ตัดสินใจด้วย `argmax` หรือ `positive ≥ negative`)

- **Baseline**  
  ใช้เอาต์พุต BERT แบบ sequence → BiLSTM → Linear (เรียบง่าย เบา)
- **CNN-BiLSTM**  
  BERT → Conv1d (kernel 3 & 5) → BiLSTM → Linear (สกัด pattern ระยะสั้นก่อน LSTM)
- **Last4Weighted (BiLSTM)**  
  รวมชั้นซ่อน 4 ชั้นสุดท้ายของ BERT ด้วยน้ำหนักเรียนรู้ได้ → (Bi)LSTM → Linear

---

## ชุดข้อมูลและการเตรียมข้อมูล

> ใช้ชุด **Wisesight Sentiment (ภาษาไทย)** และ **คัดเลือกเฉพาะ 2 คลาส** (*positive*, *negative*) — **ไม่ใช้ neutral และ question**

- **จำนวนหลังคัดเลือก**: **11,118** ข้อความ  
  - Positive: **4,481**, Negative: **6,637**
- **การแบ่งข้อมูล**:  
  - Train/Val: **80%** (ทำ **5-Fold Cross-Validation** บนส่วนนี้)  
  - Test: **20%** (ชุดทดสอบแยกต่างหาก)
- **การเตรียมข้อความ**: ใช้ tokenizer ของ WangchanBERTa, `max_len = 128`

> โปรดตรวจสอบสัญญาอนุญาตของ Wisesight ต้นทางก่อนใช้งานเชิงพาณิชย์/แจกจ่ายซ้ำ

---

## การประเมินผล—ผลลัพธ์

- **Metrics**: Accuracy, Macro-F1, ROC-AUC  
- **กระบวนการ**: 5-Fold CV + Final Test (ชุดทดสอบแยก)

| โมเดล | CV Accuracy | CV F1 | CV ROC-AUC | Test Accuracy | Test F1 | Test ROC-AUC |
|---|---:|---:|---:|---:|---:|---:|
| **Model1_Baseline** | 90.36 ± 1.07 | 89.99 ± 1.10 | 95.67 ± 0.59 | 90.15 | 89.71 | 95.69 |
| **Model2_CNN_BiLSTM** | 90.32 ± 0.56 | 89.95 ± 0.56 | 95.92 ± 0.28 | **90.29** | **89.88** | 95.76 |
| **Model3_Last4Weighted (Pure/BiLSTM)** | **90.80 ± 0.70** | **90.42 ± 0.75** | **96.19 ± 0.27** | 90.11 | 89.68 | **95.78** |
| Model4_Middle4Mean | 90.51 ± 0.67 | 90.11 ± 0.68 | 95.78 ± 0.43 | 90.20 | 89.76 | 95.55 |

> สรุป: ผลแต่ละสถาปัตยกรรม **แตกต่างกันเล็กน้อย (~<1%)**  
> ใช้ `cnn_bilstm` เป็น **โมเดลหลักในการใช้งานจริง** และเปิด `last4weighted_bilstm` ให้เลือกสำหรับเคสเฉพาะ/เทียบผล

---

## การใช้งานที่ตั้งใจไว้ (Intended Use)

- วิเคราะห์ความรู้สึกรีวิว/คอมเมนต์ภาษาไทยแบบ 2 คลาส (positive/negative)
- เดโม Space รองรับ 3 โหมด: **Single**, **Batch (หลายบรรทัด)**, และ **CSV**  
  - CSV: ถ้าพบคอลัมน์ `review` จะใช้ทันที (ไม่พบจะเดาคอลัมน์ object ตัวแรก)  
  - ถ้ามีคอลัมน์ `shop` จะสรุปผลต่อร้าน + แสดงกราฟสรุป

---

## ข้อจำกัดและข้อควรระวัง

> โปรดพิจารณาใช้ร่วมกับกฎ/กระบวนการทวนโดยมนุษย์

- ภาษาพูด, สแลงเฉพาะกลุ่ม, ประชด/เสียดสี อาจทำให้ทำนายพลาด  
- ข้อความนอกโดเมนที่ฝึก (เช่น สายวิชาการเฉพาะทาง, code-mixed อังกฤษมาก) อาจลดความแม่นยำ  
- ค่า **probabilities** เป็นการประมาณเชิงสถิติ—not ground truth  
- ควรลบข้อมูลส่วนบุคคล (PII) ก่อนส่งเข้าเดโมสาธารณะ

---

## โครงไฟล์ของรีโป
```markdown
common/models.py
baseline/
├─ config.json
└─ model.safetensors
cnn_bilstm/
├─ config.json
└─ model.safetensors
last4weighted_bilstm/
├─ config.json
└─ model.safetensors
requirements.txt
LICENSE
```
---


## Quickstart (Python)

> ต้องมี: `torch`, `transformers`, `safetensors`, `sentencepiece`, `huggingface_hub`

```bash
pip install -U torch transformers safetensors sentencepiece huggingface_hub

import json, importlib.util, torch, torch.nn.functional as F
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from safetensors.torch import load_file

REPO_ID   = "Dusit-P/thai-sentiment-wcb"
# เลือกหนึ่ง: "cnn_bilstm" | "baseline" | "last4weighted_bilstm"
MODEL_DIR = "cnn_bilstm"

# โหลดสถาปัตยกรรม (factory)
models_py = hf_hub_download(REPO_ID, filename="common/models.py")
spec = importlib.util.spec_from_file_location("models", models_py)
mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)

# โหลดคอนฟิก/น้ำหนัก
cfg_path = hf_hub_download(REPO_ID, filename=f"{MODEL_DIR}/config.json")
w_path   = hf_hub_download(REPO_ID, filename=f"{MODEL_DIR}/model.safetensors")
cfg = json.load(open(cfg_path, "r", encoding="utf-8"))

tok = AutoTokenizer.from_pretrained(cfg["base_model"])
model = mod.create_model_by_name(cfg["arch"])
state = load_file(w_path); model.load_state_dict(state, strict=True)
model.eval()

def classify(text: str):
    enc = tok([text], padding=True, truncation=True, max_length=cfg["max_len"], return_tensors="pt")
    with torch.no_grad():
        p = F.softmax(model(enc["input_ids"], enc["attention_mask"]), dim=1)[0].tolist()
    probs = {"negative": float(p[0]), "positive": float(p[1])}
    label = "positive" if probs["positive"] >= probs["negative"] else "negative"
    return probs, label

print(classify("บริการดีมาก ประทับใจ"))

---

Space / REST API
Base URL: https://<YOUR_SPACE_URL>
API ด้านล่างอ้างอิงฟังก์ชันใน app.py ของ Space: predict_one, predict_many, predict_csv
หากคุณเปลี่ยนชื่อฟังก์ชัน/เส้นทาง ให้ปรับ URL ให้สอดคล้อง
1) Predict ข้อความเดียว
POST /run/predict_one
Body (JSON):
{
  "data": ["อาหารอร่อยมาก บริการดี", "cnn_bilstm"]
}

-Response (ตัวอย่าง):
{
  "data": [
    {"negative": 0.12, "positive": 0.88},
    "positive"
  ]
}

>-curl ตัวอย่าง:
curl -X POST "https://<YOUR_SPACE_URL>/run/predict_one" \
  -H "content-type: application/json" \
  -d '{"data":["อาหารอร่อยมาก บริการดี","cnn_bilstm"]}'

2) Predict หลายข้อความ (ทีละบรรทัด)
POST /run/predict_many
>Body (JSON):
{
  "data": ["แย่มาก รอนานมาก\nอร่อย บริการไว", "cnn_bilstm"]
}

>-curl ตัวอย่าง:
curl -X POST "https://<YOUR_SPACE_URL>/run/predict_many" \
  -H "content-type: application/json" \
  -d '{"data":["แย่มาก รอนานมาก\nอร่อย บริการไว","cnn_bilstm"]}'

3) อัปโหลด CSV
POST /run/predict_csv (multipart/form-data)
Fields
file: ไฟล์ CSV (ต้องมีคอลัมน์ review; ถ้ามี shop จะแสดงสรุปต่อร้าน)
model_choice: cnn_bilstm | baseline | last4weighted_bilstm
>-curl ตัวอย่าง:

curl -X POST "https://<YOUR_SPACE_URL>/run/predict_csv" \
  -F "file=@/path/to/reviews.csv" \
  -F "model_choice=cnn_bilstm"

>บางเวอร์ชันของ Gradio มีปุ่ม “View API” บนหน้า Space เพื่อตรวจ schema/endpoint ล่าสุดอัตโนมัติ
---
#Reproducibility (ย่อ)
-Base: airesearch/wangchanberta-base-att-spm-uncased
-max_len=128, Batch size=16, Optimizer: AdamW (lr_bert=2e-5, lr_others=1e-3), Early stopping
-5-Fold Stratified, Seed=42
-ไลบรารีหลัก: torch, transformers, safetensors, sentencepiece
---
#License & Attribution
>Model license: MIT (ปรับได้ตามต้องการ)
>Dataset: Wisesight Sentiment — โปรดอ้างอิงและปฏิบัติตามสัญญาอนุญาตของชุดข้อมูลต้นทาง

---
#Citation
>Dusit P. (2025). Thai Sentiment WCB (WangchanBERTa + LSTM/CNN/Last4 heads).
>Hugging Face: Dusit-P/thai-sentiment-wcb.
>Demo: <https://<YOUR_SPACE_URL>>.
---
#Changelog
-v1.0.0 — ปล่อย cnn_bilstm, baseline, last4weighted_bilstm; เพิ่ม Space (UI/REST)