thai-sentiment / README.md

Update README.md

b5a3973 verified 3 months ago

12.3 kB

	---
	language:
	- th
	pipeline_tag: text-classification
	tags:
	- sentiment-analysis
	- thai
	- wangchanberta
	- bilstm
	- cnn
	- lstm
	license: apache-2.0
	library_name: transformers
	datasets:
	- wisesight_sentiment
	---

	# Thai Sentiment (WangchanBERTa + LSTM Heads)

	โมเดลสำหรับวิเคราะห์อารมณ์ (2 คลาส: NEG/POS) ภาษาไทย โดยใช้ WangchanBERTa เป็น backbone และเพิ่มหัว (heads) แบบ LSTM/CNN-LSTM หลายสถาปัตยกรรมสำหรับเปรียบเทียบและใช้งานตามบริบท

	รีโปนี้บรรจุโมเดล 4 ตัว (เก็บเป็นโฟลเดอร์ย่อย):
	- `WCB/` — WangchanBERTa (ใช้ [CLS])
	- `WCB_BiLSTM/` — WangchanBERTa → BiLSTM → Pooling
	- `WCB_CNN_BiLSTM/` — WangchanBERTa → CNN → BiLSTM → Pooling
	- `WCB_4Layer_BiLSTM/` — WangchanBERTa (ถ่วงน้ำหนัก 4 เลเยอร์สุดท้าย) → BiLSTM → Pooling

	แต่ละโฟลเดอร์มี `model.safetensors` และ `config.json` (เมตาดาต้า: `id2label/label2id`, `max_length`, `pooling_after_lstm`, `base_model`)

	---

	## สรุปผลการประเมิน (5-fold CV)

	\| Model \| Accuracy (%) \| F1-Score (%) \| AUC (%) \|
	\|---\|---:\|---:\|---:\|
	\| WCB \| 90.33 ± 0.32 \| 89.92 ± 0.33 \| 95.72 ± 0.22 \|
	\| WCB_BiLSTM \| 90.93 ± 0.37 \| 90.54 ± 0.39 \| 95.57 ± 1.22 \|
	\| WCB_CNN_BiLSTM \| 90.14 ± 0.66 \| 89.73 ± 0.68 \| 95.83 ± 0.42 \|
	\| WCB_4Layer_BiLSTM \| 90.52 ± 0.65 \| 90.13 ± 0.68 \| 95.43 ± 0.36 \|

	ข้อสังเกตย่อ
	- แม่นยำสูงสุด: `WCB_BiLSTM` (Acc/F1 สูงสุด) แต่ AUC แปรปรวนกว่าตัวอื่นเล็กน้อย (±1.22%).
	- AUC สูงสุด/เสถียรดี: `WCB_CNN_BiLSTM` (AUC 95.83% ±0.42) เหมาะหากให้ความสำคัญกับการแยกคลาสจากสกอร์ความเชื่อมั่น แต่ Acc/F1 ต่ำกว่าเล็กน้อย.
	- เร็ว/เสถียร: `WCB` เร็วที่สุดและเสถียรสุด เหมาะงานทรัพยากรจำกัด.

	### เวลาเทรน (โดยเฉลี่ย)

	\| Model \| วินาที/รอบ \| เวลารวม (ชม.) \|
	\|---\|---:\|---:\|
	\| WCB \| 54.67 \| 4.58 \|
	\| WCB_BiLSTM \| 67.84 \| 5.68 \|
	\| WCB_CNN_BiLSTM \| 68.72 \| 5.76 \|
	\| WCB_4Layer_BiLSTM \| 72.91 \| 6.11 \|

	---

	## โครงสร้างรีโป

	```
	.
	├─ WCB/
	│ ├─ model.safetensors
	│ └─ config.json
	├─ WCB_BiLSTM/
	│ ├─ model.safetensors
	│ └─ config.json
	├─ WCB_CNN_BiLSTM/
	│ ├─ model.safetensors
	│ └─ config.json
	├─ WCB_4Layer_BiLSTM/
	│ ├─ model.safetensors
	│ └─ config.json
	├─ common/
	│ ├─ models.py
	│ └─ __init__.py
	├─ requirements.txt
	├─ LICENSE
	└─ README.md
	```

	---

	## วิธีใช้งาน

	### 🔧 ติดตั้ง Dependencies

	```bash
	pip install torch transformers huggingface-hub safetensors
	```

	### 📦 วิธีที่ 1: โหลดโมเดลจาก Hugging Face Hub (แนะนำ)

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	import json
	import importlib.util

	# ===== ตั้งค่า =====
	REPO_ID = "Dusit-P/thai-sentiment" # เปลี่ยนเป็น repo ของคุณ
	MODEL_NAME = "WCB_BiLSTM" # เลือก: WCB, WCB_BiLSTM, WCB_CNN_BiLSTM, WCB_4Layer_BiLSTM

	# ===== 1. ดาวน์โหลดไฟล์จำเป็น =====
	config_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/config.json")
	weights_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/model.safetensors")
	models_py = hf_hub_download(REPO_ID, filename="common/models.py")

	# ===== 2. โหลด config =====
	with open(config_path, "r", encoding="utf-8") as f:
	config = json.load(f)

	# ===== 3. โหลด tokenizer =====
	base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# ===== 4. โหลดโมเดล =====
	# Import models.py
	spec = importlib.util.spec_from_file_location("models", models_py)
	models = importlib.util.module_from_spec(spec)
	spec.loader.exec_module(models)

	# สร้างโมเดล
	architecture = config.get("architecture", MODEL_NAME)
	num_labels = config.get("num_labels", 2)
	pooling = config.get("pooling_after_lstm", "masked_mean")

	model = models._build(architecture, base_model, num_labels, pooling)

	# โหลด weights
	state_dict = load_file(weights_path)
	model.load_state_dict(state_dict, strict=False)
	model.eval()

	# ===== 5. ทำนาย =====
	text = "มือถือรุ่นนี้ดีมาก ราคาคุ้มค่า แนะนำเลย!"

	# Tokenize
	inputs = tokenizer(
	text,
	truncation=True,
	padding=True,
	max_length=config.get("max_length", 128),
	return_tensors="pt"
	)

	# Predict
	with torch.no_grad():
	logits = model(inputs["input_ids"], inputs["attention_mask"])
	probs = F.softmax(logits, dim=1)[0]
	pred_id = torch.argmax(logits, dim=1).item()

	# แสดงผล
	id2label = {int(k): v for k, v in config["id2label"].items()}
	print(f"Text: {text}")
	print(f"Prediction: {id2label[pred_id]}")
	print(f"Probabilities: NEG={probs[0]:.4f}, POS={probs[1]:.4f}")
	```

	Output ตัวอย่าง:
	```
	Text: มือถือรุ่นนี้ดีมาก ราคาคุ้มค่า แนะนำเลย!
	Prediction: positive
	Probabilities: NEG=0.0234, POS=0.9766
	```

	---

	### 📦 วิธีที่ 2: Clone Repo แล้วใช้งาน

	```bash
	git clone https://huggingface.co/Dusit-P/thai-sentiment
	cd thai-sentiment
	pip install -r requirements.txt
	```

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer
	from safetensors.torch import load_file
	from common.models import _build
	import json

	# ===== เลือกโมเดล =====
	MODEL_DIR = "WCB_BiLSTM"

	# ===== โหลด config =====
	with open(f"{MODEL_DIR}/config.json", "r") as f:
	config = json.load(f)

	# ===== โหลด tokenizer =====
	base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# ===== โหลดโมเดล =====
	model = _build(
	config.get("architecture", MODEL_DIR),
	base_model,
	config.get("num_labels", 2),
	config.get("pooling_after_lstm", "masked_mean")
	)

	state_dict = load_file(f"{MODEL_DIR}/model.safetensors")
	model.load_state_dict(state_dict, strict=False)
	model.eval()

	# ===== ทำนาย =====
	text = "ของแพงไป คุณภาพไม่คุ้มราคา"

	inputs = tokenizer(
	text,
	truncation=True,
	padding=True,
	max_length=config.get("max_length", 128),
	return_tensors="pt"
	)

	with torch.no_grad():
	logits = model(inputs["input_ids"], inputs["attention_mask"])
	probs = F.softmax(logits, dim=1)[0]
	pred_id = torch.argmax(logits, dim=1).item()

	id2label = {int(k): v for k, v in config["id2label"].items()}
	print(f"Prediction: {id2label[pred_id]}")
	print(f"Probabilities: {probs}")
	```

	---

	### 📦 วิธีที่ 3: ทำนายหลายข้อความพร้อมกัน (Batch Prediction)

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	import json
	import importlib.util

	# ===== Setup =====
	REPO_ID = "Dusit-P/thai-sentiment"
	MODEL_NAME = "WCB_BiLSTM"

	# ===== โหลดโมเดล (ตามวิธีที่ 1) =====
	config_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/config.json")
	weights_path = hf_hub_download(REPO_ID, filename=f"{MODEL_NAME}/model.safetensors")
	models_py = hf_hub_download(REPO_ID, filename="common/models.py")

	with open(config_path, "r") as f:
	config = json.load(f)

	base_model = config.get("base_model", "airesearch/wangchanberta-base-att-spm-uncased")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	spec = importlib.util.spec_from_file_location("models", models_py)
	models = importlib.util.module_from_spec(spec)
	spec.loader.exec_module(models)

	model = models._build(
	config.get("architecture", MODEL_NAME),
	base_model,
	config.get("num_labels", 2),
	config.get("pooling_after_lstm", "masked_mean")
	)

	state_dict = load_file(weights_path)
	model.load_state_dict(state_dict, strict=False)
	model.eval()

	# ===== ทำนายหลายข้อความ =====
	texts = [
	"อาหารอร่อยมาก บริการดีมาก",
	"ของแพงไป รสชาติก็ธรรมดา",
	"บรรยากาศดี แต่รอนานไป",
	"คุ้มค่ามาก แนะนำเลย"
	]

	# Tokenize batch
	inputs = tokenizer(
	texts,
	truncation=True,
	padding=True,
	max_length=config.get("max_length", 128),
	return_tensors="pt"
	)

	# Predict batch
	with torch.no_grad():
	logits = model(inputs["input_ids"], inputs["attention_mask"])
	probs = F.softmax(logits, dim=1)
	pred_ids = torch.argmax(logits, dim=1)

	# แสดงผล
	id2label = {int(k): v for k, v in config["id2label"].items()}

	print("=" * 70)
	for i, text in enumerate(texts):
	label = id2label[pred_ids[i].item()]
	neg_prob = probs[i][0].item()
	pos_prob = probs[i][1].item()

	print(f"Text: {text}")
	print(f" → Prediction: {label}")
	print(f" → Confidence: NEG={neg_prob:.4f}, POS={pos_prob:.4f}")
	print("-" * 70)
	```

	Output ตัวอย่าง:
	```
	======================================================================
	Text: อาหารอร่อยมาก บริการดีมาก
	→ Prediction: positive
	→ Confidence: NEG=0.0156, POS=0.9844
	----------------------------------------------------------------------
	Text: ของแพงไป รสชาติก็ธรรมดา
	→ Prediction: negative
	→ Confidence: NEG=0.8923, POS=0.1077
	----------------------------------------------------------------------
	Text: บรรยากาศดี แต่รอนานไป
	→ Prediction: positive
	→ Confidence: NEG=0.3421, POS=0.6579
	----------------------------------------------------------------------
	Text: คุ้มค่ามาก แนะนำเลย
	→ Prediction: positive
	→ Confidence: NEG=0.0089, POS=0.9911
	----------------------------------------------------------------------
	```

	---

	## 🎯 เลือกโมเดลให้เหมาะกับงาน

	- ต้องการความแม่นยำสูงสุด → `WCB_BiLSTM`
	Acc/F1 สูงสุด (90.93% / 90.54%)

	- ทรัพยากรจำกัด / ต้องการความเร็ว → `WCB`
	เร็วที่สุด (54.67 วิ/รอบ) และเสถียรสุด

	- โฟกัส AUC / การจัดอันดับความเสี่ยง → `WCB_CNN_BiLSTM`
	AUC สูงสุด (95.83%) และเสถียร

	- สมดุลโดยรวม → `WCB_4Layer_BiLSTM`
	ประสิทธิภาพดี

	---

	## 🚀 Demo Application

	ลองใช้งานโมเดลผ่าน Gradio Demo:
	https://huggingface.co/spaces/Dusit-P/Thai-Sentiment-GUI

	---

	## 📄 License

	Apache-2.0

	---

	## 🙏 Acknowledgments

	- WangchanBERTa: airesearch/wangchanberta-base-att-spm-uncased
	- Dataset: wisesight_sentiment

	---

	## 📧 Contact

	หากมีคำถามหรือข้อเสนอแนะ กรุณาติดต่อผ่าน GitHub Issues