PhayaThaiBERT Thai POS Tagger
Fine-tuned PhayaThaiBERT for UPOS Part-of-Speech tagging on Thai sentences.
Trained on the UD_Thai-TUD treebank following Universal Dependencies conventions.
Model Description
- Base model: PhayaThaiBERT
- Task: Token Classification (POS tagging)
- Dataset: UD_Thai-TUD
- Tags (15 UPOS):
ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB
This model predicts word-level POS tags for Thai text. For best performance, use pre-segmented Thai words.
Usage
1) Load the model
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")
model.eval()
Option 1: Raw Thai text (automatic word segmentation)
from pythainlp.tokenize import word_tokenize
def predict_pos(text: str):
"""POS tag Thai text (raw string)."""
# 1. Word segmentation
words = word_tokenize(text)
# 2. Tokenize with alignment
encoded = tokenizer(
words,
is_split_into_words=True,
return_tensors="pt"
)
word_ids = encoded.word_ids()
with torch.no_grad():
outputs = model(**encoded)
preds = outputs.logits.argmax(dim=-1)[0]
# 3. Align subwords → words
results = []
prev = None
for idx, w_id in enumerate(word_ids):
if w_id is None:
continue
if w_id != prev:
label = model.config.id2label[preds[idx].item()]
results.append((words[w_id], label))
prev = w_id
return results
# Example
text = "ฉันกินข้าวที่ร้านอาหาร"
for w, p in predict_pos(text):
print(f"{w:15s} {p}")
Option 2: Pre-segmented Words (Recommended - Better Accuracy)
def predict_pos_from_words(words):
"""POS tag a list of pre-segmented Thai words."""
encoded = tokenizer(
words,
is_split_into_words=True,
return_tensors="pt"
)
word_ids = encoded.word_ids()
with torch.no_grad():
outputs = model(**encoded)
preds = outputs.logits.argmax(dim=-1)[0]
results = []
prev = None
for idx, w_id in enumerate(word_ids):
if w_id is None:
continue
if w_id != prev:
label = model.config.id2label[preds[idx].item()]
results.append((words[w_id], label))
prev = w_id
return results
# Example
words = ["ฉัน", "กิน", "ข้าว", "ที่", "ร้านอาหาร"]
for w, p in predict_pos_from_words(words):
print(f"{w:15s} {p}")
Example Output
Input: "ฉันกินข้าวที่ร้านอาหาร"
ฉัน: PRON
กิน: VERB
ข้าว: NOUN
ที่: ADP
ร้านอาหาร: NOUN
Training Details
Dataset
- Source: UD_Thai-TUD
- Training Set: 2,902 sentences
- Development Set: 362 sentences
- Test Set: 363 sentences
Training Configuration
- Epochs: 3
- Batch Size: 16
- Learning Rate: 3e-5
- Optimizer: AdamW
- Warmup Ratio: 0.1
- Weight Decay: 0.01
Hardware
- Trained on GPU (CUDA)
- Mixed precision training (FP16)
Limitations
- The model was trained on Universal Dependencies data, which may not cover all domains
- Performance may vary on informal text, social media, or specialized domains
- Word segmentation quality affects accuracy (use consistent segmentation)
- Limited to 15 UPOS tags (coarse-grained POS categories)
Ethical Considerations
- The model should not be used as the sole basis for critical decisions
- Performance may vary across different text types and domains
- Users should validate outputs for their specific use cases
Citation
If you use this model, please cite the base model PhayaThaiBERT:
@inproceedings{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
booktitle={2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)},
pages={1--6},
year={2021},
organization={IEEE}
}
Acknowledgements
- Base model: PhayaThaiBERT
- Training data: Universal Dependencies Thai Treebank
License
MIT License'
- Downloads last month
- 32
Model tree for sandpapat/phayathaibert-thai-pos-tagger
Base model
clicknext/phayathaibertDataset used to train sandpapat/phayathaibert-thai-pos-tagger
Evaluation results
- Test Accuracy on UD_Thai-TUDself-reported0.906
- Test Micro F1 on UD_Thai-TUDself-reported0.906
- Test Macro F1 on UD_Thai-TUDself-reported0.813