OpenThai-NER
Fine-tuned Pavarissy/phayathaibert-thainer for Thai Named Entity Recognition (NER) on the JonusNattapong/OpenThai-NER dataset.
Note: Your evaluation logs show
eval_loss: nanwhile F1/Precision/Recall are normal. This typically indicates a loss computation/logging issue (e.g., label masking /-100handling / fp16 instability), not necessarily that the model is unusable. Metrics below are still valid as they come from decoded predictions vs. gold labels.
Model Details
- Base model:
Pavarissy/phayathaibert-thainer - Task: Token Classification (NER)
- Language: Thai (
th) - License: Apache-2.0
Intended Use
This model is intended for:
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration
Limitations / Known Issues (Transparent)
eval_lossis NaN:
Common causes:- fp16 mixed precision numerical instability
- incorrect label mask handling (
-100) in loss - invalid label ids in a batch (out of range)
- occasional overflow if logits become extreme
If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with-100.
Domain shift: Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
Entity boundary ambiguity: Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.
Evaluation
Final evaluation (epoch 3)
- Precision: 0.8565
- Recall: 0.8778
- F1: 0.8670
- Accuracy: 0.9565
- Runtime: 29.8202s
Training summary
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 |
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 |
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 |
Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.
How to Use
Transformers Pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))
Manual inference
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(list(zip(tokens, pred_ids)))
Training Details
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 16
- Framework: Hugging Face Transformers Trainer
Reproducibility Notes
If you want fully stable loss:
- Set
fp16=False(orbf16=Trueif available) - Validate label ids are within
[0, num_labels-1]or-100 - Ensure the data collator pads labels with
-100 - Try gradient clipping (e.g.,
max_grad_norm=1.0)
Citation
If you use this corpus in your research, please cite:
@dataset{Thainer Model 2025,
title={Thai Named Entity Recognition},
author{Nattapong Tapachoom}
year={2025},
howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}
Repo: JonusNattapong/OpenThai-NER
Base model: Pavarissy/phayathaibert-thainer
- Downloads last month
- -
Model tree for JonusNattapong/OpenThai-NER
Base model
clicknext/phayathaibertDataset used to train JonusNattapong/OpenThai-NER
Evaluation results
- Precision on JonusNattapong/OpenThai-NERself-reported0.857
- Recall on JonusNattapong/OpenThai-NERself-reported0.878
- F1 on JonusNattapong/OpenThai-NERself-reported0.867
- Accuracy on JonusNattapong/OpenThai-NERself-reported0.957