|
|
--- |
|
|
language: |
|
|
- th |
|
|
license: cc-by-3.0 |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- ner |
|
|
- thai |
|
|
- token-classification |
|
|
- transformers |
|
|
- pytorch |
|
|
- phayathaibert |
|
|
- phayathaibert-thainer |
|
|
base_model: Pavarissy/phayathaibert-thainer |
|
|
datasets: |
|
|
- JonusNattapong/OpenThai-NER |
|
|
metrics: |
|
|
- name: precision |
|
|
type: precision |
|
|
value: 0.8565 |
|
|
- name: recall |
|
|
type: recall |
|
|
value: 0.8778 |
|
|
- name: f1 |
|
|
type: f1 |
|
|
value: 0.867 |
|
|
- name: accuracy |
|
|
type: accuracy |
|
|
value: 0.9565 |
|
|
model-index: |
|
|
- name: OpenThai-NER |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
dataset: |
|
|
name: JonusNattapong/OpenThai-NER |
|
|
type: JonusNattapong/OpenThai-NER |
|
|
metrics: |
|
|
- name: Precision |
|
|
type: precision |
|
|
value: 0.8565 |
|
|
- name: Recall |
|
|
type: recall |
|
|
value: 0.8778 |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.867 |
|
|
- name: Accuracy |
|
|
type: accuracy |
|
|
value: 0.9565 |
|
|
new_version: Pavarissy/phayathaibert-thainer |
|
|
--- |
|
|
# OpenThai-NER |
|
|
|
|
|
Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base model:** `Pavarissy/phayathaibert-thainer` |
|
|
- **Task:** Token Classification (NER) |
|
|
- **Language:** Thai (`th`) |
|
|
- **License:** cc-by-3.0 |
|
|
- **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV). |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for: |
|
|
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema) |
|
|
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration |
|
|
|
|
|
## Limitations / Known Issues (Transparent) |
|
|
|
|
|
- **`eval_loss` is NaN:** |
|
|
Common causes: |
|
|
- fp16 mixed precision numerical instability |
|
|
- incorrect label mask handling (`-100`) in loss |
|
|
- invalid label ids in a batch (out of range) |
|
|
- occasional overflow if logits become extreme |
|
|
If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`. |
|
|
|
|
|
- **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text). |
|
|
- **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Final evaluation (epoch 3) |
|
|
|
|
|
- **Precision:** 0.8565 |
|
|
- **Recall:** 0.8778 |
|
|
- **F1:** 0.8670 |
|
|
- **Accuracy:** 0.9565 |
|
|
- **Runtime:** 29.8202s |
|
|
|
|
|
### Training summary |
|
|
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
|
|------:|--------------:|----------------:|----------:|-------:|---:|---------:| |
|
|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 | |
|
|
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 | |
|
|
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 | |
|
|
|
|
|
> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Transformers Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
repo_id = "JonusNattapong/OpenThai-NER" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
model = AutoModelForTokenClassification.from_pretrained(repo_id) |
|
|
|
|
|
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา" |
|
|
print(ner(text)) |
|
|
```` |
|
|
|
|
|
### Manual inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
|
|
repo_id = "JonusNattapong/OpenThai-NER" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
model = AutoModelForTokenClassification.from_pretrained(repo_id) |
|
|
|
|
|
text = "นายสมชายทำงานที่กระทรวงการคลัง" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
pred_ids = outputs.logits.argmax(-1)[0].tolist() |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
|
|
|
|
print(list(zip(tokens, pred_ids))) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
* **Epochs:** 3 |
|
|
* **Learning Rate:** 2e-5 |
|
|
* **Batch Size:** 16 |
|
|
* **Framework:** Hugging Face Transformers Trainer |
|
|
|
|
|
## Reproducibility Notes |
|
|
|
|
|
If you want fully stable loss: |
|
|
|
|
|
* Set `fp16=False` (or `bf16=True` if available) |
|
|
* Validate label ids are within `[0, num_labels-1]` or `-100` |
|
|
* Ensure the data collator pads labels with `-100` |
|
|
* Try gradient clipping (e.g., `max_grad_norm=1.0`) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this corpus in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@dataset{Thainer Model 2025, |
|
|
title={Thai Named Entity Recognition}, |
|
|
author{Nattapong Tapachoom} |
|
|
year={2025}, |
|
|
howpublished={https://github.com/JonusNattapong/Natural-Language-Processing} |
|
|
} |
|
|
``` |
|
|
--- |
|
|
|
|
|
**Repo:** `JonusNattapong/OpenThai-NER` |
|
|
**Base model:** `Pavarissy/phayathaibert-thainer` |