OpenThai-NER / README.md
JonusNattapong's picture
Update README.md
51f300c verified
---
language:
- th
license: cc-by-3.0
pipeline_tag: token-classification
tags:
- ner
- thai
- token-classification
- transformers
- pytorch
- phayathaibert
- phayathaibert-thainer
base_model: Pavarissy/phayathaibert-thainer
datasets:
- JonusNattapong/OpenThai-NER
metrics:
- name: precision
type: precision
value: 0.8565
- name: recall
type: recall
value: 0.8778
- name: f1
type: f1
value: 0.867
- name: accuracy
type: accuracy
value: 0.9565
model-index:
- name: OpenThai-NER
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: JonusNattapong/OpenThai-NER
type: JonusNattapong/OpenThai-NER
metrics:
- name: Precision
type: precision
value: 0.8565
- name: Recall
type: recall
value: 0.8778
- name: F1
type: f1
value: 0.867
- name: Accuracy
type: accuracy
value: 0.9565
new_version: Pavarissy/phayathaibert-thainer
---
# OpenThai-NER
Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset.
## Model Details
- **Base model:** `Pavarissy/phayathaibert-thainer`
- **Task:** Token Classification (NER)
- **Language:** Thai (`th`)
- **License:** cc-by-3.0
- **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV).
## Intended Use
This model is intended for:
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration
## Limitations / Known Issues (Transparent)
- **`eval_loss` is NaN:**
Common causes:
- fp16 mixed precision numerical instability
- incorrect label mask handling (`-100`) in loss
- invalid label ids in a batch (out of range)
- occasional overflow if logits become extreme
If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`.
- **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
- **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.
## Evaluation
### Final evaluation (epoch 3)
- **Precision:** 0.8565
- **Recall:** 0.8778
- **F1:** 0.8670
- **Accuracy:** 0.9565
- **Runtime:** 29.8202s
### Training summary
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|------:|--------------:|----------------:|----------:|-------:|---:|---------:|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 |
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 |
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 |
> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.
## How to Use
### Transformers Pipeline
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))
````
### Manual inference
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(list(zip(tokens, pred_ids)))
```
## Training Details
* **Epochs:** 3
* **Learning Rate:** 2e-5
* **Batch Size:** 16
* **Framework:** Hugging Face Transformers Trainer
## Reproducibility Notes
If you want fully stable loss:
* Set `fp16=False` (or `bf16=True` if available)
* Validate label ids are within `[0, num_labels-1]` or `-100`
* Ensure the data collator pads labels with `-100`
* Try gradient clipping (e.g., `max_grad_norm=1.0`)
## Citation
If you use this corpus in your research, please cite:
```bibtex
@dataset{Thainer Model 2025,
title={Thai Named Entity Recognition},
author{Nattapong Tapachoom}
year={2025},
howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}
```
---
**Repo:** `JonusNattapong/OpenThai-NER`
**Base model:** `Pavarissy/phayathaibert-thainer`