File size: 5,181 Bytes
b6f8b12 d2cec32 71509ff d2cec32 efd9677 d2cec32 b6f8b12 d2cec32 b6f8b12 d2cec32 b6f8b12 d2cec32 efd9677 d2cec32 605fe6f 51f300c d2cec32 b6f8b12 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 efd9677 d2cec32 066b115 d2cec32 b6f8b12 d2cec32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
language:
- th
license: cc-by-3.0
pipeline_tag: token-classification
tags:
- ner
- thai
- token-classification
- transformers
- pytorch
- phayathaibert
- phayathaibert-thainer
base_model: Pavarissy/phayathaibert-thainer
datasets:
- JonusNattapong/OpenThai-NER
metrics:
- name: precision
type: precision
value: 0.8565
- name: recall
type: recall
value: 0.8778
- name: f1
type: f1
value: 0.867
- name: accuracy
type: accuracy
value: 0.9565
model-index:
- name: OpenThai-NER
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: JonusNattapong/OpenThai-NER
type: JonusNattapong/OpenThai-NER
metrics:
- name: Precision
type: precision
value: 0.8565
- name: Recall
type: recall
value: 0.8778
- name: F1
type: f1
value: 0.867
- name: Accuracy
type: accuracy
value: 0.9565
new_version: Pavarissy/phayathaibert-thainer
---
# OpenThai-NER
Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset.
## Model Details
- **Base model:** `Pavarissy/phayathaibert-thainer`
- **Task:** Token Classification (NER)
- **Language:** Thai (`th`)
- **License:** cc-by-3.0
- **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV).
## Intended Use
This model is intended for:
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration
## Limitations / Known Issues (Transparent)
- **`eval_loss` is NaN:**
Common causes:
- fp16 mixed precision numerical instability
- incorrect label mask handling (`-100`) in loss
- invalid label ids in a batch (out of range)
- occasional overflow if logits become extreme
If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`.
- **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
- **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.
## Evaluation
### Final evaluation (epoch 3)
- **Precision:** 0.8565
- **Recall:** 0.8778
- **F1:** 0.8670
- **Accuracy:** 0.9565
- **Runtime:** 29.8202s
### Training summary
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|------:|--------------:|----------------:|----------:|-------:|---:|---------:|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 |
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 |
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 |
> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.
## How to Use
### Transformers Pipeline
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))
````
### Manual inference
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "JonusNattapong/OpenThai-NER"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(list(zip(tokens, pred_ids)))
```
## Training Details
* **Epochs:** 3
* **Learning Rate:** 2e-5
* **Batch Size:** 16
* **Framework:** Hugging Face Transformers Trainer
## Reproducibility Notes
If you want fully stable loss:
* Set `fp16=False` (or `bf16=True` if available)
* Validate label ids are within `[0, num_labels-1]` or `-100`
* Ensure the data collator pads labels with `-100`
* Try gradient clipping (e.g., `max_grad_norm=1.0`)
## Citation
If you use this corpus in your research, please cite:
```bibtex
@dataset{Thainer Model 2025,
title={Thai Named Entity Recognition},
author{Nattapong Tapachoom}
year={2025},
howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}
```
---
**Repo:** `JonusNattapong/OpenThai-NER`
**Base model:** `Pavarissy/phayathaibert-thainer` |