---
language:
- th
license: cc-by-3.0
pipeline_tag: token-classification
tags:
- ner
- thai
- token-classification
- transformers
- pytorch
- phayathaibert
- phayathaibert-thainer
base_model: Pavarissy/phayathaibert-thainer
datasets:
- JonusNattapong/OpenThai-NER
metrics:
- name: precision
  type: precision
  value: 0.8565
- name: recall
  type: recall
  value: 0.8778
- name: f1
  type: f1
  value: 0.867
- name: accuracy
  type: accuracy
  value: 0.9565
model-index:
- name: OpenThai-NER
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: JonusNattapong/OpenThai-NER
      type: JonusNattapong/OpenThai-NER
    metrics:
    - name: Precision
      type: precision
      value: 0.8565
    - name: Recall
      type: recall
      value: 0.8778
    - name: F1
      type: f1
      value: 0.867
    - name: Accuracy
      type: accuracy
      value: 0.9565
new_version: Pavarissy/phayathaibert-thainer
---
# OpenThai-NER

Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset.

## Model Details

- **Base model:** `Pavarissy/phayathaibert-thainer`
- **Task:** Token Classification (NER)
- **Language:** Thai (`th`)
- **License:** cc-by-3.0
- **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV).

## Intended Use

This model is intended for:
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration

## Limitations / Known Issues (Transparent)

- **`eval_loss` is NaN:**  
  Common causes:
  - fp16 mixed precision numerical instability
  - incorrect label mask handling (`-100`) in loss
  - invalid label ids in a batch (out of range)
  - occasional overflow if logits become extreme  
  If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`.

- **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
- **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.

## Evaluation

### Final evaluation (epoch 3)

- **Precision:** 0.8565  
- **Recall:** 0.8778  
- **F1:** 0.8670  
- **Accuracy:** 0.9565  
- **Runtime:** 29.8202s

### Training summary

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|------:|--------------:|----------------:|----------:|-------:|---:|---------:|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 |
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 |
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 |

> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.

## How to Use

### Transformers Pipeline

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))
````

### Manual inference

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

print(list(zip(tokens, pred_ids)))
```

## Training Details

* **Epochs:** 3
* **Learning Rate:** 2e-5
* **Batch Size:** 16
* **Framework:** Hugging Face Transformers Trainer

## Reproducibility Notes

If you want fully stable loss:

* Set `fp16=False` (or `bf16=True` if available)
* Validate label ids are within `[0, num_labels-1]` or `-100`
* Ensure the data collator pads labels with `-100`
* Try gradient clipping (e.g., `max_grad_norm=1.0`)

## Citation

If you use this corpus in your research, please cite:

```bibtex
@dataset{Thainer Model 2025,
  title={Thai Named Entity Recognition},
  author{Nattapong Tapachoom}
  year={2025},
  howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}
```
---

**Repo:** `JonusNattapong/OpenThai-NER`
**Base model:** `Pavarissy/phayathaibert-thainer`