File size: 5,181 Bytes

b6f8b12
d2cec32
 
 
 
71509ff
d2cec32
 
 
 
 
 
 
 
 
 
efd9677
d2cec32
 
 
 
 
 
 
 
 
 
 
 
b6f8b12
 
d2cec32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6f8b12
d2cec32
b6f8b12
d2cec32
efd9677
d2cec32
 
 
 
 
605fe6f
51f300c
d2cec32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6f8b12
d2cec32
efd9677
d2cec32
 
 
 
 
efd9677
d2cec32
efd9677
d2cec32
efd9677
d2cec32
efd9677
d2cec32
 
efd9677
d2cec32
efd9677
d2cec32
 
efd9677
d2cec32
 
 
 
efd9677
d2cec32
efd9677
d2cec32
 
 
efd9677
d2cec32
efd9677
d2cec32
 
066b115
d2cec32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6f8b12
d2cec32

---
language:
- th
license: cc-by-3.0
pipeline_tag: token-classification
tags:
- ner
- thai
- token-classification
- transformers
- pytorch
- phayathaibert
- phayathaibert-thainer
base_model: Pavarissy/phayathaibert-thainer
datasets:
- JonusNattapong/OpenThai-NER
metrics:
- name: precision
  type: precision
  value: 0.8565
- name: recall
  type: recall
  value: 0.8778
- name: f1
  type: f1
  value: 0.867
- name: accuracy
  type: accuracy
  value: 0.9565
model-index:
- name: OpenThai-NER
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: JonusNattapong/OpenThai-NER
      type: JonusNattapong/OpenThai-NER
    metrics:
    - name: Precision
      type: precision
      value: 0.8565
    - name: Recall
      type: recall
      value: 0.8778
    - name: F1
      type: f1
      value: 0.867
    - name: Accuracy
      type: accuracy
      value: 0.9565
new_version: Pavarissy/phayathaibert-thainer
---
# OpenThai-NER

Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset.

## Model Details

- **Base model:** `Pavarissy/phayathaibert-thainer`
- **Task:** Token Classification (NER)
- **Language:** Thai (`th`)
- **License:** cc-by-3.0
- **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV).

## Intended Use

This model is intended for:
- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration

## Limitations / Known Issues (Transparent)

- **`eval_loss` is NaN:**  
  Common causes:
  - fp16 mixed precision numerical instability
  - incorrect label mask handling (`-100`) in loss
  - invalid label ids in a batch (out of range)
  - occasional overflow if logits become extreme  
  If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`.

- **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
- **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.

## Evaluation

### Final evaluation (epoch 3)

- **Precision:** 0.8565  
- **Recall:** 0.8778  
- **F1:** 0.8670  
- **Accuracy:** 0.9565  
- **Runtime:** 29.8202s

### Training summary

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|------:|--------------:|----------------:|----------:|-------:|---:|---------:|
| 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 |
| 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 |
| 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 |

> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.

## How to Use

### Transformers Pipeline

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))
````

### Manual inference

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

print(list(zip(tokens, pred_ids)))
```

## Training Details

* **Epochs:** 3
* **Learning Rate:** 2e-5
* **Batch Size:** 16
* **Framework:** Hugging Face Transformers Trainer

## Reproducibility Notes

If you want fully stable loss:

* Set `fp16=False` (or `bf16=True` if available)
* Validate label ids are within `[0, num_labels-1]` or `-100`
* Ensure the data collator pads labels with `-100`
* Try gradient clipping (e.g., `max_grad_norm=1.0`)

## Citation

If you use this corpus in your research, please cite:

```bibtex
@dataset{Thainer Model 2025,
  title={Thai Named Entity Recognition},
  author{Nattapong Tapachoom}
  year={2025},
  howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}
```
---

**Repo:** `JonusNattapong/OpenThai-NER`
**Base model:** `Pavarissy/phayathaibert-thainer`