--- language: - th license: cc-by-3.0 pipeline_tag: token-classification tags: - ner - thai - token-classification - transformers - pytorch - phayathaibert - phayathaibert-thainer base_model: Pavarissy/phayathaibert-thainer datasets: - JonusNattapong/OpenThai-NER metrics: - name: precision type: precision value: 0.8565 - name: recall type: recall value: 0.8778 - name: f1 type: f1 value: 0.867 - name: accuracy type: accuracy value: 0.9565 model-index: - name: OpenThai-NER results: - task: type: token-classification name: Named Entity Recognition dataset: name: JonusNattapong/OpenThai-NER type: JonusNattapong/OpenThai-NER metrics: - name: Precision type: precision value: 0.8565 - name: Recall type: recall value: 0.8778 - name: F1 type: f1 value: 0.867 - name: Accuracy type: accuracy value: 0.9565 new_version: Pavarissy/phayathaibert-thainer --- # OpenThai-NER Fine-tuned **Pavarissy/phayathaibert-thainer** for **Thai Named Entity Recognition (NER)** on the **JonusNattapong/OpenThai-NER** dataset. ## Model Details - **Base model:** `Pavarissy/phayathaibert-thainer` - **Task:** Token Classification (NER) - **Language:** Thai (`th`) - **License:** cc-by-3.0 - **Google Colab:** [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV). ## Intended Use This model is intended for: - Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema) - Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration ## Limitations / Known Issues (Transparent) - **`eval_loss` is NaN:** Common causes: - fp16 mixed precision numerical instability - incorrect label mask handling (`-100`) in loss - invalid label ids in a batch (out of range) - occasional overflow if logits become extreme If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`. - **Domain shift:** Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text). - **Entity boundary ambiguity:** Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names. ## Evaluation ### Final evaluation (epoch 3) - **Precision:** 0.8565 - **Recall:** 0.8778 - **F1:** 0.8670 - **Accuracy:** 0.9565 - **Runtime:** 29.8202s ### Training summary | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |------:|--------------:|----------------:|----------:|-------:|---:|---------:| | 1 | 0.369000 | NaN | 0.787043 | 0.824356 | 0.805268 | 0.932532 | | 2 | 0.237600 | NaN | 0.841745 | 0.855728 | 0.848679 | 0.949934 | | 3 | 0.195900 | NaN | 0.856493 | 0.877835 | 0.867033 | 0.956475 | > Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving. ## How to Use ### Transformers Pipeline ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline repo_id = "JonusNattapong/OpenThai-NER" tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForTokenClassification.from_pretrained(repo_id) ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา" print(ner(text)) ```` ### Manual inference ```python import torch from transformers import AutoTokenizer, AutoModelForTokenClassification repo_id = "JonusNattapong/OpenThai-NER" tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForTokenClassification.from_pretrained(repo_id) text = "นายสมชายทำงานที่กระทรวงการคลัง" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) pred_ids = outputs.logits.argmax(-1)[0].tolist() tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) print(list(zip(tokens, pred_ids))) ``` ## Training Details * **Epochs:** 3 * **Learning Rate:** 2e-5 * **Batch Size:** 16 * **Framework:** Hugging Face Transformers Trainer ## Reproducibility Notes If you want fully stable loss: * Set `fp16=False` (or `bf16=True` if available) * Validate label ids are within `[0, num_labels-1]` or `-100` * Ensure the data collator pads labels with `-100` * Try gradient clipping (e.g., `max_grad_norm=1.0`) ## Citation If you use this corpus in your research, please cite: ```bibtex @dataset{Thainer Model 2025, title={Thai Named Entity Recognition}, author{Nattapong Tapachoom} year={2025}, howpublished={https://github.com/JonusNattapong/Natural-Language-Processing} } ``` --- **Repo:** `JonusNattapong/OpenThai-NER` **Base model:** `Pavarissy/phayathaibert-thainer`