OpenThai-NER / README.md

Update README.md

51f300c verified 13 days ago

5.18 kB

	---
	language:
	- th
	license: cc-by-3.0
	pipeline_tag: token-classification
	tags:
	- ner
	- thai
	- token-classification
	- transformers
	- pytorch
	- phayathaibert
	- phayathaibert-thainer
	base_model: Pavarissy/phayathaibert-thainer
	datasets:
	- JonusNattapong/OpenThai-NER
	metrics:
	- name: precision
	type: precision
	value: 0.8565
	- name: recall
	type: recall
	value: 0.8778
	- name: f1
	type: f1
	value: 0.867
	- name: accuracy
	type: accuracy
	value: 0.9565
	model-index:
	- name: OpenThai-NER
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: JonusNattapong/OpenThai-NER
	type: JonusNattapong/OpenThai-NER
	metrics:
	- name: Precision
	type: precision
	value: 0.8565
	- name: Recall
	type: recall
	value: 0.8778
	- name: F1
	type: f1
	value: 0.867
	- name: Accuracy
	type: accuracy
	value: 0.9565
	new_version: Pavarissy/phayathaibert-thainer
	---
	# OpenThai-NER

	Fine-tuned Pavarissy/phayathaibert-thainer for Thai Named Entity Recognition (NER) on the JonusNattapong/OpenThai-NER dataset.

	## Model Details

	- Base model: `Pavarissy/phayathaibert-thainer`
	- Task: Token Classification (NER)
	- Language: Thai (`th`)
	- License: cc-by-3.0
	- Google Colab: [Google Colab](https://colab.research.google.com/drive/1Vy5bXO0BaZnYCYKbA9J3VagFtutMWcTY#scrollTo=M_cCxkSuUaXV).

	## Intended Use

	This model is intended for:
	- Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
	- Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration

	## Limitations / Known Issues (Transparent)

	- `eval_loss` is NaN:
	Common causes:
	- fp16 mixed precision numerical instability
	- incorrect label mask handling (`-100`) in loss
	- invalid label ids in a batch (out of range)
	- occasional overflow if logits become extreme
	If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with `-100`.

	- Domain shift: Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).
	- Entity boundary ambiguity: Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.

	## Evaluation

	### Final evaluation (epoch 3)

	- Precision: 0.8565
	- Recall: 0.8778
	- F1: 0.8670
	- Accuracy: 0.9565
	- Runtime: 29.8202s

	### Training summary

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|------:\|--------------:\|----------------:\|----------:\|-------:\|---:\|---------:\|
	\| 1 \| 0.369000 \| NaN \| 0.787043 \| 0.824356 \| 0.805268 \| 0.932532 \|
	\| 2 \| 0.237600 \| NaN \| 0.841745 \| 0.855728 \| 0.848679 \| 0.949934 \|
	\| 3 \| 0.195900 \| NaN \| 0.856493 \| 0.877835 \| 0.867033 \| 0.956475 \|

	> Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.

	## How to Use

	### Transformers Pipeline

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

	repo_id = "JonusNattapong/OpenThai-NER"

	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForTokenClassification.from_pretrained(repo_id)

	ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
	text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
	print(ner(text))
	````

	### Manual inference

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	repo_id = "JonusNattapong/OpenThai-NER"

	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForTokenClassification.from_pretrained(repo_id)

	text = "นายสมชายทำงานที่กระทรวงการคลัง"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)

	pred_ids = outputs.logits.argmax(-1)[0].tolist()
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

	print(list(zip(tokens, pred_ids)))
	```

	## Training Details

	* Epochs: 3
	* Learning Rate: 2e-5
	* Batch Size: 16
	* Framework: Hugging Face Transformers Trainer

	## Reproducibility Notes

	If you want fully stable loss:

	* Set `fp16=False` (or `bf16=True` if available)
	* Validate label ids are within `[0, num_labels-1]` or `-100`
	* Ensure the data collator pads labels with `-100`
	* Try gradient clipping (e.g., `max_grad_norm=1.0`)

	## Citation

	If you use this corpus in your research, please cite:

	```bibtex
	@dataset{Thainer Model 2025,
	title={Thai Named Entity Recognition},
	author{Nattapong Tapachoom}
	year={2025},
	howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
	}
	```
	---

	Repo: `JonusNattapong/OpenThai-NER`
	Base model: `Pavarissy/phayathaibert-thainer`