Thai Job NER — Fine-tuned PhayaThaiBERT (v2: Mixed Training)

Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from PhayaThaiBERT (~122M params).

v2 update: Trained on style-matched synthetic + real-world data with upsampling. Real-world F1 improved from 0.143 → 0.975.

Model Description

This model extracts 7 entity types from Thai job-related text:

Entity	Description	Example
`HARD_SKILL`	Skills or procedures	ดูแลผู้สูงอายุ, CPR, Python
`PERSON`	Names	คุณสมชาย, พี่แจน
`LOCATION`	Places	สีลม, ลาดพร้าว, บางนา
`COMPENSATION`	Pay amounts	18,000 บาท/เดือน
`EMPLOYMENT_TERMS`	Job structure	part-time, กะกลางวัน
`CONTACT`	Phone, Line, email	081-234-5678, @care123
`DEMOGRAPHIC`	Age, gender	อายุ 25-40, หญิง

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "chayuto/thai-job-ner-phayathaibert"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")

text = "รับสมัครคนดูแลผู้สูงอายุ ย่านสีลม เงินเดือน 18,000 บาท โทร 081-234-5678"
results = ner(text)
for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")

Training

Base model: clicknext/phayathaibert (CamemBERT architecture, ~122M params, XLM-R-derived vocabulary)
Training data: 619 posts (510 style-matched synthetic + 54 real-world posts upsampled 5x), silver labels from GPT-4o, fuzzy-aligned to IOB2
Real data proportion: 34.6% (key factor for real-world performance)
Hardware: Apple Silicon MPS backend, FP32
Hyperparameters: LR=3e-5, warmup=0.1, batch=2, grad_accum=4, 10 epochs, gradient checkpointing, frozen embeddings
Training time: ~35 min

Data Pipeline

Raw Thai text + GPT-4o entity extractions → fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping → subword token mapping via offset_mapping → IOB2-formatted HuggingFace Dataset.

Training Strategy

Style-matched synthetic data (v3) was generated to mimic real-world post characteristics: varied formatting, code-switching, informal language. Real-world posts were upsampled 5x to achieve a 20-35% real data proportion — the sweet spot identified through systematic experiments (see below).

Evaluation

Real-World Test Set (8 held-out real posts, 98 entities)

Metric	Score
F1	0.975
Precision	0.960
Recall	0.990

Real-World Per-Entity F1

Entity	F1	Precision	Recall	Support
COMPENSATION	1.000	1.000	1.000	15
CONTACT	1.000	1.000	1.000	10
DEMOGRAPHIC	1.000	1.000	1.000	14
LOCATION	1.000	1.000	1.000	8
PERSON	1.000	1.000	1.000	3
HARD_SKILL	0.973	0.947	1.000	36
EMPLOYMENT_TERMS	0.880	0.846	0.917	12

5 of 7 entity types achieve perfect F1 on real-world data.

Mixed Test Set (78 examples)

Metric	Score
F1	0.929
Precision	0.915
Recall	0.944

Experiment History: Closing the Real-World Gap

Version	Training Data	Real-World F1	Key Insight
v1 (synthetic-only)	1,253 synthetic	0.143	Synthetic alone fails on real data
v1 (real-only)	37 real posts	0.558	Small but 4x more effective per sample
v1 (mixed, 4.1% real)	1,057 mixed	0.935	Even a few real posts help dramatically
v2 (v3+real5x, 34.6% real)	619 mixed	0.975	Style-matched synthetic + upsampling

Key finding: real data proportion is the #1 factor. Style-matched synthetic data is 3x more efficient than generic synthetic data.

Limitations

Real-world evaluation is based on 8 held-out posts (98 entities) — confidence intervals are wide
EMPLOYMENT_TERMS remains the weakest entity (F1=0.880) — boundary ambiguity in schedule/contract terms
Embeddings were frozen during training (MPS memory constraint) — unfreezing on a larger GPU may yield further gains
256 token max sequence length (covers >95% of real posts)
Larger model file size due to 248K vocabulary (vs WangchanBERTa's 25K)

Technical Notes

FP16 is broken on MPS — always use FP32 for Apple Silicon training
PhayaThaiBERT's 248K vocab (XLM-R-derived) requires frozen embeddings + gradient checkpointing to fit on 18GB MPS
Uses offset_mapping for tokenizer-agnostic subword-to-character alignment
Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment
Real data upsampling (5-10x) is a simple, effective technique for low-resource scenarios

License

MIT

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32

Dataset used to train chayuto/thai-job-ner-phayathaibert

Evaluation results

F1 (real-world)
self-reported

0.975
F1 (mixed test)
self-reported

0.929
Precision
self-reported

0.960
Recall
self-reported

0.990

chayuto
/

thai-job-ner-phayathaibert