Thai Job NER โ Fine-tuned PhayaThaiBERT (v2: Mixed Training)
Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from PhayaThaiBERT (~122M params).
v2 update: Trained on style-matched synthetic + real-world data with upsampling. Real-world F1 improved from 0.143 โ 0.975.
Model Description
This model extracts 7 entity types from Thai job-related text:
| Entity | Description | Example |
|---|---|---|
HARD_SKILL |
Skills or procedures | เธเธนเนเธฅเธเธนเนเธชเธนเธเธญเธฒเธขเธธ, CPR, Python |
PERSON |
Names | เธเธธเธเธชเธกเธเธฒเธข, เธเธตเนเนเธเธ |
LOCATION |
Places | เธชเธตเธฅเธก, เธฅเธฒเธเธเธฃเนเธฒเธง, เธเธฒเธเธเธฒ |
COMPENSATION |
Pay amounts | 18,000 เธเธฒเธ/เนเธเธทเธญเธ |
EMPLOYMENT_TERMS |
Job structure | part-time, เธเธฐเธเธฅเธฒเธเธงเธฑเธ |
CONTACT |
Phone, Line, email | 081-234-5678, @care123 |
DEMOGRAPHIC |
Age, gender | เธญเธฒเธขเธธ 25-40, เธซเธเธดเธ |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "chayuto/thai-job-ner-phayathaibert"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")
text = "เธฃเธฑเธเธชเธกเธฑเธเธฃเธเธเธเธนเนเธฅเธเธนเนเธชเธนเธเธญเธฒเธขเธธ เธขเนเธฒเธเธชเธตเธฅเธก เนเธเธดเธเนเธเธทเธญเธ 18,000 เธเธฒเธ เนเธเธฃ 081-234-5678"
results = ner(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")
Training
- Base model:
clicknext/phayathaibert(CamemBERT architecture, ~122M params, XLM-R-derived vocabulary) - Training data: 619 posts (510 style-matched synthetic + 54 real-world posts upsampled 5x), silver labels from GPT-4o, fuzzy-aligned to IOB2
- Real data proportion: 34.6% (key factor for real-world performance)
- Hardware: Apple Silicon MPS backend, FP32
- Hyperparameters: LR=3e-5, warmup=0.1, batch=2, grad_accum=4, 10 epochs, gradient checkpointing, frozen embeddings
- Training time: ~35 min
Data Pipeline
Raw Thai text + GPT-4o entity extractions โ fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping โ subword token mapping via offset_mapping โ IOB2-formatted HuggingFace Dataset.
Training Strategy
Style-matched synthetic data (v3) was generated to mimic real-world post characteristics: varied formatting, code-switching, informal language. Real-world posts were upsampled 5x to achieve a 20-35% real data proportion โ the sweet spot identified through systematic experiments (see below).
Evaluation
Real-World Test Set (8 held-out real posts, 98 entities)
| Metric | Score |
|---|---|
| F1 | 0.975 |
| Precision | 0.960 |
| Recall | 0.990 |
Real-World Per-Entity F1
| Entity | F1 | Precision | Recall | Support |
|---|---|---|---|---|
| COMPENSATION | 1.000 | 1.000 | 1.000 | 15 |
| CONTACT | 1.000 | 1.000 | 1.000 | 10 |
| DEMOGRAPHIC | 1.000 | 1.000 | 1.000 | 14 |
| LOCATION | 1.000 | 1.000 | 1.000 | 8 |
| PERSON | 1.000 | 1.000 | 1.000 | 3 |
| HARD_SKILL | 0.973 | 0.947 | 1.000 | 36 |
| EMPLOYMENT_TERMS | 0.880 | 0.846 | 0.917 | 12 |
5 of 7 entity types achieve perfect F1 on real-world data.
Mixed Test Set (78 examples)
| Metric | Score |
|---|---|
| F1 | 0.929 |
| Precision | 0.915 |
| Recall | 0.944 |
Experiment History: Closing the Real-World Gap
| Version | Training Data | Real-World F1 | Key Insight |
|---|---|---|---|
| v1 (synthetic-only) | 1,253 synthetic | 0.143 | Synthetic alone fails on real data |
| v1 (real-only) | 37 real posts | 0.558 | Small but 4x more effective per sample |
| v1 (mixed, 4.1% real) | 1,057 mixed | 0.935 | Even a few real posts help dramatically |
| v2 (v3+real5x, 34.6% real) | 619 mixed | 0.975 | Style-matched synthetic + upsampling |
Key finding: real data proportion is the #1 factor. Style-matched synthetic data is 3x more efficient than generic synthetic data.
Links
- Model: chayuto/thai-job-ner-phayathaibert
- WangchanBERTa variant: chayuto/thai-job-ner-wangchanberta
- Dataset: chayuto/thai-job-ner-dataset
- Source Code: github.com/chayuto/thai-job-nlp-ner
Limitations
- Real-world evaluation is based on 8 held-out posts (98 entities) โ confidence intervals are wide
- EMPLOYMENT_TERMS remains the weakest entity (F1=0.880) โ boundary ambiguity in schedule/contract terms
- Embeddings were frozen during training (MPS memory constraint) โ unfreezing on a larger GPU may yield further gains
- 256 token max sequence length (covers >95% of real posts)
- Larger model file size due to 248K vocabulary (vs WangchanBERTa's 25K)
Technical Notes
- FP16 is broken on MPS โ always use FP32 for Apple Silicon training
- PhayaThaiBERT's 248K vocab (XLM-R-derived) requires frozen embeddings + gradient checkpointing to fit on 18GB MPS
- Uses
offset_mappingfor tokenizer-agnostic subword-to-character alignment - Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment
- Real data upsampling (5-10x) is a simple, effective technique for low-resource scenarios
License
MIT
- Downloads last month
- 7
Dataset used to train chayuto/thai-job-ner-phayathaibert
Evaluation results
- F1 (real-world)self-reported0.975
- F1 (mixed test)self-reported0.929
- Precisionself-reported0.960
- Recallself-reported0.990