Thai Job NER โ€” Fine-tuned PhayaThaiBERT (v2: Mixed Training)

Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from PhayaThaiBERT (~122M params).

v2 update: Trained on style-matched synthetic + real-world data with upsampling. Real-world F1 improved from 0.143 โ†’ 0.975.

Model Description

This model extracts 7 entity types from Thai job-related text:

Entity Description Example
HARD_SKILL Skills or procedures เธ”เธนเนเธฅเธœเธนเน‰เธชเธนเธ‡เธญเธฒเธขเธธ, CPR, Python
PERSON Names เธ„เธธเธ“เธชเธกเธŠเธฒเธข, เธžเธตเนˆเนเธˆเธ™
LOCATION Places เธชเธตเธฅเธก, เธฅเธฒเธ”เธžเธฃเน‰เธฒเธง, เธšเธฒเธ‡เธ™เธฒ
COMPENSATION Pay amounts 18,000 เธšเธฒเธ—/เน€เธ”เธทเธญเธ™
EMPLOYMENT_TERMS Job structure part-time, เธเธฐเธเธฅเธฒเธ‡เธงเธฑเธ™
CONTACT Phone, Line, email 081-234-5678, @care123
DEMOGRAPHIC Age, gender เธญเธฒเธขเธธ 25-40, เธซเธเธดเธ‡

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "chayuto/thai-job-ner-phayathaibert"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")

text = "เธฃเธฑเธšเธชเธกเธฑเธ„เธฃเธ„เธ™เธ”เธนเนเธฅเธœเธนเน‰เธชเธนเธ‡เธญเธฒเธขเธธ เธขเนˆเธฒเธ™เธชเธตเธฅเธก เน€เธ‡เธดเธ™เน€เธ”เธทเธญเธ™ 18,000 เธšเธฒเธ— เน‚เธ—เธฃ 081-234-5678"
results = ner(text)
for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")

Training

  • Base model: clicknext/phayathaibert (CamemBERT architecture, ~122M params, XLM-R-derived vocabulary)
  • Training data: 619 posts (510 style-matched synthetic + 54 real-world posts upsampled 5x), silver labels from GPT-4o, fuzzy-aligned to IOB2
  • Real data proportion: 34.6% (key factor for real-world performance)
  • Hardware: Apple Silicon MPS backend, FP32
  • Hyperparameters: LR=3e-5, warmup=0.1, batch=2, grad_accum=4, 10 epochs, gradient checkpointing, frozen embeddings
  • Training time: ~35 min

Data Pipeline

Raw Thai text + GPT-4o entity extractions โ†’ fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping โ†’ subword token mapping via offset_mapping โ†’ IOB2-formatted HuggingFace Dataset.

Training Strategy

Style-matched synthetic data (v3) was generated to mimic real-world post characteristics: varied formatting, code-switching, informal language. Real-world posts were upsampled 5x to achieve a 20-35% real data proportion โ€” the sweet spot identified through systematic experiments (see below).

Evaluation

Real-World Test Set (8 held-out real posts, 98 entities)

Metric Score
F1 0.975
Precision 0.960
Recall 0.990

Real-World Per-Entity F1

Entity F1 Precision Recall Support
COMPENSATION 1.000 1.000 1.000 15
CONTACT 1.000 1.000 1.000 10
DEMOGRAPHIC 1.000 1.000 1.000 14
LOCATION 1.000 1.000 1.000 8
PERSON 1.000 1.000 1.000 3
HARD_SKILL 0.973 0.947 1.000 36
EMPLOYMENT_TERMS 0.880 0.846 0.917 12

5 of 7 entity types achieve perfect F1 on real-world data.

Mixed Test Set (78 examples)

Metric Score
F1 0.929
Precision 0.915
Recall 0.944

Experiment History: Closing the Real-World Gap

Version Training Data Real-World F1 Key Insight
v1 (synthetic-only) 1,253 synthetic 0.143 Synthetic alone fails on real data
v1 (real-only) 37 real posts 0.558 Small but 4x more effective per sample
v1 (mixed, 4.1% real) 1,057 mixed 0.935 Even a few real posts help dramatically
v2 (v3+real5x, 34.6% real) 619 mixed 0.975 Style-matched synthetic + upsampling

Key finding: real data proportion is the #1 factor. Style-matched synthetic data is 3x more efficient than generic synthetic data.

Links

Limitations

  • Real-world evaluation is based on 8 held-out posts (98 entities) โ€” confidence intervals are wide
  • EMPLOYMENT_TERMS remains the weakest entity (F1=0.880) โ€” boundary ambiguity in schedule/contract terms
  • Embeddings were frozen during training (MPS memory constraint) โ€” unfreezing on a larger GPU may yield further gains
  • 256 token max sequence length (covers >95% of real posts)
  • Larger model file size due to 248K vocabulary (vs WangchanBERTa's 25K)

Technical Notes

  • FP16 is broken on MPS โ€” always use FP32 for Apple Silicon training
  • PhayaThaiBERT's 248K vocab (XLM-R-derived) requires frozen embeddings + gradient checkpointing to fit on 18GB MPS
  • Uses offset_mapping for tokenizer-agnostic subword-to-character alignment
  • Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment
  • Real data upsampling (5-10x) is a simple, effective technique for low-resource scenarios

License

MIT

Downloads last month
7
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train chayuto/thai-job-ner-phayathaibert

Evaluation results