Thai Job NER β€” Fine-tuned WangchanBERTa

Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from wangchanberta-base-att-spm-uncased (110M params).

Model Description

This model extracts 7 entity types from Thai job-related text:

Entity Description Example
HARD_SKILL Skills or procedures ดูแΰΈ₯ΰΈœΰΈΉΰΉ‰ΰΈͺΰΈΉΰΈ‡ΰΈ­ΰΈ²ΰΈ’ΰΈΈ, CPR, Python
PERSON Names ΰΈ„ΰΈΈΰΈ“ΰΈͺฑชาฒ, ΰΈžΰΈ΅ΰΉˆΰΉΰΈˆΰΈ™
LOCATION Places ΰΈͺΰΈ΅ΰΈ₯ΰΈ‘, ΰΈ₯ΰΈ²ΰΈ”ΰΈžΰΈ£ΰΉ‰ΰΈ²ΰΈ§, ΰΈšΰΈ²ΰΈ‡ΰΈ™ΰΈ²
COMPENSATION Pay amounts 18,000 ΰΈšΰΈ²ΰΈ—/ΰΉ€ΰΈ”ΰΈ·ΰΈ­ΰΈ™
EMPLOYMENT_TERMS Job structure part-time, กะกΰΈ₯ΰΈ²ΰΈ‡ΰΈ§ΰΈ±ΰΈ™
CONTACT Phone, Line, email 081-234-5678, @care123
DEMOGRAPHIC Age, gender ΰΈ­ΰΈ²ΰΈ’ΰΈΈ 25-40, หญิง

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "chayuto/thai-job-ner-wangchanberta"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")

text = "รับΰΈͺฑัครคนดูแΰΈ₯ΰΈœΰΈΉΰΉ‰ΰΈͺΰΈΉΰΈ‡ΰΈ­ΰΈ²ΰΈ’ΰΈΈ ΰΈ’ΰΉˆΰΈ²ΰΈ™ΰΈͺΰΈ΅ΰΈ₯ΰΈ‘ ΰΉ€ΰΈ‡ΰΈ΄ΰΈ™ΰΉ€ΰΈ”ΰΈ·ΰΈ­ΰΈ™ 18,000 ΰΈšΰΈ²ΰΈ— ΰΉ‚ΰΈ—ΰΈ£ 081-234-5678"
results = ner(text)
for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")

Training

  • Base model: airesearch/wangchanberta-base-att-spm-uncased (CamemBERT architecture, 110M params)
  • Training data: 1,253 Thai job posts (synthetic silver labels from GPT-4o, fuzzy-aligned to IOB2) β€” Dataset on HuggingFace
  • Hardware: Apple Silicon MPS backend, FP32
  • Hyperparameters: LR=3e-5, warmup=0.1, batch=8, grad_accum=2, 15 epochs, class-weighted loss, label smoothing=0.05
  • Training time: ~3 min 48 sec

Data Pipeline

Raw Thai text + GPT-4o entity extractions β†’ fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping β†’ subword token mapping via offset_mapping β†’ IOB2-formatted HuggingFace Dataset.

Evaluation

Overall (Test Set, 126 examples)

Metric Score
F1 0.897
Precision 0.850
Recall 0.949

Per-Entity F1

Entity F1 Precision Recall
CONTACT 0.962 0.942 0.983
LOCATION 0.959 0.928 0.991
EMPLOYMENT_TERMS 0.926 0.870 0.990
PERSON 0.907 0.861 0.958
HARD_SKILL 0.903 0.873 0.936
DEMOGRAPHIC 0.875 0.827 0.928
COMPENSATION 0.764 0.673 0.884

Links

Limitations

  • Trained on synthetic data β€” may underperform on real-world posts with heavy emoji usage, OCR errors, or extreme colloquialism
  • Thai-specific: limited English entity extraction capability
  • 512 token max sequence length
  • HARD_SKILL has the lowest F1 (0.761) due to open vocabulary and complex boundaries

Technical Notes

  • FP16 is broken on MPS β€” always use FP32 for Apple Silicon training
  • Uses offset_mapping to bypass WangchanBERTa's <_> space token misalignment in char_to_token()
  • Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment

License

MIT

Downloads last month
21
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train chayuto/thai-job-ner-wangchanberta

Evaluation results