Thai Job NER β Fine-tuned WangchanBERTa
Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from wangchanberta-base-att-spm-uncased (110M params).
Model Description
This model extracts 7 entity types from Thai job-related text:
| Entity | Description | Example |
|---|---|---|
HARD_SKILL |
Skills or procedures | ΰΈΰΈΉΰΉΰΈ₯ΰΈΰΈΉΰΉΰΈͺΰΈΉΰΈΰΈΰΈ²ΰΈ’ΰΈΈ, CPR, Python |
PERSON |
Names | ΰΈΰΈΈΰΈΰΈͺΰΈ‘ΰΈΰΈ²ΰΈ’, ΰΈΰΈ΅ΰΉΰΉΰΈΰΈ |
LOCATION |
Places | ΰΈͺΰΈ΅ΰΈ₯ΰΈ‘, ΰΈ₯ΰΈ²ΰΈΰΈΰΈ£ΰΉΰΈ²ΰΈ§, ΰΈΰΈ²ΰΈΰΈΰΈ² |
COMPENSATION |
Pay amounts | 18,000 ΰΈΰΈ²ΰΈ/ΰΉΰΈΰΈ·ΰΈΰΈ |
EMPLOYMENT_TERMS |
Job structure | part-time, ΰΈΰΈ°ΰΈΰΈ₯ΰΈ²ΰΈΰΈ§ΰΈ±ΰΈ |
CONTACT |
Phone, Line, email | 081-234-5678, @care123 |
DEMOGRAPHIC |
Age, gender | ΰΈΰΈ²ΰΈ’ΰΈΈ 25-40, ΰΈ«ΰΈΰΈ΄ΰΈ |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "chayuto/thai-job-ner-wangchanberta"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")
text = "ΰΈ£ΰΈ±ΰΈΰΈͺΰΈ‘ΰΈ±ΰΈΰΈ£ΰΈΰΈΰΈΰΈΉΰΉΰΈ₯ΰΈΰΈΉΰΉΰΈͺΰΈΉΰΈΰΈΰΈ²ΰΈ’ΰΈΈ ΰΈ’ΰΉΰΈ²ΰΈΰΈͺΰΈ΅ΰΈ₯ΰΈ‘ ΰΉΰΈΰΈ΄ΰΈΰΉΰΈΰΈ·ΰΈΰΈ 18,000 ΰΈΰΈ²ΰΈ ΰΉΰΈΰΈ£ 081-234-5678"
results = ner(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")
Training
- Base model:
airesearch/wangchanberta-base-att-spm-uncased(CamemBERT architecture, 110M params) - Training data: 1,253 Thai job posts (synthetic silver labels from GPT-4o, fuzzy-aligned to IOB2) β Dataset on HuggingFace
- Hardware: Apple Silicon MPS backend, FP32
- Hyperparameters: LR=3e-5, warmup=0.1, batch=8, grad_accum=2, 15 epochs, class-weighted loss, label smoothing=0.05
- Training time: ~3 min 48 sec
Data Pipeline
Raw Thai text + GPT-4o entity extractions β fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping β subword token mapping via offset_mapping β IOB2-formatted HuggingFace Dataset.
Evaluation
Overall (Test Set, 126 examples)
| Metric | Score |
|---|---|
| F1 | 0.897 |
| Precision | 0.850 |
| Recall | 0.949 |
Per-Entity F1
| Entity | F1 | Precision | Recall |
|---|---|---|---|
| CONTACT | 0.962 | 0.942 | 0.983 |
| LOCATION | 0.959 | 0.928 | 0.991 |
| EMPLOYMENT_TERMS | 0.926 | 0.870 | 0.990 |
| PERSON | 0.907 | 0.861 | 0.958 |
| HARD_SKILL | 0.903 | 0.873 | 0.936 |
| DEMOGRAPHIC | 0.875 | 0.827 | 0.928 |
| COMPENSATION | 0.764 | 0.673 | 0.884 |
Links
- Model: chayuto/thai-job-ner-wangchanberta
- Dataset: chayuto/thai-job-ner-dataset
- Source Code: github.com/chayuto/thai-job-nlp-ner
Limitations
- Trained on synthetic data β may underperform on real-world posts with heavy emoji usage, OCR errors, or extreme colloquialism
- Thai-specific: limited English entity extraction capability
- 512 token max sequence length
- HARD_SKILL has the lowest F1 (0.761) due to open vocabulary and complex boundaries
Technical Notes
- FP16 is broken on MPS β always use FP32 for Apple Silicon training
- Uses
offset_mappingto bypass WangchanBERTa's<_>space token misalignment inchar_to_token() - Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment
License
MIT
- Downloads last month
- 21
Dataset used to train chayuto/thai-job-ner-wangchanberta
Evaluation results
- F1self-reported0.897
- Precisionself-reported0.850
- Recallself-reported0.859