chayuto/thai-job-ner-dataset
Viewer • Updated • 1.27k • 12
How to use chayuto/thai-job-ner-wangchanberta with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="chayuto/thai-job-ner-wangchanberta") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("chayuto/thai-job-ner-wangchanberta", dtype="auto")Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from wangchanberta-base-att-spm-uncased (110M params).
This model extracts 7 entity types from Thai job-related text:
| Entity | Description | Example |
|---|---|---|
HARD_SKILL |
Skills or procedures | ดูแลผู้สูงอายุ, CPR, Python |
PERSON |
Names | คุณสมชาย, พี่แจน |
LOCATION |
Places | สีลม, ลาดพร้าว, บางนา |
COMPENSATION |
Pay amounts | 18,000 บาท/เดือน |
EMPLOYMENT_TERMS |
Job structure | part-time, กะกลางวัน |
CONTACT |
Phone, Line, email | 081-234-5678, @care123 |
DEMOGRAPHIC |
Age, gender | อายุ 25-40, หญิง |
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "chayuto/thai-job-ner-wangchanberta"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")
text = "รับสมัครคนดูแลผู้สูงอายุ ย่านสีลม เงินเดือน 18,000 บาท โทร 081-234-5678"
results = ner(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")
airesearch/wangchanberta-base-att-spm-uncased (CamemBERT architecture, 110M params)Raw Thai text + GPT-4o entity extractions → fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping → subword token mapping via offset_mapping → IOB2-formatted HuggingFace Dataset.
| Metric | Score |
|---|---|
| F1 | 0.897 |
| Precision | 0.850 |
| Recall | 0.949 |
| Entity | F1 | Precision | Recall |
|---|---|---|---|
| CONTACT | 0.962 | 0.942 | 0.983 |
| LOCATION | 0.959 | 0.928 | 0.991 |
| EMPLOYMENT_TERMS | 0.926 | 0.870 | 0.990 |
| PERSON | 0.907 | 0.861 | 0.958 |
| HARD_SKILL | 0.903 | 0.873 | 0.936 |
| DEMOGRAPHIC | 0.875 | 0.827 | 0.928 |
| COMPENSATION | 0.764 | 0.673 | 0.884 |
offset_mapping to bypass WangchanBERTa's <_> space token misalignment in char_to_token()MIT