|
|
--- |
|
|
datasets: |
|
|
- pythainlp/thainer-corpus-v2 |
|
|
language: |
|
|
- th |
|
|
base_model: |
|
|
- clicknext/phayathaibert |
|
|
pipeline_tag: token-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- medical |
|
|
--- |
|
|
# No Name Thai NER |
|
|
|
|
|
<!-- --> |
|
|
<img src="mascot-image-landscape.png" alt="mascot" style="width: 600px; height: auto; display: block; margin: 0 auto;"> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 20px;"> |
|
|
<img src="Looloohealth.png" alt="Looloo Health" style="width: 250px; height: auto;"> |
|
|
<img src="PresScribe.png" alt="Prescribe" style="width: 250px; height: auto;"> |
|
|
</div> |
|
|
|
|
|
|
|
|
Compact Thai token-classification model optimized for fast named-entity recognition (NER) and practical medical-text deidentification. This checkpoint was trained for robust entity detection on Thai clinical and conversational text and is intended for use in context-preserving anonymization pipelines. |
|
|
|
|
|
At [**Looloo Health**](https://looloohealth.com/en/), we're passionate about making healthcare more accessible and affordable for everyone. |
|
|
The model is a core component of our AI Medical Scribe, [**PresScribe**](https://www.youtube.com/watch?v=oUiJ9oPgZMA), where it helps ensure patient privacy through automated de-identification. |
|
|
We believe that unlocking the potential of clinical data is key to this goal, and we're excited to share our work with the community. |
|
|
|
|
|
|
|
|
**Features** |
|
|
- Detects common sensitive entity types found in medical text (names, phone numbers, IDs, addresses, dates, etc.). |
|
|
- Lightweight and fast to run on **CPUs** with the Hugging Face `transformers` pipeline. |
|
|
- Designed to be used as part of a deidentification workflow (post-processing recommended to merge token-level spans). |
|
|
- Trained on a **comprehensive synthetic dataset of over 300,000 samples**, ensuring it is robust and generalizable. |
|
|
- On our internal test set, we achieved over 95% accuracy for our specific use case. |
|
|
|
|
|
|
|
|
**Supported entity labels** |
|
|
- PERSON |
|
|
- PHONE |
|
|
- EMAIL |
|
|
- ADDRESS (sometimes labelled as LOCATION) |
|
|
- DATE |
|
|
- NATIONAL_ID |
|
|
- HOSPITAL_IDS |
|
|
|
|
|
## Quick start |
|
|
|
|
|
Install minimal dependencies: |
|
|
|
|
|
``` |
|
|
pip install -U transformers torch |
|
|
``` |
|
|
|
|
|
Load and run the model with Hugging Face pipelines: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
ner = pipeline("token-classification", model="loolootech/no-name-ner-th", device=-1) |
|
|
text = "คุณสมชายเป็นอะไรมาครับวันนี้ อ๋อวันนี้ปวดตับครับ งั้นวันนี้หมอขอตรวจละเอียดหน่อยนะ ได้เลยครับน้องมาร์ค" |
|
|
results = ner(text) |
|
|
print(results) |
|
|
``` |
|
|
|
|
|
Notes on post-processing (more details on our [example notebook](https://github.com/loolootech/no-name-ner-th/blob/main/example.ipynb)) |
|
|
- The pipeline returns token-level predictions (B-/I- style). For redaction or anonymization you should merge adjacent tokens with the same label to form full spans before replacing with entity-specific redaction tokens (e.g. [PERSON], [PHONE]). |
|
|
- When redacting, replace spans from right-to-left or rebuild the output string from slices to avoid offset shifts. |
|
|
|
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
* This model is intended as an assistive tool for de-identification. It is not a substitute for professional, legal, or medical advice. |
|
|
|
|
|
* Users are fully responsible for ensuring compliance with applicable privacy, legal, and regulatory requirements. |
|
|
|
|
|
* While efforts have been made to improve accuracy, no automated system is 100% reliable. We strongly recommend implementing a regular human review process to validate outputs. |
|
|
|
|
|
|
|
|
## **License** |
|
|
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License ([CC BY-NC 4.0](LICENSE)). |
|
|
|
|
|
- For commercial usage, please contact contact@looloohealth.com. |
|
|
|
|
|
|
|
|
## **Citation** |
|
|
|
|
|
If you use the model, you can cite it with the following bibtex. |
|
|
|
|
|
``` |
|
|
@misc {no_name_ner_th, |
|
|
author = { Atirut Boribalburephan, Chiraphat Boonnag, Knot Pipatsrisawat }, |
|
|
title = { no-name-ner-th }, |
|
|
year = 2025, |
|
|
url = { https://huggingface.co/loolootech/no-name-ner-th }, |
|
|
publisher = { Hugging Face } |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## **Acknowledgement** |
|
|
We extend our gratitude to the `PhayaThaiBERT` team and `Pavarissy/phayathaibert-thainer` for providing the initial checkpoint for our model, which served as a crucial starting point. We also acknowledge PyThaiNLP for their invaluable contribution of the `thainer-corpus-v2` dataset, which was essential for training and evaluation. |