no-name-ner-th / README.md

Update README.md

fe7fd8e verified 5 months ago

4.62 kB

	---
	datasets:
	- pythainlp/thainer-corpus-v2
	language:
	- th
	base_model:
	- clicknext/phayathaibert
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- medical
	---
	# No Name Thai NER

	<!-- ![Mascot image](mascot-image-landscape.png)-->
	<img src="mascot-image-landscape.png" alt="mascot" style="width: 600px; height: auto; display: block; margin: 0 auto;">
	<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin-bottom: 20px;">
	<img src="Looloohealth.png" alt="Looloo Health" style="width: 250px; height: auto;">
	<img src="PresScribe.png" alt="Prescribe" style="width: 250px; height: auto;">
	</div>


	Compact Thai token-classification model optimized for fast named-entity recognition (NER) and practical medical-text deidentification. This checkpoint was trained for robust entity detection on Thai clinical and conversational text and is intended for use in context-preserving anonymization pipelines.

	At [Looloo Health](https://looloohealth.com/en/), we're passionate about making healthcare more accessible and affordable for everyone.
	The model is a core component of our AI Medical Scribe, [PresScribe](https://www.youtube.com/watch?v=oUiJ9oPgZMA), where it helps ensure patient privacy through automated de-identification.
	We believe that unlocking the potential of clinical data is key to this goal, and we're excited to share our work with the community.


	Features
	- Detects common sensitive entity types found in medical text (names, phone numbers, IDs, addresses, dates, etc.).
	- Lightweight and fast to run on CPUs with the Hugging Face `transformers` pipeline.
	- Designed to be used as part of a deidentification workflow (post-processing recommended to merge token-level spans).
	- Trained on a comprehensive synthetic dataset of over 300,000 samples, ensuring it is robust and generalizable.
	- On our internal test set, we achieved over 95% accuracy for our specific use case.


	Supported entity labels
	- PERSON
	- PHONE
	- EMAIL
	- ADDRESS (sometimes labelled as LOCATION)
	- DATE
	- NATIONAL_ID
	- HOSPITAL_IDS

	## Quick start

	Install minimal dependencies:

	```
	pip install -U transformers torch
	```

	Load and run the model with Hugging Face pipelines:

	```python
	from transformers import pipeline

	ner = pipeline("token-classification", model="loolootech/no-name-ner-th", device=-1)
	text = "คุณสมชายเป็นอะไรมาครับวันนี้ อ๋อวันนี้ปวดตับครับ งั้นวันนี้หมอขอตรวจละเอียดหน่อยนะ ได้เลยครับน้องมาร์ค"
	results = ner(text)
	print(results)
	```

	Notes on post-processing (more details on our [example notebook](https://github.com/loolootech/no-name-ner-th/blob/main/example.ipynb))
	- The pipeline returns token-level predictions (B-/I- style). For redaction or anonymization you should merge adjacent tokens with the same label to form full spans before replacing with entity-specific redaction tokens (e.g. [PERSON], [PHONE]).
	- When redacting, replace spans from right-to-left or rebuild the output string from slices to avoid offset shifts.


	## Disclaimer

	* This model is intended as an assistive tool for de-identification. It is not a substitute for professional, legal, or medical advice.

	* Users are fully responsible for ensuring compliance with applicable privacy, legal, and regulatory requirements.

	* While efforts have been made to improve accuracy, no automated system is 100% reliable. We strongly recommend implementing a regular human review process to validate outputs.


	## License
	This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License ([CC BY-NC 4.0](LICENSE)).

	- For commercial usage, please contact contact@looloohealth.com.


	## Citation

	If you use the model, you can cite it with the following bibtex.

	```
	@misc {no_name_ner_th,
	author = { Atirut Boribalburephan, Chiraphat Boonnag, Knot Pipatsrisawat },
	title = { no-name-ner-th },
	year = 2025,
	url = { https://huggingface.co/loolootech/no-name-ner-th },
	publisher = { Hugging Face }
	}
	```


	## Acknowledgement
	We extend our gratitude to the `PhayaThaiBERT` team and `Pavarissy/phayathaibert-thainer` for providing the initial checkpoint for our model, which served as a crucial starting point. We also acknowledge PyThaiNLP for their invaluable contribution of the `thainer-corpus-v2` dataset, which was essential for training and evaluation.