MedTextBERT / README.md
annaadar's picture
update readme usage section
b605c46 verified
---
license: cc-by-nc-4.0
language: he
base_model: onlplab/alephbert-base
tags:
- text-classification
- hebrew
- medical
---
# MedTextBERT
A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base).
Classifies extracted text into 24 document categories covering a wide range of medical specialties.
Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.
## Performance
| Metric | Score |
|--------|-------|
| Accuracy | 93.8% |
| F1 | 93.75% |
Evaluated on a held-out test set after 20 epochs of fine-tuning.
## Categories
`family_medicine` `cardiology` `cardiology_procedures` `imaging`
`diabetes_endocrinology` `pathology` `pediatrics` `orthopedics`
`neurology` `psychiatry` `urology` `surgery` `gastroenterology`
`hematology` `pulmonology` `dermatology` `infections_inflammation`
`gynecology` `oncology` `pharmacy` `emergency_medicine`
`geriatrics_rehabilitation` `administration_general` `lab_results`
## Training Data
Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents,
covering edge cases and category variations to improve generalization across real-world formats.
## Usage
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="annaadar/MedTextBERT",
tokenizer="annaadar/MedTextBERT"
)
result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)
```
## Limitations
- Trained on synthetic data — performance on real-world clinical documents may vary
- Designed for Hebrew text only
- Not validated for clinical or diagnostic use
## Intended Use
Research and portfolio purposes only.
Not intended for clinical or commercial use.
License: CC BY-NC 4.0