| --- |
| license: cc-by-nc-4.0 |
| language: he |
| base_model: onlplab/alephbert-base |
| tags: |
| - text-classification |
| - hebrew |
| - medical |
| --- |
| |
| # MedTextBERT |
|
|
| A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base). |
| Classifies extracted text into 24 document categories covering a wide range of medical specialties. |
|
|
| Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents. |
|
|
| ## Performance |
|
|
| | Metric | Score | |
| |--------|-------| |
| | Accuracy | 93.8% | |
| | F1 | 93.75% | |
|
|
| Evaluated on a held-out test set after 20 epochs of fine-tuning. |
|
|
| ## Categories |
|
|
| `family_medicine` `cardiology` `cardiology_procedures` `imaging` |
| `diabetes_endocrinology` `pathology` `pediatrics` `orthopedics` |
| `neurology` `psychiatry` `urology` `surgery` `gastroenterology` |
| `hematology` `pulmonology` `dermatology` `infections_inflammation` |
| `gynecology` `oncology` `pharmacy` `emergency_medicine` |
| `geriatrics_rehabilitation` `administration_general` `lab_results` |
|
|
| ## Training Data |
|
|
| Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, |
| covering edge cases and category variations to improve generalization across real-world formats. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline( |
| "text-classification", |
| model="annaadar/MedTextBERT", |
| tokenizer="annaadar/MedTextBERT" |
| ) |
| |
| result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים") |
| print(result) |
| ``` |
|
|
| ## Limitations |
|
|
| - Trained on synthetic data — performance on real-world clinical documents may vary |
| - Designed for Hebrew text only |
| - Not validated for clinical or diagnostic use |
|
|
| ## Intended Use |
|
|
| Research and portfolio purposes only. |
| Not intended for clinical or commercial use. |
| License: CC BY-NC 4.0 |