---
license: cc-by-nc-4.0
language: he
base_model: onlplab/alephbert-base
tags:
  - text-classification
  - hebrew
  - medical
---

# MedTextBERT

A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base).  
Classifies extracted text into 24 document categories covering a wide range of medical specialties.

Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.

## Performance

| Metric | Score |
|--------|-------|
| Accuracy | 93.8% |
| F1 | 93.75% |

Evaluated on a held-out test set after 20 epochs of fine-tuning.

## Categories

`family_medicine` `cardiology` `cardiology_procedures` `imaging`  
`diabetes_endocrinology` `pathology` `pediatrics` `orthopedics`  
`neurology` `psychiatry` `urology` `surgery` `gastroenterology`  
`hematology` `pulmonology` `dermatology` `infections_inflammation`  
`gynecology` `oncology` `pharmacy` `emergency_medicine`  
`geriatrics_rehabilitation` `administration_general` `lab_results`

## Training Data

Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents,
covering edge cases and category variations to improve generalization across real-world formats.

## Usage

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="annaadar/MedTextBERT",
    tokenizer="annaadar/MedTextBERT"
)

result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)
```

## Limitations

- Trained on synthetic data — performance on real-world clinical documents may vary
- Designed for Hebrew text only
- Not validated for clinical or diagnostic use

## Intended Use

Research and portfolio purposes only.  
Not intended for clinical or commercial use.  
License: CC BY-NC 4.0