File size: 1,844 Bytes
8f0ceef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b605c46
 
 
 
 
 
8f0ceef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: cc-by-nc-4.0
language: he
base_model: onlplab/alephbert-base
tags:
  - text-classification
  - hebrew
  - medical
---

# MedTextBERT

A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base).  
Classifies extracted text into 24 document categories covering a wide range of medical specialties.

Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.

## Performance

| Metric | Score |
|--------|-------|
| Accuracy | 93.8% |
| F1 | 93.75% |

Evaluated on a held-out test set after 20 epochs of fine-tuning.

## Categories

`family_medicine` `cardiology` `cardiology_procedures` `imaging`  
`diabetes_endocrinology` `pathology` `pediatrics` `orthopedics`  
`neurology` `psychiatry` `urology` `surgery` `gastroenterology`  
`hematology` `pulmonology` `dermatology` `infections_inflammation`  
`gynecology` `oncology` `pharmacy` `emergency_medicine`  
`geriatrics_rehabilitation` `administration_general` `lab_results`

## Training Data

Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents,
covering edge cases and category variations to improve generalization across real-world formats.

## Usage

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="annaadar/MedTextBERT",
    tokenizer="annaadar/MedTextBERT"
)

result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)
```

## Limitations

- Trained on synthetic data — performance on real-world clinical documents may vary
- Designed for Hebrew text only
- Not validated for clinical or diagnostic use

## Intended Use

Research and portfolio purposes only.  
Not intended for clinical or commercial use.  
License: CC BY-NC 4.0