annaadar commited on
Commit
8f0ceef
·
verified ·
1 Parent(s): 4f7deaa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -3
README.md CHANGED
@@ -1,3 +1,61 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language: he
4
+ base_model: onlplab/alephbert-base
5
+ tags:
6
+ - text-classification
7
+ - hebrew
8
+ - medical
9
+ ---
10
+
11
+ # MedTextBERT
12
+
13
+ A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base).
14
+ Classifies extracted text into 24 document categories covering a wide range of medical specialties.
15
+
16
+ Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.
17
+
18
+ ## Performance
19
+
20
+ | Metric | Score |
21
+ |--------|-------|
22
+ | Accuracy | 93.8% |
23
+ | F1 | 93.75% |
24
+
25
+ Evaluated on a held-out test set after 20 epochs of fine-tuning.
26
+
27
+ ## Categories
28
+
29
+ `family_medicine` `cardiology` `cardiology_procedures` `imaging`
30
+ `diabetes_endocrinology` `pathology` `pediatrics` `orthopedics`
31
+ `neurology` `psychiatry` `urology` `surgery` `gastroenterology`
32
+ `hematology` `pulmonology` `dermatology` `infections_inflammation`
33
+ `gynecology` `oncology` `pharmacy` `emergency_medicine`
34
+ `geriatrics_rehabilitation` `administration_general` `lab_results`
35
+
36
+ ## Training Data
37
+
38
+ Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents,
39
+ covering edge cases and category variations to improve generalization across real-world formats.
40
+
41
+ ## Usage
42
+
43
+ ```python
44
+ from transformers import pipeline
45
+
46
+ classifier = pipeline("text-classification", model="annaadar/MedTextBERT")
47
+ result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
48
+ print(result)
49
+ ```
50
+
51
+ ## Limitations
52
+
53
+ - Trained on synthetic data — performance on real-world clinical documents may vary
54
+ - Designed for Hebrew text only
55
+ - Not validated for clinical or diagnostic use
56
+
57
+ ## Intended Use
58
+
59
+ Research and portfolio purposes only.
60
+ Not intended for clinical or commercial use.
61
+ License: CC BY-NC 4.0