MedTextBERT / README.md
annaadar's picture
update readme usage section
b605c46 verified
metadata
license: cc-by-nc-4.0
language: he
base_model: onlplab/alephbert-base
tags:
  - text-classification
  - hebrew
  - medical

MedTextBERT

A Hebrew medical document classifier fine-tuned on AlephBERT.
Classifies extracted text into 24 document categories covering a wide range of medical specialties.

Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.

Performance

Metric Score
Accuracy 93.8%
F1 93.75%

Evaluated on a held-out test set after 20 epochs of fine-tuning.

Categories

family_medicine cardiology cardiology_procedures imaging
diabetes_endocrinology pathology pediatrics orthopedics
neurology psychiatry urology surgery gastroenterology
hematology pulmonology dermatology infections_inflammation
gynecology oncology pharmacy emergency_medicine
geriatrics_rehabilitation administration_general lab_results

Training Data

Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, covering edge cases and category variations to improve generalization across real-world formats.

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="annaadar/MedTextBERT",
    tokenizer="annaadar/MedTextBERT"
)

result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)

Limitations

  • Trained on synthetic data — performance on real-world clinical documents may vary
  • Designed for Hebrew text only
  • Not validated for clinical or diagnostic use

Intended Use

Research and portfolio purposes only.
Not intended for clinical or commercial use.
License: CC BY-NC 4.0