--- license: cc-by-nc-4.0 language: he base_model: onlplab/alephbert-base tags: - text-classification - hebrew - medical --- # MedTextBERT A Hebrew medical document classifier fine-tuned on [AlephBERT](https://huggingface.co/onlplab/alephbert-base). Classifies extracted text into 24 document categories covering a wide range of medical specialties. Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents. ## Performance | Metric | Score | |--------|-------| | Accuracy | 93.8% | | F1 | 93.75% | Evaluated on a held-out test set after 20 epochs of fine-tuning. ## Categories `family_medicine` `cardiology` `cardiology_procedures` `imaging` `diabetes_endocrinology` `pathology` `pediatrics` `orthopedics` `neurology` `psychiatry` `urology` `surgery` `gastroenterology` `hematology` `pulmonology` `dermatology` `infections_inflammation` `gynecology` `oncology` `pharmacy` `emergency_medicine` `geriatrics_rehabilitation` `administration_general` `lab_results` ## Training Data Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, covering edge cases and category variations to improve generalization across real-world formats. ## Usage ```python from transformers import pipeline classifier = pipeline( "text-classification", model="annaadar/MedTextBERT", tokenizer="annaadar/MedTextBERT" ) result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים") print(result) ``` ## Limitations - Trained on synthetic data — performance on real-world clinical documents may vary - Designed for Hebrew text only - Not validated for clinical or diagnostic use ## Intended Use Research and portfolio purposes only. Not intended for clinical or commercial use. License: CC BY-NC 4.0