Parsan โ Hebrew morphosyntax models
Trained weights for Parsan, a single joint model for Hebrew morphosyntax: raw text in, Universal Dependencies CoNLL-U out (segmentation, POS, morphological features, lemmas, dependency parse).
Try it in your browser: noamor/parsan-demo โ paste Hebrew text and get a dependency tree + CoNLL-U.
This repository holds three checkpoints, each in its own folder:
| folder | what it is | size |
|---|---|---|
joint_base |
DictaBERT joint tagger + parser + lemma (best accuracy) | ~749 MB |
joint_tiny2 |
DictaBERT-tiny variant (~3x faster) | ~185 MB |
seg_char_ctx |
character segmenter on dictabert-char | ~353 MB |
Use
Install the library, download the weights, and point PARSAN_RUNS at them (the library
also fetches them automatically on first use):
from huggingface_hub import snapshot_download
snapshot_download("noamor/parsan", local_dir="runs")
PARSAN_RUNS=$PWD/runs python scripts/predict.py \
--text input.txt --sent newline --profile base --out out.conllu
--profile base|tiny, --segmenter char|rftok.
Results
End-to-end from raw text, IAHLT gold, LAS (F1x100); OOD is the micro-average over five held-out genres.
| system | wiki | knesset | OOD |
|---|---|---|---|
| Parsan | 92.2 | 88.6 | 89.2 |
| HebPipe | 89.7 | 86.2 | 86.1 |
| Stanza | 83.1 | 80.7 | 80.5 |
Credits
Built on DictaBERT (Dicta) and the UD Hebrew-IAHLT treebanks. Thanks to Amir Zeldes for the encouragement and inspiration, and to Avner Algom (IAHLT). MIT license.