Sefaria/he_berel_gold
Viewer โข Updated โข 2.94k โข 40 โข 1
A Hebrew Named Entity Recognition (NER) model for Rabbinic literature, fine-tuned from BEREL 3.0 โ a BERT-based language model pre-trained on Rabbinic Hebrew texts by DICTA.
This model identifies two entity types in Rabbinic Hebrew text:
| Label | Hebrew | Description |
|---|---|---|
Cit (B-ืืงืืจ / I-ืืงืืจ) |
ืืงืืจ | Citations โ references to Jewish texts and sources |
Per (B-ืื-ืืื / I-ืื-ืืื) |
ืื-ืืื | Persons โ names of people |
It uses BIO tagging and was trained for the purpose of automatically linking citations and persons in Sefaria's corpus of Rabbinic literature.
BertForTokenClassification (BERT-base, 12 layers, 12 attention heads, hidden size 768)Best checkpoint was saved at epoch 3 (of a possible 10) via early stopping:
| Metric | Score |
|---|---|
| F1 | 87.2% |
| Precision | 85.7% |
| Recall | 88.8% |
| Eval loss | 0.0815 |
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "Sefaria/berel-linker-ner"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="first",
stride=128,
)
text = "ืืืจื ืืจืื\"ื ืืืืืืช ืฉืืช"
entities = ner(text)
print(entities)
{
"O": 0,
"I-ืืงืืจ": 1,
"I-ืื-ืืื": 2,
"B-ืืงืืจ": 5,
"B-ืื-ืืื": 6
}
Developed by Sefaria for automated entity linking in classical Jewish texts.
Base model
dicta-il/BEREL_3.0