Instructions to use alexgoldberg/hebrew-manuscript-marc500-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use alexgoldberg/hebrew-manuscript-marc500-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="alexgoldberg/hebrew-manuscript-marc500-classifier")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("alexgoldberg/hebrew-manuscript-marc500-classifier", dtype="auto") - Notebooks
- Google Colab
- Kaggle
MARC 500 Colophon Sentence Classifier
A binary sentence classifier that detects whether a Hebrew MARC 500 (general notes) sentence is a colophon (the scribe's signature record at the end of a manuscript โ typically including the scribe's name, place, and date of completion).
Built for the Mapping Hebrew Manuscripts (MHM) pipeline (Bar-Ilan
University). Colophon sentences identified by this model are routed
to Wikidata P1684 (inscription) instead of generic P7535
(described at URL) notes.
Note: this checkpoint has a single learned head (COLOPHON only). In the MHM pipeline a sibling provenance decision is produced by a deterministic Hebrew keyword heuristic (
converter/authority/marc500_classifier.py:_PROVENANCE_KEYWORDS), not by a learned head. Both decisions appear in this card's example tables for completeness.
Quick stats
| Base | dicta-il/dictabert |
| Architecture | DictaBERT [CLS] โ Dropout(0.3) โ Linear(768 โ 1) โ sigmoid |
| Heads | 1 (COLOPHON) |
| Threshold | 0.45 |
| F1 (best fold) | 0.9642 |
| F1 (mean fold) | 0.9610 |
| Max length | 64 tokens |
| Validation | 5-fold stratified CV |
How to use
from huggingface_hub import hf_hub_download
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
REPO = "alexgoldberg/hebrew-manuscript-marc500-classifier"
ckpt = torch.load(hf_hub_download(REPO, "marc500_classifier_model.pt"),
map_location="cpu", weights_only=False)
threshold = ckpt["threshold"] # 0.45
max_len = ckpt["max_length"] # 64
class ColophonModel(nn.Module):
def __init__(self, base):
super().__init__()
self.bert = AutoModel.from_pretrained(base)
self.dropout = nn.Dropout(0.3)
self.classifier = nn.Linear(self.bert.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
cls = self.bert(input_ids, attention_mask).last_hidden_state[:,0]
return self.classifier(self.dropout(cls))
BASE = "dicta-il/dictabert"
tok = AutoTokenizer.from_pretrained(BASE)
model = ColophonModel(BASE)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
sentence = 'ื ืฉืื ืคืืจืืฉ ืื ืืืฉื ืืืืฉื ืชืืจื ืืื ื ืื ืืืืจ.'
enc = tok(sentence, max_length=max_len, padding="max_length",
truncation=True, return_tensors="pt")
with torch.no_grad():
score = float(torch.sigmoid(model(enc["input_ids"],
enc["attention_mask"])).squeeze())
is_colophon = score >= threshold
print(is_colophon, round(score, 4))
A complete inference helper is shipped as examples.py in this repo.
Real input/output examples
Each sentence below is taken verbatim from a National Library of
Israel MARC 500 general-notes field. Gold colophon and
Gold provenance are the distant-supervision labels used when
this corpus was extracted (see scripts/extract_marc500_sentences.py
in the MHM pipeline). The provenance score is the keyword-heuristic
decision (not a learned model output).
Example 1 โ clear_colophon
Sentence:
ืงืืืืคืื ืืืืืจ (181ื): ""ื ืฉืื ืคืืจืืฉ ืื ืืืฉื ืืืืฉื ืชืืจื ืืื ื' ืื ืืืืจ ืืขืืจ ืงืืจืื ืฉื ืช ืืืืื ื'ืจ'ืข'ื' ืืืชื"".
Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0
Decisions:
| Head | Score | Threshold | Above? |
|---|---|---|---|
| COLOPHON (learned) | 0.9939 | 0.45 | YES |
| PROVENANCE (heuristic) | 0.00 | 0.50 | no |
Example 2 โ clear_provenance
Sentence:
ืฉืื ืืื ืื ืกืคืจื ืืืืืงืื ื, ืืจืืฉืื ื ื ืงืจืืช ืืงืจื ืจืืืืืื ื ืืคืก ืขื ืงืืฃ ืืื ืืืจืฉื [ืื ืืืืจืคื] ืฉื ืช ื'ืชืงื""ื [ืฆ""ื: ืชืงืก""ื (1569)].
Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 1
Decisions:
| Head | Score | Threshold | Above? |
|---|---|---|---|
| COLOPHON (learned) | 0.3101 | 0.45 | no |
| PROVENANCE (heuristic) | 0.00 | 0.50 | no |
Example 3 โ both
Sentence:
ืืคืืื ืืชืืชื ืฉืื ืืื ืืกืคืจ ืื ื ืื ืฆืืจ ื' ืกืืื ืืืืื ืืื'ื ืื ืงืจืืช ืกืื, ืืืื ื ืื""ื ืืงื ืืชื ืืื ืืืฉื ืืืืฉื ืชืืจื ...
Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 1
Decisions:
| Head | Score | Threshold | Above? |
|---|---|---|---|
| COLOPHON (learned) | 0.9504 | 0.45 | YES |
| PROVENANCE (heuristic) | 0.65 | 0.50 | YES |
Example 4 โ codicology_neither
Sentence:
ืืืฃ 96ื: ""ืืฉืืืจื ืขืจื ืืฆืื"", ืกืืืืืช ืืขืจื ืฆืื ืืืคืืจ.
Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 0
Decisions:
| Head | Score | Threshold | Above? |
|---|---|---|---|
| COLOPHON (learned) | 0.2180 | 0.45 | no |
| PROVENANCE (heuristic) | 0.00 | 0.50 | no |
Example 5 โ keyword_colophon
Sentence:
ืืืื ืืืืืช ืืฉื ืื ืช""ื-ืชืฆ""ื ืืืจืืฉื ""ืฉื ื ืื ืฉืืื ืฉื ืช ืืชืชืงื ""ื"" ืืฉืืจืืช [=ืช""ื] ืืื ืจืื ื ืืชื ืืฉื ืช ืช""ื.
Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0
Decisions:
| Head | Score | Threshold | Above? |
|---|---|---|---|
| COLOPHON (learned) | 0.9025 | 0.45 | YES |
| PROVENANCE (heuristic) | 0.00 | 0.50 | no |
Training details
- Encoder:
dicta-il/dictabert. - Loss: focal loss with
pos_weightfor class imbalance (colophon sentences are a small minority of MARC 500 traffic). - Validation: 5-fold stratified CV at the manuscript level (sentences from the same record are kept on the same side of the split โ prevents leakage from neighboring sentences).
- Threshold tuning: scanned per fold; the published checkpoint
stores
threshold = 0.45. - Distant supervision: positive labels assigned from sentences
containing colophon-formula keywords (
ื ืฉืื,ืกืืื,ืงืืืืคืื, etc.); seescripts/extract_marc500_sentences.pyin the MHM pipeline for the exact extraction logic.
Limitations
- Single-head model: only COLOPHON is a learned classification.
The
is_provenancecompanion in the parent MHM pipeline is a Hebrew keyword heuristic, not this model. - Distant-supervision label noise: keyword-derived labels are not gold; some colophons that lack the canonical formulae are likely missed at training time.
- Sentence-level: sentences are assumed to be split before inference (e.g. by the parent pipeline's MARC 500 splitter). On run-on text, performance drops.
- Catalog scope: NLI MARC only; not validated on other catalogs.
Pipeline integration
In the MHM pipeline this model is consulted by NerWorker.run. Each
MARC 500 sentence is scored:
- COLOPHON-positive sentences are appended to
record["ml_colophon_sentences"]and merged intorecord["colophon_text"], which feeds WikidataP1684(inscription). - PROVENANCE-positive sentences (per the keyword heuristic) are routed through the provenance NER pipeline as if they had come from MARC 561.
Pre-deployment estimate: P1684 (inscription) coverage rises from 41% to ~55% on the parent MHM pipeline when this model is enabled (CLAUDE.md Rule 35).
Citation
@software{mhm_marc500_classifier_2025,
author = {Goldberg, Alexander},
title = {MARC 500 Hebrew Colophon Sentence Classifier},
year = {2025},
url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-marc500-classifier},
note = {Mapping Hebrew Manuscripts (MHM) Pipeline, Bar-Ilan University},
}
License
Apache-2.0. The base model dicta-il/dictabert is ยฉ DICTA, used here
under its published license.
Acknowledgments
DICTA (DictaBERT), National Library of Israel (catalog), Bar-Ilan University (MHM project).
Model tree for alexgoldberg/hebrew-manuscript-marc500-classifier
Base model
dicta-il/dictabertEvaluation results
- F1 (best fold, 5-fold CV)self-reported0.964
- F1 (mean across folds)self-reported0.961