MARC 500 Colophon Sentence Classifier

A binary sentence classifier that detects whether a Hebrew MARC 500 (general notes) sentence is a colophon (the scribe's signature record at the end of a manuscript — typically including the scribe's name, place, and date of completion).

Built for the Mapping Hebrew Manuscripts (MHM) pipeline (Bar-Ilan University). Colophon sentences identified by this model are routed to Wikidata P1684 (inscription) instead of generic P7535 (described at URL) notes.

Note: this checkpoint has a single learned head (COLOPHON only). In the MHM pipeline a sibling provenance decision is produced by a deterministic Hebrew keyword heuristic (converter/authority/marc500_classifier.py:_PROVENANCE_KEYWORDS), not by a learned head. Both decisions appear in this card's example tables for completeness.

Quick stats


Base	`dicta-il/dictabert`
Architecture	DictaBERT [CLS] → Dropout(0.3) → Linear(768 → 1) → sigmoid
Heads	1 (COLOPHON)
Threshold	0.45
F1 (best fold)	0.9642
F1 (mean fold)	0.9610
Max length	64 tokens
Validation	5-fold stratified CV

How to use

from huggingface_hub import hf_hub_download
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

REPO = "alexgoldberg/hebrew-manuscript-marc500-classifier"
ckpt = torch.load(hf_hub_download(REPO, "marc500_classifier_model.pt"),
                  map_location="cpu", weights_only=False)
threshold = ckpt["threshold"]   # 0.45
max_len   = ckpt["max_length"]  # 64

class ColophonModel(nn.Module):
    def __init__(self, base):
        super().__init__()
        self.bert = AutoModel.from_pretrained(base)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)
    def forward(self, input_ids, attention_mask):
        cls = self.bert(input_ids, attention_mask).last_hidden_state[:,0]
        return self.classifier(self.dropout(cls))

BASE = "dicta-il/dictabert"
tok = AutoTokenizer.from_pretrained(BASE)
model = ColophonModel(BASE)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

sentence = 'נשלם פירוש כל חמשה חומשי תורה יום ה כח לאדר.'
enc = tok(sentence, max_length=max_len, padding="max_length",
          truncation=True, return_tensors="pt")
with torch.no_grad():
    score = float(torch.sigmoid(model(enc["input_ids"],
                                      enc["attention_mask"])).squeeze())
is_colophon = score >= threshold
print(is_colophon, round(score, 4))

A complete inference helper is shipped as examples.py in this repo.

Real input/output examples

Each sentence below is taken verbatim from a National Library of Israel MARC 500 general-notes field. Gold colophon and Gold provenance are the distant-supervision labels used when this corpus was extracted (see scripts/extract_marc500_sentences.py in the MHM pipeline). The provenance score is the keyword-heuristic decision (not a learned model output).

Example 1 — clear_colophon

Sentence:

קולופון המחבר (181א): ""נשלם פירוש כל חמשה חומשי תורה יום ה' כח לאדר בעיר קירים שנת אלהים ה'ר'ע'ה' אותי"".

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0

Decisions:

Head	Score	Threshold	Above?
COLOPHON (learned)	0.9939	0.45	YES
PROVENANCE (heuristic)	0.00	0.50	no

Example 2 — clear_provenance

Sentence:

שהן בגנזי ספרי ואטיקאנו, הראשונה נקראת מקרא ריגייאה נדפס על קלף באנוירשה [אנטוורפן] שנת א'תקי""ז [צ""ל: תקס""ט (1569)].

Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 1

Decisions:

Head	Score	Threshold	Above?
COLOPHON (learned)	0.3101	0.45	no
PROVENANCE (heuristic)	0.00	0.50	no

Example 3 — both

Sentence:

לפיכך כתבתי שמי בזה הספר אנא מנצור ן' סאלם אללוי אלד'י מן קרית סדם, זיכני הב""ה וקניתי אלו חמשה חומשי תורה ...

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 1

Decisions:

Head	Score	Threshold	Above?
COLOPHON (learned)	0.9504	0.45	YES
PROVENANCE (heuristic)	0.65	0.50	YES

Example 4 — codicology_neither

Sentence:

בדף 96א: ""אשמורה ערב הצום"", סליחות לערב צום כיפור.

Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 0

Decisions:

Head	Score	Threshold	Above?
COLOPHON (learned)	0.2180	0.45	no
PROVENANCE (heuristic)	0.00	0.50	no

Example 5 — keyword_colophon

Sentence:

כולל לוחות לשנים ת""ה-תצ""א ובראשו ""שנה זו שהיא שנת אתתקנ""ד"" לשטרות [=ת""ג] וכנראה נכתב בשנת ת""ג.

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0

Decisions:

Head	Score	Threshold	Above?
COLOPHON (learned)	0.9025	0.45	YES
PROVENANCE (heuristic)	0.00	0.50	no

Training details

Encoder: dicta-il/dictabert.
Loss: focal loss with pos_weight for class imbalance (colophon sentences are a small minority of MARC 500 traffic).
Validation: 5-fold stratified CV at the manuscript level (sentences from the same record are kept on the same side of the split — prevents leakage from neighboring sentences).
Threshold tuning: scanned per fold; the published checkpoint stores threshold = 0.45.
Distant supervision: positive labels assigned from sentences containing colophon-formula keywords (נשלם, סיום, קולופון, etc.); see scripts/extract_marc500_sentences.py in the MHM pipeline for the exact extraction logic.

Limitations

Single-head model: only COLOPHON is a learned classification. The is_provenance companion in the parent MHM pipeline is a Hebrew keyword heuristic, not this model.
Distant-supervision label noise: keyword-derived labels are not gold; some colophons that lack the canonical formulae are likely missed at training time.
Sentence-level: sentences are assumed to be split before inference (e.g. by the parent pipeline's MARC 500 splitter). On run-on text, performance drops.
Catalog scope: NLI MARC only; not validated on other catalogs.

Pipeline integration

In the MHM pipeline this model is consulted by NerWorker.run. Each MARC 500 sentence is scored:

COLOPHON-positive sentences are appended to record["ml_colophon_sentences"] and merged into record["colophon_text"], which feeds Wikidata P1684 (inscription).
PROVENANCE-positive sentences (per the keyword heuristic) are routed through the provenance NER pipeline as if they had come from MARC 561.

Pre-deployment estimate: P1684 (inscription) coverage rises from 41% to ~55% on the parent MHM pipeline when this model is enabled (CLAUDE.md Rule 35).

Citation

@software{mhm_marc500_classifier_2025,
  author = {Goldberg, Alexander},
  title  = {MARC 500 Hebrew Colophon Sentence Classifier},
  year   = {2025},
  url    = {https://huggingface.co/alexgoldberg/hebrew-manuscript-marc500-classifier},
  note   = {Mapping Hebrew Manuscripts (MHM) Pipeline, Bar-Ilan University},
}

License

Acknowledgments

DICTA (DictaBERT), National Library of Israel (catalog), Bar-Ilan University (MHM project).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for alexgoldberg/hebrew-manuscript-marc500-classifier

Base model

dicta-il/dictabert

Finetuned

(7)

this model

Evaluation results

F1 (best fold, 5-fold CV)
self-reported

0.964
F1 (mean across folds)
self-reported

0.961