MARC 500 Colophon Sentence Classifier

A binary sentence classifier that detects whether a Hebrew MARC 500 (general notes) sentence is a colophon (the scribe's signature record at the end of a manuscript โ€” typically including the scribe's name, place, and date of completion).

Built for the Mapping Hebrew Manuscripts (MHM) pipeline (Bar-Ilan University). Colophon sentences identified by this model are routed to Wikidata P1684 (inscription) instead of generic P7535 (described at URL) notes.

Note: this checkpoint has a single learned head (COLOPHON only). In the MHM pipeline a sibling provenance decision is produced by a deterministic Hebrew keyword heuristic (converter/authority/marc500_classifier.py:_PROVENANCE_KEYWORDS), not by a learned head. Both decisions appear in this card's example tables for completeness.

Quick stats

Base dicta-il/dictabert
Architecture DictaBERT [CLS] โ†’ Dropout(0.3) โ†’ Linear(768 โ†’ 1) โ†’ sigmoid
Heads 1 (COLOPHON)
Threshold 0.45
F1 (best fold) 0.9642
F1 (mean fold) 0.9610
Max length 64 tokens
Validation 5-fold stratified CV

How to use

from huggingface_hub import hf_hub_download
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

REPO = "alexgoldberg/hebrew-manuscript-marc500-classifier"
ckpt = torch.load(hf_hub_download(REPO, "marc500_classifier_model.pt"),
                  map_location="cpu", weights_only=False)
threshold = ckpt["threshold"]   # 0.45
max_len   = ckpt["max_length"]  # 64

class ColophonModel(nn.Module):
    def __init__(self, base):
        super().__init__()
        self.bert = AutoModel.from_pretrained(base)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)
    def forward(self, input_ids, attention_mask):
        cls = self.bert(input_ids, attention_mask).last_hidden_state[:,0]
        return self.classifier(self.dropout(cls))

BASE = "dicta-il/dictabert"
tok = AutoTokenizer.from_pretrained(BASE)
model = ColophonModel(BASE)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

sentence = 'ื ืฉืœื ืคื™ืจื•ืฉ ื›ืœ ื—ืžืฉื” ื—ื•ืžืฉื™ ืชื•ืจื” ื™ื•ื ื” ื›ื— ืœืื“ืจ.'
enc = tok(sentence, max_length=max_len, padding="max_length",
          truncation=True, return_tensors="pt")
with torch.no_grad():
    score = float(torch.sigmoid(model(enc["input_ids"],
                                      enc["attention_mask"])).squeeze())
is_colophon = score >= threshold
print(is_colophon, round(score, 4))

A complete inference helper is shipped as examples.py in this repo.

Real input/output examples

Each sentence below is taken verbatim from a National Library of Israel MARC 500 general-notes field. Gold colophon and Gold provenance are the distant-supervision labels used when this corpus was extracted (see scripts/extract_marc500_sentences.py in the MHM pipeline). The provenance score is the keyword-heuristic decision (not a learned model output).

Example 1 โ€” clear_colophon

Sentence:

ืงื•ืœื•ืคื•ืŸ ื”ืžื—ื‘ืจ (181ื): ""ื ืฉืœื ืคื™ืจื•ืฉ ื›ืœ ื—ืžืฉื” ื—ื•ืžืฉื™ ืชื•ืจื” ื™ื•ื ื”' ื›ื— ืœืื“ืจ ื‘ืขื™ืจ ืงื™ืจื™ื ืฉื ืช ืืœื”ื™ื ื”'ืจ'ืข'ื”' ืื•ืชื™"".

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0

Decisions:

Head Score Threshold Above?
COLOPHON (learned) 0.9939 0.45 YES
PROVENANCE (heuristic) 0.00 0.50 no

Example 2 โ€” clear_provenance

Sentence:

ืฉื”ืŸ ื‘ื’ื ื–ื™ ืกืคืจื™ ื•ืื˜ื™ืงืื ื•, ื”ืจืืฉื•ื ื” ื ืงืจืืช ืžืงืจื ืจื™ื’ื™ื™ืื” ื ื“ืคืก ืขืœ ืงืœืฃ ื‘ืื ื•ื™ืจืฉื” [ืื ื˜ื•ื•ืจืคืŸ] ืฉื ืช ื'ืชืงื™""ื– [ืฆ""ืœ: ืชืงืก""ื˜ (1569)].

Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 1

Decisions:

Head Score Threshold Above?
COLOPHON (learned) 0.3101 0.45 no
PROVENANCE (heuristic) 0.00 0.50 no

Example 3 โ€” both

Sentence:

ืœืคื™ื›ืš ื›ืชื‘ืชื™ ืฉืžื™ ื‘ื–ื” ื”ืกืคืจ ืื ื ืžื ืฆื•ืจ ืŸ' ืกืืœื ืืœืœื•ื™ ืืœื“'ื™ ืžืŸ ืงืจื™ืช ืกื“ื, ื–ื™ื›ื ื™ ื”ื‘""ื” ื•ืงื ื™ืชื™ ืืœื• ื—ืžืฉื” ื—ื•ืžืฉื™ ืชื•ืจื” ...

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 1

Decisions:

Head Score Threshold Above?
COLOPHON (learned) 0.9504 0.45 YES
PROVENANCE (heuristic) 0.65 0.50 YES

Example 4 โ€” codicology_neither

Sentence:

ื‘ื“ืฃ 96ื: ""ืืฉืžื•ืจื” ืขืจื‘ ื”ืฆื•ื"", ืกืœื™ื—ื•ืช ืœืขืจื‘ ืฆื•ื ื›ื™ืคื•ืจ.

Gold labels (distant supervision): COLOPHON = 0, PROVENANCE = 0

Decisions:

Head Score Threshold Above?
COLOPHON (learned) 0.2180 0.45 no
PROVENANCE (heuristic) 0.00 0.50 no

Example 5 โ€” keyword_colophon

Sentence:

ื›ื•ืœืœ ืœื•ื—ื•ืช ืœืฉื ื™ื ืช""ื”-ืชืฆ""ื ื•ื‘ืจืืฉื• ""ืฉื ื” ื–ื• ืฉื”ื™ื ืฉื ืช ืืชืชืงื ""ื“"" ืœืฉื˜ืจื•ืช [=ืช""ื’] ื•ื›ื ืจืื” ื ื›ืชื‘ ื‘ืฉื ืช ืช""ื’.

Gold labels (distant supervision): COLOPHON = 1, PROVENANCE = 0

Decisions:

Head Score Threshold Above?
COLOPHON (learned) 0.9025 0.45 YES
PROVENANCE (heuristic) 0.00 0.50 no

Training details

  • Encoder: dicta-il/dictabert.
  • Loss: focal loss with pos_weight for class imbalance (colophon sentences are a small minority of MARC 500 traffic).
  • Validation: 5-fold stratified CV at the manuscript level (sentences from the same record are kept on the same side of the split โ€” prevents leakage from neighboring sentences).
  • Threshold tuning: scanned per fold; the published checkpoint stores threshold = 0.45.
  • Distant supervision: positive labels assigned from sentences containing colophon-formula keywords (ื ืฉืœื, ืกื™ื•ื, ืงื•ืœื•ืคื•ืŸ, etc.); see scripts/extract_marc500_sentences.py in the MHM pipeline for the exact extraction logic.

Limitations

  • Single-head model: only COLOPHON is a learned classification. The is_provenance companion in the parent MHM pipeline is a Hebrew keyword heuristic, not this model.
  • Distant-supervision label noise: keyword-derived labels are not gold; some colophons that lack the canonical formulae are likely missed at training time.
  • Sentence-level: sentences are assumed to be split before inference (e.g. by the parent pipeline's MARC 500 splitter). On run-on text, performance drops.
  • Catalog scope: NLI MARC only; not validated on other catalogs.

Pipeline integration

In the MHM pipeline this model is consulted by NerWorker.run. Each MARC 500 sentence is scored:

  • COLOPHON-positive sentences are appended to record["ml_colophon_sentences"] and merged into record["colophon_text"], which feeds Wikidata P1684 (inscription).
  • PROVENANCE-positive sentences (per the keyword heuristic) are routed through the provenance NER pipeline as if they had come from MARC 561.

Pre-deployment estimate: P1684 (inscription) coverage rises from 41% to ~55% on the parent MHM pipeline when this model is enabled (CLAUDE.md Rule 35).

Citation

@software{mhm_marc500_classifier_2025,
  author = {Goldberg, Alexander},
  title  = {MARC 500 Hebrew Colophon Sentence Classifier},
  year   = {2025},
  url    = {https://huggingface.co/alexgoldberg/hebrew-manuscript-marc500-classifier},
  note   = {Mapping Hebrew Manuscripts (MHM) Pipeline, Bar-Ilan University},
}

License

Apache-2.0. The base model dicta-il/dictabert is ยฉ DICTA, used here under its published license.

Acknowledgments

DICTA (DictaBERT), National Library of Israel (catalog), Bar-Ilan University (MHM project).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for alexgoldberg/hebrew-manuscript-marc500-classifier

Finetuned
(7)
this model

Evaluation results