latin-bert / README.md
diyclassics's picture
docs: remove blockquote from experimental note
7a1b678
|
raw
history blame
3.33 kB
metadata
language: la
license: apache-2.0
library_name: transformers
tags:
  - latin
  - bert
  - nlp
  - classics
pipeline_tag: fill-mask

Latin BERT (Bamman & Burns 2020)

HuggingFace-compatible packaging of the Latin BERT model from:

Bamman, D., & Burns, P.J. (2020). Latin BERT: A Contextual Language Model for Classical Philology. arXiv preprint arXiv:2009.10053.

The original model and training code are available at github.com/dbamman/latin-bert. This repo repackages the same weights for use with HuggingFace transformers.

Note: This is an experimental repackaging. If you encounter any issues, please open a thread in the Discussion tab.

Model Details

  • Architecture: BERT-base (12 layers, 768 hidden, 12 attention heads)
  • Parameters: ~111M
  • Vocab size: 32,900 (SubwordTextEncoder)
  • Max sequence length: 512
  • Training data: Latin texts (see paper for details)

Install

pip install transformers torch

Usage

Basic: Get contextual embeddings

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = AutoModel.from_pretrained("latincy/latin-bert")

inputs = tokenizer("Gallia est omnis divisa in partes tres", return_tensors="pt")
outputs = model(**inputs)

# outputs.last_hidden_state: (batch, seq_len, 768)

Masked language model (fill-mask)

from transformers import AutoTokenizer, BertForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = BertForMaskedLM.from_pretrained("latincy/latin-bert")

text = "Gallia est omnis [MASK] in partes tres"
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits

top5 = logits[0, mask_idx, :].topk(5).indices.squeeze()
for token_id in top5:
    print(tokenizer.decode([token_id.item()]))

Custom Tokenizer

The original Latin BERT uses a tensor2tensor SubwordTextEncoder, not standard WordPiece. This repo includes a faithful reimplementation as a HuggingFace PreTrainedTokenizer — this is why trust_remote_code=True is required.

Verified against the original case studies from the paper:

POS tagging (Table 1)

Treebank Accuracy
Perseus 95.2%
PROIEL 98.2%
ITTB 99.2%

Masked word prediction (Table 3)

Metric Score
P@1 33.1%
P@10 62.2%
P@50 74.0%

spaCy Integration

Works with spacy-transformers:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "latincy/latin-bert"

[components.transformer.model.tokenizer_config]
trust_remote_code = true
use_fast = false

Citation

@article{bamman2020latin,
  title={Latin BERT: A Contextual Language Model for Classical Philology},
  author={Bamman, David and Burns, Patrick J},
  journal={arXiv preprint arXiv:2009.10053},
  year={2020}
}