language: la
license: apache-2.0
library_name: transformers
tags:
- latin
- bert
- nlp
- classics
pipeline_tag: fill-mask
Latin BERT (Bamman & Burns 2020)
HuggingFace-compatible packaging of the Latin BERT model from:
Bamman, D., & Burns, P.J. (2020). Latin BERT: A Contextual Language Model for Classical Philology. arXiv preprint arXiv:2009.10053.
The original model and training code are available at github.com/dbamman/latin-bert. This repo repackages the same weights for use with HuggingFace transformers.
Note: This is an experimental repackaging. If you encounter any issues, please open a thread in the Discussion tab.
Model Details
- Architecture: BERT-base (12 layers, 768 hidden, 12 attention heads)
- Parameters: ~111M
- Vocab size: 32,900 (SubwordTextEncoder)
- Max sequence length: 512
- Training data: Latin texts (see paper for details)
Install
pip install transformers torch
Usage
Basic: Get contextual embeddings
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"latincy/latin-bert", trust_remote_code=True
)
model = AutoModel.from_pretrained("latincy/latin-bert")
inputs = tokenizer("Gallia est omnis divisa in partes tres", return_tensors="pt")
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, 768)
Masked language model (fill-mask)
from transformers import AutoTokenizer, BertForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained(
"latincy/latin-bert", trust_remote_code=True
)
model = BertForMaskedLM.from_pretrained("latincy/latin-bert")
text = "Gallia est omnis [MASK] in partes tres"
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
logits = model(**inputs).logits
top5 = logits[0, mask_idx, :].topk(5).indices.squeeze()
for token_id in top5:
print(tokenizer.decode([token_id.item()]))
Custom Tokenizer
The original Latin BERT uses a tensor2tensor SubwordTextEncoder, not standard
WordPiece. This repo includes a faithful reimplementation as a HuggingFace
PreTrainedTokenizer — this is why trust_remote_code=True is required.
Verified against the original case studies from the paper:
POS tagging (Table 1)
| Treebank | Accuracy |
|---|---|
| Perseus | 95.2% |
| PROIEL | 98.2% |
| ITTB | 99.2% |
Masked word prediction (Table 3)
| Metric | Score |
|---|---|
| P@1 | 33.1% |
| P@10 | 62.2% |
| P@50 | 74.0% |
spaCy Integration
Works with spacy-transformers:
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "latincy/latin-bert"
[components.transformer.model.tokenizer_config]
trust_remote_code = true
use_fast = false
Citation
@article{bamman2020latin,
title={Latin BERT: A Contextual Language Model for Classical Philology},
author={Bamman, David and Burns, Patrick J},
journal={arXiv preprint arXiv:2009.10053},
year={2020}
}