SyllaMoBert-grc-v1 / README.md
Ericu950's picture
Update README.md
50ef3bd verified
---
library_name: transformers
language:
- grc
---
# SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek
**SyllaMoBert-grc-v1** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.
Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0
!pip install syllagreek_utils==0.1.0
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ`
**Observe that words are fused at the syllabic level.**
Load and test the model like this:
```
# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0
#import what's needed
import random
import torch
from transformers import AutoTokenizer, ModernBertForMaskedLM
from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier
# Set the computation device: GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pretrained model and tokenizer from Hugging Face
checkpoint = "Ericu950/SyllaMoBert-grc-v1"
model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Input Greek text line
line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ'
# Apply custom preprocessing: tokenization and normalization
tokens = preprocess_greek_line(line)
# Apply syllabification to tokens, joining them into syllables
syllables = syllabify_joined(tokens)
# Randomly select a syllable index to mask
mask_idx = random.randint(0, len(syllables) - 1)
# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK])
syllables[mask_idx] = tokenizer.mask_token
print("Masked syllables:", syllables)
# Tokenize the masked syllables and prepare inputs for the model
# is_split_into_words=True tells the tokenizer not to split again
inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device)
# Identify the index of the mask token in the input tensor
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
# Disable gradient calculation since we're just doing inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # raw prediction scores for each token
# Extract the logits corresponding to the masked position
mask_logits = logits[0, mask_token_index[0]]
# Get the top 5 predicted token IDs for the masked position
top_tokens = torch.topk(mask_logits, 5, dim=-1).indices
# Decode and print the top 5 predicted tokens for the masked syllable
print("Top predictions for [MASK]:")
for token_id in top_tokens:
print("→", tokenizer.decode([token_id.item()]))
```
This should print
```
Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ']
Top predictions for [MASK]:
→ ήν
→ ῆσ
→ ῇ
→ ὴν
→ ῆ
```
---
# License
MIT License.
---
# Authors
This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).
---
# Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.