File size: 3,802 Bytes
57834e3 50ef3bd 57834e3 7332a2e 57834e3 7332a2e 57834e3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | ---
library_name: transformers
language:
- grc
---
# SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek
**SyllaMoBert-grc-v1** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.
Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0
!pip install syllagreek_utils==0.1.0
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ`
**Observe that words are fused at the syllabic level.**
Load and test the model like this:
```
# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0
#import what's needed
import random
import torch
from transformers import AutoTokenizer, ModernBertForMaskedLM
from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier
# Set the computation device: GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pretrained model and tokenizer from Hugging Face
checkpoint = "Ericu950/SyllaMoBert-grc-v1"
model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Input Greek text line
line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ'
# Apply custom preprocessing: tokenization and normalization
tokens = preprocess_greek_line(line)
# Apply syllabification to tokens, joining them into syllables
syllables = syllabify_joined(tokens)
# Randomly select a syllable index to mask
mask_idx = random.randint(0, len(syllables) - 1)
# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK])
syllables[mask_idx] = tokenizer.mask_token
print("Masked syllables:", syllables)
# Tokenize the masked syllables and prepare inputs for the model
# is_split_into_words=True tells the tokenizer not to split again
inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device)
# Identify the index of the mask token in the input tensor
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
# Disable gradient calculation since we're just doing inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # raw prediction scores for each token
# Extract the logits corresponding to the masked position
mask_logits = logits[0, mask_token_index[0]]
# Get the top 5 predicted token IDs for the masked position
top_tokens = torch.topk(mask_logits, 5, dim=-1).indices
# Decode and print the top 5 predicted tokens for the masked syllable
print("Top predictions for [MASK]:")
for token_id in top_tokens:
print("→", tokenizer.decode([token_id.item()]))
```
This should print
```
Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ']
Top predictions for [MASK]:
→ ήν
→ ῆσ
→ ῇ
→ ὴν
→ ῆ
```
---
# License
MIT License.
---
# Authors
This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).
---
# Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. |