|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- grc |
|
|
--- |
|
|
# SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek |
|
|
|
|
|
**SyllaMoBert-grc-v1** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level. |
|
|
It is specifically designed to tackle tasks involving prosody, meter, and rhyme. |
|
|
|
|
|
Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0 |
|
|
|
|
|
!pip install syllagreek_utils==0.1.0 |
|
|
tokens = preprocess_greek_line(line) |
|
|
syllables = syllabify_joined(tokens) |
|
|
|
|
|
This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ` |
|
|
|
|
|
**Observe that words are fused at the syllabic level.** |
|
|
|
|
|
Load and test the model like this: |
|
|
|
|
|
``` |
|
|
|
|
|
# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to |
|
|
!pip install syllagreek_utils==0.1.0 |
|
|
|
|
|
#import what's needed |
|
|
|
|
|
import random |
|
|
import torch |
|
|
from transformers import AutoTokenizer, ModernBertForMaskedLM |
|
|
from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier |
|
|
|
|
|
# Set the computation device: GPU if available, else CPU |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load pretrained model and tokenizer from Hugging Face |
|
|
checkpoint = "Ericu950/SyllaMoBert-grc-v1" |
|
|
model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
|
|
|
|
# Input Greek text line |
|
|
line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ' |
|
|
|
|
|
# Apply custom preprocessing: tokenization and normalization |
|
|
tokens = preprocess_greek_line(line) |
|
|
|
|
|
# Apply syllabification to tokens, joining them into syllables |
|
|
syllables = syllabify_joined(tokens) |
|
|
|
|
|
# Randomly select a syllable index to mask |
|
|
mask_idx = random.randint(0, len(syllables) - 1) |
|
|
|
|
|
# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK]) |
|
|
syllables[mask_idx] = tokenizer.mask_token |
|
|
|
|
|
print("Masked syllables:", syllables) |
|
|
|
|
|
# Tokenize the masked syllables and prepare inputs for the model |
|
|
# is_split_into_words=True tells the tokenizer not to split again |
|
|
inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device) |
|
|
|
|
|
# Identify the index of the mask token in the input tensor |
|
|
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
|
|
|
|
|
# Disable gradient calculation since we're just doing inference |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits # raw prediction scores for each token |
|
|
|
|
|
# Extract the logits corresponding to the masked position |
|
|
mask_logits = logits[0, mask_token_index[0]] |
|
|
|
|
|
# Get the top 5 predicted token IDs for the masked position |
|
|
top_tokens = torch.topk(mask_logits, 5, dim=-1).indices |
|
|
|
|
|
# Decode and print the top 5 predicted tokens for the masked syllable |
|
|
print("Top predictions for [MASK]:") |
|
|
for token_id in top_tokens: |
|
|
print("→", tokenizer.decode([token_id.item()])) |
|
|
|
|
|
``` |
|
|
|
|
|
This should print |
|
|
|
|
|
``` |
|
|
|
|
|
Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ'] |
|
|
Top predictions for [MASK]: |
|
|
→ ήν |
|
|
→ ῆσ |
|
|
→ ῇ |
|
|
→ ὴν |
|
|
→ ῆ |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# License |
|
|
|
|
|
MIT License. |
|
|
|
|
|
--- |
|
|
|
|
|
# Authors |
|
|
|
|
|
This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University). |
|
|
|
|
|
--- |
|
|
|
|
|
# Acknowledgements |
|
|
|
|
|
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. |