--- library_name: transformers language: - grc --- # SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek **SyllaMoBert-grc-v1** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level. It is specifically designed to tackle tasks involving prosody, meter, and rhyme. Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0 !pip install syllagreek_utils==0.1.0 tokens = preprocess_greek_line(line) syllables = syllabify_joined(tokens) This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ` **Observe that words are fused at the syllabic level.** Load and test the model like this: ``` # First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to !pip install syllagreek_utils==0.1.0 #import what's needed import random import torch from transformers import AutoTokenizer, ModernBertForMaskedLM from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier # Set the computation device: GPU if available, else CPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load pretrained model and tokenizer from Hugging Face checkpoint = "Ericu950/SyllaMoBert-grc-v1" model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device) tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Input Greek text line line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ' # Apply custom preprocessing: tokenization and normalization tokens = preprocess_greek_line(line) # Apply syllabification to tokens, joining them into syllables syllables = syllabify_joined(tokens) # Randomly select a syllable index to mask mask_idx = random.randint(0, len(syllables) - 1) # Replace the selected syllable with the tokenizer's mask token (e.g., [MASK]) syllables[mask_idx] = tokenizer.mask_token print("Masked syllables:", syllables) # Tokenize the masked syllables and prepare inputs for the model # is_split_into_words=True tells the tokenizer not to split again inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device) # Identify the index of the mask token in the input tensor mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] # Disable gradient calculation since we're just doing inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # raw prediction scores for each token # Extract the logits corresponding to the masked position mask_logits = logits[0, mask_token_index[0]] # Get the top 5 predicted token IDs for the masked position top_tokens = torch.topk(mask_logits, 5, dim=-1).indices # Decode and print the top 5 predicted tokens for the masked syllable print("Top predictions for [MASK]:") for token_id in top_tokens: print("→", tokenizer.decode([token_id.item()])) ``` This should print ``` Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ'] Top predictions for [MASK]: → ήν → ῆσ → ῇ → ὴν → ῆ ``` --- # License MIT License. --- # Authors This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University). --- # Acknowledgements The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.