--- library_name: transformers language: - grc base_model: - Ericu950/SyllaMoBert-grc-v1 pipeline_tag: token-classification --- # SyllaMoBert-grc-macronizer-v1 **SyllaMoBERT-grc-macronizer-v1** is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context. The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy: - 97.9% on open syllables with short dichrona - 99.0% on open syllables with long dichrona - 99.8% on the (trivially predictable) class of heavy syllables This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis. This model is trained on data generated by [Albin Thörn Cleland’s rule-based macronizer](https://github.com/Urdatorn/macronize-tlg). It is a finetuned version of a [ModernBERT](https://huggingface.co/docs/transformers/model_doc/modern_bert) model trained from skratch on syllabified Ancient Greek texts using the base model [`Ericu950/SyllaMoBert-grc-v1`](https://huggingface.co/Ericu950/SyllaMoBert-grc-v1). --- ## Quick Start First, install the syllabification utility: ```bash pip install syllagreek_utils==0.1.0 ``` Then run the following code: ``` import torch from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification from syllagreek_utils import preprocess_greek_line, syllabify_joined from torch.nn.functional import softmax # Load model and tokenizer model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1" tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path) model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() # Input line line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος" # Preprocess and syllabify tokens = preprocess_greek_line(line) syllables = syllabify_joined(tokens) print("Syllables:", syllables) # Tokenize inputs = tokenizer( syllables, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=2048, padding="max_length" ) inputs.pop("token_type_ids", None) inputs = {k: v.to(device) for k, v in inputs.items()} # Predict with torch.no_grad(): logits = model(**inputs).logits probs = softmax(logits, dim=-1) predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy() # Align predictions with syllables tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze()) aligned_preds = [] syll_idx = 0 for tok in tokens: if tok in tokenizer.all_special_tokens: continue if syll_idx >= len(syllables): break aligned_preds.append((syllables[syll_idx], predictions[syll_idx])) syll_idx += 1 # Print results label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"} print("\nMacronization Predictions:") for syll, label in aligned_preds: print(f"{syll:>10} → {label_map[label]}") Example Output: Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ'] Macronization Predictions: φάσ → clear γα → ambiguous → short νο → clear νἀσ → clear συ → ambiguous → short ρί → ambiguous → short οι → clear ο → clear πα → ambiguous → short ρή → clear ο → clear ρο → clear νἐκ → clear τε → clear λα → ambiguous → short μῶ → clear νοσ → clear ⸻ 📝 License This project is released under the MIT License. ⸻ 👥 Authors This work is part of ongoing research by: • Albin Thörn Cleland (Lund University) • Eric Cullhed (Uppsala University) ⸻ 💻 Acknowledgements The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725). --- Would you like a Hugging Face model card in `.json` or `.md` format as well?