|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- grc |
|
|
base_model: |
|
|
- Ericu950/SyllaMoBert-grc-v1 |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# SyllaMoBert-grc-macronizer-v1 |
|
|
|
|
|
**SyllaMoBERT-grc-macronizer-v1** is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context. |
|
|
|
|
|
The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy: |
|
|
- 97.9% on open syllables with short dichrona |
|
|
- 99.0% on open syllables with long dichrona |
|
|
- 99.8% on the (trivially predictable) class of heavy syllables |
|
|
|
|
|
This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis. |
|
|
|
|
|
This model is trained on data generated by [Albin Thörn Cleland’s rule-based macronizer](https://github.com/Urdatorn/macronize-tlg). It is a finetuned version of a [ModernBERT](https://huggingface.co/docs/transformers/model_doc/modern_bert) model trained from skratch on syllabified Ancient Greek texts using the base model [`Ericu950/SyllaMoBert-grc-v1`](https://huggingface.co/Ericu950/SyllaMoBert-grc-v1). |
|
|
|
|
|
--- |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
First, install the syllabification utility: |
|
|
|
|
|
```bash |
|
|
pip install syllagreek_utils==0.1.0 |
|
|
``` |
|
|
|
|
|
Then run the following code: |
|
|
``` |
|
|
import torch |
|
|
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification |
|
|
from syllagreek_utils import preprocess_greek_line, syllabify_joined |
|
|
from torch.nn.functional import softmax |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1" |
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path) |
|
|
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16) |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
model.eval() |
|
|
|
|
|
# Input line |
|
|
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος" |
|
|
|
|
|
# Preprocess and syllabify |
|
|
tokens = preprocess_greek_line(line) |
|
|
syllables = syllabify_joined(tokens) |
|
|
print("Syllables:", syllables) |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer( |
|
|
syllables, |
|
|
is_split_into_words=True, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
max_length=2048, |
|
|
padding="max_length" |
|
|
) |
|
|
inputs.pop("token_type_ids", None) |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
probs = softmax(logits, dim=-1) |
|
|
predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy() |
|
|
|
|
|
# Align predictions with syllables |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze()) |
|
|
aligned_preds = [] |
|
|
syll_idx = 0 |
|
|
for tok in tokens: |
|
|
if tok in tokenizer.all_special_tokens: |
|
|
continue |
|
|
if syll_idx >= len(syllables): |
|
|
break |
|
|
aligned_preds.append((syllables[syll_idx], predictions[syll_idx])) |
|
|
syll_idx += 1 |
|
|
|
|
|
# Print results |
|
|
label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"} |
|
|
print("\nMacronization Predictions:") |
|
|
for syll, label in aligned_preds: |
|
|
print(f"{syll:>10} → {label_map[label]}") |
|
|
|
|
|
Example Output: |
|
|
|
|
|
Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ'] |
|
|
|
|
|
Macronization Predictions: |
|
|
φάσ → clear |
|
|
γα → ambiguous → short |
|
|
νο → clear |
|
|
νἀσ → clear |
|
|
συ → ambiguous → short |
|
|
ρί → ambiguous → short |
|
|
οι → clear |
|
|
ο → clear |
|
|
πα → ambiguous → short |
|
|
ρή → clear |
|
|
ο → clear |
|
|
ρο → clear |
|
|
νἐκ → clear |
|
|
τε → clear |
|
|
λα → ambiguous → short |
|
|
μῶ → clear |
|
|
νοσ → clear |
|
|
|
|
|
|
|
|
|
|
|
⸻ |
|
|
|
|
|
📝 License |
|
|
|
|
|
This project is released under the MIT License. |
|
|
|
|
|
⸻ |
|
|
|
|
|
👥 Authors |
|
|
|
|
|
This work is part of ongoing research by: |
|
|
• Albin Thörn Cleland (Lund University) |
|
|
• Eric Cullhed (Uppsala University) |
|
|
|
|
|
⸻ |
|
|
|
|
|
💻 Acknowledgements |
|
|
|
|
|
The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725). |
|
|
|
|
|
--- |
|
|
|
|
|
Would you like a Hugging Face model card in `.json` or `.md` format as well? |