File size: 4,358 Bytes
2831234
 
 
 
52e0940
 
 
2831234
 
 
 
2549e4e
2831234
2549e4e
52e0940
 
 
2549e4e
 
 
 
2831234
 
 
 
 
 
 
 
 
2549e4e
2831234
 
2549e4e
2831234
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
language:
- grc
base_model:
- Ericu950/SyllaMoBert-grc-v1
pipeline_tag: token-classification
---

# SyllaMoBert-grc-macronizer-v1

**SyllaMoBERT-grc-macronizer-v1** is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context.

The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy:
- 97.9% on open syllables with short dichrona
- 99.0% on open syllables with long dichrona
- 99.8% on the (trivially predictable) class of heavy syllables

This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis.

This model is trained on data generated by [Albin Thörn Cleland’s rule-based macronizer](https://github.com/Urdatorn/macronize-tlg). It is a finetuned version of a [ModernBERT](https://huggingface.co/docs/transformers/model_doc/modern_bert) model trained from skratch on syllabified Ancient Greek texts using the base model [`Ericu950/SyllaMoBert-grc-v1`](https://huggingface.co/Ericu950/SyllaMoBert-grc-v1).

---

##  Quick Start

First, install the syllabification utility:

```bash
pip install syllagreek_utils==0.1.0
```

Then run the following code:
```
import torch
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
from syllagreek_utils import preprocess_greek_line, syllabify_joined
from torch.nn.functional import softmax

# Load model and tokenizer
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Input line
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"

# Preprocess and syllabify
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
print("Syllables:", syllables)

# Tokenize
inputs = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=2048,
    padding="max_length"
)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Predict
with torch.no_grad():
    logits = model(**inputs).logits
    probs = softmax(logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()

# Align predictions with syllables
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
aligned_preds = []
syll_idx = 0
for tok in tokens:
    if tok in tokenizer.all_special_tokens:
        continue
    if syll_idx >= len(syllables):
        break
    aligned_preds.append((syllables[syll_idx], predictions[syll_idx]))
    syll_idx += 1

# Print results
label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}
print("\nMacronization Predictions:")
for syll, label in aligned_preds:
    print(f"{syll:>10} → {label_map[label]}")

Example Output:

Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']

Macronization Predictions:
       φάσ → clear
        γα → ambiguous → short
        νο → clear
       νἀσ → clear
        συ → ambiguous → short
        ρί → ambiguous → short
        οι → clear
         ο → clear
        πα → ambiguous → short
        ρή → clear
         ο → clear
        ρο → clear
       νἐκ → clear
        τε → clear
        λα → ambiguous → short
        μῶ → clear
       νοσ → clear





📝 License

This project is released under the MIT License.



👥 Authors

This work is part of ongoing research by:
	•	Albin Thörn Cleland (Lund University)
    •	Eric Cullhed (Uppsala University)



💻 Acknowledgements

The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725).

---

Would you like a Hugging Face model card in `.json` or `.md` format as well?