File size: 3,802 Bytes
57834e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50ef3bd
57834e3
 
 
 
 
7332a2e
 
 
 
57834e3
 
 
7332a2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57834e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
library_name: transformers
language:
- grc
---
# SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek

**SyllaMoBert-grc-v1** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.  
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0 

!pip install syllagreek_utils==0.1.0
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)

This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ`

**Observe that words are fused at the syllabic level.**

Load and test the model like this:

```

# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0

#import what's needed

import random
import torch
from transformers import AutoTokenizer, ModernBertForMaskedLM
from syllagreek_utils import preprocess_greek_line, syllabify_joined  # this is the custom  preprocessor & syllabifier

# Set the computation device: GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pretrained model and tokenizer from Hugging Face
checkpoint = "Ericu950/SyllaMoBert-grc-v1"
model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Input Greek text line
line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ'

# Apply custom preprocessing: tokenization and normalization
tokens = preprocess_greek_line(line)

# Apply syllabification to tokens, joining them into syllables
syllables = syllabify_joined(tokens)

# Randomly select a syllable index to mask
mask_idx = random.randint(0, len(syllables) - 1)

# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK])
syllables[mask_idx] = tokenizer.mask_token

print("Masked syllables:", syllables)

# Tokenize the masked syllables and prepare inputs for the model
# is_split_into_words=True tells the tokenizer not to split again
inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device)

# Identify the index of the mask token in the input tensor
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

# Disable gradient calculation since we're just doing inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # raw prediction scores for each token

# Extract the logits corresponding to the masked position
mask_logits = logits[0, mask_token_index[0]]

# Get the top 5 predicted token IDs for the masked position
top_tokens = torch.topk(mask_logits, 5, dim=-1).indices

# Decode and print the top 5 predicted tokens for the masked syllable
print("Top predictions for [MASK]:")
for token_id in top_tokens:
    print("→", tokenizer.decode([token_id.item()]))

```

This should print 

```

Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ']
Top predictions for [MASK]:
→ ήν
→ ῆσ
→ ῇ
→ ὴν
→ ῆ

```

---

# License

MIT License.

---

# Authors

This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).

---

# Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.