File size: 3,753 Bytes

ccecfdf
 
14ecfc9
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
 
 
 
 
 
 
 
 
 
 
 
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
 
 
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
ccecfdf
43f04da
 
ccecfdf
14ecfc9
 
 
 
ccecfdf
14ecfc9
 
 
 
ccecfdf
14ecfc9
 
 
 
 
ccecfdf
14ecfc9
 
 
 
 
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
 
ccecfdf
14ecfc9
 
 
 
 
 
 
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9
ccecfdf
14ecfc9

---
library_name: transformers
language:
- grc
---
# SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

**SyllaBERTa** is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the *syllable* level.  
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

---

# Model Summary

| Attribute              | Value                             |
|:------------------------|:----------------------------------|
| Base architecture       | RoBERTa (custom configuration)    |
| Vocabulary size         | 42,042 syllabic tokens            |
| Hidden size             | 768                               |
| Number of layers        | 12                                |
| Attention heads         | 12                                |
| Intermediate size       | 3,072                             |
| Max sequence length     | 514                               |
| Pretraining objective   | Masked Language Modeling (MLM)    |
| Optimizer               | AdamW                             |
| Loss function           | CrossEntropy with 15% token masking probability |

---

The tokenizer is a custom subclass of `PreTrainedTokenizer`, operating on syllables rather than words or characters.  
It:
- Maps each syllable to an ID.
- Supports diphthong merging and Greek orthographic phenomena.
- Uses space-separated syllable tokens.

**Example tokenization:**

Input:  
`Κατέβην χθὲς εἰς Πειραιᾶ`

Tokens:  
`['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']`

> Observe that words are fused at the syllabic level.

---

# Usage Example

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)

# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)

# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
    outputs = model(**inputs)

# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

print("Top predictions:", predicted)
```

It should print:

```
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

Top 5 predictions for masked token:
ραι          (score: 23.12)
ρα           (score: 14.69)
ραισ         (score: 12.63)
σαι          (score: 12.43)
ρη           (score: 12.26)
```

---

# License

MIT License.

---

# Authors

This work is part of ongoing research by **Eric Cullhed** (Uppsala University) and **Albin Thörn Cleland** (Lund University).

---

# Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.