SyllaMoBert-grc-macronizer-v1 / README.md

Update README.md

52e0940 verified 9 months ago

4.36 kB

	---
	library_name: transformers
	language:
	- grc
	base_model:
	- Ericu950/SyllaMoBert-grc-v1
	pipeline_tag: token-classification
	---

	# SyllaMoBert-grc-macronizer-v1

	SyllaMoBERT-grc-macronizer-v1 is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context.

	The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy:
	- 97.9% on open syllables with short dichrona
	- 99.0% on open syllables with long dichrona
	- 99.8% on the (trivially predictable) class of heavy syllables

	This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis.

	This model is trained on data generated by [Albin Thörn Cleland’s rule-based macronizer](https://github.com/Urdatorn/macronize-tlg). It is a finetuned version of a [ModernBERT](https://huggingface.co/docs/transformers/model_doc/modern_bert) model trained from skratch on syllabified Ancient Greek texts using the base model [`Ericu950/SyllaMoBert-grc-v1`](https://huggingface.co/Ericu950/SyllaMoBert-grc-v1).

	---

	## Quick Start

	First, install the syllabification utility:

	```bash
	pip install syllagreek_utils==0.1.0
	```

	Then run the following code:
	```
	import torch
	from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
	from syllagreek_utils import preprocess_greek_line, syllabify_joined
	from torch.nn.functional import softmax

	# Load model and tokenizer
	model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
	tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
	model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	# Input line
	line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"

	# Preprocess and syllabify
	tokens = preprocess_greek_line(line)
	syllables = syllabify_joined(tokens)
	print("Syllables:", syllables)

	# Tokenize
	inputs = tokenizer(
	syllables,
	is_split_into_words=True,
	return_tensors="pt",
	truncation=True,
	max_length=2048,
	padding="max_length"
	)
	inputs.pop("token_type_ids", None)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# Predict
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = softmax(logits, dim=-1)
	predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()

	# Align predictions with syllables
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
	aligned_preds = []
	syll_idx = 0
	for tok in tokens:
	if tok in tokenizer.all_special_tokens:
	continue
	if syll_idx >= len(syllables):
	break
	aligned_preds.append((syllables[syll_idx], predictions[syll_idx]))
	syll_idx += 1

	# Print results
	label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}
	print("\nMacronization Predictions:")
	for syll, label in aligned_preds:
	print(f"{syll:>10} → {label_map[label]}")

	Example Output:

	Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']

	Macronization Predictions:
	φάσ → clear
	γα → ambiguous → short
	νο → clear
	νἀσ → clear
	συ → ambiguous → short
	ρί → ambiguous → short
	οι → clear
	ο → clear
	πα → ambiguous → short
	ρή → clear
	ο → clear
	ρο → clear
	νἐκ → clear
	τε → clear
	λα → ambiguous → short
	μῶ → clear
	νοσ → clear



	⸻

	📝 License

	This project is released under the MIT License.

	⸻

	👥 Authors

	This work is part of ongoing research by:
	• Albin Thörn Cleland (Lund University)
	• Eric Cullhed (Uppsala University)

	⸻

	💻 Acknowledgements

	The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725).

	---

	Would you like a Hugging Face model card in `.json` or `.md` format as well?