SyllaMoBert-grc-v1 / README.md

Update README.md

50ef3bd verified 9 months ago

3.8 kB

	---
	library_name: transformers
	language:
	- grc
	---
	# SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek

	SyllaMoBert-grc-v1 is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
	It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

	Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0

	!pip install syllagreek_utils==0.1.0
	tokens = preprocess_greek_line(line)
	syllables = syllabify_joined(tokens)

	This will convert line, e.g. `Κατέβην χθὲς εἰς Πειραιᾶ` into `κα τέ βην χθὲ σεἰσ πει ραι ᾶ`

	Observe that words are fused at the syllabic level.

	Load and test the model like this:

	```

	# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
	!pip install syllagreek_utils==0.1.0

	#import what's needed

	import random
	import torch
	from transformers import AutoTokenizer, ModernBertForMaskedLM
	from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier

	# Set the computation device: GPU if available, else CPU
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load pretrained model and tokenizer from Hugging Face
	checkpoint = "Ericu950/SyllaMoBert-grc-v1"
	model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device)
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)

	# Input Greek text line
	line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ'

	# Apply custom preprocessing: tokenization and normalization
	tokens = preprocess_greek_line(line)

	# Apply syllabification to tokens, joining them into syllables
	syllables = syllabify_joined(tokens)

	# Randomly select a syllable index to mask
	mask_idx = random.randint(0, len(syllables) - 1)

	# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK])
	syllables[mask_idx] = tokenizer.mask_token

	print("Masked syllables:", syllables)

	# Tokenize the masked syllables and prepare inputs for the model
	# is_split_into_words=True tells the tokenizer not to split again
	inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device)

	# Identify the index of the mask token in the input tensor
	mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

	# Disable gradient calculation since we're just doing inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits # raw prediction scores for each token

	# Extract the logits corresponding to the masked position
	mask_logits = logits[0, mask_token_index[0]]

	# Get the top 5 predicted token IDs for the masked position
	top_tokens = torch.topk(mask_logits, 5, dim=-1).indices

	# Decode and print the top 5 predicted tokens for the masked syllable
	print("Top predictions for [MASK]:")
	for token_id in top_tokens:
	print("→", tokenizer.decode([token_id.item()]))

	```

	This should print

	```

	Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ']
	Top predictions for [MASK]:
	→ ήν
	→ ῆσ
	→ ῇ
	→ ὴν
	→ ῆ

	```

	---

	# License

	MIT License.

	---

	# Authors

	This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).

	---

	# Acknowledgements

	The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.