mercelisw
/

electra-grc

Model card Files Files and versions

electra-grc / README.md

alekkeersmaekers's picture

alekkeersmaekers

Added usage instructions

e174f9d verified 11 months ago

|

1.26 kB

	---
	language:
	- grc
	tags:
	- ELECTRA
	- TensorFlow
	---



	An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary [GLAUx](https://github.com/alekkeersmaekers/glaux) corpus and the [DukeNLP](https://github.com/alekkeersmaekers/duke-nlp) papyrus corpus.

	The model has some design choices made to combat data sparsity:
	* Its input should always be in Unicode NFD (so separate Unicode signs for diacritics).
	* All grave accents should be replaced with acute accents (καί, not καὶ).
	* When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι).

	If you use it in conjunction with [glaux-nlp](https://github.com/alekkeersmaekers/glaux-nlp), you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you.

	## Citation

	```bibtex
	@misc{mercelis_electra-grc_2022,
	title = {electra-grc},
	url = {https://huggingface.co/mercelisw/electra-grc},
	abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.},
	author = {Mercelis, Wouter and Keersmaekers, Alek},
	year = {2022},
	}
	```