| language: | |
| - grc | |
| tags: | |
| - ELECTRA | |
| - TensorFlow | |
| An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary [GLAUx](https://github.com/alekkeersmaekers/glaux) corpus and the [DukeNLP](https://github.com/alekkeersmaekers/duke-nlp) papyrus corpus. | |
| The model has some design choices made to combat data sparsity: | |
| * Its input should always be in Unicode NFD (so separate Unicode signs for diacritics). | |
| * All grave accents should be replaced with acute accents (καί, not καὶ). | |
| * When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι). | |
| If you use it in conjunction with [glaux-nlp](https://github.com/alekkeersmaekers/glaux-nlp), you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you. | |
| ## Citation | |
| ```bibtex | |
| @misc{mercelis_electra-grc_2022, | |
| title = {electra-grc}, | |
| url = {https://huggingface.co/mercelisw/electra-grc}, | |
| abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.}, | |
| author = {Mercelis, Wouter and Keersmaekers, Alek}, | |
| year = {2022}, | |
| } | |
| ``` |