--- language: - grc tags: - ELECTRA - TensorFlow --- An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary [GLAUx](https://github.com/alekkeersmaekers/glaux) corpus and the [DukeNLP](https://github.com/alekkeersmaekers/duke-nlp) papyrus corpus. The model has some design choices made to combat data sparsity: * Its input should always be in Unicode NFD (so separate Unicode signs for diacritics). * All grave accents should be replaced with acute accents (καί, not καὶ). * When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι). If you use it in conjunction with [glaux-nlp](https://github.com/alekkeersmaekers/glaux-nlp), you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you. ## Citation ```bibtex @misc{mercelis_electra-grc_2022, title = {electra-grc}, url = {https://huggingface.co/mercelisw/electra-grc}, abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.}, author = {Mercelis, Wouter and Keersmaekers, Alek}, year = {2022}, } ```