grc_dep_web_trf

Ancient Greek pipeline for spaCy, part of the LatinCy project.

Experimental beta release. This is part of the first generation of Ancient Greek models porting the LatinCy Latin pipeline infrastructure to Ancient Greek. Expect rough edges; scores and component behavior will improve as training data is harmonized and curated through the LatinCy flywheel (train, evaluate, curate, retrain).

Transformer model powered by PhilBerta (Ancient Greek RoBERTa). Trained on Universal Dependencies Ancient Greek treebanks (PTNK, PROIEL, Perseus) with a 1.2M-entry lookup lemmatizer overlay built from CLTK Morpheus, UD treebanks, and Wiktionary.

Feature	Description
Name	`grc_dep_web_trf`
Version	`3.8.1`
spaCy	`>=3.8.11,<3.9.0`
Default Pipeline	`senter`, `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser`
Components	`senter`, `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser`
Vectors	0 keys, 0 unique vectors (0 dimensions)
License	`MIT`
Author	Patrick J. Burns

Install

pip install https://huggingface.co/latincy/grc_dep_web_trf/resolve/main/grc_dep_web_trf-3.8.1-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("grc_dep_web_trf")
doc = nlp("\u03bc\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u03ac\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2")

for token in doc:
    print(token.text, token.pos_, token.lemma_, token.dep_)

Evaluation

Scores on held-out UD test data (combined PTNK + PROIEL + Perseus).

Metric	Score
POS (UPOS) Accuracy	97.28
TAG (XPOS) Accuracy	97.40
Morph (UFeats) Accuracy	93.61
Lemma Accuracy	93.99
Unlabeled Attachment Score (UAS)	85.12
Labeled Attachment Score (LAS)	80.27
Sentences F-Score	88.18

Training Data

Source	Description
UD_Ancient_Greek-PTNK	Septuagint (Codex Alexandrinus)
UD_Ancient_Greek-PROIEL	PROIEL Ancient Greek treebank
UD_Ancient_Greek-Perseus	Perseus Ancient Greek treebank

Components

transformer -- PhilBerta transformer backbone (Ancient Greek RoBERTa)
tagger -- Fine-grained POS tagger (XPOS, harmonized 16-tag tagset)
morphologizer -- Morphological feature assignment (UPOS + UFeats)
trainable_lemmatizer -- Edit-tree lemmatizer
lookup_lemmatizer -- 1.2M-entry dictionary lemmatizer overlay (CLTK Morpheus + UD + Wiktionary); normalizes grave accents to acute at query time
parser -- Dependency parser (transition-based)
senter -- Sentence segmenter

Label Scheme

View label scheme (1796 labels for 3 components)

tagger: adjective, adverb, conjunction, conjunction_adverb, conjunction_pronoun, determiner, interjection, noun, number, particle, preposition, pronoun, proper_noun, punc, unknown, verb

morphologizer: 1749 morphological feature combinations

parser: ROOT, acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, conj, cop, csubj, dep, det, discourse, dislocated, fixed, flat, iobj, mark, nmod, nsubj, nummod, obj, obl, orphan, parataxis, punct, vocative, xcomp

Downloads last month: 3

Evaluation results

POS Accuracy
self-reported

0.973
TAG (XPOS) Accuracy
self-reported

0.974
Lemma Accuracy
self-reported

0.940
Labeled Attachment Score
self-reported

0.803