grc_dep_web_lg
Ancient Greek pipeline for spaCy, part of the LatinCy project.
Experimental beta release. This is part of the first generation of Ancient Greek models porting the LatinCy Latin pipeline infrastructure to Ancient Greek. Expect rough edges; scores and component behavior will improve as training data is harmonized and curated through the LatinCy flywheel (train, evaluate, curate, retrain).
Large model with 200,000-key floret vectors (300 dimensions). Trained on Universal Dependencies Ancient Greek treebanks (PTNK, PROIEL, Perseus) with a 1.2M-entry lookup lemmatizer overlay built from CLTK Morpheus, UD treebanks, and Wiktionary.
| Feature | Description |
|---|---|
| Name | grc_dep_web_lg |
| Version | 3.8.1 |
| spaCy | >=3.8.11,<3.9.0 |
| Default Pipeline | senter, tok2vec, tagger, morphologizer, trainable_lemmatizer, lookup_lemmatizer, parser |
| Components | senter, tok2vec, tagger, morphologizer, trainable_lemmatizer, lookup_lemmatizer, parser |
| Vectors | floret, 200,000 unique vectors (300 dimensions) |
| License | MIT |
| Author | Patrick J. Burns |
Install
pip install https://huggingface.co/latincy/grc_dep_web_lg/resolve/main/grc_dep_web_lg-3.8.1-py3-none-any.whl
Usage
import spacy
nlp = spacy.load("grc_dep_web_lg")
doc = nlp("\u03bc\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u03ac\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2")
for token in doc:
print(token.text, token.pos_, token.lemma_, token.dep_)
Evaluation
Scores on held-out UD test data (combined PTNK + PROIEL + Perseus).
| Metric | Score |
|---|---|
| POS (UPOS) Accuracy | 92.03 |
| TAG (XPOS) Accuracy | 91.96 |
| Morph (UFeats) Accuracy | 82.68 |
| Lemma Accuracy | 93.54 |
| Unlabeled Attachment Score (UAS) | 76.24 |
| Labeled Attachment Score (LAS) | 68.11 |
| Sentences F-Score | 88.18 |
Training Data
| Source | Description |
|---|---|
| UD_Ancient_Greek-PTNK | Septuagint (Codex Alexandrinus) |
| UD_Ancient_Greek-PROIEL | PROIEL Ancient Greek treebank |
| UD_Ancient_Greek-Perseus | Perseus Ancient Greek treebank |
Components
- tok2vec -- Shared token-to-vector encoder (CNN, width 96)
- tagger -- Fine-grained POS tagger (XPOS, harmonized 16-tag tagset)
- morphologizer -- Morphological feature assignment (UPOS + UFeats)
- trainable_lemmatizer -- Edit-tree lemmatizer
- lookup_lemmatizer -- 1.2M-entry dictionary lemmatizer overlay (CLTK Morpheus + UD + Wiktionary); normalizes grave accents to acute at query time
- parser -- Dependency parser (transition-based)
- senter -- Sentence segmenter
Label Scheme
View label scheme (1796 labels for 3 components)
tagger: adjective, adverb, conjunction, conjunction_adverb, conjunction_pronoun, determiner, interjection, noun, number, particle, preposition, pronoun, proper_noun, punc, unknown, verb
morphologizer: 1749 morphological feature combinations
parser: ROOT, acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, conj, cop, csubj, dep, det, discourse, dislocated, fixed, flat, iobj, mark, nmod, nsubj, nummod, obj, obl, orphan, parataxis, punct, vocative, xcomp
- Downloads last month
- 28
Evaluation results
- POS Accuracyself-reported0.920
- TAG (XPOS) Accuracyself-reported0.920
- Lemma Accuracyself-reported0.935
- Labeled Attachment Scoreself-reported0.681