grc_dep_web_lg

Ancient Greek pipeline for spaCy, part of the LatinCy project.

Experimental beta release. This is part of the first generation of Ancient Greek models porting the LatinCy Latin pipeline infrastructure to Ancient Greek. Expect rough edges; scores and component behavior will improve as training data is harmonized and curated through the LatinCy flywheel (train, evaluate, curate, retrain).

Large model with 200,000-key floret vectors (300 dimensions). Trained on Universal Dependencies Ancient Greek treebanks (PTNK, PROIEL, Perseus) with a 1.2M-entry lookup lemmatizer overlay built from CLTK Morpheus, UD treebanks, and Wiktionary.

Feature Description
Name grc_dep_web_lg
Version 3.8.1
spaCy >=3.8.11,<3.9.0
Default Pipeline senter, tok2vec, tagger, morphologizer, trainable_lemmatizer, lookup_lemmatizer, parser
Components senter, tok2vec, tagger, morphologizer, trainable_lemmatizer, lookup_lemmatizer, parser
Vectors floret, 200,000 unique vectors (300 dimensions)
License MIT
Author Patrick J. Burns

Install

pip install https://huggingface.co/latincy/grc_dep_web_lg/resolve/main/grc_dep_web_lg-3.8.1-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("grc_dep_web_lg")
doc = nlp("\u03bc\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u03ac\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2")

for token in doc:
    print(token.text, token.pos_, token.lemma_, token.dep_)

Evaluation

Scores on held-out UD test data (combined PTNK + PROIEL + Perseus).

Metric Score
POS (UPOS) Accuracy 92.03
TAG (XPOS) Accuracy 91.96
Morph (UFeats) Accuracy 82.68
Lemma Accuracy 93.54
Unlabeled Attachment Score (UAS) 76.24
Labeled Attachment Score (LAS) 68.11
Sentences F-Score 88.18

Training Data

Source Description
UD_Ancient_Greek-PTNK Septuagint (Codex Alexandrinus)
UD_Ancient_Greek-PROIEL PROIEL Ancient Greek treebank
UD_Ancient_Greek-Perseus Perseus Ancient Greek treebank

Components

  • tok2vec -- Shared token-to-vector encoder (CNN, width 96)
  • tagger -- Fine-grained POS tagger (XPOS, harmonized 16-tag tagset)
  • morphologizer -- Morphological feature assignment (UPOS + UFeats)
  • trainable_lemmatizer -- Edit-tree lemmatizer
  • lookup_lemmatizer -- 1.2M-entry dictionary lemmatizer overlay (CLTK Morpheus + UD + Wiktionary); normalizes grave accents to acute at query time
  • parser -- Dependency parser (transition-based)
  • senter -- Sentence segmenter

Label Scheme

View label scheme (1796 labels for 3 components)

tagger: adjective, adverb, conjunction, conjunction_adverb, conjunction_pronoun, determiner, interjection, noun, number, particle, preposition, pronoun, proper_noun, punc, unknown, verb

morphologizer: 1749 morphological feature combinations

parser: ROOT, acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, conj, cop, csubj, dep, det, discourse, dislocated, fixed, flat, iobj, mark, nmod, nsubj, nummod, obj, obl, orphan, parataxis, punct, vocative, xcomp

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results