--- language: - la license: mit tags: - word-vectors - latin - nlp - word2vec - fasttext - glove - static-vectors - digital-humanities - classics - latincy model-index: - name: la_vectors results: - task: type: feature-extraction name: Word Analogy dataset: type: custom name: LatinCy Analogies (1,330 solvable / 1,383 total) metrics: - type: accuracy value: 84.5 name: FastText CBOW-300-10 Rank 1 - type: accuracy value: 81.4 name: Floret v3.9 (lg) Rank 1 - type: accuracy value: 70.2 name: Word2Vec CBOW-300-10 Rank 1 - type: accuracy value: 49.5 name: GloVe 300 Rank 1 - task: type: feature-extraction name: Odd-One-Out dataset: type: custom name: LatinCy Odd-One-Out (2,223 solvable / 2,728 total) metrics: - type: accuracy value: 79.1 name: Word2Vec CBOW-300-10 - type: accuracy value: 75.1 name: GloVe 300 - type: accuracy value: 74.0 name: Floret v3.9 (lg) - type: accuracy value: 73.6 name: FastText CBOW-300-10 --- # LatinCy Vectors Static word vectors for Latin, trained on the [LatinCy](https://github.com/diyclassics/latincy) corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison. ## Available Models All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources). | Model | Type | Vocab | HF Repo | |-------|------|-------|---------| | Floret (lg) | Hash-based subword | 200k buckets | [`latincy/la_vectors_floret_lg`](https://huggingface.co/latincy/la_vectors_floret_lg) | | Floret (md) | Hash-based subword | 50k buckets | [`latincy/la_vectors_floret_md`](https://huggingface.co/latincy/la_vectors_floret_md) | | FastText CBOW-300-10 | Subword (n-gram) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | | Word2Vec CBOW-300-10 | Word-level | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | | GloVe 300 | Word-level (co-occurrence) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo ([`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors)). ## Evaluation Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable). | Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out | |-------|----------------|----------------|-------------| | **FastText CBOW-300-10** | **84.5%** | **96.6%** | 73.6% | | Floret v3.9 (lg) | 81.4% | 95.3% | 74.0% | | Word2Vec CBOW-300-10 | 70.2% | 91.3% | **79.1%** | | GloVe 300 | 49.5% | 79.2% | 75.1% | FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups. For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the [evaluation report](eval/reports/burns-2025-latincy-w2v-evaluation-datasets-report.pdf). ## Usage ### From HuggingFace Hub ```python from huggingface_hub import hf_hub_download # FastText binary model path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin") # Word2Vec text vectors path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt") # GloVe vectors path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt") ``` ### Floret (spaCy) ```python import spacy nlp = spacy.load("la_vectors_floret_lg") doc = nlp("rex populum regit") for token in doc: print(token.text, token.has_vector, token.vector[:5]) ``` ## Training Corpus All vectors are trained on the same corpus for valid cross-method comparison. | Source | Sentences | Tokens | |--------|-----------|--------| | CC100-Latin | 6,507,840 | 128,886,505 | | Latin Wikisource | 3,933,289 | 76,736,695 | | Latin Wikipedia | 972,336 | 15,218,700 | | CAMENA Neo-Latin | 736,400 | 9,970,933 | | The Latin Library | 650,082 | 12,822,687 | | CLTK-Tesserae | 516,930 | 6,626,484 | | Perseus Digital Library | 223,535 | 4,317,063 | | Patrologia Latina | 125,333 | 10,399,108 | | UD Latin treebanks (6) | 55,332 | 980,787 | | **Total** | **13,721,077** | **265,958,962** | ## Citation ```bibtex @misc{burns2023latincy, title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}", author = "Burns, Patrick J.", year = "2023", eprint = "2305.04365", archivePrefix = "arXiv", primaryClass = "cs.CL", url = "https://arxiv.org/abs/2305.04365" } ``` ## References - Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In *Proceedings of the Sixth Italian Conference on Computational Linguistics*. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf. ## Acknowledgments This work was supported in part through the [NYU IT High Performance Computing](https://sites.google.com/nyu.edu/nyu-hpc/about/acknowledgement-statement) resources, services, and staff expertise.