fastText
Latin
word-vectors
latin
nlp
word2vec
glove
static-vectors
digital-humanities
classics
latincy
Eval Results (legacy)
Instructions to use latincy/la_vectors with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use latincy/la_vectors with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("latincy/la_vectors", "model.bin")) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - la | |
| license: mit | |
| tags: | |
| - word-vectors | |
| - latin | |
| - nlp | |
| - word2vec | |
| - fasttext | |
| - glove | |
| - static-vectors | |
| - digital-humanities | |
| - classics | |
| - latincy | |
| model-index: | |
| - name: la_vectors | |
| results: | |
| - task: | |
| type: feature-extraction | |
| name: Word Analogy | |
| dataset: | |
| type: custom | |
| name: LatinCy Analogies (1,330 solvable / 1,383 total) | |
| metrics: | |
| - type: accuracy | |
| value: 84.5 | |
| name: FastText CBOW-300-10 Rank 1 | |
| - type: accuracy | |
| value: 81.4 | |
| name: Floret v3.9 (lg) Rank 1 | |
| - type: accuracy | |
| value: 70.2 | |
| name: Word2Vec CBOW-300-10 Rank 1 | |
| - type: accuracy | |
| value: 49.5 | |
| name: GloVe 300 Rank 1 | |
| - task: | |
| type: feature-extraction | |
| name: Odd-One-Out | |
| dataset: | |
| type: custom | |
| name: LatinCy Odd-One-Out (2,223 solvable / 2,728 total) | |
| metrics: | |
| - type: accuracy | |
| value: 79.1 | |
| name: Word2Vec CBOW-300-10 | |
| - type: accuracy | |
| value: 75.1 | |
| name: GloVe 300 | |
| - type: accuracy | |
| value: 74.0 | |
| name: Floret v3.9 (lg) | |
| - type: accuracy | |
| value: 73.6 | |
| name: FastText CBOW-300-10 | |
| # LatinCy Vectors | |
| Static word vectors for Latin, trained on the [LatinCy](https://github.com/diyclassics/latincy) corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison. | |
| ## Available Models | |
| All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources). | |
| | Model | Type | Vocab | HF Repo | | |
| |-------|------|-------|---------| | |
| | Floret (lg) | Hash-based subword | 200k buckets | [`latincy/la_vectors_floret_lg`](https://huggingface.co/latincy/la_vectors_floret_lg) | | |
| | Floret (md) | Hash-based subword | 50k buckets | [`latincy/la_vectors_floret_md`](https://huggingface.co/latincy/la_vectors_floret_md) | | |
| | FastText CBOW-300-10 | Subword (n-gram) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | | |
| | Word2Vec CBOW-300-10 | Word-level | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | | |
| | GloVe 300 | Word-level (co-occurrence) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) | | |
| Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo ([`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors)). | |
| ## Evaluation | |
| Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable). | |
| | Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out | | |
| |-------|----------------|----------------|-------------| | |
| | **FastText CBOW-300-10** | **84.5%** | **96.6%** | 73.6% | | |
| | Floret v3.9 (lg) | 81.4% | 95.3% | 74.0% | | |
| | Word2Vec CBOW-300-10 | 70.2% | 91.3% | **79.1%** | | |
| | GloVe 300 | 49.5% | 79.2% | 75.1% | | |
| FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups. | |
| For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the [evaluation report](eval/reports/burns-2025-latincy-w2v-evaluation-datasets-report.pdf). | |
| ## Usage | |
| ### From HuggingFace Hub | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| # FastText binary model | |
| path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin") | |
| # Word2Vec text vectors | |
| path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt") | |
| # GloVe vectors | |
| path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt") | |
| ``` | |
| ### Floret (spaCy) | |
| ```python | |
| import spacy | |
| nlp = spacy.load("la_vectors_floret_lg") | |
| doc = nlp("rex populum regit") | |
| for token in doc: | |
| print(token.text, token.has_vector, token.vector[:5]) | |
| ``` | |
| ## Training Corpus | |
| All vectors are trained on the same corpus for valid cross-method comparison. | |
| | Source | Sentences | Tokens | | |
| |--------|-----------|--------| | |
| | CC100-Latin | 6,507,840 | 128,886,505 | | |
| | Latin Wikisource | 3,933,289 | 76,736,695 | | |
| | Latin Wikipedia | 972,336 | 15,218,700 | | |
| | CAMENA Neo-Latin | 736,400 | 9,970,933 | | |
| | The Latin Library | 650,082 | 12,822,687 | | |
| | CLTK-Tesserae | 516,930 | 6,626,484 | | |
| | Perseus Digital Library | 223,535 | 4,317,063 | | |
| | Patrologia Latina | 125,333 | 10,399,108 | | |
| | UD Latin treebanks (6) | 55,332 | 980,787 | | |
| | **Total** | **13,721,077** | **265,958,962** | | |
| ## Citation | |
| ```bibtex | |
| @misc{burns2023latincy, | |
| title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}", | |
| author = "Burns, Patrick J.", | |
| year = "2023", | |
| eprint = "2305.04365", | |
| archivePrefix = "arXiv", | |
| primaryClass = "cs.CL", | |
| url = "https://arxiv.org/abs/2305.04365" | |
| } | |
| ``` | |
| ## References | |
| - Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In *Proceedings of the Sixth Italian Conference on Computational Linguistics*. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf. | |
| ## Acknowledgments | |
| This work was supported in part through the [NYU IT High Performance Computing](https://sites.google.com/nyu.edu/nyu-hpc/about/acknowledgement-statement) resources, services, and staff expertise. | |