---
language:
  - la
license: mit
tags:
  - word-vectors
  - latin
  - nlp
  - word2vec
  - fasttext
  - glove
  - static-vectors
  - digital-humanities
  - classics
  - latincy
model-index:
  - name: la_vectors
    results:
      - task:
          type: feature-extraction
          name: Word Analogy
        dataset:
          type: custom
          name: LatinCy Analogies (1,330 solvable / 1,383 total)
        metrics:
          - type: accuracy
            value: 84.5
            name: FastText CBOW-300-10 Rank 1
          - type: accuracy
            value: 81.4
            name: Floret v3.9 (lg) Rank 1
          - type: accuracy
            value: 70.2
            name: Word2Vec CBOW-300-10 Rank 1
          - type: accuracy
            value: 49.5
            name: GloVe 300 Rank 1
      - task:
          type: feature-extraction
          name: Odd-One-Out
        dataset:
          type: custom
          name: LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
        metrics:
          - type: accuracy
            value: 79.1
            name: Word2Vec CBOW-300-10
          - type: accuracy
            value: 75.1
            name: GloVe 300
          - type: accuracy
            value: 74.0
            name: Floret v3.9 (lg)
          - type: accuracy
            value: 73.6
            name: FastText CBOW-300-10
---

# LatinCy Vectors

Static word vectors for Latin, trained on the [LatinCy](https://github.com/diyclassics/latincy) corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison.

## Available Models

All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources).

| Model | Type | Vocab | HF Repo |
|-------|------|-------|---------|
| Floret (lg) | Hash-based subword | 200k buckets | [`latincy/la_vectors_floret_lg`](https://huggingface.co/latincy/la_vectors_floret_lg) |
| Floret (md) | Hash-based subword | 50k buckets | [`latincy/la_vectors_floret_md`](https://huggingface.co/latincy/la_vectors_floret_md) |
| FastText CBOW-300-10 | Subword (n-gram) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) |
| Word2Vec CBOW-300-10 | Word-level | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) |
| GloVe 300 | Word-level (co-occurrence) | 233k words | [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) |

Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo ([`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors)).

## Evaluation

Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable).

| Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out |
|-------|----------------|----------------|-------------|
| **FastText CBOW-300-10** | **84.5%** | **96.6%** | 73.6% |
| Floret v3.9 (lg) | 81.4% | 95.3% | 74.0% |
| Word2Vec CBOW-300-10 | 70.2% | 91.3% | **79.1%** |
| GloVe 300 | 49.5% | 79.2% | 75.1% |

FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups.

For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the [evaluation report](eval/reports/burns-2025-latincy-w2v-evaluation-datasets-report.pdf).

## Usage

### From HuggingFace Hub

```python
from huggingface_hub import hf_hub_download

# FastText binary model
path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin")

# Word2Vec text vectors
path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt")

# GloVe vectors
path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt")
```

### Floret (spaCy)

```python
import spacy

nlp = spacy.load("la_vectors_floret_lg")
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.has_vector, token.vector[:5])
```

## Training Corpus

All vectors are trained on the same corpus for valid cross-method comparison.

| Source | Sentences | Tokens |
|--------|-----------|--------|
| CC100-Latin | 6,507,840 | 128,886,505 |
| Latin Wikisource | 3,933,289 | 76,736,695 |
| Latin Wikipedia | 972,336 | 15,218,700 |
| CAMENA Neo-Latin | 736,400 | 9,970,933 |
| The Latin Library | 650,082 | 12,822,687 |
| CLTK-Tesserae | 516,930 | 6,626,484 |
| Perseus Digital Library | 223,535 | 4,317,063 |
| Patrologia Latina | 125,333 | 10,399,108 |
| UD Latin treebanks (6) | 55,332 | 980,787 |
| **Total** | **13,721,077** | **265,958,962** |

## Citation

```bibtex
@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}
```

## References

- Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In *Proceedings of the Sixth Italian Conference on Computational Linguistics*. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

## Acknowledgments

This work was supported in part through the [NYU IT High Performance Computing](https://sites.google.com/nyu.edu/nyu-hpc/about/acknowledgement-statement) resources, services, and staff expertise.