Initial upload: la_vectors v3.9.0

be5396c 22 days ago

5.95 kB

	---
	language:
	- la
	license: mit
	tags:
	- word-vectors
	- latin
	- nlp
	- word2vec
	- fasttext
	- glove
	- static-vectors
	- digital-humanities
	- classics
	- latincy
	model-index:
	- name: la_vectors
	results:
	- task:
	type: feature-extraction
	name: Word Analogy
	dataset:
	type: custom
	name: LatinCy Analogies (1,330 solvable / 1,383 total)
	metrics:
	- type: accuracy
	value: 84.5
	name: FastText CBOW-300-10 Rank 1
	- type: accuracy
	value: 81.4
	name: Floret v3.9 (lg) Rank 1
	- type: accuracy
	value: 70.2
	name: Word2Vec CBOW-300-10 Rank 1
	- type: accuracy
	value: 49.5
	name: GloVe 300 Rank 1
	- task:
	type: feature-extraction
	name: Odd-One-Out
	dataset:
	type: custom
	name: LatinCy Odd-One-Out (2,223 solvable / 2,728 total)
	metrics:
	- type: accuracy
	value: 79.1
	name: Word2Vec CBOW-300-10
	- type: accuracy
	value: 75.1
	name: GloVe 300
	- type: accuracy
	value: 74.0
	name: Floret v3.9 (lg)
	- type: accuracy
	value: 73.6
	name: FastText CBOW-300-10
	---

	# LatinCy Vectors

	Static word vectors for Latin, trained on the [LatinCy](https://github.com/diyclassics/latincy) corpus. Provides Floret, FastText, Word2Vec, and GloVe embeddings trained on the same data and evaluated on the same benchmarks for direct cross-method comparison.

	## Available Models

	All models are trained with CBOW architecture, 300 dimensions, window size 10, min count 50, 15 epochs, negative sampling 25 on the full LatinCy corpus (13.7M sentences, ~266M tokens from 9 sources).

	\| Model \| Type \| Vocab \| HF Repo \|
	\|-------\|------\|-------\|---------\|
	\| Floret (lg) \| Hash-based subword \| 200k buckets \| [`latincy/la_vectors_floret_lg`](https://huggingface.co/latincy/la_vectors_floret_lg) \|
	\| Floret (md) \| Hash-based subword \| 50k buckets \| [`latincy/la_vectors_floret_md`](https://huggingface.co/latincy/la_vectors_floret_md) \|
	\| FastText CBOW-300-10 \| Subword (n-gram) \| 233k words \| [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) \|
	\| Word2Vec CBOW-300-10 \| Word-level \| 233k words \| [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) \|
	\| GloVe 300 \| Word-level (co-occurrence) \| 233k words \| [`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors) \|

	Floret vectors are distributed separately as spaCy pipeline components. FastText, Word2Vec, and GloVe are in a single umbrella repo ([`latincy/la_vectors`](https://huggingface.co/latincy/la_vectors)).

	## Evaluation

	Evaluated on curated Latin benchmarks: 1,383 analogy items across 11 categories and 2,728 odd-one-out items. Items unsolvable by all models are excluded per the evaluation methodology (1,330 analogies and 2,223 odd-one-out items are solvable).

	\| Model \| Analogy Rank 1 \| Analogy Rank 5 \| Odd-One-Out \|
	\|-------\|----------------\|----------------\|-------------\|
	\| FastText CBOW-300-10 \| 84.5% \| 96.6% \| 73.6% \|
	\| Floret v3.9 (lg) \| 81.4% \| 95.3% \| 74.0% \|
	\| Word2Vec CBOW-300-10 \| 70.2% \| 91.3% \| 79.1% \|
	\| GloVe 300 \| 49.5% \| 79.2% \| 75.1% \|

	FastText leads on analogy resolution due to subword information that captures Latin morphology. Word2Vec leads on odd-one-out (semantic clustering). GloVe is weaker on analogies because it lacks subword representations. Floret is used in LatinCy spaCy pipelines because it is 6x smaller than FastText while remaining competitive, and supports arbitrary vocabulary via hash-based lookups.

	For full evaluation details including per-category breakdowns and nearest-neighbor spot checks, see the [evaluation report](eval/reports/burns-2025-latincy-w2v-evaluation-datasets-report.pdf).

	## Usage

	### From HuggingFace Hub

	```python
	from huggingface_hub import hf_hub_download

	# FastText binary model
	path = hf_hub_download("latincy/la_vectors", "fasttext/la_fasttext_cbow_300_10.bin")

	# Word2Vec text vectors
	path = hf_hub_download("latincy/la_vectors", "word2vec/la_w2v_cbow_300_10.txt")

	# GloVe vectors
	path = hf_hub_download("latincy/la_vectors", "glove/la_glove_300.txt")
	```

	### Floret (spaCy)

	```python
	import spacy

	nlp = spacy.load("la_vectors_floret_lg")
	doc = nlp("rex populum regit")
	for token in doc:
	print(token.text, token.has_vector, token.vector[:5])
	```

	## Training Corpus

	All vectors are trained on the same corpus for valid cross-method comparison.

	\| Source \| Sentences \| Tokens \|
	\|--------\|-----------\|--------\|
	\| CC100-Latin \| 6,507,840 \| 128,886,505 \|
	\| Latin Wikisource \| 3,933,289 \| 76,736,695 \|
	\| Latin Wikipedia \| 972,336 \| 15,218,700 \|
	\| CAMENA Neo-Latin \| 736,400 \| 9,970,933 \|
	\| The Latin Library \| 650,082 \| 12,822,687 \|
	\| CLTK-Tesserae \| 516,930 \| 6,626,484 \|
	\| Perseus Digital Library \| 223,535 \| 4,317,063 \|
	\| Patrologia Latina \| 125,333 \| 10,399,108 \|
	\| UD Latin treebanks (6) \| 55,332 \| 980,787 \|
	\| Total \| 13,721,077 \| 265,958,962 \|

	## Citation

	```bibtex
	@misc{burns2023latincy,
	title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
	author = "Burns, Patrick J.",
	year = "2023",
	eprint = "2305.04365",
	archivePrefix = "arXiv",
	primaryClass = "cs.CL",
	url = "https://arxiv.org/abs/2305.04365"
	}
	```

	## References

	- Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. "Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin." In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

	## Acknowledgments

	This work was supported in part through the [NYU IT High Performance Computing](https://sites.google.com/nyu.edu/nyu-hpc/about/acknowledgement-statement) resources, services, and staff expertise.