Upload README.md with huggingface_hub

ceeb92a verified 16 days ago

5.88 kB

	---
	language:
	- en
	license: gpl-3.0
	tags:
	- word-embeddings
	- word2vec
	- embeddings
	- nlp
	- free-software
	- dfsg
	datasets:
	- wikimedia/wikipedia
	- pg19
	metrics:
	- accuracy
	model-index:
	- name: fle-v34
	results:
	- task:
	type: word-analogy
	name: Word Analogy
	dataset:
	type: custom
	name: Google Analogy Test Set
	metrics:
	- type: accuracy
	value: 66.5
	name: Overall Accuracy
	- type: accuracy
	value: 61.4
	name: Semantic Accuracy
	- type: accuracy
	value: 69.2
	name: Syntactic Accuracy
	library_name: numpy
	pipeline_tag: feature-extraction
	---

	# Free Language Embeddings (V34)

	300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.

	66.5% on Google analogies — beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| Dynamic masking word2vec skip-gram \|
	\| Dimensions \| 300 \|
	\| Vocabulary \| 100,000 whole words \|
	\| Training data \| ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) \|
	\| Training hardware \| Single NVIDIA RTX 3090 \|
	\| Training time \| ~4 days (2M steps) \|
	\| License \| GPL-3.0 \|
	\| Parameters \| 60M (30M target + 30M context embeddings) \|

	### Training Data

	All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets.

	\| Source \| Weight \| License \|
	\|--------\|--------\|---------\|
	\| Wikipedia \| 30% \| CC BY-SA 3.0 \|
	\| Project Gutenberg \| 20% \| Public domain \|
	\| arXiv \| 20% \| Various open access \|
	\| Stack Exchange \| 16% \| CC BY-SA 4.0 \|
	\| US Government Publishing Office \| 10% \| Public domain (US gov) \|
	\| RFCs \| 2.5% \| IETF Trust \|
	\| Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages \| 1.5% \| GPL/GFDL \|

	## Benchmark Results

	\| Model \| Data \| Google Analogies \|
	\|-------\|------\|-----------------\|
	\| fle V34 (this model) \| ~2B tokens \| 66.5% \|
	\| word2vec (Mikolov 2013) \| 6B tokens \| 61.0% \|
	\| GloVe (small) \| 6B tokens \| 71.0% \|
	\| Google word2vec \| 6B tokens \| 72.7% \|
	\| GloVe (Pennington 2014) \| 840B tokens \| 75.6% \|
	\| FastText (Bojanowski 2017) \| 16B tokens \| 77.0% \|

	Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.

	## Quick Start

	```bash
	# Download
	pip install huggingface_hub numpy
	python -c "
	from huggingface_hub import hf_hub_download
	hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
	hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
	"

	# Use
	python fle.py king - man + woman
	python fle.py --similar cat
	python fle.py # interactive mode
	```

	### Python API

	```python
	from fle import FLE

	fle = FLE() # loads fle_v34.npz
	vec = fle["cat"] # 300d numpy array
	fle.similar("cat", n=10) # nearest neighbors
	fle.analogy("king", "man", "woman") # king:man :: woman:?
	fle.similarity("cat", "dog") # cosine similarity
	fle.query("king - man + woman") # vector arithmetic
	```

	## Examples

	```
	$ python fle.py king - man + woman
	→ queen 0.7387
	→ princess 0.6781
	→ monarch 0.5546

	$ python fle.py paris - france + germany
	→ berlin 0.8209
	→ vienna 0.7862
	→ munich 0.7850

	$ python fle.py --similar cat
	kitten 0.7168
	cats 0.6849
	tabby 0.6572
	dog 0.5919

	$ python fle.py ubuntu - debian + redhat
	centos 0.6261
	linux 0.6016
	rhel 0.5949

	$ python fle.py brain
	cerebral 0.6665
	cerebellum 0.6022
	nerves 0.5748
	```

	## What Makes This Different

	- Free as in freedom. Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main.
	- Dynamic masking. Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay — analogies jump from 1.2% to 66.5% in the second half of training.
	- Whole-word vocabulary. No subword tokenization. Subwords break word2vec geometry completely — they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.

	## Training

	Trained with cosine learning rate schedule (3e-4 → 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.

	Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings)

	## Interactive Visualizations

	- [Embedding Spectrogram](https://ruapotato.github.io/Free-Language-Embeddings/spectrogram.html) — PCA waves, sine fits, cosine surfaces
	- [3D Semantic Directions](https://ruapotato.github.io/Free-Language-Embeddings/semantic_3d.html) — See how semantic axes align in the learned geometry
	- [Training Dashboard](https://ruapotato.github.io/Free-Language-Embeddings/dashboard.html) — Loss curves and training metrics

	## Citation

	```bibtex
	@misc{hamner2026fle,
	title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
	author={David Hamner},
	year={2026},
	url={https://github.com/ruapotato/Free-Language-Embeddings}
	}
	```

	## License

	GPL-3.0 — See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details.

	Built by David Hamner.