File size: 5,875 Bytes
6132608 454be19 6132608 454be19 6132608 454be19 6132608 527d48d 6132608 454be19 6132608 527d48d 6132608 ceeb92a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
language:
- en
license: gpl-3.0
tags:
- word-embeddings
- word2vec
- embeddings
- nlp
- free-software
- dfsg
datasets:
- wikimedia/wikipedia
- pg19
metrics:
- accuracy
model-index:
- name: fle-v34
results:
- task:
type: word-analogy
name: Word Analogy
dataset:
type: custom
name: Google Analogy Test Set
metrics:
- type: accuracy
value: 66.5
name: Overall Accuracy
- type: accuracy
value: 61.4
name: Semantic Accuracy
- type: accuracy
value: 69.2
name: Syntactic Accuracy
library_name: numpy
pipeline_tag: feature-extraction
---
# Free Language Embeddings (V34)
300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.
**66.5% on Google analogies** β beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.
## Model Details
| | |
|---|---|
| **Architecture** | Dynamic masking word2vec skip-gram |
| **Dimensions** | 300 |
| **Vocabulary** | 100,000 whole words |
| **Training data** | ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) |
| **Training hardware** | Single NVIDIA RTX 3090 |
| **Training time** | ~4 days (2M steps) |
| **License** | GPL-3.0 |
| **Parameters** | 60M (30M target + 30M context embeddings) |
### Training Data
All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets.
| Source | Weight | License |
|--------|--------|---------|
| Wikipedia | 30% | CC BY-SA 3.0 |
| Project Gutenberg | 20% | Public domain |
| arXiv | 20% | Various open access |
| Stack Exchange | 16% | CC BY-SA 4.0 |
| US Government Publishing Office | 10% | Public domain (US gov) |
| RFCs | 2.5% | IETF Trust |
| Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages | 1.5% | GPL/GFDL |
## Benchmark Results
| Model | Data | Google Analogies |
|-------|------|-----------------|
| **fle V34 (this model)** | **~2B tokens** | **66.5%** |
| word2vec (Mikolov 2013) | 6B tokens | 61.0% |
| GloVe (small) | 6B tokens | 71.0% |
| Google word2vec | 6B tokens | 72.7% |
| GloVe (Pennington 2014) | 840B tokens | 75.6% |
| FastText (Bojanowski 2017) | 16B tokens | 77.0% |
Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.
## Quick Start
```bash
# Download
pip install huggingface_hub numpy
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
"
# Use
python fle.py king - man + woman
python fle.py --similar cat
python fle.py # interactive mode
```
### Python API
```python
from fle import FLE
fle = FLE() # loads fle_v34.npz
vec = fle["cat"] # 300d numpy array
fle.similar("cat", n=10) # nearest neighbors
fle.analogy("king", "man", "woman") # king:man :: woman:?
fle.similarity("cat", "dog") # cosine similarity
fle.query("king - man + woman") # vector arithmetic
```
## Examples
```
$ python fle.py king - man + woman
β queen 0.7387
β princess 0.6781
β monarch 0.5546
$ python fle.py paris - france + germany
β berlin 0.8209
β vienna 0.7862
β munich 0.7850
$ python fle.py --similar cat
kitten 0.7168
cats 0.6849
tabby 0.6572
dog 0.5919
$ python fle.py ubuntu - debian + redhat
centos 0.6261
linux 0.6016
rhel 0.5949
$ python fle.py brain
cerebral 0.6665
cerebellum 0.6022
nerves 0.5748
```
## What Makes This Different
- **Free as in freedom.** Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main.
- **Dynamic masking.** Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay β analogies jump from 1.2% to 66.5% in the second half of training.
- **Whole-word vocabulary.** No subword tokenization. Subwords break word2vec geometry completely β they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.
## Training
Trained with cosine learning rate schedule (3e-4 β 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.
Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings)
## Interactive Visualizations
- [Embedding Spectrogram](https://ruapotato.github.io/Free-Language-Embeddings/spectrogram.html) β PCA waves, sine fits, cosine surfaces
- [3D Semantic Directions](https://ruapotato.github.io/Free-Language-Embeddings/semantic_3d.html) β See how semantic axes align in the learned geometry
- [Training Dashboard](https://ruapotato.github.io/Free-Language-Embeddings/dashboard.html) β Loss curves and training metrics
## Citation
```bibtex
@misc{hamner2026fle,
title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
author={David Hamner},
year={2026},
url={https://github.com/ruapotato/Free-Language-Embeddings}
}
```
## License
GPL-3.0 β See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details.
Built by David Hamner.
|