hackersgame's picture
Upload README.md with huggingface_hub
ceeb92a verified
---
language:
- en
license: gpl-3.0
tags:
- word-embeddings
- word2vec
- embeddings
- nlp
- free-software
- dfsg
datasets:
- wikimedia/wikipedia
- pg19
metrics:
- accuracy
model-index:
- name: fle-v34
results:
- task:
type: word-analogy
name: Word Analogy
dataset:
type: custom
name: Google Analogy Test Set
metrics:
- type: accuracy
value: 66.5
name: Overall Accuracy
- type: accuracy
value: 61.4
name: Semantic Accuracy
- type: accuracy
value: 69.2
name: Syntactic Accuracy
library_name: numpy
pipeline_tag: feature-extraction
---
# Free Language Embeddings (V34)
300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.
**66.5% on Google analogies** β€” beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.
## Model Details
| | |
|---|---|
| **Architecture** | Dynamic masking word2vec skip-gram |
| **Dimensions** | 300 |
| **Vocabulary** | 100,000 whole words |
| **Training data** | ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) |
| **Training hardware** | Single NVIDIA RTX 3090 |
| **Training time** | ~4 days (2M steps) |
| **License** | GPL-3.0 |
| **Parameters** | 60M (30M target + 30M context embeddings) |
### Training Data
All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets.
| Source | Weight | License |
|--------|--------|---------|
| Wikipedia | 30% | CC BY-SA 3.0 |
| Project Gutenberg | 20% | Public domain |
| arXiv | 20% | Various open access |
| Stack Exchange | 16% | CC BY-SA 4.0 |
| US Government Publishing Office | 10% | Public domain (US gov) |
| RFCs | 2.5% | IETF Trust |
| Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages | 1.5% | GPL/GFDL |
## Benchmark Results
| Model | Data | Google Analogies |
|-------|------|-----------------|
| **fle V34 (this model)** | **~2B tokens** | **66.5%** |
| word2vec (Mikolov 2013) | 6B tokens | 61.0% |
| GloVe (small) | 6B tokens | 71.0% |
| Google word2vec | 6B tokens | 72.7% |
| GloVe (Pennington 2014) | 840B tokens | 75.6% |
| FastText (Bojanowski 2017) | 16B tokens | 77.0% |
Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.
## Quick Start
```bash
# Download
pip install huggingface_hub numpy
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
"
# Use
python fle.py king - man + woman
python fle.py --similar cat
python fle.py # interactive mode
```
### Python API
```python
from fle import FLE
fle = FLE() # loads fle_v34.npz
vec = fle["cat"] # 300d numpy array
fle.similar("cat", n=10) # nearest neighbors
fle.analogy("king", "man", "woman") # king:man :: woman:?
fle.similarity("cat", "dog") # cosine similarity
fle.query("king - man + woman") # vector arithmetic
```
## Examples
```
$ python fle.py king - man + woman
β†’ queen 0.7387
β†’ princess 0.6781
β†’ monarch 0.5546
$ python fle.py paris - france + germany
β†’ berlin 0.8209
β†’ vienna 0.7862
β†’ munich 0.7850
$ python fle.py --similar cat
kitten 0.7168
cats 0.6849
tabby 0.6572
dog 0.5919
$ python fle.py ubuntu - debian + redhat
centos 0.6261
linux 0.6016
rhel 0.5949
$ python fle.py brain
cerebral 0.6665
cerebellum 0.6022
nerves 0.5748
```
## What Makes This Different
- **Free as in freedom.** Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main.
- **Dynamic masking.** Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay β€” analogies jump from 1.2% to 66.5% in the second half of training.
- **Whole-word vocabulary.** No subword tokenization. Subwords break word2vec geometry completely β€” they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.
## Training
Trained with cosine learning rate schedule (3e-4 β†’ 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.
Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings)
## Interactive Visualizations
- [Embedding Spectrogram](https://ruapotato.github.io/Free-Language-Embeddings/spectrogram.html) β€” PCA waves, sine fits, cosine surfaces
- [3D Semantic Directions](https://ruapotato.github.io/Free-Language-Embeddings/semantic_3d.html) β€” See how semantic axes align in the learned geometry
- [Training Dashboard](https://ruapotato.github.io/Free-Language-Embeddings/dashboard.html) β€” Loss curves and training metrics
## Citation
```bibtex
@misc{hamner2026fle,
title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
author={David Hamner},
year={2026},
url={https://github.com/ruapotato/Free-Language-Embeddings}
}
```
## License
GPL-3.0 β€” See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details.
Built by David Hamner.