File size: 5,875 Bytes

---
language:
- en
license: gpl-3.0
tags:
- word-embeddings
- word2vec
- embeddings
- nlp
- free-software
- dfsg
datasets:
- wikimedia/wikipedia
- pg19
metrics:
- accuracy
model-index:
- name: fle-v34
  results:
  - task:
      type: word-analogy
      name: Word Analogy
    dataset:
      type: custom
      name: Google Analogy Test Set
    metrics:
    - type: accuracy
      value: 66.5
      name: Overall Accuracy
    - type: accuracy
      value: 61.4
      name: Semantic Accuracy
    - type: accuracy
      value: 69.2
      name: Syntactic Accuracy
library_name: numpy
pipeline_tag: feature-extraction
---

# Free Language Embeddings (V34)

300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.

**66.5% on Google analogies** — beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.

## Model Details

| | |
|---|---|
| **Architecture** | Dynamic masking word2vec skip-gram |
| **Dimensions** | 300 |
| **Vocabulary** | 100,000 whole words |
| **Training data** | ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) |
| **Training hardware** | Single NVIDIA RTX 3090 |
| **Training time** | ~4 days (2M steps) |
| **License** | GPL-3.0 |
| **Parameters** | 60M (30M target + 30M context embeddings) |

### Training Data

All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets.

| Source | Weight | License |
|--------|--------|---------|
| Wikipedia | 30% | CC BY-SA 3.0 |
| Project Gutenberg | 20% | Public domain |
| arXiv | 20% | Various open access |
| Stack Exchange | 16% | CC BY-SA 4.0 |
| US Government Publishing Office | 10% | Public domain (US gov) |
| RFCs | 2.5% | IETF Trust |
| Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages | 1.5% | GPL/GFDL |

## Benchmark Results

| Model | Data | Google Analogies |
|-------|------|-----------------|
| **fle V34 (this model)** | **~2B tokens** | **66.5%** |
| word2vec (Mikolov 2013) | 6B tokens | 61.0% |
| GloVe (small) | 6B tokens | 71.0% |
| Google word2vec | 6B tokens | 72.7% |
| GloVe (Pennington 2014) | 840B tokens | 75.6% |
| FastText (Bojanowski 2017) | 16B tokens | 77.0% |

Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.

## Quick Start

```bash
# Download
pip install huggingface_hub numpy
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
"

# Use
python fle.py king - man + woman
python fle.py --similar cat
python fle.py   # interactive mode
```

### Python API

```python
from fle import FLE

fle = FLE()                                  # loads fle_v34.npz
vec = fle["cat"]                             # 300d numpy array
fle.similar("cat", n=10)                     # nearest neighbors
fle.analogy("king", "man", "woman")          # king:man :: woman:?
fle.similarity("cat", "dog")                 # cosine similarity
fle.query("king - man + woman")              # vector arithmetic
```

## Examples

```
$ python fle.py king - man + woman
  → queen                0.7387
  → princess             0.6781
  → monarch              0.5546

$ python fle.py paris - france + germany
  → berlin               0.8209
  → vienna               0.7862
  → munich               0.7850

$ python fle.py --similar cat
  kitten               0.7168
  cats                  0.6849
  tabby                 0.6572
  dog                   0.5919

$ python fle.py ubuntu - debian + redhat
  centos               0.6261
  linux                0.6016
  rhel                 0.5949

$ python fle.py brain
  cerebral             0.6665
  cerebellum           0.6022
  nerves               0.5748
```

## What Makes This Different

- **Free as in freedom.** Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main.
- **Dynamic masking.** Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay — analogies jump from 1.2% to 66.5% in the second half of training.
- **Whole-word vocabulary.** No subword tokenization. Subwords break word2vec geometry completely — they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.

## Training

Trained with cosine learning rate schedule (3e-4 → 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.

Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings)

## Interactive Visualizations

- [Embedding Spectrogram](https://ruapotato.github.io/Free-Language-Embeddings/spectrogram.html) — PCA waves, sine fits, cosine surfaces
- [3D Semantic Directions](https://ruapotato.github.io/Free-Language-Embeddings/semantic_3d.html) — See how semantic axes align in the learned geometry
- [Training Dashboard](https://ruapotato.github.io/Free-Language-Embeddings/dashboard.html) — Loss curves and training metrics

## Citation

```bibtex
@misc{hamner2026fle,
  title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
  author={David Hamner},
  year={2026},
  url={https://github.com/ruapotato/Free-Language-Embeddings}
}
```

## License

GPL-3.0 — See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details.

Built by David Hamner.