| --- |
| language: |
| - en |
| license: gpl-3.0 |
| tags: |
| - word-embeddings |
| - word2vec |
| - embeddings |
| - nlp |
| - free-software |
| - dfsg |
| datasets: |
| - wikimedia/wikipedia |
| - pg19 |
| metrics: |
| - accuracy |
| model-index: |
| - name: fle-v34 |
| results: |
| - task: |
| type: word-analogy |
| name: Word Analogy |
| dataset: |
| type: custom |
| name: Google Analogy Test Set |
| metrics: |
| - type: accuracy |
| value: 66.5 |
| name: Overall Accuracy |
| - type: accuracy |
| value: 61.4 |
| name: Semantic Accuracy |
| - type: accuracy |
| value: 69.2 |
| name: Syntactic Accuracy |
| library_name: numpy |
| pipeline_tag: feature-extraction |
| --- |
| |
| # Free Language Embeddings (V34) |
|
|
| 300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090. |
|
|
| **66.5% on Google analogies** β beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data. |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Architecture** | Dynamic masking word2vec skip-gram | |
| | **Dimensions** | 300 | |
| | **Vocabulary** | 100,000 whole words | |
| | **Training data** | ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) | |
| | **Training hardware** | Single NVIDIA RTX 3090 | |
| | **Training time** | ~4 days (2M steps) | |
| | **License** | GPL-3.0 | |
| | **Parameters** | 60M (30M target + 30M context embeddings) | |
|
|
| ### Training Data |
|
|
| All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets. |
|
|
| | Source | Weight | License | |
| |--------|--------|---------| |
| | Wikipedia | 30% | CC BY-SA 3.0 | |
| | Project Gutenberg | 20% | Public domain | |
| | arXiv | 20% | Various open access | |
| | Stack Exchange | 16% | CC BY-SA 4.0 | |
| | US Government Publishing Office | 10% | Public domain (US gov) | |
| | RFCs | 2.5% | IETF Trust | |
| | Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages | 1.5% | GPL/GFDL | |
|
|
| ## Benchmark Results |
|
|
| | Model | Data | Google Analogies | |
| |-------|------|-----------------| |
| | **fle V34 (this model)** | **~2B tokens** | **66.5%** | |
| | word2vec (Mikolov 2013) | 6B tokens | 61.0% | |
| | GloVe (small) | 6B tokens | 71.0% | |
| | Google word2vec | 6B tokens | 72.7% | |
| | GloVe (Pennington 2014) | 840B tokens | 75.6% | |
| | FastText (Bojanowski 2017) | 16B tokens | 77.0% | |
|
|
| Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%. |
|
|
| ## Quick Start |
|
|
| ```bash |
| # Download |
| pip install huggingface_hub numpy |
| python -c " |
| from huggingface_hub import hf_hub_download |
| hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.') |
| hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.') |
| " |
| |
| # Use |
| python fle.py king - man + woman |
| python fle.py --similar cat |
| python fle.py # interactive mode |
| ``` |
|
|
| ### Python API |
|
|
| ```python |
| from fle import FLE |
| |
| fle = FLE() # loads fle_v34.npz |
| vec = fle["cat"] # 300d numpy array |
| fle.similar("cat", n=10) # nearest neighbors |
| fle.analogy("king", "man", "woman") # king:man :: woman:? |
| fle.similarity("cat", "dog") # cosine similarity |
| fle.query("king - man + woman") # vector arithmetic |
| ``` |
|
|
| ## Examples |
|
|
| ``` |
| $ python fle.py king - man + woman |
| β queen 0.7387 |
| β princess 0.6781 |
| β monarch 0.5546 |
| |
| $ python fle.py paris - france + germany |
| β berlin 0.8209 |
| β vienna 0.7862 |
| β munich 0.7850 |
| |
| $ python fle.py --similar cat |
| kitten 0.7168 |
| cats 0.6849 |
| tabby 0.6572 |
| dog 0.5919 |
| |
| $ python fle.py ubuntu - debian + redhat |
| centos 0.6261 |
| linux 0.6016 |
| rhel 0.5949 |
| |
| $ python fle.py brain |
| cerebral 0.6665 |
| cerebellum 0.6022 |
| nerves 0.5748 |
| ``` |
|
|
| ## What Makes This Different |
|
|
| - **Free as in freedom.** Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main. |
| - **Dynamic masking.** Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay β analogies jump from 1.2% to 66.5% in the second half of training. |
| - **Whole-word vocabulary.** No subword tokenization. Subwords break word2vec geometry completely β they don't carry enough meaning individually for co-occurrence statistics to produce useful structure. |
|
|
| ## Training |
|
|
| Trained with cosine learning rate schedule (3e-4 β 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays. |
|
|
| Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings) |
|
|
| ## Interactive Visualizations |
|
|
| - [Embedding Spectrogram](https://ruapotato.github.io/Free-Language-Embeddings/spectrogram.html) β PCA waves, sine fits, cosine surfaces |
| - [3D Semantic Directions](https://ruapotato.github.io/Free-Language-Embeddings/semantic_3d.html) β See how semantic axes align in the learned geometry |
| - [Training Dashboard](https://ruapotato.github.io/Free-Language-Embeddings/dashboard.html) β Loss curves and training metrics |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{hamner2026fle, |
| title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data}, |
| author={David Hamner}, |
| year={2026}, |
| url={https://github.com/ruapotato/Free-Language-Embeddings} |
| } |
| ``` |
|
|
| ## License |
|
|
| GPL-3.0 β See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details. |
|
|
| Built by David Hamner. |
|
|