hackersgame's picture
Upload README.md with huggingface_hub
ceeb92a verified
metadata
language:
  - en
license: gpl-3.0
tags:
  - word-embeddings
  - word2vec
  - embeddings
  - nlp
  - free-software
  - dfsg
datasets:
  - wikimedia/wikipedia
  - pg19
metrics:
  - accuracy
model-index:
  - name: fle-v34
    results:
      - task:
          type: word-analogy
          name: Word Analogy
        dataset:
          type: custom
          name: Google Analogy Test Set
        metrics:
          - type: accuracy
            value: 66.5
            name: Overall Accuracy
          - type: accuracy
            value: 61.4
            name: Semantic Accuracy
          - type: accuracy
            value: 69.2
            name: Syntactic Accuracy
library_name: numpy
pipeline_tag: feature-extraction

Free Language Embeddings (V34)

300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.

66.5% on Google analogies β€” beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.

Model Details

Architecture Dynamic masking word2vec skip-gram
Dimensions 300
Vocabulary 100,000 whole words
Training data ~2B tokens, all DFSG-compliant (see below)
Training hardware Single NVIDIA RTX 3090
Training time ~4 days (2M steps)
License GPL-3.0
Parameters 60M (30M target + 30M context embeddings)

Training Data

All training data meets the Debian Free Software Guidelines for redistribution, modification, and use. No web scrapes, no proprietary datasets.

Source Weight License
Wikipedia 30% CC BY-SA 3.0
Project Gutenberg 20% Public domain
arXiv 20% Various open access
Stack Exchange 16% CC BY-SA 4.0
US Government Publishing Office 10% Public domain (US gov)
RFCs 2.5% IETF Trust
Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages 1.5% GPL/GFDL

Benchmark Results

Model Data Google Analogies
fle V34 (this model) ~2B tokens 66.5%
word2vec (Mikolov 2013) 6B tokens 61.0%
GloVe (small) 6B tokens 71.0%
Google word2vec 6B tokens 72.7%
GloVe (Pennington 2014) 840B tokens 75.6%
FastText (Bojanowski 2017) 16B tokens 77.0%

Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.

Quick Start

# Download
pip install huggingface_hub numpy
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
"

# Use
python fle.py king - man + woman
python fle.py --similar cat
python fle.py   # interactive mode

Python API

from fle import FLE

fle = FLE()                                  # loads fle_v34.npz
vec = fle["cat"]                             # 300d numpy array
fle.similar("cat", n=10)                     # nearest neighbors
fle.analogy("king", "man", "woman")          # king:man :: woman:?
fle.similarity("cat", "dog")                 # cosine similarity
fle.query("king - man + woman")              # vector arithmetic

Examples

$ python fle.py king - man + woman
  β†’ queen                0.7387
  β†’ princess             0.6781
  β†’ monarch              0.5546

$ python fle.py paris - france + germany
  β†’ berlin               0.8209
  β†’ vienna               0.7862
  β†’ munich               0.7850

$ python fle.py --similar cat
  kitten               0.7168
  cats                  0.6849
  tabby                 0.6572
  dog                   0.5919

$ python fle.py ubuntu - debian + redhat
  centos               0.6261
  linux                0.6016
  rhel                 0.5949

$ python fle.py brain
  cerebral             0.6665
  cerebellum           0.6022
  nerves               0.5748

What Makes This Different

  • Free as in freedom. Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could apt install from Debian main.
  • Dynamic masking. Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay β€” analogies jump from 1.2% to 66.5% in the second half of training.
  • Whole-word vocabulary. No subword tokenization. Subwords break word2vec geometry completely β€” they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.

Training

Trained with cosine learning rate schedule (3e-4 β†’ 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.

Full training code and visualizations: github.com/ruapotato/Free-Language-Embeddings

Interactive Visualizations

Citation

@misc{hamner2026fle,
  title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
  author={David Hamner},
  year={2026},
  url={https://github.com/ruapotato/Free-Language-Embeddings}
}

License

GPL-3.0 β€” See LICENSE for details.

Built by David Hamner.