hackersgame commited on
Commit
6132608
Β·
verified Β·
1 Parent(s): fdce46f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: gpl-3.0
5
+ tags:
6
+ - word-embeddings
7
+ - word2vec
8
+ - embeddings
9
+ - nlp
10
+ - free-software
11
+ - dfsg
12
+ datasets:
13
+ - Skylion007/openwebtext
14
+ metrics:
15
+ - accuracy
16
+ model-index:
17
+ - name: fle-v34
18
+ results:
19
+ - task:
20
+ type: word-analogy
21
+ name: Word Analogy
22
+ dataset:
23
+ type: custom
24
+ name: Google Analogy Test Set
25
+ metrics:
26
+ - type: accuracy
27
+ value: 66.5
28
+ name: Overall Accuracy
29
+ - type: accuracy
30
+ value: 61.4
31
+ name: Semantic Accuracy
32
+ - type: accuracy
33
+ value: 69.2
34
+ name: Syntactic Accuracy
35
+ library_name: numpy
36
+ pipeline_tag: feature-extraction
37
+ ---
38
+
39
+ # Free Language Embeddings (V34)
40
+
41
+ 300-dimensional word vectors trained from scratch on ~2B tokens of DFSG-compliant text using a single RTX 3090.
42
+
43
+ **66.5% on Google analogies** β€” beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.
44
+
45
+ ## Model Details
46
+
47
+ | | |
48
+ |---|---|
49
+ | **Architecture** | Dynamic masking word2vec skip-gram |
50
+ | **Dimensions** | 300 |
51
+ | **Vocabulary** | 100,000 whole words |
52
+ | **Training data** | ~2B tokens (OpenWebText subset, DFSG-compliant) |
53
+ | **Training hardware** | Single NVIDIA RTX 3090 |
54
+ | **Training time** | ~24 hours (2M steps) |
55
+ | **License** | GPL-3.0 |
56
+ | **Parameters** | 60M (30M target + 30M context embeddings) |
57
+
58
+ ## Benchmark Results
59
+
60
+ | Model | Data | Google Analogies |
61
+ |-------|------|-----------------|
62
+ | **fle V34 (this model)** | **~2B tokens** | **66.5%** |
63
+ | word2vec (Mikolov 2013) | 6B tokens | 61.0% |
64
+ | GloVe (small) | 6B tokens | 71.0% |
65
+ | Google word2vec | 6B tokens | 72.7% |
66
+ | GloVe (Pennington 2014) | 840B tokens | 75.6% |
67
+ | FastText (Bojanowski 2017) | 16B tokens | 77.0% |
68
+
69
+ Breakdown: semantic 61.4%, syntactic 69.2%. Comparatives 91.7%, plurals 86.8%, capitals 82.6%.
70
+
71
+ ## Quick Start
72
+
73
+ ```bash
74
+ # Download
75
+ pip install huggingface_hub numpy
76
+ python -c "
77
+ from huggingface_hub import hf_hub_download
78
+ hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle_v34.npz', local_dir='.')
79
+ hf_hub_download('hackersgame/Free_Language_Embeddings', 'fle.py', local_dir='.')
80
+ "
81
+
82
+ # Use
83
+ python fle.py king - man + woman
84
+ python fle.py --similar cat
85
+ python fle.py # interactive mode
86
+ ```
87
+
88
+ ### Python API
89
+
90
+ ```python
91
+ from fle import FLE
92
+
93
+ fle = FLE() # loads fle_v34.npz
94
+ vec = fle["cat"] # 300d numpy array
95
+ fle.similar("cat", n=10) # nearest neighbors
96
+ fle.analogy("king", "man", "woman") # king:man :: woman:?
97
+ fle.similarity("cat", "dog") # cosine similarity
98
+ fle.query("king - man + woman") # vector arithmetic
99
+ ```
100
+
101
+ ## Examples
102
+
103
+ ```
104
+ $ python fle.py king - man + woman
105
+ β†’ queen 0.7387
106
+ β†’ princess 0.6781
107
+ β†’ monarch 0.5546
108
+
109
+ $ python fle.py paris - france + germany
110
+ β†’ berlin 0.8209
111
+ β†’ vienna 0.7862
112
+ β†’ munich 0.7850
113
+
114
+ $ python fle.py --similar cat
115
+ kitten 0.7168
116
+ cats 0.6849
117
+ tabby 0.6572
118
+ dog 0.5919
119
+
120
+ $ python fle.py ubuntu - debian + redhat
121
+ centos 0.6261
122
+ linux 0.6016
123
+ rhel 0.5949
124
+
125
+ $ python fle.py brain
126
+ cerebral 0.6665
127
+ cerebellum 0.6022
128
+ nerves 0.5748
129
+ ```
130
+
131
+ ## What Makes This Different
132
+
133
+ - **Free as in freedom.** Every dataset is DFSG-compliant. Every weight is reproducible. GPL-3.0 licensed. The goal: word embeddings you could `apt install` from Debian main.
134
+ - **Dynamic masking.** Randomly masks context positions during training, forcing the model to extract signal from partial views. The result: geometry that crystallizes during cosine LR decay β€” analogies jump from 1.2% to 66.5% in the second half of training.
135
+ - **Whole-word vocabulary.** No subword tokenization. Subwords break word2vec geometry completely β€” they don't carry enough meaning individually for co-occurrence statistics to produce useful structure.
136
+
137
+ ## Training
138
+
139
+ Trained with cosine learning rate schedule (3e-4 β†’ 1e-6). The training curve shows a striking crystallization pattern: near-zero analogy accuracy for the first 50% of training, then rapid emergence of geometric structure as the learning rate decays.
140
+
141
+ Full training code and visualizations: [github.com/ruapotato/Free-Language-Embeddings](https://github.com/ruapotato/Free-Language-Embeddings)
142
+
143
+ ## Interactive Visualizations
144
+
145
+ - [Embedding Spectrogram](https://ruapotato.github.io/chat_hamner/spectrogram.html) β€” PCA waves, sine fits, cosine surfaces
146
+ - [3D Semantic Directions](https://ruapotato.github.io/chat_hamner/semantic_3d.html) β€” See how semantic axes align in the learned geometry
147
+ - [Training Dashboard](https://ruapotato.github.io/chat_hamner/dashboard.html) β€” Loss curves and training metrics
148
+
149
+ ## Citation
150
+
151
+ ```bibtex
152
+ @misc{hamner2026fle,
153
+ title={Free Language Embeddings: Dynamic Masking Word2Vec on DFSG-Compliant Data},
154
+ author={David Hamner},
155
+ year={2026},
156
+ url={https://github.com/ruapotato/Free-Language-Embeddings}
157
+ }
158
+ ```
159
+
160
+ ## License
161
+
162
+ GPL-3.0 β€” See [LICENSE](https://github.com/ruapotato/Free-Language-Embeddings/blob/main/LICENSE) for details.
163
+
164
+ Built by David Hamner with help from Claude.