shreyask commited on
Commit
402ca2c
·
verified ·
1 Parent(s): 43fd8a7

docs: gensim attribution and pretrain script in README

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -22,6 +22,8 @@ Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 line
22
 
23
  Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
24
 
 
 
25
  ## Features
26
 
27
  - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
@@ -34,3 +36,4 @@ Companion to the blog post: [microembeddings: Understanding Word Vectors from Sc
34
  - [Blog Post](https://kshreyas.dev/post/microembeddings/)
35
  - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
36
  - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
 
 
22
 
23
  Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
24
 
25
+ Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.
26
+
27
  ## Features
28
 
29
  - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
 
36
  - [Blog Post](https://kshreyas.dev/post/microembeddings/)
37
  - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
38
  - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
39
+ - Pretraining script: `pretrain_gensim.py` (dev-only, run locally)