Spaces:

shreyask
/

microembeddings

Sleeping

shreyask commited on 10 days ago

Commit

402ca2c

verified ·

1 Parent(s): 43fd8a7

docs: gensim attribution and pretrain script in README

Files changed (1) hide show

README.md CHANGED Viewed

@@ -22,6 +22,8 @@ Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 line
 Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
 ## Features
 - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
@@ -34,3 +36,4 @@ Companion to the blog post: [microembeddings: Understanding Word Vectors from Sc
 - [Blog Post](https://kshreyas.dev/post/microembeddings/)
 - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
 - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)

 Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
+Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.
 ## Features
 - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
 - [Blog Post](https://kshreyas.dev/post/microembeddings/)
 - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
 - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
+- Pretraining script: `pretrain_gensim.py` (dev-only, run locally)