Spaces:
Sleeping
Sleeping
docs: gensim attribution and pretrain script in README
Browse files
README.md
CHANGED
|
@@ -22,6 +22,8 @@ Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 line
|
|
| 22 |
|
| 23 |
Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
|
| 24 |
|
|
|
|
|
|
|
| 25 |
## Features
|
| 26 |
|
| 27 |
- **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
|
|
@@ -34,3 +36,4 @@ Companion to the blog post: [microembeddings: Understanding Word Vectors from Sc
|
|
| 34 |
- [Blog Post](https://kshreyas.dev/post/microembeddings/)
|
| 35 |
- [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
|
| 36 |
- [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
|
|
|
|
|
|
| 22 |
|
| 23 |
Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
|
| 24 |
|
| 25 |
+
Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.
|
| 26 |
+
|
| 27 |
## Features
|
| 28 |
|
| 29 |
- **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
|
|
|
|
| 36 |
- [Blog Post](https://kshreyas.dev/post/microembeddings/)
|
| 37 |
- [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
|
| 38 |
- [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
|
| 39 |
+
- Pretraining script: `pretrain_gensim.py` (dev-only, run locally)
|