Spaces:
Sleeping
Sleeping
| title: microembeddings | |
| emoji: 🧮 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.23.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Word2Vec skip-gram from scratch | |
| tags: | |
| - embeddings | |
| - word2vec | |
| - education | |
| - nlp | |
| # microembeddings | |
| Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors. | |
| Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/) | |
| Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive. | |
| ## Features | |
| - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve | |
| - **Explore** — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting | |
| - **Analogies** — Word vector arithmetic: king - man + woman = queen | |
| - **Nearest Neighbors** — Find semantically similar words by cosine similarity | |
| ## Learn More | |
| - [Blog Post](https://kshreyas.dev/post/microembeddings/) | |
| - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/) | |
| - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781) | |
| - Pretraining script: `pretrain_gensim.py` (dev-only, run locally) | |