--- title: microembeddings emoji: 🧮 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.23.0 app_file: app.py pinned: false license: mit short_description: Word2Vec skip-gram from scratch tags: - embeddings - word2vec - education - nlp --- # microembeddings Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors. Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/) Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive. ## Features - **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve - **Explore** — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting - **Analogies** — Word vector arithmetic: king - man + woman = queen - **Nearest Neighbors** — Find semantically similar words by cosine similarity ## Learn More - [Blog Post](https://kshreyas.dev/post/microembeddings/) - [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/) - [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781) - Pretraining script: `pretrain_gensim.py` (dev-only, run locally)