Spaces:
Sleeping
Sleeping
File size: 1,492 Bytes
7ccb023 cea13cd 7ccb023 cea13cd 7ccb023 cea13cd 7ccb023 cea13cd 402ca2c cea13cd 402ca2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ---
title: microembeddings
emoji: 🧮
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_file: app.py
pinned: false
license: mit
short_description: Word2Vec skip-gram from scratch
tags:
- embeddings
- word2vec
- education
- nlp
---
# microembeddings
Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors.
Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)
Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.
## Features
- **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
- **Explore** — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting
- **Analogies** — Word vector arithmetic: king - man + woman = queen
- **Nearest Neighbors** — Find semantically similar words by cosine similarity
## Learn More
- [Blog Post](https://kshreyas.dev/post/microembeddings/)
- [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
- [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
- Pretraining script: `pretrain_gensim.py` (dev-only, run locally)
|