File size: 1,492 Bytes
7ccb023
cea13cd
 
 
 
7ccb023
cea13cd
7ccb023
 
cea13cd
 
 
 
 
 
 
7ccb023
 
cea13cd
 
 
 
 
 
402ca2c
 
cea13cd
 
 
 
 
 
 
 
 
 
 
 
402ca2c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
title: microembeddings
emoji: 🧮
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_file: app.py
pinned: false
license: mit
short_description: Word2Vec skip-gram from scratch
tags:
  - embeddings
  - word2vec
  - education
  - nlp
---

# microembeddings

Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors.

Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)

Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's **Train** tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.

## Features

- **Train** — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
- **Explore** — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting
- **Analogies** — Word vector arithmetic: king - man + woman = queen
- **Nearest Neighbors** — Find semantically similar words by cosine similarity

## Learn More

- [Blog Post](https://kshreyas.dev/post/microembeddings/)
- [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
- [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
- Pretraining script: `pretrain_gensim.py` (dev-only, run locally)