Spaces:

shreyask
/

microembeddings

Sleeping

docs: gensim attribution and pretrain script in README

402ca2c verified 10 days ago

1.49 kB

	---
	title: microembeddings
	emoji: 🧮
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.23.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Word2Vec skip-gram from scratch
	tags:
	- embeddings
	- word2vec
	- education
	- nlp
	---

	# microembeddings

	Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors.

	Companion to the blog post: [microembeddings: Understanding Word Vectors from Scratch](https://kshreyas.dev/post/microembeddings/)

	Preloaded vectors are generated with `gensim` Word2Vec on the full 17M-word text8 corpus for better quality. The Space's Train tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.

	## Features

	- Train — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
	- Explore — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting
	- Analogies — Word vector arithmetic: king - man + woman = queen
	- Nearest Neighbors — Find semantically similar words by cosine similarity

	## Learn More

	- [Blog Post](https://kshreyas.dev/post/microembeddings/)
	- [Inspired by Karpathy's microGPT](https://karpathy.github.io/2026/02/12/microgpt/)
	- [Mikolov et al., 2013 — Word2Vec paper](https://arxiv.org/abs/1301.3781)
	- Pretraining script: `pretrain_gensim.py` (dev-only, run locally)