Spaces:

shreyask
/

microembeddings

Sleeping

App Files Files Community

microembeddings / README.md

shreyask

docs: gensim attribution and pretrain script in README

402ca2c verified 10 days ago

preview code

raw

history blame contribute delete

1.49 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: microembeddings
emoji: 🧮
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_file: app.py
pinned: false
license: mit
short_description: Word2Vec skip-gram from scratch
tags:
  - embeddings
  - word2vec
  - education
  - nlp

microembeddings

Word2Vec skip-gram with negative sampling, implemented from scratch in ~190 lines of NumPy. Train word embeddings, visualize the embedding space, solve analogies, and find nearest neighbors.

Companion to the blog post: microembeddings: Understanding Word Vectors from Scratch

Preloaded vectors are generated with gensim Word2Vec on the full 17M-word text8 corpus for better quality. The Space's Train tab reruns the smaller NumPy implementation on a 500k-word subset so training stays interactive.

Features

Train — Train embeddings from scratch on text8 (cleaned Wikipedia), watch the loss curve
Explore — 2D scatter plot (PCA/t-SNE) of the embedding space with category highlighting
Analogies — Word vector arithmetic: king - man + woman = queen
Nearest Neighbors — Find semantically similar words by cosine similarity

Learn More

Blog Post
Inspired by Karpathy's microGPT
Mikolov et al., 2013 — Word2Vec paper
Pretraining script: pretrain_gensim.py (dev-only, run locally)