Spaces:
Running
title: Text Vectorization Lab
emoji: ๐งฎ
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
Text Vectorization Lab
A step-by-step, interactive simulation of every technique covered in the Text Vectorization in NLP deck and notebook:
- One-Hot Encoding
- Count Vectorizer
- Bag-of-Words
- N-grams
- TF-IDF Vectorizer
- Word Embeddings (Word2Vec โ Skip-gram & CBOW โ and FastText)
It's a single small web app: a Flask backend (app.py) that actually
runs the scikit-learn / numpy / gensim code from the reference
notebook on whatever text you type in, and a vanilla HTML/CSS/JS
frontend that calls that backend and reveals the result stage by stage โ
tokenize โ vocabulary โ vectors โ along an animated pipeline tape. There is
no precomputed/fake data: every matrix, score, and embedding on the page is
computed live by the Python backend for the exact text you entered.
1. Setup
Requires Python 3.9+.
cd text-vectorization-lab
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
2. Run
python app.py
Then open http://localhost:5000 in your browser.
The dev server runs with debug=True, so editing app.py or the
templates/static files will auto-reload.
3. How it's wired
text-vectorization-lab/
โโโ app.py # Flask app + all vectorization logic (the "backend")
โโโ requirements.txt
โโโ templates/
โ โโโ index.html # Single-page app shell, one <section> per technique
โโโ static/
โโโ css/style.css # Design system (dark "lab" theme, pipeline tape, matrices)
โโโ js/main.js # Fetches /api/* and animates the step-by-step reveal
Each technique has its own REST endpoint:
| Endpoint | Mirrors notebook section |
|---|---|
POST /api/onehot |
One-Hot Encoding (manual NumPy + OneHotEncoder cross-check) |
POST /api/count-vectorizer |
CountVectorizer (vocabulary, matrix, stop words, max_features, .transform() on new text) |
POST /api/bow |
Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity |
POST /api/ngrams |
Manual n-gram generator + CountVectorizer(ngram_range=...) + char-level n-grams |
POST /api/tfidf |
Manual TF/IDF computation (sklearn-style smoothing) + TfidfVectorizer + top words per doc |
POST /api/embeddings |
gensim.models.Word2Vec (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and FastText for out-of-vocabulary words |
The frontend never computes vectors itself โ it only sends your raw text to these endpoints and renders whatever comes back, so the numbers you see are always whatever scikit-learn/gensim actually produce.
4. Using it
Every panel has a textarea pre-filled with the same example corpus used in the notebook, plus a Run button. Edit the text, click run, and watch the pipeline tape light up stage by stage as the backend tokenizes, builds the vocabulary, and fills in the matrix or vectors. Reset to example restores the original notebook text for that panel.
The Word Embeddings panel trains a real Word2Vec + FastText model on your sentences on every run, so it takes a couple of seconds โ that's genuine training time, not a fake delay.
5. Notes & limits
- All endpoints lower-case and strip punctuation with a simple regex
tokenizer (
re.findall(r"[A-Za-z0-9']+", text.lower())), matching the.split()-based tokenization used in the notebook closely enough for teaching purposes.CountVectorizer/TfidfVectorizeruse scikit-learn's own tokenizer internally, as in the notebook. - Word2Vec/FastText need a handful of sentences with repeated/shared words to produce meaningful similarities โ very short or fully disjoint corpora will train but the similarity scores won't mean much.
- This is a teaching tool, not a production service: there's no auth, rate
limiting, or persistence, and
debug=Trueshould be turned off if you ever deploy it anywhere other than your own machine.