---
title: Text Vectorization Lab
emoji: 🧮
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---

# Text Vectorization Lab

A step-by-step, interactive simulation of every technique covered in the
*Text Vectorization in NLP* deck and notebook:

- One-Hot Encoding
- Count Vectorizer
- Bag-of-Words
- N-grams
- TF-IDF Vectorizer
- Word Embeddings (Word2Vec — Skip-gram & CBOW — and FastText)

It's a single small web app: a **Flask backend** (`app.py`) that actually
runs the `scikit-learn` / `numpy` / `gensim` code from the reference
notebook on whatever text you type in, and a **vanilla HTML/CSS/JS
frontend** that calls that backend and reveals the result stage by stage —
tokenize → vocabulary → vectors — along an animated pipeline tape. There is
no precomputed/fake data: every matrix, score, and embedding on the page is
computed live by the Python backend for the exact text you entered.

## 1. Setup

Requires Python 3.9+.

```bash
cd text-vectorization-lab
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
```

## 2. Run

```bash
python app.py
```

Then open **http://localhost:5000** in your browser.

The dev server runs with `debug=True`, so editing `app.py` or the
templates/static files will auto-reload.

## 3. How it's wired

```
text-vectorization-lab/
├── app.py                  # Flask app + all vectorization logic (the "backend")
├── requirements.txt
├── templates/
│   └── index.html          # Single-page app shell, one <section> per technique
└── static/
    ├── css/style.css       # Design system (dark "lab" theme, pipeline tape, matrices)
    └── js/main.js          # Fetches /api/* and animates the step-by-step reveal
```

Each technique has its own REST endpoint:

| Endpoint | Mirrors notebook section |
|---|---|
| `POST /api/onehot` | One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check) |
| `POST /api/count-vectorizer` | `CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text) |
| `POST /api/bow` | Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity |
| `POST /api/ngrams` | Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams |
| `POST /api/tfidf` | Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc |
| `POST /api/embeddings` | `gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words |

The frontend never computes vectors itself — it only sends your raw text
to these endpoints and renders whatever comes back, so the numbers you see
are always whatever scikit-learn/gensim actually produce.

## 4. Using it

Every panel has a textarea pre-filled with the same example corpus used in
the notebook, plus a **Run** button. Edit the text, click run, and watch
the pipeline tape light up stage by stage as the backend tokenizes,
builds the vocabulary, and fills in the matrix or vectors. **Reset to
example** restores the original notebook text for that panel.

The Word Embeddings panel trains a real Word2Vec + FastText model on your
sentences on every run, so it takes a couple of seconds — that's genuine
training time, not a fake delay.

## 5. Notes & limits

- All endpoints lower-case and strip punctuation with a simple regex
  tokenizer (`re.findall(r"[A-Za-z0-9']+", text.lower())`), matching the
  `.split()`-based tokenization used in the notebook closely enough for
  teaching purposes. `CountVectorizer`/`TfidfVectorizer` use scikit-learn's
  own tokenizer internally, as in the notebook.
- Word2Vec/FastText need a handful of sentences with repeated/shared words
  to produce meaningful similarities — very short or fully disjoint corpora
  will train but the similarity scores won't mean much.
- This is a teaching tool, not a production service: there's no auth, rate
  limiting, or persistence, and `debug=True` should be turned off if you
  ever deploy it anywhere other than your own machine.