Text-Vectorizer / README.md
shukdev3's picture
Upload README.md
0d6fa58 verified
|
Raw
History Blame Contribute Delete
4.12 kB
---
title: Text Vectorization Lab
emoji: ๐Ÿงฎ
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---
# Text Vectorization Lab
A step-by-step, interactive simulation of every technique covered in the
*Text Vectorization in NLP* deck and notebook:
- One-Hot Encoding
- Count Vectorizer
- Bag-of-Words
- N-grams
- TF-IDF Vectorizer
- Word Embeddings (Word2Vec โ€” Skip-gram & CBOW โ€” and FastText)
It's a single small web app: a **Flask backend** (`app.py`) that actually
runs the `scikit-learn` / `numpy` / `gensim` code from the reference
notebook on whatever text you type in, and a **vanilla HTML/CSS/JS
frontend** that calls that backend and reveals the result stage by stage โ€”
tokenize โ†’ vocabulary โ†’ vectors โ€” along an animated pipeline tape. There is
no precomputed/fake data: every matrix, score, and embedding on the page is
computed live by the Python backend for the exact text you entered.
## 1. Setup
Requires Python 3.9+.
```bash
cd text-vectorization-lab
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
```
## 2. Run
```bash
python app.py
```
Then open **http://localhost:5000** in your browser.
The dev server runs with `debug=True`, so editing `app.py` or the
templates/static files will auto-reload.
## 3. How it's wired
```
text-vectorization-lab/
โ”œโ”€โ”€ app.py # Flask app + all vectorization logic (the "backend")
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ templates/
โ”‚ โ””โ”€โ”€ index.html # Single-page app shell, one <section> per technique
โ””โ”€โ”€ static/
โ”œโ”€โ”€ css/style.css # Design system (dark "lab" theme, pipeline tape, matrices)
โ””โ”€โ”€ js/main.js # Fetches /api/* and animates the step-by-step reveal
```
Each technique has its own REST endpoint:
| Endpoint | Mirrors notebook section |
|---|---|
| `POST /api/onehot` | One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check) |
| `POST /api/count-vectorizer` | `CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text) |
| `POST /api/bow` | Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity |
| `POST /api/ngrams` | Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams |
| `POST /api/tfidf` | Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc |
| `POST /api/embeddings` | `gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words |
The frontend never computes vectors itself โ€” it only sends your raw text
to these endpoints and renders whatever comes back, so the numbers you see
are always whatever scikit-learn/gensim actually produce.
## 4. Using it
Every panel has a textarea pre-filled with the same example corpus used in
the notebook, plus a **Run** button. Edit the text, click run, and watch
the pipeline tape light up stage by stage as the backend tokenizes,
builds the vocabulary, and fills in the matrix or vectors. **Reset to
example** restores the original notebook text for that panel.
The Word Embeddings panel trains a real Word2Vec + FastText model on your
sentences on every run, so it takes a couple of seconds โ€” that's genuine
training time, not a fake delay.
## 5. Notes & limits
- All endpoints lower-case and strip punctuation with a simple regex
tokenizer (`re.findall(r"[A-Za-z0-9']+", text.lower())`), matching the
`.split()`-based tokenization used in the notebook closely enough for
teaching purposes. `CountVectorizer`/`TfidfVectorizer` use scikit-learn's
own tokenizer internally, as in the notebook.
- Word2Vec/FastText need a handful of sentences with repeated/shared words
to produce meaningful similarities โ€” very short or fully disjoint corpora
will train but the similarity scores won't mean much.
- This is a teaching tool, not a production service: there's no auth, rate
limiting, or persistence, and `debug=True` should be turned off if you
ever deploy it anywhere other than your own machine.