Spaces:
Running
Running
| title: Text Vectorization Lab | |
| emoji: ๐งฎ | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # Text Vectorization Lab | |
| A step-by-step, interactive simulation of every technique covered in the | |
| *Text Vectorization in NLP* deck and notebook: | |
| - One-Hot Encoding | |
| - Count Vectorizer | |
| - Bag-of-Words | |
| - N-grams | |
| - TF-IDF Vectorizer | |
| - Word Embeddings (Word2Vec โ Skip-gram & CBOW โ and FastText) | |
| It's a single small web app: a **Flask backend** (`app.py`) that actually | |
| runs the `scikit-learn` / `numpy` / `gensim` code from the reference | |
| notebook on whatever text you type in, and a **vanilla HTML/CSS/JS | |
| frontend** that calls that backend and reveals the result stage by stage โ | |
| tokenize โ vocabulary โ vectors โ along an animated pipeline tape. There is | |
| no precomputed/fake data: every matrix, score, and embedding on the page is | |
| computed live by the Python backend for the exact text you entered. | |
| ## 1. Setup | |
| Requires Python 3.9+. | |
| ```bash | |
| cd text-vectorization-lab | |
| python -m venv venv | |
| source venv/bin/activate # Windows: venv\Scripts\activate | |
| pip install -r requirements.txt | |
| ``` | |
| ## 2. Run | |
| ```bash | |
| python app.py | |
| ``` | |
| Then open **http://localhost:5000** in your browser. | |
| The dev server runs with `debug=True`, so editing `app.py` or the | |
| templates/static files will auto-reload. | |
| ## 3. How it's wired | |
| ``` | |
| text-vectorization-lab/ | |
| โโโ app.py # Flask app + all vectorization logic (the "backend") | |
| โโโ requirements.txt | |
| โโโ templates/ | |
| โ โโโ index.html # Single-page app shell, one <section> per technique | |
| โโโ static/ | |
| โโโ css/style.css # Design system (dark "lab" theme, pipeline tape, matrices) | |
| โโโ js/main.js # Fetches /api/* and animates the step-by-step reveal | |
| ``` | |
| Each technique has its own REST endpoint: | |
| | Endpoint | Mirrors notebook section | | |
| |---|---| | |
| | `POST /api/onehot` | One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check) | | |
| | `POST /api/count-vectorizer` | `CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text) | | |
| | `POST /api/bow` | Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity | | |
| | `POST /api/ngrams` | Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams | | |
| | `POST /api/tfidf` | Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc | | |
| | `POST /api/embeddings` | `gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words | | |
| The frontend never computes vectors itself โ it only sends your raw text | |
| to these endpoints and renders whatever comes back, so the numbers you see | |
| are always whatever scikit-learn/gensim actually produce. | |
| ## 4. Using it | |
| Every panel has a textarea pre-filled with the same example corpus used in | |
| the notebook, plus a **Run** button. Edit the text, click run, and watch | |
| the pipeline tape light up stage by stage as the backend tokenizes, | |
| builds the vocabulary, and fills in the matrix or vectors. **Reset to | |
| example** restores the original notebook text for that panel. | |
| The Word Embeddings panel trains a real Word2Vec + FastText model on your | |
| sentences on every run, so it takes a couple of seconds โ that's genuine | |
| training time, not a fake delay. | |
| ## 5. Notes & limits | |
| - All endpoints lower-case and strip punctuation with a simple regex | |
| tokenizer (`re.findall(r"[A-Za-z0-9']+", text.lower())`), matching the | |
| `.split()`-based tokenization used in the notebook closely enough for | |
| teaching purposes. `CountVectorizer`/`TfidfVectorizer` use scikit-learn's | |
| own tokenizer internally, as in the notebook. | |
| - Word2Vec/FastText need a handful of sentences with repeated/shared words | |
| to produce meaningful similarities โ very short or fully disjoint corpora | |
| will train but the similarity scores won't mean much. | |
| - This is a teaching tool, not a production service: there's no auth, rate | |
| limiting, or persistence, and `debug=True` should be turned off if you | |
| ever deploy it anywhere other than your own machine. | |