--- title: Text Vectorization Lab emoji: 🧮 colorFrom: yellow colorTo: indigo sdk: docker app_port: 7860 pinned: false --- # Text Vectorization Lab A step-by-step, interactive simulation of every technique covered in the *Text Vectorization in NLP* deck and notebook: - One-Hot Encoding - Count Vectorizer - Bag-of-Words - N-grams - TF-IDF Vectorizer - Word Embeddings (Word2Vec — Skip-gram & CBOW — and FastText) It's a single small web app: a **Flask backend** (`app.py`) that actually runs the `scikit-learn` / `numpy` / `gensim` code from the reference notebook on whatever text you type in, and a **vanilla HTML/CSS/JS frontend** that calls that backend and reveals the result stage by stage — tokenize → vocabulary → vectors — along an animated pipeline tape. There is no precomputed/fake data: every matrix, score, and embedding on the page is computed live by the Python backend for the exact text you entered. ## 1. Setup Requires Python 3.9+. ```bash cd text-vectorization-lab python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt ``` ## 2. Run ```bash python app.py ``` Then open **http://localhost:5000** in your browser. The dev server runs with `debug=True`, so editing `app.py` or the templates/static files will auto-reload. ## 3. How it's wired ``` text-vectorization-lab/ ├── app.py # Flask app + all vectorization logic (the "backend") ├── requirements.txt ├── templates/ │ └── index.html # Single-page app shell, one
per technique └── static/ ├── css/style.css # Design system (dark "lab" theme, pipeline tape, matrices) └── js/main.js # Fetches /api/* and animates the step-by-step reveal ``` Each technique has its own REST endpoint: | Endpoint | Mirrors notebook section | |---|---| | `POST /api/onehot` | One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check) | | `POST /api/count-vectorizer` | `CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text) | | `POST /api/bow` | Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity | | `POST /api/ngrams` | Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams | | `POST /api/tfidf` | Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc | | `POST /api/embeddings` | `gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words | The frontend never computes vectors itself — it only sends your raw text to these endpoints and renders whatever comes back, so the numbers you see are always whatever scikit-learn/gensim actually produce. ## 4. Using it Every panel has a textarea pre-filled with the same example corpus used in the notebook, plus a **Run** button. Edit the text, click run, and watch the pipeline tape light up stage by stage as the backend tokenizes, builds the vocabulary, and fills in the matrix or vectors. **Reset to example** restores the original notebook text for that panel. The Word Embeddings panel trains a real Word2Vec + FastText model on your sentences on every run, so it takes a couple of seconds — that's genuine training time, not a fake delay. ## 5. Notes & limits - All endpoints lower-case and strip punctuation with a simple regex tokenizer (`re.findall(r"[A-Za-z0-9']+", text.lower())`), matching the `.split()`-based tokenization used in the notebook closely enough for teaching purposes. `CountVectorizer`/`TfidfVectorizer` use scikit-learn's own tokenizer internally, as in the notebook. - Word2Vec/FastText need a handful of sentences with repeated/shared words to produce meaningful similarities — very short or fully disjoint corpora will train but the similarity scores won't mean much. - This is a teaching tool, not a production service: there's no auth, rate limiting, or persistence, and `debug=True` should be turned off if you ever deploy it anywhere other than your own machine.