Text-Vectorizer / README.md
shukdev3's picture
Upload README.md
0d6fa58 verified
|
Raw
History Blame Contribute Delete
4.12 kB
metadata
title: Text Vectorization Lab
emoji: ๐Ÿงฎ
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Text Vectorization Lab

A step-by-step, interactive simulation of every technique covered in the Text Vectorization in NLP deck and notebook:

  • One-Hot Encoding
  • Count Vectorizer
  • Bag-of-Words
  • N-grams
  • TF-IDF Vectorizer
  • Word Embeddings (Word2Vec โ€” Skip-gram & CBOW โ€” and FastText)

It's a single small web app: a Flask backend (app.py) that actually runs the scikit-learn / numpy / gensim code from the reference notebook on whatever text you type in, and a vanilla HTML/CSS/JS frontend that calls that backend and reveals the result stage by stage โ€” tokenize โ†’ vocabulary โ†’ vectors โ€” along an animated pipeline tape. There is no precomputed/fake data: every matrix, score, and embedding on the page is computed live by the Python backend for the exact text you entered.

1. Setup

Requires Python 3.9+.

cd text-vectorization-lab
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Run

python app.py

Then open http://localhost:5000 in your browser.

The dev server runs with debug=True, so editing app.py or the templates/static files will auto-reload.

3. How it's wired

text-vectorization-lab/
โ”œโ”€โ”€ app.py                  # Flask app + all vectorization logic (the "backend")
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ templates/
โ”‚   โ””โ”€โ”€ index.html          # Single-page app shell, one <section> per technique
โ””โ”€โ”€ static/
    โ”œโ”€โ”€ css/style.css       # Design system (dark "lab" theme, pipeline tape, matrices)
    โ””โ”€โ”€ js/main.js          # Fetches /api/* and animates the step-by-step reveal

Each technique has its own REST endpoint:

Endpoint Mirrors notebook section
POST /api/onehot One-Hot Encoding (manual NumPy + OneHotEncoder cross-check)
POST /api/count-vectorizer CountVectorizer (vocabulary, matrix, stop words, max_features, .transform() on new text)
POST /api/bow Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity
POST /api/ngrams Manual n-gram generator + CountVectorizer(ngram_range=...) + char-level n-grams
POST /api/tfidf Manual TF/IDF computation (sklearn-style smoothing) + TfidfVectorizer + top words per doc
POST /api/embeddings gensim.models.Word2Vec (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and FastText for out-of-vocabulary words

The frontend never computes vectors itself โ€” it only sends your raw text to these endpoints and renders whatever comes back, so the numbers you see are always whatever scikit-learn/gensim actually produce.

4. Using it

Every panel has a textarea pre-filled with the same example corpus used in the notebook, plus a Run button. Edit the text, click run, and watch the pipeline tape light up stage by stage as the backend tokenizes, builds the vocabulary, and fills in the matrix or vectors. Reset to example restores the original notebook text for that panel.

The Word Embeddings panel trains a real Word2Vec + FastText model on your sentences on every run, so it takes a couple of seconds โ€” that's genuine training time, not a fake delay.

5. Notes & limits

  • All endpoints lower-case and strip punctuation with a simple regex tokenizer (re.findall(r"[A-Za-z0-9']+", text.lower())), matching the .split()-based tokenization used in the notebook closely enough for teaching purposes. CountVectorizer/TfidfVectorizer use scikit-learn's own tokenizer internally, as in the notebook.
  • Word2Vec/FastText need a handful of sentences with repeated/shared words to produce meaningful similarities โ€” very short or fully disjoint corpora will train but the similarity scores won't mean much.
  • This is a teaching tool, not a production service: there's no auth, rate limiting, or persistence, and debug=True should be turned off if you ever deploy it anywhere other than your own machine.