Spaces:

shukdev3
/

Text-Vectorizer

Running

App Files Files Community

Text-Vectorizer / README.md

shukdev3

Upload README.md

0d6fa58 verified 6 days ago

preview code

Raw

History Blame Contribute Delete

4.12 kB

metadata

title: Text Vectorization Lab
emoji: 🧮
colorFrom: yellow
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Text Vectorization Lab

A step-by-step, interactive simulation of every technique covered in the Text Vectorization in NLP deck and notebook:

One-Hot Encoding
Count Vectorizer
Bag-of-Words
N-grams
TF-IDF Vectorizer
Word Embeddings (Word2Vec — Skip-gram & CBOW — and FastText)

It's a single small web app: a Flask backend (app.py) that actually runs the scikit-learn / numpy / gensim code from the reference notebook on whatever text you type in, and a vanilla HTML/CSS/JS frontend that calls that backend and reveals the result stage by stage — tokenize → vocabulary → vectors — along an animated pipeline tape. There is no precomputed/fake data: every matrix, score, and embedding on the page is computed live by the Python backend for the exact text you entered.

1. Setup

Requires Python 3.9+.

cd text-vectorization-lab
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Run

python app.py

Then open http://localhost:5000 in your browser.

The dev server runs with debug=True, so editing app.py or the templates/static files will auto-reload.

3. How it's wired

text-vectorization-lab/
├── app.py                  # Flask app + all vectorization logic (the "backend")
├── requirements.txt
├── templates/
│   └── index.html          # Single-page app shell, one <section> per technique
└── static/
    ├── css/style.css       # Design system (dark "lab" theme, pipeline tape, matrices)
    └── js/main.js          # Fetches /api/* and animates the step-by-step reveal

Each technique has its own REST endpoint:

Endpoint	Mirrors notebook section
`POST /api/onehot`	One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check)
`POST /api/count-vectorizer`	`CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text)
`POST /api/bow`	Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity
`POST /api/ngrams`	Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams
`POST /api/tfidf`	Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc
`POST /api/embeddings`	`gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words

The frontend never computes vectors itself — it only sends your raw text to these endpoints and renders whatever comes back, so the numbers you see are always whatever scikit-learn/gensim actually produce.

4. Using it

Every panel has a textarea pre-filled with the same example corpus used in the notebook, plus a Run button. Edit the text, click run, and watch the pipeline tape light up stage by stage as the backend tokenizes, builds the vocabulary, and fills in the matrix or vectors. Reset to example restores the original notebook text for that panel.

The Word Embeddings panel trains a real Word2Vec + FastText model on your sentences on every run, so it takes a couple of seconds — that's genuine training time, not a fake delay.

5. Notes & limits

All endpoints lower-case and strip punctuation with a simple regex tokenizer (re.findall(r"[A-Za-z0-9']+", text.lower())), matching the .split()-based tokenization used in the notebook closely enough for teaching purposes. CountVectorizer/TfidfVectorizer use scikit-learn's own tokenizer internally, as in the notebook.
Word2Vec/FastText need a handful of sentences with repeated/shared words to produce meaningful similarities — very short or fully disjoint corpora will train but the similarity scores won't mean much.
This is a teaching tool, not a production service: there's no auth, rate limiting, or persistence, and debug=True should be turned off if you ever deploy it anywhere other than your own machine.