Spaces:

shukdev3
/

Text-Vectorizer

Running

App Files Files Community

Text-Vectorizer / README.md

shukdev3

Upload README.md

0d6fa58 verified 7 days ago

preview code

Raw

History Blame Contribute Delete

4.12 kB

	---
	title: Text Vectorization Lab
	emoji: 🧮
	colorFrom: yellow
	colorTo: indigo
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# Text Vectorization Lab

	A step-by-step, interactive simulation of every technique covered in the
	Text Vectorization in NLP deck and notebook:

	- One-Hot Encoding
	- Count Vectorizer
	- Bag-of-Words
	- N-grams
	- TF-IDF Vectorizer
	- Word Embeddings (Word2Vec — Skip-gram & CBOW — and FastText)

	It's a single small web app: a Flask backend (`app.py`) that actually
	runs the `scikit-learn` / `numpy` / `gensim` code from the reference
	notebook on whatever text you type in, and a **vanilla HTML/CSS/JS
	frontend** that calls that backend and reveals the result stage by stage —
	tokenize → vocabulary → vectors — along an animated pipeline tape. There is
	no precomputed/fake data: every matrix, score, and embedding on the page is
	computed live by the Python backend for the exact text you entered.

	## 1. Setup

	Requires Python 3.9+.

	```bash
	cd text-vectorization-lab
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate
	pip install -r requirements.txt
	```

	## 2. Run

	```bash
	python app.py
	```

	Then open http://localhost:5000 in your browser.

	The dev server runs with `debug=True`, so editing `app.py` or the
	templates/static files will auto-reload.

	## 3. How it's wired

	```
	text-vectorization-lab/
	├── app.py # Flask app + all vectorization logic (the "backend")
	├── requirements.txt
	├── templates/
	│ └── index.html # Single-page app shell, one <section> per technique
	└── static/
	├── css/style.css # Design system (dark "lab" theme, pipeline tape, matrices)
	└── js/main.js # Fetches /api/* and animates the step-by-step reveal
	```

	Each technique has its own REST endpoint:

	\| Endpoint \| Mirrors notebook section \|
	\|---\|---\|
	\| `POST /api/onehot` \| One-Hot Encoding (manual NumPy + `OneHotEncoder` cross-check) \|
	\| `POST /api/count-vectorizer` \| `CountVectorizer` (vocabulary, matrix, stop words, `max_features`, `.transform()` on new text) \|
	\| `POST /api/bow` \| Hand-rolled Bag-of-Words counter + binary BoW + cosine similarity \|
	\| `POST /api/ngrams` \| Manual n-gram generator + `CountVectorizer(ngram_range=...)` + char-level n-grams \|
	\| `POST /api/tfidf` \| Manual TF/IDF computation (sklearn-style smoothing) + `TfidfVectorizer` + top words per doc \|
	\| `POST /api/embeddings` \| `gensim.models.Word2Vec` (Skip-gram & CBOW), similarity, most-similar, PCA-to-2D, and `FastText` for out-of-vocabulary words \|

	The frontend never computes vectors itself — it only sends your raw text
	to these endpoints and renders whatever comes back, so the numbers you see
	are always whatever scikit-learn/gensim actually produce.

	## 4. Using it

	Every panel has a textarea pre-filled with the same example corpus used in
	the notebook, plus a Run button. Edit the text, click run, and watch
	the pipeline tape light up stage by stage as the backend tokenizes,
	builds the vocabulary, and fills in the matrix or vectors. **Reset to
	example** restores the original notebook text for that panel.

	The Word Embeddings panel trains a real Word2Vec + FastText model on your
	sentences on every run, so it takes a couple of seconds — that's genuine
	training time, not a fake delay.

	## 5. Notes & limits

	- All endpoints lower-case and strip punctuation with a simple regex
	tokenizer (`re.findall(r"[A-Za-z0-9']+", text.lower())`), matching the
	`.split()`-based tokenization used in the notebook closely enough for
	teaching purposes. `CountVectorizer`/`TfidfVectorizer` use scikit-learn's
	own tokenizer internally, as in the notebook.
	- Word2Vec/FastText need a handful of sentences with repeated/shared words
	to produce meaningful similarities — very short or fully disjoint corpora
	will train but the similarity scores won't mean much.
	- This is a teaching tool, not a production service: there's no auth, rate
	limiting, or persistence, and `debug=True` should be turned off if you
	ever deploy it anywhere other than your own machine.