Text-Vectorizer / templates /index.html
shukdev3's picture
Create templates/index.html
2198ed3 verified
Raw
History Blame Contribute Delete
19.3 kB
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Text Vectorization Lab</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link href="https://fonts.googleapis.com/css2?family=Fraunces:opsz,wght@9..144,500;9..144,600;9..144,700&family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet">
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>
<body>
<div class="app">
<!-- ============================================================ SIDEBAR -->
<aside class="sidebar">
<div class="brand">
<div class="brand-mark">tv</div>
<div class="brand-text">
<div class="brand-title">Vectorization Lab</div>
<div class="brand-sub">live sklearn · gensim backend</div>
</div>
</div>
<div class="nav-group-label">Start here</div>
<button class="nav-item active" data-target="overview"><span class="nav-idx">00</span> Overview</button>
<div class="nav-group-label">Sparse techniques</div>
<button class="nav-item" data-target="onehot"><span class="nav-idx">01</span> One-Hot Encoding</button>
<button class="nav-item" data-target="count"><span class="nav-idx">02</span> Count Vectorizer</button>
<button class="nav-item" data-target="bow"><span class="nav-idx">03</span> Bag-of-Words</button>
<button class="nav-item" data-target="ngrams"><span class="nav-idx">04</span> N-grams</button>
<button class="nav-item" data-target="tfidf"><span class="nav-idx">05</span> TF-IDF</button>
<div class="nav-group-label">Dense techniques</div>
<button class="nav-item" data-target="embeddings"><span class="nav-idx">06</span> Word Embeddings</button>
<div class="sidebar-foot">
<span class="pulse-dot"></span>Flask + scikit-learn + gensim run every result below in real time — nothing is precomputed.
</div>
</aside>
<!-- ============================================================ MAIN -->
<main class="main">
<!-- ===================================================== OVERVIEW -->
<section class="section active" id="sec-overview">
<div class="page-header">
<div class="eyebrow">Text Vectorization Lab</div>
<h1 class="page-title">Watch text turn into numbers, one step at a time.</h1>
<p class="page-desc">Every technique from the deck — One-Hot Encoding, Count Vectorizer, Bag-of-Words, N-grams, TF-IDF and Word2Vec/FastText embeddings — is simulated here against a real Python backend. Type your own sentences, hit run, and the API tokenizes, builds the vocabulary, and constructs the vectors live, the same way the reference notebook does it.</p>
</div>
<div class="tech-grid">
<button class="tech-card" data-target="onehot">
<div class="n">01 · sparse · binary</div>
<h4>One-Hot Encoding</h4>
<p>Every word gets its own slot. One 1, the rest 0s, vector length = vocabulary size.</p>
</button>
<button class="tech-card" data-target="count">
<div class="n">02 · sparse · counts</div>
<h4>Count Vectorizer</h4>
<p>Document–term matrix of raw word frequencies, scikit-learn style.</p>
</button>
<button class="tech-card" data-target="bow">
<div class="n">03 · sparse · counts</div>
<h4>Bag-of-Words</h4>
<p>The idea behind Count Vectorizer — order is thrown away, frequency stays.</p>
</button>
<button class="tech-card" data-target="ngrams">
<div class="n">04 · sparse · sequences</div>
<h4>N-grams</h4>
<p>Chains of N consecutive tokens, recovering a little of the word order BoW loses.</p>
</button>
<button class="tech-card" data-target="tfidf">
<div class="n">05 · sparse · weighted</div>
<h4>TF-IDF</h4>
<p>Term frequency × inverse document frequency — common words get downweighted.</p>
</button>
<button class="tech-card" data-target="embeddings">
<div class="n">06 · dense · learned</div>
<h4>Word Embeddings</h4>
<p>Word2Vec (CBOW / Skip-gram) and FastText learn dense vectors that capture meaning.</p>
</button>
</div>
<div class="card" style="margin-top:26px;">
<div class="card-head"><div class="card-title">How this lab is wired</div></div>
<div class="step-body">
<p style="margin-top:0">Each panel sends your text to a Flask endpoint in <code>app.py</code>. That endpoint runs the exact <code>scikit-learn</code> / <code>numpy</code> / <code>gensim</code> calls from the reference notebook — <code>CountVectorizer</code>, <code>TfidfVectorizer</code>, <code>OneHotEncoder</code>, a hand-rolled Bag-of-Words counter, an N-gram generator, and a <code>gensim.models.Word2Vec</code>/<code>FastText</code> trainer — and streams the intermediate results back as JSON. The page then reveals those results stage by stage along the pipeline tape at the top of each panel, so you can see tokenization happen before the vocabulary appears, and the vocabulary settle before the vectors fill in.</p>
</div>
</div>
</section>
<!-- ===================================================== ONE-HOT -->
<section class="section" id="sec-onehot">
<div class="page-header">
<div class="eyebrow">Step 01 / Sparse · Binary</div>
<h1 class="page-title">One-Hot Encoding</h1>
<p class="page-desc">Every unique word in the vocabulary gets one dedicated position in the vector. A word's vector is all zeros except a single 1 at its own index — so the vector length always equals the vocabulary size.</p>
</div>
<div class="tape" id="tape-onehot">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Raw sentences</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Tokenize</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Build vocabulary</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> One-hot vectors</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> sklearn check</div>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Input corpus</div><div class="card-note">one sentence per line</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Sentences</label>
<textarea id="onehot-corpus">I love NLP
NLP is fun
I love coding</textarea>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="onehot">Run one-hot encoding</button>
<button class="btn btn-ghost btn-sm" data-reset="onehot">Reset to example</button>
</div>
</div>
<div class="steps" id="out-onehot"></div>
</section>
<!-- ===================================================== COUNT VECTORIZER -->
<section class="section" id="sec-count">
<div class="page-header">
<div class="eyebrow">Step 02 / Sparse · Counts</div>
<h1 class="page-title">Count Vectorizer</h1>
<p class="page-desc">Builds a vocabulary from the whole corpus, then counts how many times each vocabulary word appears in every document — a document–term frequency matrix. Word order is discarded.</p>
</div>
<div class="tape" id="tape-count">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Raw corpus</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Tokenize docs</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Fit vocabulary</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> Count matrix</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> Transform new doc</div>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Input corpus</div><div class="card-note">one document per line</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Documents</label>
<textarea id="count-corpus">I love NLP and I love Python
NLP is amazing and fun
Python is great for NLP</textarea>
</div>
</div>
<div class="field-row">
<div class="field" style="max-width:220px;">
<label>Max features (optional)</label>
<input type="number" id="count-maxfeatures" placeholder="e.g. 5" min="1">
</div>
<div class="field" style="flex:2;">
<label>New document to transform (optional)</label>
<input type="text" id="count-newdoc" placeholder="e.g. NLP and Python are great">
</div>
<div class="field" style="max-width:180px; display:flex; align-items:flex-end;">
<label class="checkbox-row" style="margin-bottom:10px;"><input type="checkbox" id="count-stopwords"> remove English stop words</label>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="count">Run count vectorizer</button>
<button class="btn btn-ghost btn-sm" data-reset="count">Reset to example</button>
</div>
</div>
<div class="steps" id="out-count"></div>
</section>
<!-- ===================================================== BAG OF WORDS -->
<section class="section" id="sec-bow">
<div class="page-header">
<div class="eyebrow">Step 03 / Sparse · Counts</div>
<h1 class="page-title">Bag-of-Words</h1>
<p class="page-desc">Bag-of-Words is the <em>concept</em> — text as an unordered bag of words where only frequency matters. Count Vectorizer is simply scikit-learn's implementation of that idea. Below: a from-scratch BoW counter, a binary (presence/absence) variant, and the cosine similarity between documents it implies.</p>
</div>
<div class="tape" id="tape-bow">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Raw corpus</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Tokenize</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Vocabulary</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> BoW matrix</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> Binary BoW</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="5"><span class="dot"></span> Cosine similarity</div>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Input corpus</div><div class="card-note">one document per line</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Documents</label>
<textarea id="bow-corpus">the cat sat on the mat
the dog sat on the log
the cat and the dog are friends</textarea>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="bow">Run bag-of-words</button>
<button class="btn btn-ghost btn-sm" data-reset="bow">Reset to example</button>
</div>
</div>
<div class="steps" id="out-bow"></div>
</section>
<!-- ===================================================== N-GRAMS -->
<section class="section" id="sec-ngrams">
<div class="page-header">
<div class="eyebrow">Step 04 / Sparse · Sequences</div>
<h1 class="page-title">N-grams</h1>
<p class="page-desc">An N-gram is a run of N consecutive tokens. Unigrams (N=1) are plain BoW; bigrams and trigrams keep a sliver of local word order that BoW throws away — at the cost of a much bigger vocabulary.</p>
</div>
<div class="tape" id="tape-ngrams">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Sentence</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Unigrams</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Bigrams</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> Trigrams</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> N-gram matrices</div>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Input</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Sentence for manual N-grams</label>
<input type="text" id="ngrams-sentence" value="I love studying Natural Language Processing">
</div>
</div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Corpus for N-gram matrices (one document per line)</label>
<textarea id="ngrams-corpus">I love NLP and machine learning
machine learning is part of AI
NLP is a branch of AI</textarea>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="ngrams">Generate N-grams</button>
<button class="btn btn-ghost btn-sm" data-reset="ngrams">Reset to example</button>
</div>
</div>
<div class="steps" id="out-ngrams"></div>
</section>
<!-- ===================================================== TF-IDF -->
<section class="section" id="sec-tfidf">
<div class="page-header">
<div class="eyebrow">Step 05 / Sparse · Weighted</div>
<h1 class="page-title">TF-IDF Vectorizer</h1>
<p class="page-desc">Term Frequency × Inverse Document Frequency. Words that show up in almost every document (like "the" or "is") get pulled down; words that are frequent in one document but rare across the corpus get pushed up.</p>
</div>
<div class="tape" id="tape-tfidf">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Raw corpus</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Term frequency (TF)</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Inverse doc. freq. (IDF)</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> TF × IDF</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> sklearn matrix</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="5"><span class="dot"></span> Top words / doc</div>
</div>
<div class="formula" style="margin-bottom:22px;">
TF-IDF(t, d) &nbsp;=&nbsp; <span class="dim">[ count(t, d) / total words in d ]</span> &nbsp;×&nbsp; <span class="dim">log( N / (1 + df(t)) )</span>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Input corpus</div><div class="card-note">one document per line</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<label>Documents</label>
<textarea id="tfidf-corpus">I love NLP and machine learning
machine learning is part of AI
NLP is a branch of AI
I love AI and deep learning</textarea>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="tfidf">Run TF-IDF</button>
<button class="btn btn-ghost btn-sm" data-reset="tfidf">Reset to example</button>
</div>
</div>
<div class="steps" id="out-tfidf"></div>
</section>
<!-- ===================================================== EMBEDDINGS -->
<section class="section" id="sec-embeddings">
<div class="page-header">
<div class="eyebrow">Step 06 / Dense · Learned</div>
<h1 class="page-title">Word Embeddings — Word2Vec &amp; FastText</h1>
<p class="page-desc">Instead of counting, embeddings <em>learn</em> dense, low-dimensional vectors from context — words used in similar contexts end up with similar vectors. This trains a real <code>gensim</code> Word2Vec (Skip-gram &amp; CBOW) and FastText model on your sentences, in the background, right now.</p>
</div>
<div class="tape" id="tape-embeddings">
<div class="tape-stage" data-stage="0"><span class="dot"></span> Training sentences</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="1"><span class="dot"></span> Train Word2Vec</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="2"><span class="dot"></span> Similarity</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="3"><span class="dot"></span> Most similar</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="4"><span class="dot"></span> PCA plot</div>
<div class="tape-arrow"></div>
<div class="tape-stage" data-stage="5"><span class="dot"></span> FastText OOV</div>
</div>
<div class="card">
<div class="card-head"><div class="card-title">Training sentences</div><div class="card-note">one per line · trains live, takes a couple seconds</div></div>
<div class="field-row">
<div class="field" style="flex:1 1 100%;">
<textarea id="embed-sentences" style="min-height:140px;">the cat sat on the mat
the dog ran on the grass
cats and dogs are pets
i love my cat
i love my dog
king and queen are royalty
man and woman are humans
paris is the capital of france
berlin is the capital of germany</textarea>
</div>
</div>
<div class="btn-row">
<button class="btn" data-run="embeddings">Train Word2Vec + FastText</button>
<button class="btn btn-ghost btn-sm" data-reset="embeddings">Reset to example</button>
</div>
</div>
<div class="steps" id="out-embeddings"></div>
</section>
<footer class="colophon">
Text Vectorization Lab — built from the “Text Vectorization in NLP” slide deck &amp; notebook. Backend: Flask · scikit-learn · numpy · gensim. Frontend: vanilla HTML/CSS/JS, no build step.
</footer>
</main>
</div>
<script src="{{ url_for('static', filename='js/main.js') }}"></script>
</body>
</html>