| # Contextual Similarity Engine β HOWTO |
|
|
| ## Overview |
|
|
| This project uses **transformer-based sentence embeddings** to find and compare |
| contextual meanings of keywords within large documents. Unlike Word2Vec (static, |
| one-vector-per-word), this system **fine-tunes on YOUR corpus** so it learns |
| domain-specific patterns β e.g. that "pizza" means "school" in your data. |
|
|
| A **Word2Vec (gensim) baseline** is included for comparison, demonstrating why |
| contextual embeddings are superior for meaning disambiguation. |
|
|
| **The pipeline is: TRAIN β INDEX β ANALYZE β EVALUATE.** |
|
|
| **Stack:** |
| - **SentenceTransformers** β contextual embeddings (PyTorch) |
| - **FAISS** β fast vector similarity search |
| - **gensim Word2Vec** β static embedding baseline for comparison |
| - **FastAPI** β REST API backend |
| - **React + TypeScript** β visualization frontend |
| - **scikit-learn** β clustering & evaluation metrics |
|
|
| --- |
|
|
| ## 1. Install Dependencies |
|
|
| ### Python backend (uv β recommended) |
|
|
| [uv](https://docs.astral.sh/uv/) is a fast Python package manager that replaces |
| `pip`, `venv`, and `requirements.txt` with a single tool and lockfile. |
|
|
| ```bash |
| # Install uv (if not already installed) |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| |
| # Create a virtual environment and install all dependencies from pyproject.toml |
| cd esfiles |
| uv sync |
| |
| # Run commands inside the managed environment |
| uv run python server.py |
| uv run python demo.py |
| ``` |
|
|
| `uv sync` reads `pyproject.toml`, resolves dependencies, creates a `.venv`, |
| and generates a `uv.lock` lockfile for reproducible installs. The lockfile |
| pins exact versions so every machine gets identical dependencies. |
|
|
| **Adding/removing packages:** |
|
|
| ```bash |
| uv add httpx # add a new dependency |
| uv remove httpx # remove it |
| uv lock --upgrade # upgrade all packages to latest compatible versions |
| ``` |
|
|
| ### Python backend (pip β alternative) |
|
|
| ```bash |
| python3 -m venv venv |
| source venv/bin/activate |
| pip install -r requirements.txt |
| ``` |
|
|
| ### React frontend |
|
|
| ```bash |
| cd frontend |
| npm install |
| ``` |
|
|
| --- |
|
|
| ## 2. Quick Start |
|
|
| ### CLI demo (Word2Vec vs Transformer comparison) |
|
|
| ```bash |
| uv run python demo.py |
| ``` |
|
|
| This runs side-by-side comparison: |
| 1. Builds both Transformer and Word2Vec engines on the same corpus |
| 2. Compares text similarity scores between approaches |
| 3. Shows word-level similarity (Word2Vec only β transformers don't do single words) |
| 4. Runs semantic search with both engines |
| 5. Tests keyword meaning matching ("pizza" β food or school?) |
| 6. Demonstrates clustering (transformer can separate meanings, Word2Vec cannot) |
|
|
| ### Web UI |
|
|
| ```bash |
| # Terminal 1: start the API server |
| uv run python server.py |
| |
| # Terminal 2: start the React dev server |
| cd frontend && npm run dev |
| ``` |
|
|
| - API docs: `http://localhost:8000/docs` |
| - Frontend: `http://localhost:5173` |
|
|
| --- |
|
|
| ## 3. Training Your Model |
|
|
| Three strategies, from simplest to most powerful: |
|
|
| ### Strategy 1: Unsupervised (TSDAE) |
|
|
| No labels needed. Learns your corpus vocabulary and phrasing via denoising autoencoder. |
|
|
| ```python |
| from training import CorpusTrainer |
| |
| corpus_texts = [open(f).read() for f in your_files] |
| trainer = CorpusTrainer(corpus_texts, base_model="all-MiniLM-L6-v2") |
| |
| result = trainer.train_unsupervised( |
| output_path="./trained_model", |
| epochs=3, |
| batch_size=16, |
| ) |
| print(f"Trained on {result['training_pairs']} sentences in {result['seconds']}s") |
| ``` |
|
|
| ### Strategy 2: Contrastive (auto-mined pairs) |
|
|
| Adjacent sentences = similar, random sentences = dissimilar. Learns document structure |
| using MultipleNegativesRankingLoss with in-batch negatives. |
|
|
| ```python |
| trainer = CorpusTrainer(corpus_texts) |
| |
| result = trainer.train_contrastive( |
| output_path="./trained_model", |
| epochs=5, |
| batch_size=16, |
| ) |
| ``` |
|
|
| ### Strategy 3: Keyword-supervised (best if you know the code words) |
|
|
| You provide a keywordβmeaning map. The trainer auto-generates training pairs: |
| keyword-in-context β meaning-substituted version, plus contrastive pairs from |
| corpus structure. |
|
|
| ```python |
| trainer = CorpusTrainer(corpus_texts) |
| |
| result = trainer.train_with_keywords( |
| keyword_meanings={"pizza": "school", "pepperoni": "math class"}, |
| output_path="./trained_model", |
| epochs=5, |
| batch_size=16, |
| ) |
| print(f"Keywords: {result['keywords']}") |
| ``` |
|
|
| ### Verifying training worked |
|
|
| ```python |
| # Compare base model vs trained model on test pairs |
| comparison = trainer.evaluate_model( |
| test_pairs=[ |
| ("pizza gives me homework", "school gives me homework", 0.95), |
| ("pizza gives me homework", "I ate delicious pizza", 0.1), |
| ("The pizza test is hard", "The school exam is difficult", 0.9), |
| ], |
| trained_model_path="./trained_model", |
| ) |
| |
| print(f"Base error: {comparison['summary']['avg_base_error']:.4f}") |
| print(f"Trained error: {comparison['summary']['avg_trained_error']:.4f}") |
| print(f"Reduction: {comparison['summary']['error_reduction_pct']:.1f}%") |
| print(f"Improved: {comparison['summary']['improved']}/{comparison['summary']['total']}") |
| ``` |
|
|
| --- |
|
|
| ## 4. Using Your Trained Model |
|
|
| After training, use the saved model path instead of the pretrained model name: |
|
|
| ```python |
| from contextual_similarity import ContextualSimilarityEngine |
| |
| engine = ContextualSimilarityEngine(model_name="./trained_model") |
| |
| engine.add_document("doc1", open("doc1.txt").read()) |
| engine.build_index() |
| |
| # Queries now use your domain-trained embeddings |
| results = engine.query("pizza homework", top_k=10) |
| matches = engine.match_keyword_to_meaning("pizza", [ |
| "Italian food, restaurant, cooking", |
| "School, education, homework and tests", |
| ]) |
| ``` |
|
|
| --- |
|
|
| ## 5. Word2Vec Baseline Comparison |
|
|
| A gensim Word2Vec engine is included to demonstrate the difference between |
| static and contextual embeddings: |
|
|
| ```python |
| from word2vec_baseline import Word2VecEngine |
| |
| w2v = Word2VecEngine(vector_size=100, window=5, epochs=50) |
| for doc_id, text in docs.items(): |
| w2v.add_document(doc_id, text) |
| w2v.build_index() |
| |
| # Word-level: which words appear in similar contexts? |
| w2v.most_similar_words("pizza", top_k=5) |
| |
| # Sentence-level: averaged word vectors (lossy) |
| w2v.compare_texts("pizza gives me homework", "school gives me homework") |
| |
| # Search |
| w2v.query("a place where children learn", top_k=3) |
| ``` |
|
|
| **Key limitation:** Word2Vec gives ONE vector per word. "pizza" always has the |
| same embedding whether it means food or school. Transformers encode the full |
| surrounding context, so the same word gets different embeddings in different passages. |
|
|
| --- |
|
|
| ## 6. Using the Web UI |
|
|
| 1. **Train Model** (start here): |
| - Paste your corpus (documents separated by blank lines) |
| - Choose strategy: Unsupervised, Contrastive, or Keyword-supervised |
| - For keyword strategy, provide a JSON keywordβmeaning map |
| - Configure base model, epochs, batch size, output path |
| - Click "Start Training" β model trains and saves to disk |
| - Run "Compare Models" to evaluate base vs trained |
|
|
| 2. **Setup:** |
| - Initialize engine with your trained model path (e.g. `./trained_model`) |
| - Add documents and build the FAISS index |
|
|
| 3. **Semantic Search:** query the corpus with trained embeddings |
| 4. **Compare Texts:** cosine similarity between any two texts |
| 5. **Keyword Analysis:** auto-cluster keyword meanings across documents |
| 6. **Keyword Matcher:** match keyword occurrences to candidate meanings |
| 7. **Batch Analysis:** multi-keyword analysis with cross-similarity matrix |
| 8. **Evaluation:** disambiguation accuracy, retrieval P@K/MRR, similarity histograms |
|
|
| --- |
|
|
| ## 7. API Endpoints |
|
|
| ### Training |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/train/unsupervised` | TSDAE domain adaptation | |
| | POST | `/api/train/contrastive` | Contrastive with auto-mined pairs | |
| | POST | `/api/train/keywords` | Keyword-supervised training | |
| | POST | `/api/train/evaluate` | Compare base vs trained model | |
|
|
| ### Engine |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/init` | Initialize engine with a model | |
| | POST | `/api/documents` | Add a document to the corpus | |
| | POST | `/api/documents/upload` | Upload a file as a document | |
| | POST | `/api/index/build` | Build FAISS index | |
| | POST | `/api/query` | Semantic search | |
| | POST | `/api/compare` | Compare two texts | |
| | POST | `/api/analyze/keyword` | Single keyword analysis | |
| | POST | `/api/analyze/batch` | Multi-keyword batch analysis | |
| | POST | `/api/match` | Match keyword to candidate meanings | |
| | GET | `/api/stats` | Corpus statistics | |
|
|
| ### Evaluation |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/eval/disambiguation` | Disambiguation accuracy | |
| | POST | `/api/eval/retrieval` | Retrieval metrics (P@K, MRR, NDCG) | |
| | GET | `/api/eval/similarity-distribution` | Pairwise similarity histogram | |
|
|
| ### Word2Vec Baseline |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/w2v/init` | Train Word2Vec on corpus | |
| | POST | `/api/w2v/compare` | Compare two texts (averaged word vectors) | |
| | POST | `/api/w2v/query` | Search corpus | |
| | POST | `/api/w2v/similar-words` | Find similar words | |
|
|
| --- |
|
|
| ## 8. Available Base Models |
|
|
| | Model | Dim | Size | Quality | Speed | |
| |-------|-----|------|---------|-------| |
| | `all-MiniLM-L6-v2` | 384 | ~80MB | Good | Fast | |
| | `all-mpnet-base-v2` | 768 | ~420MB | Best | Medium | |
|
|
| Start with `all-MiniLM-L6-v2` for fast iteration, upgrade to `all-mpnet-base-v2` |
| for production quality. |
|
|
| --- |
|
|
| ## 9. Evaluation Metrics |
|
|
| | Metric | What it measures | |
| |--------|-----------------| |
| | **Accuracy** | % of keyword occurrences correctly matched to their meaning | |
| | **Weighted F1** | Harmonic mean of precision/recall, weighted by class frequency | |
| | **MRR** | Mean Reciprocal Rank β how early the first relevant result appears | |
| | **P@K** | Precision at K β fraction of top-K results that are relevant | |
| | **NDCG@K** | Normalized Discounted Cumulative Gain β ranking quality metric | |
|
|
| --- |
|
|
| ## 10. Tuning Parameters |
|
|
| ### Training |
|
|
| | Parameter | Default | Notes | |
| |-----------|---------|-------| |
| | `epochs` | 3-5 | More = better fit but risk overfitting | |
| | `batch_size` | 16 | Larger = faster, needs more memory. MNRL benefits from larger batches | |
| | `context_window` | 2 | (Keyword strategy) sentences around keyword to include as context | |
|
|
| ### Engine |
|
|
| | Parameter | Default | Notes | |
| |-----------|---------|-------| |
| | `chunk_size` | 512 | Characters per chunk. Larger = more context per chunk | |
| | `chunk_overlap` | 128 | Overlap prevents losing context at chunk boundaries | |
| | `batch_size` | 64 | Encoding batch size for FAISS indexing | |
|
|
| --- |
|
|
| ## 11. Computational Resources |
|
|
| | Task | CPU | GPU (CUDA/MPS) | RAM | |
| |------|-----|----------------|-----| |
| | Training (small, <1K pairs) | OK | Faster (2-5x) | 4GB+ | |
| | Training (medium, 1K-10K pairs) | Slow | Recommended | 8GB+ | |
| | Training (large, 10K+ pairs) | Very slow | Required | 16GB+ | |
| | Indexing (1K chunks) | OK | Faster | 4GB+ | |
| | Querying | Fast | N/A | 2GB+ | |
|
|
| **Minimum:** MacBook with 8GB RAM can train small models on CPU. |
| **Recommended:** 16GB RAM + GPU (NVIDIA CUDA or Apple Silicon MPS). |
|
|
| --- |
|
|
| ## 12. Project Structure |
|
|
| ``` |
| esfiles/ |
| βββ pyproject.toml # Project config & dependencies (uv) |
| βββ requirements.txt # Fallback for pip users |
| βββ contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis |
| βββ training.py # Training pipeline: 3 strategies + evaluation |
| βββ evaluation.py # Evaluation pipeline: metrics, reports |
| βββ word2vec_baseline.py # Gensim Word2Vec baseline for comparison |
| βββ server.py # FastAPI REST API |
| βββ demo.py # CLI demo: Word2Vec vs Transformer comparison |
| βββ HOWTO.md # This file |
| βββ frontend/ # React + TypeScript UI |
| βββ package.json |
| βββ tsconfig.json |
| βββ vite.config.ts |
| βββ index.html |
| βββ src/ |
| βββ main.tsx |
| βββ App.tsx |
| βββ styles.css |
| βββ types.ts |
| βββ api.ts |
| βββ components/ |
| βββ ScoreBar.tsx |
| βββ StatusMessage.tsx |
| βββ TrainingPanel.tsx |
| βββ EngineSetup.tsx |
| βββ SemanticSearch.tsx |
| βββ TextCompare.tsx |
| βββ KeywordAnalysis.tsx |
| βββ KeywordMatcher.tsx |
| βββ BatchAnalysis.tsx |
| βββ EvaluationDashboard.tsx |
| ``` |
|
|