esfiles / HOWTO.md
Besjon Cifliku
feat: initial project setup
db764ae

Contextual Similarity Engine β€” HOWTO

Overview

This project uses transformer-based sentence embeddings to find and compare contextual meanings of keywords within large documents. Unlike Word2Vec (static, one-vector-per-word), this system fine-tunes on YOUR corpus so it learns domain-specific patterns β€” e.g. that "pizza" means "school" in your data.

A Word2Vec (gensim) baseline is included for comparison, demonstrating why contextual embeddings are superior for meaning disambiguation.

The pipeline is: TRAIN β†’ INDEX β†’ ANALYZE β†’ EVALUATE.

Stack:

  • SentenceTransformers β€” contextual embeddings (PyTorch)
  • FAISS β€” fast vector similarity search
  • gensim Word2Vec β€” static embedding baseline for comparison
  • FastAPI β€” REST API backend
  • React + TypeScript β€” visualization frontend
  • scikit-learn β€” clustering & evaluation metrics

1. Install Dependencies

Python backend (uv β€” recommended)

uv is a fast Python package manager that replaces pip, venv, and requirements.txt with a single tool and lockfile.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install all dependencies from pyproject.toml
cd esfiles
uv sync

# Run commands inside the managed environment
uv run python server.py
uv run python demo.py

uv sync reads pyproject.toml, resolves dependencies, creates a .venv, and generates a uv.lock lockfile for reproducible installs. The lockfile pins exact versions so every machine gets identical dependencies.

Adding/removing packages:

uv add httpx              # add a new dependency
uv remove httpx           # remove it
uv lock --upgrade         # upgrade all packages to latest compatible versions

Python backend (pip β€” alternative)

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

React frontend

cd frontend
npm install

2. Quick Start

CLI demo (Word2Vec vs Transformer comparison)

uv run python demo.py

This runs side-by-side comparison:

  1. Builds both Transformer and Word2Vec engines on the same corpus
  2. Compares text similarity scores between approaches
  3. Shows word-level similarity (Word2Vec only β€” transformers don't do single words)
  4. Runs semantic search with both engines
  5. Tests keyword meaning matching ("pizza" β†’ food or school?)
  6. Demonstrates clustering (transformer can separate meanings, Word2Vec cannot)

Web UI

# Terminal 1: start the API server
uv run python server.py

# Terminal 2: start the React dev server
cd frontend && npm run dev
  • API docs: http://localhost:8000/docs
  • Frontend: http://localhost:5173

3. Training Your Model

Three strategies, from simplest to most powerful:

Strategy 1: Unsupervised (TSDAE)

No labels needed. Learns your corpus vocabulary and phrasing via denoising autoencoder.

from training import CorpusTrainer

corpus_texts = [open(f).read() for f in your_files]
trainer = CorpusTrainer(corpus_texts, base_model="all-MiniLM-L6-v2")

result = trainer.train_unsupervised(
    output_path="./trained_model",
    epochs=3,
    batch_size=16,
)
print(f"Trained on {result['training_pairs']} sentences in {result['seconds']}s")

Strategy 2: Contrastive (auto-mined pairs)

Adjacent sentences = similar, random sentences = dissimilar. Learns document structure using MultipleNegativesRankingLoss with in-batch negatives.

trainer = CorpusTrainer(corpus_texts)

result = trainer.train_contrastive(
    output_path="./trained_model",
    epochs=5,
    batch_size=16,
)

Strategy 3: Keyword-supervised (best if you know the code words)

You provide a keywordβ†’meaning map. The trainer auto-generates training pairs: keyword-in-context ↔ meaning-substituted version, plus contrastive pairs from corpus structure.

trainer = CorpusTrainer(corpus_texts)

result = trainer.train_with_keywords(
    keyword_meanings={"pizza": "school", "pepperoni": "math class"},
    output_path="./trained_model",
    epochs=5,
    batch_size=16,
)
print(f"Keywords: {result['keywords']}")

Verifying training worked

# Compare base model vs trained model on test pairs
comparison = trainer.evaluate_model(
    test_pairs=[
        ("pizza gives me homework", "school gives me homework", 0.95),
        ("pizza gives me homework", "I ate delicious pizza", 0.1),
        ("The pizza test is hard", "The school exam is difficult", 0.9),
    ],
    trained_model_path="./trained_model",
)

print(f"Base error:    {comparison['summary']['avg_base_error']:.4f}")
print(f"Trained error: {comparison['summary']['avg_trained_error']:.4f}")
print(f"Reduction:     {comparison['summary']['error_reduction_pct']:.1f}%")
print(f"Improved:      {comparison['summary']['improved']}/{comparison['summary']['total']}")

4. Using Your Trained Model

After training, use the saved model path instead of the pretrained model name:

from contextual_similarity import ContextualSimilarityEngine

engine = ContextualSimilarityEngine(model_name="./trained_model")

engine.add_document("doc1", open("doc1.txt").read())
engine.build_index()

# Queries now use your domain-trained embeddings
results = engine.query("pizza homework", top_k=10)
matches = engine.match_keyword_to_meaning("pizza", [
    "Italian food, restaurant, cooking",
    "School, education, homework and tests",
])

5. Word2Vec Baseline Comparison

A gensim Word2Vec engine is included to demonstrate the difference between static and contextual embeddings:

from word2vec_baseline import Word2VecEngine

w2v = Word2VecEngine(vector_size=100, window=5, epochs=50)
for doc_id, text in docs.items():
    w2v.add_document(doc_id, text)
w2v.build_index()

# Word-level: which words appear in similar contexts?
w2v.most_similar_words("pizza", top_k=5)

# Sentence-level: averaged word vectors (lossy)
w2v.compare_texts("pizza gives me homework", "school gives me homework")

# Search
w2v.query("a place where children learn", top_k=3)

Key limitation: Word2Vec gives ONE vector per word. "pizza" always has the same embedding whether it means food or school. Transformers encode the full surrounding context, so the same word gets different embeddings in different passages.


6. Using the Web UI

  1. Train Model (start here):

    • Paste your corpus (documents separated by blank lines)
    • Choose strategy: Unsupervised, Contrastive, or Keyword-supervised
    • For keyword strategy, provide a JSON keywordβ†’meaning map
    • Configure base model, epochs, batch size, output path
    • Click "Start Training" β€” model trains and saves to disk
    • Run "Compare Models" to evaluate base vs trained
  2. Setup:

    • Initialize engine with your trained model path (e.g. ./trained_model)
    • Add documents and build the FAISS index
  3. Semantic Search: query the corpus with trained embeddings

  4. Compare Texts: cosine similarity between any two texts

  5. Keyword Analysis: auto-cluster keyword meanings across documents

  6. Keyword Matcher: match keyword occurrences to candidate meanings

  7. Batch Analysis: multi-keyword analysis with cross-similarity matrix

  8. Evaluation: disambiguation accuracy, retrieval P@K/MRR, similarity histograms


7. API Endpoints

Training

Method Endpoint Description
POST /api/train/unsupervised TSDAE domain adaptation
POST /api/train/contrastive Contrastive with auto-mined pairs
POST /api/train/keywords Keyword-supervised training
POST /api/train/evaluate Compare base vs trained model

Engine

Method Endpoint Description
POST /api/init Initialize engine with a model
POST /api/documents Add a document to the corpus
POST /api/documents/upload Upload a file as a document
POST /api/index/build Build FAISS index
POST /api/query Semantic search
POST /api/compare Compare two texts
POST /api/analyze/keyword Single keyword analysis
POST /api/analyze/batch Multi-keyword batch analysis
POST /api/match Match keyword to candidate meanings
GET /api/stats Corpus statistics

Evaluation

Method Endpoint Description
POST /api/eval/disambiguation Disambiguation accuracy
POST /api/eval/retrieval Retrieval metrics (P@K, MRR, NDCG)
GET /api/eval/similarity-distribution Pairwise similarity histogram

Word2Vec Baseline

Method Endpoint Description
POST /api/w2v/init Train Word2Vec on corpus
POST /api/w2v/compare Compare two texts (averaged word vectors)
POST /api/w2v/query Search corpus
POST /api/w2v/similar-words Find similar words

8. Available Base Models

Model Dim Size Quality Speed
all-MiniLM-L6-v2 384 ~80MB Good Fast
all-mpnet-base-v2 768 ~420MB Best Medium

Start with all-MiniLM-L6-v2 for fast iteration, upgrade to all-mpnet-base-v2 for production quality.


9. Evaluation Metrics

Metric What it measures
Accuracy % of keyword occurrences correctly matched to their meaning
Weighted F1 Harmonic mean of precision/recall, weighted by class frequency
MRR Mean Reciprocal Rank β€” how early the first relevant result appears
P@K Precision at K β€” fraction of top-K results that are relevant
NDCG@K Normalized Discounted Cumulative Gain β€” ranking quality metric

10. Tuning Parameters

Training

Parameter Default Notes
epochs 3-5 More = better fit but risk overfitting
batch_size 16 Larger = faster, needs more memory. MNRL benefits from larger batches
context_window 2 (Keyword strategy) sentences around keyword to include as context

Engine

Parameter Default Notes
chunk_size 512 Characters per chunk. Larger = more context per chunk
chunk_overlap 128 Overlap prevents losing context at chunk boundaries
batch_size 64 Encoding batch size for FAISS indexing

11. Computational Resources

Task CPU GPU (CUDA/MPS) RAM
Training (small, <1K pairs) OK Faster (2-5x) 4GB+
Training (medium, 1K-10K pairs) Slow Recommended 8GB+
Training (large, 10K+ pairs) Very slow Required 16GB+
Indexing (1K chunks) OK Faster 4GB+
Querying Fast N/A 2GB+

Minimum: MacBook with 8GB RAM can train small models on CPU. Recommended: 16GB RAM + GPU (NVIDIA CUDA or Apple Silicon MPS).


12. Project Structure

esfiles/
β”œβ”€β”€ pyproject.toml              # Project config & dependencies (uv)
β”œβ”€β”€ requirements.txt            # Fallback for pip users
β”œβ”€β”€ contextual_similarity.py    # Core engine: chunking, embedding, FAISS, analysis
β”œβ”€β”€ training.py                 # Training pipeline: 3 strategies + evaluation
β”œβ”€β”€ evaluation.py               # Evaluation pipeline: metrics, reports
β”œβ”€β”€ word2vec_baseline.py        # Gensim Word2Vec baseline for comparison
β”œβ”€β”€ server.py                   # FastAPI REST API
β”œβ”€β”€ demo.py                     # CLI demo: Word2Vec vs Transformer comparison
β”œβ”€β”€ HOWTO.md                    # This file
└── frontend/                   # React + TypeScript UI
    β”œβ”€β”€ package.json
    β”œβ”€β”€ tsconfig.json
    β”œβ”€β”€ vite.config.ts
    β”œβ”€β”€ index.html
    └── src/
        β”œβ”€β”€ main.tsx
        β”œβ”€β”€ App.tsx
        β”œβ”€β”€ styles.css
        β”œβ”€β”€ types.ts
        β”œβ”€β”€ api.ts
        └── components/
            β”œβ”€β”€ ScoreBar.tsx
            β”œβ”€β”€ StatusMessage.tsx
            β”œβ”€β”€ TrainingPanel.tsx
            β”œβ”€β”€ EngineSetup.tsx
            β”œβ”€β”€ SemanticSearch.tsx
            β”œβ”€β”€ TextCompare.tsx
            β”œβ”€β”€ KeywordAnalysis.tsx
            β”œβ”€β”€ KeywordMatcher.tsx
            β”œβ”€β”€ BatchAnalysis.tsx
            └── EvaluationDashboard.tsx