Spaces:

caisdev
/

esfiles

Running

App Files Files Community

esfiles / HOWTO.md

Besjon Cifliku

feat: initial project setup

db764ae 18 days ago

preview code

raw

history blame contribute delete

12.6 kB

Contextual Similarity Engine — HOWTO

Overview

This project uses transformer-based sentence embeddings to find and compare contextual meanings of keywords within large documents. Unlike Word2Vec (static, one-vector-per-word), this system fine-tunes on YOUR corpus so it learns domain-specific patterns — e.g. that "pizza" means "school" in your data.

A Word2Vec (gensim) baseline is included for comparison, demonstrating why contextual embeddings are superior for meaning disambiguation.

The pipeline is: TRAIN → INDEX → ANALYZE → EVALUATE.

Stack:

SentenceTransformers — contextual embeddings (PyTorch)
FAISS — fast vector similarity search
gensim Word2Vec — static embedding baseline for comparison
FastAPI — REST API backend
React + TypeScript — visualization frontend
scikit-learn — clustering & evaluation metrics

1. Install Dependencies

Python backend (uv — recommended)

uv is a fast Python package manager that replaces pip, venv, and requirements.txt with a single tool and lockfile.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install all dependencies from pyproject.toml
cd esfiles
uv sync

# Run commands inside the managed environment
uv run python server.py
uv run python demo.py

uv sync reads pyproject.toml, resolves dependencies, creates a .venv, and generates a uv.lock lockfile for reproducible installs. The lockfile pins exact versions so every machine gets identical dependencies.

Adding/removing packages:

uv add httpx              # add a new dependency
uv remove httpx           # remove it
uv lock --upgrade         # upgrade all packages to latest compatible versions

Python backend (pip — alternative)

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

React frontend

cd frontend
npm install

2. Quick Start

CLI demo (Word2Vec vs Transformer comparison)

uv run python demo.py

This runs side-by-side comparison:

Builds both Transformer and Word2Vec engines on the same corpus
Compares text similarity scores between approaches
Shows word-level similarity (Word2Vec only — transformers don't do single words)
Runs semantic search with both engines
Tests keyword meaning matching ("pizza" → food or school?)
Demonstrates clustering (transformer can separate meanings, Word2Vec cannot)

Web UI

# Terminal 1: start the API server
uv run python server.py

# Terminal 2: start the React dev server
cd frontend && npm run dev

API docs: http://localhost:8000/docs
Frontend: http://localhost:5173

3. Training Your Model

Three strategies, from simplest to most powerful:

Strategy 1: Unsupervised (TSDAE)

No labels needed. Learns your corpus vocabulary and phrasing via denoising autoencoder.

from training import CorpusTrainer

corpus_texts = [open(f).read() for f in your_files]
trainer = CorpusTrainer(corpus_texts, base_model="all-MiniLM-L6-v2")

result = trainer.train_unsupervised(
    output_path="./trained_model",
    epochs=3,
    batch_size=16,
)
print(f"Trained on {result['training_pairs']} sentences in {result['seconds']}s")

Strategy 2: Contrastive (auto-mined pairs)

Adjacent sentences = similar, random sentences = dissimilar. Learns document structure using MultipleNegativesRankingLoss with in-batch negatives.

trainer = CorpusTrainer(corpus_texts)

result = trainer.train_contrastive(
    output_path="./trained_model",
    epochs=5,
    batch_size=16,
)

Strategy 3: Keyword-supervised (best if you know the code words)

You provide a keyword→meaning map. The trainer auto-generates training pairs: keyword-in-context ↔ meaning-substituted version, plus contrastive pairs from corpus structure.

trainer = CorpusTrainer(corpus_texts)

result = trainer.train_with_keywords(
    keyword_meanings={"pizza": "school", "pepperoni": "math class"},
    output_path="./trained_model",
    epochs=5,
    batch_size=16,
)
print(f"Keywords: {result['keywords']}")

Verifying training worked

# Compare base model vs trained model on test pairs
comparison = trainer.evaluate_model(
    test_pairs=[
        ("pizza gives me homework", "school gives me homework", 0.95),
        ("pizza gives me homework", "I ate delicious pizza", 0.1),
        ("The pizza test is hard", "The school exam is difficult", 0.9),
    ],
    trained_model_path="./trained_model",
)

print(f"Base error:    {comparison['summary']['avg_base_error']:.4f}")
print(f"Trained error: {comparison['summary']['avg_trained_error']:.4f}")
print(f"Reduction:     {comparison['summary']['error_reduction_pct']:.1f}%")
print(f"Improved:      {comparison['summary']['improved']}/{comparison['summary']['total']}")

4. Using Your Trained Model

After training, use the saved model path instead of the pretrained model name:

from contextual_similarity import ContextualSimilarityEngine

engine = ContextualSimilarityEngine(model_name="./trained_model")

engine.add_document("doc1", open("doc1.txt").read())
engine.build_index()

# Queries now use your domain-trained embeddings
results = engine.query("pizza homework", top_k=10)
matches = engine.match_keyword_to_meaning("pizza", [
    "Italian food, restaurant, cooking",
    "School, education, homework and tests",
])

5. Word2Vec Baseline Comparison

A gensim Word2Vec engine is included to demonstrate the difference between static and contextual embeddings:

from word2vec_baseline import Word2VecEngine

w2v = Word2VecEngine(vector_size=100, window=5, epochs=50)
for doc_id, text in docs.items():
    w2v.add_document(doc_id, text)
w2v.build_index()

# Word-level: which words appear in similar contexts?
w2v.most_similar_words("pizza", top_k=5)

# Sentence-level: averaged word vectors (lossy)
w2v.compare_texts("pizza gives me homework", "school gives me homework")

# Search
w2v.query("a place where children learn", top_k=3)

Key limitation: Word2Vec gives ONE vector per word. "pizza" always has the same embedding whether it means food or school. Transformers encode the full surrounding context, so the same word gets different embeddings in different passages.

6. Using the Web UI

Train Model (start here):
- Paste your corpus (documents separated by blank lines)
- Choose strategy: Unsupervised, Contrastive, or Keyword-supervised
- For keyword strategy, provide a JSON keyword→meaning map
- Configure base model, epochs, batch size, output path
- Click "Start Training" — model trains and saves to disk
- Run "Compare Models" to evaluate base vs trained
Setup:
- Initialize engine with your trained model path (e.g. ./trained_model)
- Add documents and build the FAISS index
Semantic Search: query the corpus with trained embeddings
Compare Texts: cosine similarity between any two texts
Keyword Analysis: auto-cluster keyword meanings across documents
Keyword Matcher: match keyword occurrences to candidate meanings
Batch Analysis: multi-keyword analysis with cross-similarity matrix
Evaluation: disambiguation accuracy, retrieval P@K/MRR, similarity histograms

7. API Endpoints

Training

Method	Endpoint	Description
POST	`/api/train/unsupervised`	TSDAE domain adaptation
POST	`/api/train/contrastive`	Contrastive with auto-mined pairs
POST	`/api/train/keywords`	Keyword-supervised training
POST	`/api/train/evaluate`	Compare base vs trained model

Engine

Method	Endpoint	Description
POST	`/api/init`	Initialize engine with a model
POST	`/api/documents`	Add a document to the corpus
POST	`/api/documents/upload`	Upload a file as a document
POST	`/api/index/build`	Build FAISS index
POST	`/api/query`	Semantic search
POST	`/api/compare`	Compare two texts
POST	`/api/analyze/keyword`	Single keyword analysis
POST	`/api/analyze/batch`	Multi-keyword batch analysis
POST	`/api/match`	Match keyword to candidate meanings
GET	`/api/stats`	Corpus statistics

Evaluation

Method	Endpoint	Description
POST	`/api/eval/disambiguation`	Disambiguation accuracy
POST	`/api/eval/retrieval`	Retrieval metrics (P@K, MRR, NDCG)
GET	`/api/eval/similarity-distribution`	Pairwise similarity histogram

Word2Vec Baseline

Method	Endpoint	Description
POST	`/api/w2v/init`	Train Word2Vec on corpus
POST	`/api/w2v/compare`	Compare two texts (averaged word vectors)
POST	`/api/w2v/query`	Search corpus
POST	`/api/w2v/similar-words`	Find similar words

8. Available Base Models

Model	Dim	Size	Quality	Speed
`all-MiniLM-L6-v2`	384	~80MB	Good	Fast
`all-mpnet-base-v2`	768	~420MB	Best	Medium

Start with all-MiniLM-L6-v2 for fast iteration, upgrade to all-mpnet-base-v2 for production quality.

9. Evaluation Metrics

Metric	What it measures
Accuracy	% of keyword occurrences correctly matched to their meaning
Weighted F1	Harmonic mean of precision/recall, weighted by class frequency
MRR	Mean Reciprocal Rank — how early the first relevant result appears
P@K	Precision at K — fraction of top-K results that are relevant
NDCG@K	Normalized Discounted Cumulative Gain — ranking quality metric

10. Tuning Parameters

Training

Parameter	Default	Notes
`epochs`	3-5	More = better fit but risk overfitting
`batch_size`	16	Larger = faster, needs more memory. MNRL benefits from larger batches
`context_window`	2	(Keyword strategy) sentences around keyword to include as context

Engine

Parameter	Default	Notes
`chunk_size`	512	Characters per chunk. Larger = more context per chunk
`chunk_overlap`	128	Overlap prevents losing context at chunk boundaries
`batch_size`	64	Encoding batch size for FAISS indexing

11. Computational Resources

Task	CPU	GPU (CUDA/MPS)	RAM
Training (small, <1K pairs)	OK	Faster (2-5x)	4GB+
Training (medium, 1K-10K pairs)	Slow	Recommended	8GB+
Training (large, 10K+ pairs)	Very slow	Required	16GB+
Indexing (1K chunks)	OK	Faster	4GB+
Querying	Fast	N/A	2GB+

Minimum: MacBook with 8GB RAM can train small models on CPU. Recommended: 16GB RAM + GPU (NVIDIA CUDA or Apple Silicon MPS).

12. Project Structure

esfiles/
├── pyproject.toml              # Project config & dependencies (uv)
├── requirements.txt            # Fallback for pip users
├── contextual_similarity.py    # Core engine: chunking, embedding, FAISS, analysis
├── training.py                 # Training pipeline: 3 strategies + evaluation
├── evaluation.py               # Evaluation pipeline: metrics, reports
├── word2vec_baseline.py        # Gensim Word2Vec baseline for comparison
├── server.py                   # FastAPI REST API
├── demo.py                     # CLI demo: Word2Vec vs Transformer comparison
├── HOWTO.md                    # This file
└── frontend/                   # React + TypeScript UI
    ├── package.json
    ├── tsconfig.json
    ├── vite.config.ts
    ├── index.html
    └── src/
        ├── main.tsx
        ├── App.tsx
        ├── styles.css
        ├── types.ts
        ├── api.ts
        └── components/
            ├── ScoreBar.tsx
            ├── StatusMessage.tsx
            ├── TrainingPanel.tsx
            ├── EngineSetup.tsx
            ├── SemanticSearch.tsx
            ├── TextCompare.tsx
            ├── KeywordAnalysis.tsx
            ├── KeywordMatcher.tsx
            ├── BatchAnalysis.tsx
            └── EvaluationDashboard.tsx