esfiles / README.md
Besjon Cifliku
feat: initial project setup
db764ae
metadata
title: Esfiles
emoji: 🏒
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: 'A prototype to analyze embeddings and word correlations '

Esfiles β€” Contextual Similarity Engine

A tool for analyzing word meanings in context using transformer-based embeddings. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system fine-tunes on your corpus so the same word gets different embeddings depending on its surrounding context β€” e.g. detecting that "pizza" is used as code for "school" in a set of documents.

Includes a Word2Vec baseline for side-by-side comparison.

Stack

Layer Technology
Embeddings SentenceTransformers (PyTorch)
Vector search FAISS
Baseline gensim Word2Vec
Backend FastAPI (Python)
Frontend React 19 + TypeScript + Vite
Evaluation scikit-learn metrics
Deployment Docker (HuggingFace Spaces, local, Railway)

Prerequisites

  • Python 3.11+
  • Node.js 18+ (for frontend)
  • uv (recommended) or pip

Setup

1. Clone the repo

git clone <repo-url>
cd esfiles

2. Install Python dependencies

With uv (recommended):

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

With pip:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Install frontend dependencies

cd frontend
npm install
cd ..

Usage

CLI demo

Run the Word2Vec vs Transformer comparison demo:

uv run python demo.py

This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.

Web UI (development)

# Terminal 1 β€” API server
uv run python server.py

# Terminal 2 β€” React dev server
cd frontend && npm run dev

Docker

docker compose up --build

The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.

How it works

Pipeline: TRAIN β†’ INDEX β†’ ANALYZE β†’ EVALUATE

  1. Train β€” Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:

    • Unsupervised (TSDAE): No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
    • Contrastive: Auto-mines training pairs from document structure (adjacent sentences = similar).
    • Keyword-supervised: You provide a keywordβ†’meaning map (e.g. {"pizza": "school"}). The trainer generates context-aware training pairs.
  2. Index β€” Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.

  3. Analyze β€” Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.

  4. Evaluate β€” Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).

API endpoints

Training

Method Endpoint Description
POST /api/train/unsupervised TSDAE domain adaptation
POST /api/train/contrastive Contrastive with auto-mined pairs
POST /api/train/keywords Keyword-supervised training
POST /api/train/evaluate Compare base vs trained model

Engine

Method Endpoint Description
POST /api/init Initialize engine with a model
POST /api/documents Add a document
POST /api/documents/upload Upload a file as a document
POST /api/index/build Build FAISS index
POST /api/query Semantic search
POST /api/compare Compare two texts
POST /api/analyze/keyword Single keyword analysis
POST /api/analyze/batch Multi-keyword batch analysis
POST /api/match Match keyword to candidate meanings
GET /api/stats Corpus statistics

Evaluation

Method Endpoint Description
POST /api/eval/disambiguation Disambiguation accuracy
POST /api/eval/retrieval Retrieval metrics (P@K, MRR, NDCG)
GET /api/eval/similarity-distribution Pairwise similarity histogram

Word2Vec baseline

Method Endpoint Description
POST /api/w2v/init Train Word2Vec on corpus
POST /api/w2v/compare Compare two texts
POST /api/w2v/query Search corpus
POST /api/w2v/similar-words Find similar words

Full interactive docs available at /docs when the server is running.

Project structure

esfiles/
β”œβ”€β”€ pyproject.toml              # Dependencies (uv)
β”œβ”€β”€ requirements.txt            # Fallback for pip
β”œβ”€β”€ uv.lock                     # Lockfile for reproducible installs
β”œβ”€β”€ contextual_similarity.py    # Core engine: chunking, embedding, FAISS, analysis
β”œβ”€β”€ training.py                 # Training pipeline: 3 strategies + evaluation
β”œβ”€β”€ evaluation.py               # Evaluation: metrics, reports
β”œβ”€β”€ word2vec_baseline.py        # gensim Word2Vec baseline
β”œβ”€β”€ data_loader.py              # Epstein Files dataset loader (HuggingFace + ChromaDB)
β”œβ”€β”€ server.py                   # FastAPI REST API
β”œβ”€β”€ demo.py                     # CLI demo: Word2Vec vs Transformer comparison
β”œβ”€β”€ Dockerfile                  # Multi-stage build (Node + Python)
β”œβ”€β”€ docker-compose.yml          # Local Docker setup
β”œβ”€β”€ HOWTO.md                    # In-depth usage guide
└── frontend/                   # React + TypeScript UI
    β”œβ”€β”€ package.json
    β”œβ”€β”€ vite.config.ts
    β”œβ”€β”€ index.html
    └── src/
        β”œβ”€β”€ App.tsx             # Main app with tab navigation
        β”œβ”€β”€ api.ts              # API client
        β”œβ”€β”€ types.ts            # TypeScript types
        └── components/         # UI components (training, search, evaluation, etc.)

Base models

Model Dimensions Quality Speed
all-MiniLM-L6-v2 384 Good Fast
all-mpnet-base-v2 768 Best Medium

Start with all-MiniLM-L6-v2 for iteration, use all-mpnet-base-v2 for production.

Further reading

See HOWTO.md for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics.

License

Apache 2.0