title: Esfiles
emoji: π’
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: 'A prototype to analyze embeddings and word correlations '
Esfiles β Contextual Similarity Engine
A tool for analyzing word meanings in context using transformer-based embeddings. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system fine-tunes on your corpus so the same word gets different embeddings depending on its surrounding context β e.g. detecting that "pizza" is used as code for "school" in a set of documents.
Includes a Word2Vec baseline for side-by-side comparison.
Stack
| Layer | Technology |
|---|---|
| Embeddings | SentenceTransformers (PyTorch) |
| Vector search | FAISS |
| Baseline | gensim Word2Vec |
| Backend | FastAPI (Python) |
| Frontend | React 19 + TypeScript + Vite |
| Evaluation | scikit-learn metrics |
| Deployment | Docker (HuggingFace Spaces, local, Railway) |
Prerequisites
- Python 3.11+
- Node.js 18+ (for frontend)
- uv (recommended) or pip
Setup
1. Clone the repo
git clone <repo-url>
cd esfiles
2. Install Python dependencies
With uv (recommended):
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
With pip:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
3. Install frontend dependencies
cd frontend
npm install
cd ..
Usage
CLI demo
Run the Word2Vec vs Transformer comparison demo:
uv run python demo.py
This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.
Web UI (development)
# Terminal 1 β API server
uv run python server.py
# Terminal 2 β React dev server
cd frontend && npm run dev
- API docs: http://localhost:8000/docs
- Frontend: http://localhost:5173
Docker
docker compose up --build
The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.
How it works
Pipeline: TRAIN β INDEX β ANALYZE β EVALUATE
Train β Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:
- Unsupervised (TSDAE): No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
- Contrastive: Auto-mines training pairs from document structure (adjacent sentences = similar).
- Keyword-supervised: You provide a keywordβmeaning map (e.g.
{"pizza": "school"}). The trainer generates context-aware training pairs.
Index β Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.
Analyze β Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.
Evaluate β Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).
API endpoints
Training
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/train/unsupervised |
TSDAE domain adaptation |
| POST | /api/train/contrastive |
Contrastive with auto-mined pairs |
| POST | /api/train/keywords |
Keyword-supervised training |
| POST | /api/train/evaluate |
Compare base vs trained model |
Engine
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/init |
Initialize engine with a model |
| POST | /api/documents |
Add a document |
| POST | /api/documents/upload |
Upload a file as a document |
| POST | /api/index/build |
Build FAISS index |
| POST | /api/query |
Semantic search |
| POST | /api/compare |
Compare two texts |
| POST | /api/analyze/keyword |
Single keyword analysis |
| POST | /api/analyze/batch |
Multi-keyword batch analysis |
| POST | /api/match |
Match keyword to candidate meanings |
| GET | /api/stats |
Corpus statistics |
Evaluation
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/eval/disambiguation |
Disambiguation accuracy |
| POST | /api/eval/retrieval |
Retrieval metrics (P@K, MRR, NDCG) |
| GET | /api/eval/similarity-distribution |
Pairwise similarity histogram |
Word2Vec baseline
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/w2v/init |
Train Word2Vec on corpus |
| POST | /api/w2v/compare |
Compare two texts |
| POST | /api/w2v/query |
Search corpus |
| POST | /api/w2v/similar-words |
Find similar words |
Full interactive docs available at /docs when the server is running.
Project structure
esfiles/
βββ pyproject.toml # Dependencies (uv)
βββ requirements.txt # Fallback for pip
βββ uv.lock # Lockfile for reproducible installs
βββ contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis
βββ training.py # Training pipeline: 3 strategies + evaluation
βββ evaluation.py # Evaluation: metrics, reports
βββ word2vec_baseline.py # gensim Word2Vec baseline
βββ data_loader.py # Epstein Files dataset loader (HuggingFace + ChromaDB)
βββ server.py # FastAPI REST API
βββ demo.py # CLI demo: Word2Vec vs Transformer comparison
βββ Dockerfile # Multi-stage build (Node + Python)
βββ docker-compose.yml # Local Docker setup
βββ HOWTO.md # In-depth usage guide
βββ frontend/ # React + TypeScript UI
βββ package.json
βββ vite.config.ts
βββ index.html
βββ src/
βββ App.tsx # Main app with tab navigation
βββ api.ts # API client
βββ types.ts # TypeScript types
βββ components/ # UI components (training, search, evaluation, etc.)
Base models
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 | Good | Fast |
all-mpnet-base-v2 |
768 | Best | Medium |
Start with all-MiniLM-L6-v2 for iteration, use all-mpnet-base-v2 for production.
Further reading
See HOWTO.md for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics.
License
Apache 2.0