--- title: Esfiles emoji: "\U0001F3E2" colorFrom: green colorTo: green sdk: docker app_port: 7860 pinned: false license: apache-2.0 short_description: 'A prototype to analyze embeddings and word correlations ' --- # Esfiles — Contextual Similarity Engine A tool for analyzing word meanings in context using **transformer-based embeddings**. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system **fine-tunes on your corpus** so the same word gets different embeddings depending on its surrounding context — e.g. detecting that "pizza" is used as code for "school" in a set of documents. Includes a **Word2Vec baseline** for side-by-side comparison. ## Stack | Layer | Technology | |-------|-----------| | Embeddings | SentenceTransformers (PyTorch) | | Vector search | FAISS | | Baseline | gensim Word2Vec | | Backend | FastAPI (Python) | | Frontend | React 19 + TypeScript + Vite | | Evaluation | scikit-learn metrics | | Deployment | Docker (HuggingFace Spaces, local, Railway) | ## Prerequisites - **Python 3.11+** - **Node.js 18+** (for frontend) - [uv](https://docs.astral.sh/uv/) (recommended) or pip ## Setup ### 1. Clone the repo ```bash git clone cd esfiles ``` ### 2. Install Python dependencies **With uv (recommended):** ```bash curl -LsSf https://astral.sh/uv/install.sh | sh uv sync ``` **With pip:** ```bash python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` ### 3. Install frontend dependencies ```bash cd frontend npm install cd .. ``` ## Usage ### CLI demo Run the Word2Vec vs Transformer comparison demo: ```bash uv run python demo.py ``` This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering. ### Web UI (development) ```bash # Terminal 1 — API server uv run python server.py # Terminal 2 — React dev server cd frontend && npm run dev ``` - **API docs:** http://localhost:8000/docs - **Frontend:** http://localhost:5173 ### Docker ```bash docker compose up --build ``` The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container. ## How it works **Pipeline: TRAIN → INDEX → ANALYZE → EVALUATE** 1. **Train** — Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies: - **Unsupervised (TSDAE):** No labels needed. Learns vocabulary and phrasing via denoising autoencoder. - **Contrastive:** Auto-mines training pairs from document structure (adjacent sentences = similar). - **Keyword-supervised:** You provide a keyword→meaning map (e.g. `{"pizza": "school"}`). The trainer generates context-aware training pairs. 2. **Index** — Chunk your documents and encode them into a FAISS vector index using the fine-tuned model. 3. **Analyze** — Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings. 4. **Evaluate** — Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI). ## API endpoints ### Training | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/train/unsupervised` | TSDAE domain adaptation | | POST | `/api/train/contrastive` | Contrastive with auto-mined pairs | | POST | `/api/train/keywords` | Keyword-supervised training | | POST | `/api/train/evaluate` | Compare base vs trained model | ### Engine | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/init` | Initialize engine with a model | | POST | `/api/documents` | Add a document | | POST | `/api/documents/upload` | Upload a file as a document | | POST | `/api/index/build` | Build FAISS index | | POST | `/api/query` | Semantic search | | POST | `/api/compare` | Compare two texts | | POST | `/api/analyze/keyword` | Single keyword analysis | | POST | `/api/analyze/batch` | Multi-keyword batch analysis | | POST | `/api/match` | Match keyword to candidate meanings | | GET | `/api/stats` | Corpus statistics | ### Evaluation | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/eval/disambiguation` | Disambiguation accuracy | | POST | `/api/eval/retrieval` | Retrieval metrics (P@K, MRR, NDCG) | | GET | `/api/eval/similarity-distribution` | Pairwise similarity histogram | ### Word2Vec baseline | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/w2v/init` | Train Word2Vec on corpus | | POST | `/api/w2v/compare` | Compare two texts | | POST | `/api/w2v/query` | Search corpus | | POST | `/api/w2v/similar-words` | Find similar words | Full interactive docs available at `/docs` when the server is running. ## Project structure ``` esfiles/ ├── pyproject.toml # Dependencies (uv) ├── requirements.txt # Fallback for pip ├── uv.lock # Lockfile for reproducible installs ├── contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis ├── training.py # Training pipeline: 3 strategies + evaluation ├── evaluation.py # Evaluation: metrics, reports ├── word2vec_baseline.py # gensim Word2Vec baseline ├── data_loader.py # Epstein Files dataset loader (HuggingFace + ChromaDB) ├── server.py # FastAPI REST API ├── demo.py # CLI demo: Word2Vec vs Transformer comparison ├── Dockerfile # Multi-stage build (Node + Python) ├── docker-compose.yml # Local Docker setup ├── HOWTO.md # In-depth usage guide └── frontend/ # React + TypeScript UI ├── package.json ├── vite.config.ts ├── index.html └── src/ ├── App.tsx # Main app with tab navigation ├── api.ts # API client ├── types.ts # TypeScript types └── components/ # UI components (training, search, evaluation, etc.) ``` ## Base models | Model | Dimensions | Quality | Speed | |-------|-----------|---------|-------| | `all-MiniLM-L6-v2` | 384 | Good | Fast | | `all-mpnet-base-v2` | 768 | Best | Medium | Start with `all-MiniLM-L6-v2` for iteration, use `all-mpnet-base-v2` for production. ## Further reading See [HOWTO.md](HOWTO.md) for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics. ## License Apache 2.0