| --- |
| title: Esfiles |
| emoji: "\U0001F3E2" |
| colorFrom: green |
| colorTo: green |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| license: apache-2.0 |
| short_description: 'A prototype to analyze embeddings and word correlations ' |
| --- |
| |
| # Esfiles β Contextual Similarity Engine |
|
|
| A tool for analyzing word meanings in context using **transformer-based embeddings**. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system **fine-tunes on your corpus** so the same word gets different embeddings depending on its surrounding context β e.g. detecting that "pizza" is used as code for "school" in a set of documents. |
|
|
| Includes a **Word2Vec baseline** for side-by-side comparison. |
|
|
| ## Stack |
|
|
| | Layer | Technology | |
| |-------|-----------| |
| | Embeddings | SentenceTransformers (PyTorch) | |
| | Vector search | FAISS | |
| | Baseline | gensim Word2Vec | |
| | Backend | FastAPI (Python) | |
| | Frontend | React 19 + TypeScript + Vite | |
| | Evaluation | scikit-learn metrics | |
| | Deployment | Docker (HuggingFace Spaces, local, Railway) | |
|
|
| ## Prerequisites |
|
|
| - **Python 3.11+** |
| - **Node.js 18+** (for frontend) |
| - [uv](https://docs.astral.sh/uv/) (recommended) or pip |
|
|
| ## Setup |
|
|
| ### 1. Clone the repo |
|
|
| ```bash |
| git clone <repo-url> |
| cd esfiles |
| ``` |
|
|
| ### 2. Install Python dependencies |
|
|
| **With uv (recommended):** |
|
|
| ```bash |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| uv sync |
| ``` |
|
|
| **With pip:** |
|
|
| ```bash |
| python3 -m venv venv |
| source venv/bin/activate |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 3. Install frontend dependencies |
|
|
| ```bash |
| cd frontend |
| npm install |
| cd .. |
| ``` |
|
|
| ## Usage |
|
|
| ### CLI demo |
|
|
| Run the Word2Vec vs Transformer comparison demo: |
|
|
| ```bash |
| uv run python demo.py |
| ``` |
|
|
| This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering. |
|
|
| ### Web UI (development) |
|
|
| ```bash |
| # Terminal 1 β API server |
| uv run python server.py |
| |
| # Terminal 2 β React dev server |
| cd frontend && npm run dev |
| ``` |
|
|
| - **API docs:** http://localhost:8000/docs |
| - **Frontend:** http://localhost:5173 |
|
|
| ### Docker |
|
|
| ```bash |
| docker compose up --build |
| ``` |
|
|
| The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container. |
|
|
| ## How it works |
|
|
| **Pipeline: TRAIN β INDEX β ANALYZE β EVALUATE** |
|
|
| 1. **Train** β Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies: |
| - **Unsupervised (TSDAE):** No labels needed. Learns vocabulary and phrasing via denoising autoencoder. |
| - **Contrastive:** Auto-mines training pairs from document structure (adjacent sentences = similar). |
| - **Keyword-supervised:** You provide a keywordβmeaning map (e.g. `{"pizza": "school"}`). The trainer generates context-aware training pairs. |
|
|
| 2. **Index** β Chunk your documents and encode them into a FAISS vector index using the fine-tuned model. |
|
|
| 3. **Analyze** β Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings. |
|
|
| 4. **Evaluate** β Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI). |
|
|
| ## API endpoints |
|
|
| ### Training |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/train/unsupervised` | TSDAE domain adaptation | |
| | POST | `/api/train/contrastive` | Contrastive with auto-mined pairs | |
| | POST | `/api/train/keywords` | Keyword-supervised training | |
| | POST | `/api/train/evaluate` | Compare base vs trained model | |
|
|
| ### Engine |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/init` | Initialize engine with a model | |
| | POST | `/api/documents` | Add a document | |
| | POST | `/api/documents/upload` | Upload a file as a document | |
| | POST | `/api/index/build` | Build FAISS index | |
| | POST | `/api/query` | Semantic search | |
| | POST | `/api/compare` | Compare two texts | |
| | POST | `/api/analyze/keyword` | Single keyword analysis | |
| | POST | `/api/analyze/batch` | Multi-keyword batch analysis | |
| | POST | `/api/match` | Match keyword to candidate meanings | |
| | GET | `/api/stats` | Corpus statistics | |
|
|
| ### Evaluation |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/eval/disambiguation` | Disambiguation accuracy | |
| | POST | `/api/eval/retrieval` | Retrieval metrics (P@K, MRR, NDCG) | |
| | GET | `/api/eval/similarity-distribution` | Pairwise similarity histogram | |
|
|
| ### Word2Vec baseline |
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/api/w2v/init` | Train Word2Vec on corpus | |
| | POST | `/api/w2v/compare` | Compare two texts | |
| | POST | `/api/w2v/query` | Search corpus | |
| | POST | `/api/w2v/similar-words` | Find similar words | |
|
|
| Full interactive docs available at `/docs` when the server is running. |
|
|
| ## Project structure |
|
|
| ``` |
| esfiles/ |
| βββ pyproject.toml # Dependencies (uv) |
| βββ requirements.txt # Fallback for pip |
| βββ uv.lock # Lockfile for reproducible installs |
| βββ contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis |
| βββ training.py # Training pipeline: 3 strategies + evaluation |
| βββ evaluation.py # Evaluation: metrics, reports |
| βββ word2vec_baseline.py # gensim Word2Vec baseline |
| βββ data_loader.py # Epstein Files dataset loader (HuggingFace + ChromaDB) |
| βββ server.py # FastAPI REST API |
| βββ demo.py # CLI demo: Word2Vec vs Transformer comparison |
| βββ Dockerfile # Multi-stage build (Node + Python) |
| βββ docker-compose.yml # Local Docker setup |
| βββ HOWTO.md # In-depth usage guide |
| βββ frontend/ # React + TypeScript UI |
| βββ package.json |
| βββ vite.config.ts |
| βββ index.html |
| βββ src/ |
| βββ App.tsx # Main app with tab navigation |
| βββ api.ts # API client |
| βββ types.ts # TypeScript types |
| βββ components/ # UI components (training, search, evaluation, etc.) |
| ``` |
|
|
| ## Base models |
|
|
| | Model | Dimensions | Quality | Speed | |
| |-------|-----------|---------|-------| |
| | `all-MiniLM-L6-v2` | 384 | Good | Fast | |
| | `all-mpnet-base-v2` | 768 | Best | Medium | |
|
|
| Start with `all-MiniLM-L6-v2` for iteration, use `all-mpnet-base-v2` for production. |
|
|
| ## Further reading |
|
|
| See [HOWTO.md](HOWTO.md) for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics. |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|