esfiles / README.md
Besjon Cifliku
feat: initial project setup
db764ae
---
title: Esfiles
emoji: "\U0001F3E2"
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: 'A prototype to analyze embeddings and word correlations '
---
# Esfiles β€” Contextual Similarity Engine
A tool for analyzing word meanings in context using **transformer-based embeddings**. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system **fine-tunes on your corpus** so the same word gets different embeddings depending on its surrounding context β€” e.g. detecting that "pizza" is used as code for "school" in a set of documents.
Includes a **Word2Vec baseline** for side-by-side comparison.
## Stack
| Layer | Technology |
|-------|-----------|
| Embeddings | SentenceTransformers (PyTorch) |
| Vector search | FAISS |
| Baseline | gensim Word2Vec |
| Backend | FastAPI (Python) |
| Frontend | React 19 + TypeScript + Vite |
| Evaluation | scikit-learn metrics |
| Deployment | Docker (HuggingFace Spaces, local, Railway) |
## Prerequisites
- **Python 3.11+**
- **Node.js 18+** (for frontend)
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
## Setup
### 1. Clone the repo
```bash
git clone <repo-url>
cd esfiles
```
### 2. Install Python dependencies
**With uv (recommended):**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
```
**With pip:**
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
### 3. Install frontend dependencies
```bash
cd frontend
npm install
cd ..
```
## Usage
### CLI demo
Run the Word2Vec vs Transformer comparison demo:
```bash
uv run python demo.py
```
This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.
### Web UI (development)
```bash
# Terminal 1 β€” API server
uv run python server.py
# Terminal 2 β€” React dev server
cd frontend && npm run dev
```
- **API docs:** http://localhost:8000/docs
- **Frontend:** http://localhost:5173
### Docker
```bash
docker compose up --build
```
The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.
## How it works
**Pipeline: TRAIN β†’ INDEX β†’ ANALYZE β†’ EVALUATE**
1. **Train** β€” Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:
- **Unsupervised (TSDAE):** No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
- **Contrastive:** Auto-mines training pairs from document structure (adjacent sentences = similar).
- **Keyword-supervised:** You provide a keyword→meaning map (e.g. `{"pizza": "school"}`). The trainer generates context-aware training pairs.
2. **Index** β€” Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.
3. **Analyze** β€” Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.
4. **Evaluate** β€” Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).
## API endpoints
### Training
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/train/unsupervised` | TSDAE domain adaptation |
| POST | `/api/train/contrastive` | Contrastive with auto-mined pairs |
| POST | `/api/train/keywords` | Keyword-supervised training |
| POST | `/api/train/evaluate` | Compare base vs trained model |
### Engine
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/init` | Initialize engine with a model |
| POST | `/api/documents` | Add a document |
| POST | `/api/documents/upload` | Upload a file as a document |
| POST | `/api/index/build` | Build FAISS index |
| POST | `/api/query` | Semantic search |
| POST | `/api/compare` | Compare two texts |
| POST | `/api/analyze/keyword` | Single keyword analysis |
| POST | `/api/analyze/batch` | Multi-keyword batch analysis |
| POST | `/api/match` | Match keyword to candidate meanings |
| GET | `/api/stats` | Corpus statistics |
### Evaluation
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/eval/disambiguation` | Disambiguation accuracy |
| POST | `/api/eval/retrieval` | Retrieval metrics (P@K, MRR, NDCG) |
| GET | `/api/eval/similarity-distribution` | Pairwise similarity histogram |
### Word2Vec baseline
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/w2v/init` | Train Word2Vec on corpus |
| POST | `/api/w2v/compare` | Compare two texts |
| POST | `/api/w2v/query` | Search corpus |
| POST | `/api/w2v/similar-words` | Find similar words |
Full interactive docs available at `/docs` when the server is running.
## Project structure
```
esfiles/
β”œβ”€β”€ pyproject.toml # Dependencies (uv)
β”œβ”€β”€ requirements.txt # Fallback for pip
β”œβ”€β”€ uv.lock # Lockfile for reproducible installs
β”œβ”€β”€ contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis
β”œβ”€β”€ training.py # Training pipeline: 3 strategies + evaluation
β”œβ”€β”€ evaluation.py # Evaluation: metrics, reports
β”œβ”€β”€ word2vec_baseline.py # gensim Word2Vec baseline
β”œβ”€β”€ data_loader.py # Epstein Files dataset loader (HuggingFace + ChromaDB)
β”œβ”€β”€ server.py # FastAPI REST API
β”œβ”€β”€ demo.py # CLI demo: Word2Vec vs Transformer comparison
β”œβ”€β”€ Dockerfile # Multi-stage build (Node + Python)
β”œβ”€β”€ docker-compose.yml # Local Docker setup
β”œβ”€β”€ HOWTO.md # In-depth usage guide
└── frontend/ # React + TypeScript UI
β”œβ”€β”€ package.json
β”œβ”€β”€ vite.config.ts
β”œβ”€β”€ index.html
└── src/
β”œβ”€β”€ App.tsx # Main app with tab navigation
β”œβ”€β”€ api.ts # API client
β”œβ”€β”€ types.ts # TypeScript types
└── components/ # UI components (training, search, evaluation, etc.)
```
## Base models
| Model | Dimensions | Quality | Speed |
|-------|-----------|---------|-------|
| `all-MiniLM-L6-v2` | 384 | Good | Fast |
| `all-mpnet-base-v2` | 768 | Best | Medium |
Start with `all-MiniLM-L6-v2` for iteration, use `all-mpnet-base-v2` for production.
## Further reading
See [HOWTO.md](HOWTO.md) for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics.
## License
Apache 2.0