Spaces:

caisdev
/

esfiles

Running

File size: 6,685 Bytes

9f009c2
 
db764ae
9f009c2
 
 
db764ae
9f009c2
 
 
 
 
db764ae

---
title: Esfiles
emoji: "\U0001F3E2"
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: 'A prototype to analyze embeddings and word correlations '
---

# Esfiles — Contextual Similarity Engine

A tool for analyzing word meanings in context using **transformer-based embeddings**. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system **fine-tunes on your corpus** so the same word gets different embeddings depending on its surrounding context — e.g. detecting that "pizza" is used as code for "school" in a set of documents.

Includes a **Word2Vec baseline** for side-by-side comparison.

## Stack

| Layer | Technology |
|-------|-----------|
| Embeddings | SentenceTransformers (PyTorch) |
| Vector search | FAISS |
| Baseline | gensim Word2Vec |
| Backend | FastAPI (Python) |
| Frontend | React 19 + TypeScript + Vite |
| Evaluation | scikit-learn metrics |
| Deployment | Docker (HuggingFace Spaces, local, Railway) |

## Prerequisites

- **Python 3.11+**
- **Node.js 18+** (for frontend)
- [uv](https://docs.astral.sh/uv/) (recommended) or pip

## Setup

### 1. Clone the repo

```bash
git clone <repo-url>
cd esfiles
```

### 2. Install Python dependencies

**With uv (recommended):**

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
```

**With pip:**

```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

### 3. Install frontend dependencies

```bash
cd frontend
npm install
cd ..
```

## Usage

### CLI demo

Run the Word2Vec vs Transformer comparison demo:

```bash
uv run python demo.py
```

This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.

### Web UI (development)

```bash
# Terminal 1 — API server
uv run python server.py

# Terminal 2 — React dev server
cd frontend && npm run dev
```

- **API docs:** http://localhost:8000/docs
- **Frontend:** http://localhost:5173

### Docker

```bash
docker compose up --build
```

The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.

## How it works

**Pipeline: TRAIN → INDEX → ANALYZE → EVALUATE**

1. **Train** — Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:
   - **Unsupervised (TSDAE):** No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
   - **Contrastive:** Auto-mines training pairs from document structure (adjacent sentences = similar).
   - **Keyword-supervised:** You provide a keyword→meaning map (e.g. `{"pizza": "school"}`). The trainer generates context-aware training pairs.

2. **Index** — Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.

3. **Analyze** — Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.

4. **Evaluate** — Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).

## API endpoints

### Training
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/train/unsupervised` | TSDAE domain adaptation |
| POST | `/api/train/contrastive` | Contrastive with auto-mined pairs |
| POST | `/api/train/keywords` | Keyword-supervised training |
| POST | `/api/train/evaluate` | Compare base vs trained model |

### Engine
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/init` | Initialize engine with a model |
| POST | `/api/documents` | Add a document |
| POST | `/api/documents/upload` | Upload a file as a document |
| POST | `/api/index/build` | Build FAISS index |
| POST | `/api/query` | Semantic search |
| POST | `/api/compare` | Compare two texts |
| POST | `/api/analyze/keyword` | Single keyword analysis |
| POST | `/api/analyze/batch` | Multi-keyword batch analysis |
| POST | `/api/match` | Match keyword to candidate meanings |
| GET  | `/api/stats` | Corpus statistics |

### Evaluation
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/eval/disambiguation` | Disambiguation accuracy |
| POST | `/api/eval/retrieval` | Retrieval metrics (P@K, MRR, NDCG) |
| GET  | `/api/eval/similarity-distribution` | Pairwise similarity histogram |

### Word2Vec baseline
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/w2v/init` | Train Word2Vec on corpus |
| POST | `/api/w2v/compare` | Compare two texts |
| POST | `/api/w2v/query` | Search corpus |
| POST | `/api/w2v/similar-words` | Find similar words |

Full interactive docs available at `/docs` when the server is running.

## Project structure

```
esfiles/
├── pyproject.toml              # Dependencies (uv)
├── requirements.txt            # Fallback for pip
├── uv.lock                     # Lockfile for reproducible installs
├── contextual_similarity.py    # Core engine: chunking, embedding, FAISS, analysis
├── training.py                 # Training pipeline: 3 strategies + evaluation
├── evaluation.py               # Evaluation: metrics, reports
├── word2vec_baseline.py        # gensim Word2Vec baseline
├── data_loader.py              # Epstein Files dataset loader (HuggingFace + ChromaDB)
├── server.py                   # FastAPI REST API
├── demo.py                     # CLI demo: Word2Vec vs Transformer comparison
├── Dockerfile                  # Multi-stage build (Node + Python)
├── docker-compose.yml          # Local Docker setup
├── HOWTO.md                    # In-depth usage guide
└── frontend/                   # React + TypeScript UI
    ├── package.json
    ├── vite.config.ts
    ├── index.html
    └── src/
        ├── App.tsx             # Main app with tab navigation
        ├── api.ts              # API client
        ├── types.ts            # TypeScript types
        └── components/         # UI components (training, search, evaluation, etc.)
```

## Base models

| Model | Dimensions | Quality | Speed |
|-------|-----------|---------|-------|
| `all-MiniLM-L6-v2` | 384 | Good | Fast |
| `all-mpnet-base-v2` | 768 | Best | Medium |

Start with `all-MiniLM-L6-v2` for iteration, use `all-mpnet-base-v2` for production.

## Further reading

See [HOWTO.md](HOWTO.md) for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics.

## License

Apache 2.0