Spaces:

caisdev
/

esfiles

Running

App Files Files Community

esfiles / README.md

Besjon Cifliku

feat: initial project setup

db764ae 19 days ago

preview code

raw

history blame contribute delete

6.69 kB

	---
	title: Esfiles
	emoji: "\U0001F3E2"
	colorFrom: green
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	license: apache-2.0
	short_description: 'A prototype to analyze embeddings and word correlations '
	---

	# Esfiles — Contextual Similarity Engine

	A tool for analyzing word meanings in context using transformer-based embeddings. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system fine-tunes on your corpus so the same word gets different embeddings depending on its surrounding context — e.g. detecting that "pizza" is used as code for "school" in a set of documents.

	Includes a Word2Vec baseline for side-by-side comparison.

	## Stack

	\| Layer \| Technology \|
	\|-------\|-----------\|
	\| Embeddings \| SentenceTransformers (PyTorch) \|
	\| Vector search \| FAISS \|
	\| Baseline \| gensim Word2Vec \|
	\| Backend \| FastAPI (Python) \|
	\| Frontend \| React 19 + TypeScript + Vite \|
	\| Evaluation \| scikit-learn metrics \|
	\| Deployment \| Docker (HuggingFace Spaces, local, Railway) \|

	## Prerequisites

	- Python 3.11+
	- Node.js 18+ (for frontend)
	- [uv](https://docs.astral.sh/uv/) (recommended) or pip

	## Setup

	### 1. Clone the repo

	```bash
	git clone <repo-url>
	cd esfiles
	```

	### 2. Install Python dependencies

	With uv (recommended):

	```bash
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	uv sync
	```

	With pip:

	```bash
	python3 -m venv venv
	source venv/bin/activate
	pip install -r requirements.txt
	```

	### 3. Install frontend dependencies

	```bash
	cd frontend
	npm install
	cd ..
	```

	## Usage

	### CLI demo

	Run the Word2Vec vs Transformer comparison demo:

	```bash
	uv run python demo.py
	```

	This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.

	### Web UI (development)

	```bash
	# Terminal 1 — API server
	uv run python server.py

	# Terminal 2 — React dev server
	cd frontend && npm run dev
	```

	- API docs: http://localhost:8000/docs
	- Frontend: http://localhost:5173

	### Docker

	```bash
	docker compose up --build
	```

	The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.

	## How it works

	Pipeline: TRAIN → INDEX → ANALYZE → EVALUATE

	1. Train — Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:
	- Unsupervised (TSDAE): No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
	- Contrastive: Auto-mines training pairs from document structure (adjacent sentences = similar).
	- Keyword-supervised: You provide a keyword→meaning map (e.g. `{"pizza": "school"}`). The trainer generates context-aware training pairs.

	2. Index — Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.

	3. Analyze — Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.

	4. Evaluate — Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).

	## API endpoints

	### Training
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| POST \| `/api/train/unsupervised` \| TSDAE domain adaptation \|
	\| POST \| `/api/train/contrastive` \| Contrastive with auto-mined pairs \|
	\| POST \| `/api/train/keywords` \| Keyword-supervised training \|
	\| POST \| `/api/train/evaluate` \| Compare base vs trained model \|

	### Engine
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| POST \| `/api/init` \| Initialize engine with a model \|
	\| POST \| `/api/documents` \| Add a document \|
	\| POST \| `/api/documents/upload` \| Upload a file as a document \|
	\| POST \| `/api/index/build` \| Build FAISS index \|
	\| POST \| `/api/query` \| Semantic search \|
	\| POST \| `/api/compare` \| Compare two texts \|
	\| POST \| `/api/analyze/keyword` \| Single keyword analysis \|
	\| POST \| `/api/analyze/batch` \| Multi-keyword batch analysis \|
	\| POST \| `/api/match` \| Match keyword to candidate meanings \|
	\| GET \| `/api/stats` \| Corpus statistics \|

	### Evaluation
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| POST \| `/api/eval/disambiguation` \| Disambiguation accuracy \|
	\| POST \| `/api/eval/retrieval` \| Retrieval metrics (P@K, MRR, NDCG) \|
	\| GET \| `/api/eval/similarity-distribution` \| Pairwise similarity histogram \|

	### Word2Vec baseline
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| POST \| `/api/w2v/init` \| Train Word2Vec on corpus \|
	\| POST \| `/api/w2v/compare` \| Compare two texts \|
	\| POST \| `/api/w2v/query` \| Search corpus \|
	\| POST \| `/api/w2v/similar-words` \| Find similar words \|

	Full interactive docs available at `/docs` when the server is running.

	## Project structure

	```
	esfiles/
	├── pyproject.toml # Dependencies (uv)
	├── requirements.txt # Fallback for pip
	├── uv.lock # Lockfile for reproducible installs
	├── contextual_similarity.py # Core engine: chunking, embedding, FAISS, analysis
	├── training.py # Training pipeline: 3 strategies + evaluation
	├── evaluation.py # Evaluation: metrics, reports
	├── word2vec_baseline.py # gensim Word2Vec baseline
	├── data_loader.py # Epstein Files dataset loader (HuggingFace + ChromaDB)
	├── server.py # FastAPI REST API
	├── demo.py # CLI demo: Word2Vec vs Transformer comparison
	├── Dockerfile # Multi-stage build (Node + Python)
	├── docker-compose.yml # Local Docker setup
	├── HOWTO.md # In-depth usage guide
	└── frontend/ # React + TypeScript UI
	├── package.json
	├── vite.config.ts
	├── index.html
	└── src/
	├── App.tsx # Main app with tab navigation
	├── api.ts # API client
	├── types.ts # TypeScript types
	└── components/ # UI components (training, search, evaluation, etc.)
	```

	## Base models

	\| Model \| Dimensions \| Quality \| Speed \|
	\|-------\|-----------\|---------\|-------\|
	\| `all-MiniLM-L6-v2` \| 384 \| Good \| Fast \|
	\| `all-mpnet-base-v2` \| 768 \| Best \| Medium \|

	Start with `all-MiniLM-L6-v2` for iteration, use `all-mpnet-base-v2` for production.

	## Further reading

	See [HOWTO.md](HOWTO.md) for detailed usage examples including Python API usage, training configuration, tuning parameters, and evaluation metrics.

	## License

	Apache 2.0