Spaces:

caisdev
/

esfiles

Running

App Files Files Community

esfiles / README.md

Besjon Cifliku

feat: initial project setup

db764ae 19 days ago

preview code

raw

history blame contribute delete

6.69 kB

metadata

title: Esfiles
emoji: 🏢
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: 'A prototype to analyze embeddings and word correlations '

Esfiles — Contextual Similarity Engine

A tool for analyzing word meanings in context using transformer-based embeddings. Unlike traditional approaches (Word2Vec) that assign one static vector per word, this system fine-tunes on your corpus so the same word gets different embeddings depending on its surrounding context — e.g. detecting that "pizza" is used as code for "school" in a set of documents.

Includes a Word2Vec baseline for side-by-side comparison.

Stack

Layer	Technology
Embeddings	SentenceTransformers (PyTorch)
Vector search	FAISS
Baseline	gensim Word2Vec
Backend	FastAPI (Python)
Frontend	React 19 + TypeScript + Vite
Evaluation	scikit-learn metrics
Deployment	Docker (HuggingFace Spaces, local, Railway)

Prerequisites

Python 3.11+
Node.js 18+ (for frontend)
uv (recommended) or pip

Setup

1. Clone the repo

git clone <repo-url>
cd esfiles

2. Install Python dependencies

With uv (recommended):

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

With pip:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Install frontend dependencies

cd frontend
npm install
cd ..

Usage

CLI demo

Run the Word2Vec vs Transformer comparison demo:

uv run python demo.py

This builds both engines on a sample corpus and compares similarity scores, semantic search, keyword matching, and clustering.

Web UI (development)

# Terminal 1 — API server
uv run python server.py

# Terminal 2 — React dev server
cd frontend && npm run dev

API docs: http://localhost:8000/docs
Frontend: http://localhost:5173

Docker

docker compose up --build

The app will be available at http://localhost:8000. The Docker build compiles the React frontend and bundles it with the FastAPI server in a single container.

How it works

Pipeline: TRAIN → INDEX → ANALYZE → EVALUATE

Train — Fine-tune a pretrained sentence-transformer on your corpus using one of three strategies:
- Unsupervised (TSDAE): No labels needed. Learns vocabulary and phrasing via denoising autoencoder.
- Contrastive: Auto-mines training pairs from document structure (adjacent sentences = similar).
- Keyword-supervised: You provide a keyword→meaning map (e.g. {"pizza": "school"}). The trainer generates context-aware training pairs.
Index — Chunk your documents and encode them into a FAISS vector index using the fine-tuned model.
Analyze — Query the index with semantic search, compare texts, analyze keyword meanings across documents, or match keywords to candidate meanings.
Evaluate — Measure disambiguation accuracy, retrieval metrics (P@K, MRR, NDCG), and clustering quality (NMI).

API endpoints

Training

Method	Endpoint	Description
POST	`/api/train/unsupervised`	TSDAE domain adaptation
POST	`/api/train/contrastive`	Contrastive with auto-mined pairs
POST	`/api/train/keywords`	Keyword-supervised training
POST	`/api/train/evaluate`	Compare base vs trained model

Engine

Method	Endpoint	Description
POST	`/api/init`	Initialize engine with a model
POST	`/api/documents`	Add a document
POST	`/api/documents/upload`	Upload a file as a document
POST	`/api/index/build`	Build FAISS index
POST	`/api/query`	Semantic search
POST	`/api/compare`	Compare two texts
POST	`/api/analyze/keyword`	Single keyword analysis
POST	`/api/analyze/batch`	Multi-keyword batch analysis
POST	`/api/match`	Match keyword to candidate meanings
GET	`/api/stats`	Corpus statistics

Evaluation

Method	Endpoint	Description
POST	`/api/eval/disambiguation`	Disambiguation accuracy
POST	`/api/eval/retrieval`	Retrieval metrics (P@K, MRR, NDCG)
GET	`/api/eval/similarity-distribution`	Pairwise similarity histogram

Word2Vec baseline

Method	Endpoint	Description
POST	`/api/w2v/init`	Train Word2Vec on corpus
POST	`/api/w2v/compare`	Compare two texts
POST	`/api/w2v/query`	Search corpus
POST	`/api/w2v/similar-words`	Find similar words

Full interactive docs available at /docs when the server is running.

Project structure

esfiles/
├── pyproject.toml              # Dependencies (uv)
├── requirements.txt            # Fallback for pip
├── uv.lock                     # Lockfile for reproducible installs
├── contextual_similarity.py    # Core engine: chunking, embedding, FAISS, analysis
├── training.py                 # Training pipeline: 3 strategies + evaluation
├── evaluation.py               # Evaluation: metrics, reports
├── word2vec_baseline.py        # gensim Word2Vec baseline
├── data_loader.py              # Epstein Files dataset loader (HuggingFace + ChromaDB)
├── server.py                   # FastAPI REST API
├── demo.py                     # CLI demo: Word2Vec vs Transformer comparison
├── Dockerfile                  # Multi-stage build (Node + Python)
├── docker-compose.yml          # Local Docker setup
├── HOWTO.md                    # In-depth usage guide
└── frontend/                   # React + TypeScript UI
    ├── package.json
    ├── vite.config.ts
    ├── index.html
    └── src/
        ├── App.tsx             # Main app with tab navigation
        ├── api.ts              # API client
        ├── types.ts            # TypeScript types
        └── components/         # UI components (training, search, evaluation, etc.)

Base models

Model	Dimensions	Quality	Speed
`all-MiniLM-L6-v2`	384	Good	Fast
`all-mpnet-base-v2`	768	Best	Medium

Start with all-MiniLM-L6-v2 for iteration, use all-mpnet-base-v2 for production.

License

Apache 2.0