Spaces:

RJuro
/

scifact-semantic-search

Sleeping

App Files Files Community

scifact-semantic-search / README.md

RJuro

Add comprehensive deployment guide to README

3a6e6ca 2 months ago

preview code

raw

history blame contribute delete

5.77 kB

metadata

title: SciFact Multilingual Semantic Search
emoji: 🔬
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860

SciFact Multilingual Semantic Search

A deployable semantic search engine over 5,183 scientific abstracts (SciFact dataset), using ChromaDB for vector storage and multilingual-e5-small for cross-lingual search in English, French, German, and Spanish.

Live demo: huggingface.co/spaces/RJuro/scifact-semantic-search

Architecture

LOCAL (one-time setup)                    HF SPACES (runtime, CPU)
──────────────────────                    ────────────────────────
SciFact dataset                           Load ChromaDB from data/chroma_db/
    ↓                                     Load multilingual-e5-small
Encode 5,183 docs with                       ↓
multilingual-e5-small                     /search?q=... →
    ↓                                       encode query →
Save to ChromaDB ──── push via git ────→    ChromaDB query →
(data/chroma_db/)                           JSON results

Key idea: Corpus encoding happens once on your machine. Only query encoding runs on HF Spaces (CPU). This keeps the Space fast and cheap.

Files

File	Purpose
`precompute.py`	Encodes all 5,183 SciFact docs and saves them to ChromaDB (run locally)
`app.py`	FastAPI server — loads ChromaDB + model, serves search API and frontend
`static/index.html`	Frontend — vanilla HTML/CSS/JS, no dependencies
`requirements.txt`	Python dependencies
`Dockerfile`	Container config for HF Spaces
`README.md`	This file (YAML header is required by HF Spaces)

Step-by-Step Deployment Guide

Prerequisites

Python 3.9+
A free Hugging Face account
Git with Git LFS installed (brew install git-lfs on macOS)

Step 1 — Install dependencies

pip install -r requirements.txt

Step 2 — Run precompute.py (local, one-time)

This downloads the SciFact dataset, encodes all 5,183 abstracts with intfloat/multilingual-e5-small, and saves the vectors + metadata into a persistent ChromaDB at data/chroma_db/.

python precompute.py

Takes ~2 minutes on CPU. When done you should see:

ChromaDB persisted to: .../data/chroma_db
Collection 'scifact': 5183 documents

Verify the output:

ls data/chroma_db/
# Should show: chroma.sqlite3  and a UUID-named directory

Step 3 — Test locally

uvicorn app:app --port 7860

Open localhost:7860 in your browser. Try searching:

effects of vaccination (English)
effets de la vaccination (French)
Auswirkungen der Impfung (German)

The same English-language corpus should return relevant results regardless of query language.

Step 4 — Create a Hugging Face Space

Go to huggingface.co/new-space:

Space name: choose any name (e.g. scifact-semantic-search)
SDK: Docker
Visibility: Public

Or use the CLI:

pip install huggingface-hub
huggingface-cli login  # paste your HF token
python -c "
from huggingface_hub import HfApi
api = HfApi()
url = api.create_repo('YOUR-SPACE-NAME', repo_type='space', space_sdk='docker')
print(url)
"

Step 5 — Push to HF Spaces

Initialize git, enable LFS (needed because chroma.sqlite3 is ~74 MB), and push:

git init
git lfs install
git lfs track "*.sqlite3" "*.bin"
git add .
git commit -m "Initial deploy"
git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/YOUR-SPACE-NAME
git push origin main

If the push is rejected (HF creates a default commit), pull first:

git pull origin main --rebase
# Resolve any conflicts in .gitattributes / README.md (keep your versions)
git add .
git rebase --continue
git push origin main

Step 6 — Wait for build

HF Spaces will build the Docker image (installs PyTorch, sentence-transformers, etc.). This takes 5-10 minutes on the first deploy. Watch progress in the Space's Logs tab.

Once the status shows Running, your app is live.

How It Works

Embedding model

intfloat/multilingual-e5-small (118M params, 384 dimensions)

This is a compact multilingual retrieval model. Critical detail — E5 models require prefixes:

Documents: passage: {text}
Queries: query: {text}

Without these prefixes, retrieval quality drops significantly.

Vector database

ChromaDB with persistent storage and cosine distance. Documents are stored with precomputed embeddings so ChromaDB doesn't need to re-embed anything at runtime.

Search flow

User types a query in any supported language
FastAPI encodes it with query: {text} prefix using the E5 model
ChromaDB finds the 5 nearest neighbors by cosine distance
Results are returned as JSON: {rank, score, title, text}
Score = 1 - cosine_distance (displayed as similarity percentage)

Cross-lingual search

The multilingual E5 model maps text from different languages into the same vector space. A French query about vaccination lands near English documents about vaccination — no translation needed.

Customization Ideas

Different dataset: Replace load_scifact() in precompute.py with your own corpus
More languages: The model supports 100+ languages — add more example chips in index.html
More results: Change top_k parameter (default 5, max 20)
Reranking: Add a cross-encoder reranker on top of the retrieval results for better precision