RJuro's picture
Add comprehensive deployment guide to README
3a6e6ca
metadata
title: SciFact Multilingual Semantic Search
emoji: πŸ”¬
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860

SciFact Multilingual Semantic Search

A deployable semantic search engine over 5,183 scientific abstracts (SciFact dataset), using ChromaDB for vector storage and multilingual-e5-small for cross-lingual search in English, French, German, and Spanish.

Live demo: huggingface.co/spaces/RJuro/scifact-semantic-search

Architecture

LOCAL (one-time setup)                    HF SPACES (runtime, CPU)
──────────────────────                    ────────────────────────
SciFact dataset                           Load ChromaDB from data/chroma_db/
    ↓                                     Load multilingual-e5-small
Encode 5,183 docs with                       ↓
multilingual-e5-small                     /search?q=... β†’
    ↓                                       encode query β†’
Save to ChromaDB ──── push via git ────→    ChromaDB query β†’
(data/chroma_db/)                           JSON results

Key idea: Corpus encoding happens once on your machine. Only query encoding runs on HF Spaces (CPU). This keeps the Space fast and cheap.

Files

File Purpose
precompute.py Encodes all 5,183 SciFact docs and saves them to ChromaDB (run locally)
app.py FastAPI server β€” loads ChromaDB + model, serves search API and frontend
static/index.html Frontend β€” vanilla HTML/CSS/JS, no dependencies
requirements.txt Python dependencies
Dockerfile Container config for HF Spaces
README.md This file (YAML header is required by HF Spaces)

Step-by-Step Deployment Guide

Prerequisites

  • Python 3.9+
  • A free Hugging Face account
  • Git with Git LFS installed (brew install git-lfs on macOS)

Step 1 β€” Install dependencies

pip install -r requirements.txt

Step 2 β€” Run precompute.py (local, one-time)

This downloads the SciFact dataset, encodes all 5,183 abstracts with intfloat/multilingual-e5-small, and saves the vectors + metadata into a persistent ChromaDB at data/chroma_db/.

python precompute.py

Takes ~2 minutes on CPU. When done you should see:

ChromaDB persisted to: .../data/chroma_db
Collection 'scifact': 5183 documents

Verify the output:

ls data/chroma_db/
# Should show: chroma.sqlite3  and a UUID-named directory

Step 3 β€” Test locally

uvicorn app:app --port 7860

Open localhost:7860 in your browser. Try searching:

  • effects of vaccination (English)
  • effets de la vaccination (French)
  • Auswirkungen der Impfung (German)

The same English-language corpus should return relevant results regardless of query language.

Step 4 β€” Create a Hugging Face Space

Go to huggingface.co/new-space:

  • Space name: choose any name (e.g. scifact-semantic-search)
  • SDK: Docker
  • Visibility: Public

Or use the CLI:

pip install huggingface-hub
huggingface-cli login  # paste your HF token
python -c "
from huggingface_hub import HfApi
api = HfApi()
url = api.create_repo('YOUR-SPACE-NAME', repo_type='space', space_sdk='docker')
print(url)
"

Step 5 β€” Push to HF Spaces

Initialize git, enable LFS (needed because chroma.sqlite3 is ~74 MB), and push:

git init
git lfs install
git lfs track "*.sqlite3" "*.bin"
git add .
git commit -m "Initial deploy"
git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/YOUR-SPACE-NAME
git push origin main

If the push is rejected (HF creates a default commit), pull first:

git pull origin main --rebase
# Resolve any conflicts in .gitattributes / README.md (keep your versions)
git add .
git rebase --continue
git push origin main

Step 6 β€” Wait for build

HF Spaces will build the Docker image (installs PyTorch, sentence-transformers, etc.). This takes 5-10 minutes on the first deploy. Watch progress in the Space's Logs tab.

Once the status shows Running, your app is live.

How It Works

Embedding model

intfloat/multilingual-e5-small (118M params, 384 dimensions)

This is a compact multilingual retrieval model. Critical detail β€” E5 models require prefixes:

  • Documents: passage: {text}
  • Queries: query: {text}

Without these prefixes, retrieval quality drops significantly.

Vector database

ChromaDB with persistent storage and cosine distance. Documents are stored with precomputed embeddings so ChromaDB doesn't need to re-embed anything at runtime.

Search flow

  1. User types a query in any supported language
  2. FastAPI encodes it with query: {text} prefix using the E5 model
  3. ChromaDB finds the 5 nearest neighbors by cosine distance
  4. Results are returned as JSON: {rank, score, title, text}
  5. Score = 1 - cosine_distance (displayed as similarity percentage)

Cross-lingual search

The multilingual E5 model maps text from different languages into the same vector space. A French query about vaccination lands near English documents about vaccination β€” no translation needed.

Customization Ideas

  • Different dataset: Replace load_scifact() in precompute.py with your own corpus
  • More languages: The model supports 100+ languages β€” add more example chips in index.html
  • More results: Change top_k parameter (default 5, max 20)
  • Reranking: Add a cross-encoder reranker on top of the retrieval results for better precision