Spaces:
Sleeping
title: SciFact Multilingual Semantic Search
emoji: π¬
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
SciFact Multilingual Semantic Search
A deployable semantic search engine over 5,183 scientific abstracts (SciFact dataset), using ChromaDB for vector storage and multilingual-e5-small for cross-lingual search in English, French, German, and Spanish.
Live demo: huggingface.co/spaces/RJuro/scifact-semantic-search
Architecture
LOCAL (one-time setup) HF SPACES (runtime, CPU)
ββββββββββββββββββββββ ββββββββββββββββββββββββ
SciFact dataset Load ChromaDB from data/chroma_db/
β Load multilingual-e5-small
Encode 5,183 docs with β
multilingual-e5-small /search?q=... β
β encode query β
Save to ChromaDB ββββ push via git βββββ ChromaDB query β
(data/chroma_db/) JSON results
Key idea: Corpus encoding happens once on your machine. Only query encoding runs on HF Spaces (CPU). This keeps the Space fast and cheap.
Files
| File | Purpose |
|---|---|
precompute.py |
Encodes all 5,183 SciFact docs and saves them to ChromaDB (run locally) |
app.py |
FastAPI server β loads ChromaDB + model, serves search API and frontend |
static/index.html |
Frontend β vanilla HTML/CSS/JS, no dependencies |
requirements.txt |
Python dependencies |
Dockerfile |
Container config for HF Spaces |
README.md |
This file (YAML header is required by HF Spaces) |
Step-by-Step Deployment Guide
Prerequisites
- Python 3.9+
- A free Hugging Face account
- Git with Git LFS installed (
brew install git-lfson macOS)
Step 1 β Install dependencies
pip install -r requirements.txt
Step 2 β Run precompute.py (local, one-time)
This downloads the SciFact dataset, encodes all 5,183 abstracts with intfloat/multilingual-e5-small, and saves the vectors + metadata into a persistent ChromaDB at data/chroma_db/.
python precompute.py
Takes ~2 minutes on CPU. When done you should see:
ChromaDB persisted to: .../data/chroma_db
Collection 'scifact': 5183 documents
Verify the output:
ls data/chroma_db/
# Should show: chroma.sqlite3 and a UUID-named directory
Step 3 β Test locally
uvicorn app:app --port 7860
Open localhost:7860 in your browser. Try searching:
effects of vaccination(English)effets de la vaccination(French)Auswirkungen der Impfung(German)
The same English-language corpus should return relevant results regardless of query language.
Step 4 β Create a Hugging Face Space
Go to huggingface.co/new-space:
- Space name: choose any name (e.g.
scifact-semantic-search) - SDK: Docker
- Visibility: Public
Or use the CLI:
pip install huggingface-hub
huggingface-cli login # paste your HF token
python -c "
from huggingface_hub import HfApi
api = HfApi()
url = api.create_repo('YOUR-SPACE-NAME', repo_type='space', space_sdk='docker')
print(url)
"
Step 5 β Push to HF Spaces
Initialize git, enable LFS (needed because chroma.sqlite3 is ~74 MB), and push:
git init
git lfs install
git lfs track "*.sqlite3" "*.bin"
git add .
git commit -m "Initial deploy"
git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/YOUR-SPACE-NAME
git push origin main
If the push is rejected (HF creates a default commit), pull first:
git pull origin main --rebase
# Resolve any conflicts in .gitattributes / README.md (keep your versions)
git add .
git rebase --continue
git push origin main
Step 6 β Wait for build
HF Spaces will build the Docker image (installs PyTorch, sentence-transformers, etc.). This takes 5-10 minutes on the first deploy. Watch progress in the Space's Logs tab.
Once the status shows Running, your app is live.
How It Works
Embedding model
intfloat/multilingual-e5-small (118M params, 384 dimensions)
This is a compact multilingual retrieval model. Critical detail β E5 models require prefixes:
- Documents:
passage: {text} - Queries:
query: {text}
Without these prefixes, retrieval quality drops significantly.
Vector database
ChromaDB with persistent storage and cosine distance. Documents are stored with precomputed embeddings so ChromaDB doesn't need to re-embed anything at runtime.
Search flow
- User types a query in any supported language
- FastAPI encodes it with
query: {text}prefix using the E5 model - ChromaDB finds the 5 nearest neighbors by cosine distance
- Results are returned as JSON:
{rank, score, title, text} - Score = 1 - cosine_distance (displayed as similarity percentage)
Cross-lingual search
The multilingual E5 model maps text from different languages into the same vector space. A French query about vaccination lands near English documents about vaccination β no translation needed.
Customization Ideas
- Different dataset: Replace
load_scifact()inprecompute.pywith your own corpus - More languages: The model supports 100+ languages β add more example chips in
index.html - More results: Change
top_kparameter (default 5, max 20) - Reranking: Add a cross-encoder reranker on top of the retrieval results for better precision