Spaces:
Sleeping
indexing
Local semantic + lexical search over a folder of text and images.
Modalities:
semantic-textβ dense embeddings over.txt/.md(sentence-transformers)bm25β lexical ranking over.txt/.mdimageβ dense CLIP embeddings over images
Group aliases:
textβsemantic-text+bm25
Install
pip install -r requirements.txt
Build the index
python index.py /path/to/folder
Writes index_data/semantic-text.faiss, index_data/bm25.pkl, index_data/image.faiss, plus *_meta.json.
Query (CLI)
The CLI is a thin HTTP client of backend/server.py so the model weights stay
loaded in one process. Start the backend in another terminal first:
indexing/.env/bin/python backend/server.py
Then:
python query.py "your query" # all modalities
python query.py "your query" 10 # top_k = 10
python query.py "your query" -m text # semantic-text + bm25
python query.py "your query" -m semantic-text # dense text only
python query.py "your query" -m bm25 # lexical text only
python query.py "your query" -m image # images only
python query.py "your query" -m text,image # everything (same as default)
Set RAGSTUDIO_URL to point at a non-default backend
(default http://127.0.0.1:8000).
Query (Python, in-process)
The searchers can still be imported directly if you don't want to run the backend β but each fresh Python process re-loads the model weights:
from searchers import SEARCHERS
SEARCHERS["semantic-text"]("your query", top_k=5) # -> [(score, path), ...]
SEARCHERS["bm25"]("your query", top_k=5)
SEARCHERS["image"]("your query", top_k=5)
Add a modality
Create
searchers/<name>.pyexposingsearch_<name>(query: str, top_k: int) -> list[tuple[float, str]].Register it in
searchers/__init__.py:from .audio import search_audio SEARCHERS["audio"] = search_audio
To group several modalities under one alias, add to GROUPS in the same file:
GROUPS["av"] = ("audio", "image")
Both SEARCHERS keys and GROUPS keys work as CLI -m values.