Spaces:

team-0
/

team-0-hackathon

Sleeping

App Files Files Community

team-0-hackathon / indexing /README.md

Jikkii

Focus on the backend; beginning of the UI

fed0900 14 days ago

preview code

raw

history blame contribute delete

2.19 kB

indexing

Local semantic + lexical search over a folder of text and images.

Modalities:

semantic-text — dense embeddings over .txt / .md (sentence-transformers)
bm25 — lexical ranking over .txt / .md
image — dense CLIP embeddings over images

Group aliases:

text → semantic-text + bm25

Install

pip install -r requirements.txt

Build the index

python index.py /path/to/folder

Writes index_data/semantic-text.faiss, index_data/bm25.pkl, index_data/image.faiss, plus *_meta.json.

Query (CLI)

The CLI is a thin HTTP client of backend/server.py so the model weights stay loaded in one process. Start the backend in another terminal first:

indexing/.env/bin/python backend/server.py

Then:

python query.py "your query"                # all modalities
python query.py "your query" 10             # top_k = 10
python query.py "your query" -m text          # semantic-text + bm25
python query.py "your query" -m semantic-text # dense text only
python query.py "your query" -m bm25          # lexical text only
python query.py "your query" -m image         # images only
python query.py "your query" -m text,image    # everything (same as default)

Set RAGSTUDIO_URL to point at a non-default backend (default http://127.0.0.1:8000).

Query (Python, in-process)

The searchers can still be imported directly if you don't want to run the backend — but each fresh Python process re-loads the model weights:

from searchers import SEARCHERS

SEARCHERS["semantic-text"]("your query", top_k=5)  # -> [(score, path), ...]
SEARCHERS["bm25"]("your query", top_k=5)
SEARCHERS["image"]("your query", top_k=5)

Add a modality

Create searchers/<name>.py exposing search_<name>(query: str, top_k: int) -> list[tuple[float, str]].

from .audio import search_audio
SEARCHERS["audio"] = search_audio

To group several modalities under one alias, add to GROUPS in the same file:

GROUPS["av"] = ("audio", "image")

Both SEARCHERS keys and GROUPS keys work as CLI -m values.