# indexing Local semantic + lexical search over a folder of text and images. Modalities: - `semantic-text` — dense embeddings over `.txt` / `.md` (sentence-transformers) - `bm25` — lexical ranking over `.txt` / `.md` - `image` — dense CLIP embeddings over images Group aliases: - `text` → `semantic-text` + `bm25` ## Install ```bash pip install -r requirements.txt ``` ## Build the index ```bash python index.py /path/to/folder ``` Writes `index_data/semantic-text.faiss`, `index_data/bm25.pkl`, `index_data/image.faiss`, plus `*_meta.json`. ## Query (CLI) The CLI is a thin HTTP client of `backend/server.py` so the model weights stay loaded in one process. Start the backend in another terminal first: ```bash indexing/.env/bin/python backend/server.py ``` Then: ```bash python query.py "your query" # all modalities python query.py "your query" 10 # top_k = 10 python query.py "your query" -m text # semantic-text + bm25 python query.py "your query" -m semantic-text # dense text only python query.py "your query" -m bm25 # lexical text only python query.py "your query" -m image # images only python query.py "your query" -m text,image # everything (same as default) ``` Set `RAGSTUDIO_URL` to point at a non-default backend (default `http://127.0.0.1:8000`). ## Query (Python, in-process) The searchers can still be imported directly if you don't want to run the backend — but each fresh Python process re-loads the model weights: ```python from searchers import SEARCHERS SEARCHERS["semantic-text"]("your query", top_k=5) # -> [(score, path), ...] SEARCHERS["bm25"]("your query", top_k=5) SEARCHERS["image"]("your query", top_k=5) ``` ## Add a modality 1. Create `searchers/.py` exposing `search_(query: str, top_k: int) -> list[tuple[float, str]]`. 2. Register it in `searchers/__init__.py`: ```python from .audio import search_audio SEARCHERS["audio"] = search_audio ``` To group several modalities under one alias, add to `GROUPS` in the same file: ```python GROUPS["av"] = ("audio", "image") ``` Both `SEARCHERS` keys and `GROUPS` keys work as CLI `-m` values.