Spaces:
Sleeping
Sleeping
| # indexing | |
| Local semantic + lexical search over a folder of text and images. | |
| Modalities: | |
| - `semantic-text` β dense embeddings over `.txt` / `.md` (sentence-transformers) | |
| - `bm25` β lexical ranking over `.txt` / `.md` | |
| - `image` β dense CLIP embeddings over images | |
| Group aliases: | |
| - `text` β `semantic-text` + `bm25` | |
| ## Install | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Build the index | |
| ```bash | |
| python index.py /path/to/folder | |
| ``` | |
| Writes `index_data/semantic-text.faiss`, `index_data/bm25.pkl`, `index_data/image.faiss`, plus `*_meta.json`. | |
| ## Query (CLI) | |
| The CLI is a thin HTTP client of `backend/server.py` so the model weights stay | |
| loaded in one process. Start the backend in another terminal first: | |
| ```bash | |
| indexing/.env/bin/python backend/server.py | |
| ``` | |
| Then: | |
| ```bash | |
| python query.py "your query" # all modalities | |
| python query.py "your query" 10 # top_k = 10 | |
| python query.py "your query" -m text # semantic-text + bm25 | |
| python query.py "your query" -m semantic-text # dense text only | |
| python query.py "your query" -m bm25 # lexical text only | |
| python query.py "your query" -m image # images only | |
| python query.py "your query" -m text,image # everything (same as default) | |
| ``` | |
| Set `RAGSTUDIO_URL` to point at a non-default backend | |
| (default `http://127.0.0.1:8000`). | |
| ## Query (Python, in-process) | |
| The searchers can still be imported directly if you don't want to run the | |
| backend β but each fresh Python process re-loads the model weights: | |
| ```python | |
| from searchers import SEARCHERS | |
| SEARCHERS["semantic-text"]("your query", top_k=5) # -> [(score, path), ...] | |
| SEARCHERS["bm25"]("your query", top_k=5) | |
| SEARCHERS["image"]("your query", top_k=5) | |
| ``` | |
| ## Add a modality | |
| 1. Create `searchers/<name>.py` exposing `search_<name>(query: str, top_k: int) -> list[tuple[float, str]]`. | |
| 2. Register it in `searchers/__init__.py`: | |
| ```python | |
| from .audio import search_audio | |
| SEARCHERS["audio"] = search_audio | |
| ``` | |
| To group several modalities under one alias, add to `GROUPS` in the same file: | |
| ```python | |
| GROUPS["av"] = ("audio", "image") | |
| ``` | |
| Both `SEARCHERS` keys and `GROUPS` keys work as CLI `-m` values. | |