Jikkii's picture
Focus on the backend; beginning of the UI
fed0900
# indexing
Local semantic + lexical search over a folder of text and images.
Modalities:
- `semantic-text` β€” dense embeddings over `.txt` / `.md` (sentence-transformers)
- `bm25` β€” lexical ranking over `.txt` / `.md`
- `image` β€” dense CLIP embeddings over images
Group aliases:
- `text` β†’ `semantic-text` + `bm25`
## Install
```bash
pip install -r requirements.txt
```
## Build the index
```bash
python index.py /path/to/folder
```
Writes `index_data/semantic-text.faiss`, `index_data/bm25.pkl`, `index_data/image.faiss`, plus `*_meta.json`.
## Query (CLI)
The CLI is a thin HTTP client of `backend/server.py` so the model weights stay
loaded in one process. Start the backend in another terminal first:
```bash
indexing/.env/bin/python backend/server.py
```
Then:
```bash
python query.py "your query" # all modalities
python query.py "your query" 10 # top_k = 10
python query.py "your query" -m text # semantic-text + bm25
python query.py "your query" -m semantic-text # dense text only
python query.py "your query" -m bm25 # lexical text only
python query.py "your query" -m image # images only
python query.py "your query" -m text,image # everything (same as default)
```
Set `RAGSTUDIO_URL` to point at a non-default backend
(default `http://127.0.0.1:8000`).
## Query (Python, in-process)
The searchers can still be imported directly if you don't want to run the
backend β€” but each fresh Python process re-loads the model weights:
```python
from searchers import SEARCHERS
SEARCHERS["semantic-text"]("your query", top_k=5) # -> [(score, path), ...]
SEARCHERS["bm25"]("your query", top_k=5)
SEARCHERS["image"]("your query", top_k=5)
```
## Add a modality
1. Create `searchers/<name>.py` exposing `search_<name>(query: str, top_k: int) -> list[tuple[float, str]]`.
2. Register it in `searchers/__init__.py`:
```python
from .audio import search_audio
SEARCHERS["audio"] = search_audio
```
To group several modalities under one alias, add to `GROUPS` in the same file:
```python
GROUPS["av"] = ("audio", "image")
```
Both `SEARCHERS` keys and `GROUPS` keys work as CLI `-m` values.