Agentic_RAG / README.md
Oleksii Obolonskyi
Persist FAISS indexes across restarts
6f19c35
---
title: Agentic RAG (Agentic DP + AIMA + MCP)
author: O.O
sdk: streamlit
app_file: app.py
---
# Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM)
A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline.
It indexes chunk files with FAISS and retrieves across:
- Agentic Design Patterns (doc_id: `agentic_design_patterns`)
- AIMA (doc_id: `aima`)
- MCP markdowns (doc_id prefix: `mcp::`)
- Articles (doc_id prefix: `article::`)
## Quick start (local)
### 0) Prerequisites
- Python 3.11+
- A Hugging Face access token (for the hosted LLM)
- Network access for article ingestion and MCP refresh scripts
### 1) Create venv and install deps
macOS:
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Windows (PowerShell):
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
Windows (CMD):
```bat
python -m venv .venv
.\.venv\Scripts\activate.bat
pip install -r requirements.txt
```
### 2) Configure Hugging Face model access
Set these environment variables (local dev or Hugging Face Spaces secrets):
```bash
export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router
```
Optional: set `RAG_HF_PROVIDER_SUFFIX` if your model id is missing the provider suffix.
### 3) Prepare sources
- Books: drop PDFs into `data/raw_pdfs/` and add entries to `sources.json`
- Articles: edit `sources_articles.json` (list of `{id,type,url,publisher}`)
- MCP docs (optional): `bash scripts/refresh_mcp.sh` (downloads the latest snapshot)
### 4) Build datasets
Recommended one-command rebuild:
```bash
make rebuild
```
Outputs to `data/normalized/`:
- `chunks_books.jsonl` + `manifest_books.json`
- `chunks_articles.jsonl` + `manifest_articles.json`
- `chunks.jsonl` + `manifest.json` (merged)
### 5) Run the app
```bash
streamlit run app.py
```
Open `http://localhost:8501`. On first run, the app builds FAISS indexes:
- `data/cache/index_books.faiss` (local)
- `data/cache/index_articles.faiss` (local)
## Configuration
You can override defaults via environment variables:
```bash
export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss
export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss
export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router
export RAG_MAX_CONTEXT_TOKENS=6000
export RAG_INJECT_MAX_CHUNKS=6
export RAG_MAX_GENERATION_TOKENS=512
export RAG_RETRIEVE_TOPK_MULT=2
export RAG_OUT_DIR=data/normalized
export RAG_ARTICLE_SOURCES=sources_articles.json
```
## Deploy to Hugging Face Spaces
1. Create a new Space (Streamlit SDK) and push this repo.
2. Enable Persistent Storage and set caches:
- `HF_HOME=/data/.huggingface`
- `SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers`
3. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`.
4. In Space Settings → Variables, set `RAG_HF_MODEL` and `RAG_LLM_BACKEND=hf-router`.
5. Optional: `RAG_HF_PROVIDER_SUFFIX`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`.
With persistent storage enabled, FAISS indexes are stored in `/data/rag_cache` and reused across restarts. They rebuild only when the normalized chunk/manifest files change.
## Common maintenance tasks
### Add new books (PDFs)
1. Add PDFs to `data/raw_pdfs/`
2. Update `sources.json`
3. Run `make rebuild`
4. (Optional) `make clean-index`
5. `streamlit run app.py`
### Add new articles
1. Update `sources_articles.json`
2. Run `make rebuild`
3. (Optional) `make clean-index`
4. `streamlit run app.py`
### Rebuild indexes only
```bash
make clean-index
```
## Scripts and commands reference
- `app.py` - Streamlit UI; loads chunk files and builds/loads FAISS indexes.
- `scripts/normalize_all.py` - Parse PDFs and MCP markdowns into `chunks_books.jsonl` and `manifest_books.json`.
- `scripts/ingest_articles.py` - Fetch URLs from `sources_articles.json` and write `chunks_articles.jsonl` and `manifest_articles.json` plus `articles_ingest_report.json`.
- `scripts/merge_chunks.py` - Merge multiple chunk files and manifests; emits `chunks.jsonl`, `manifest.json`, and `merge_report.json`.
- `scripts/rebuild_all.sh` - Run normalize, ingest, and merge in order (same as `make rebuild`).
- `scripts/refresh_mcp.sh` - Download `llms-full.txt` and regenerate MCP markdowns in `mcp/`.
- `scripts/split_mcp.py` - Split a single MCP snapshot text file into topic markdown files.
- `refresh_mcp.sh` - Convenience wrapper for `scripts/refresh_mcp.sh`.
- `normalize_all.py`, `ingest_articles.py`, `merge_chunks.py`, `split_mcp.py` - Convenience wrappers for the `scripts/` versions.
- `Makefile` - `make install`, `make rebuild`, `make clean-index`, `make run`.
- `build_kb.py` - Legacy entry point referencing a removed `src/` package; not used by the current app.
## License
Apache License 2.0. See `LICENSE`.
## Troubleshooting
- If you see `No chunks loaded`, ensure `data/normalized/*.jsonl` exists and has content.
- If the Hugging Face request fails, verify `HF_TOKEN` is set and the model name or endpoint is correct.
- If article ingestion skips sources, check `data/normalized/articles_ingest_report.json`.