Spaces:

16bitSega
/

Agentic_RAG

Sleeping

App Files Files Community

Agentic_RAG / README.md

Oleksii Obolonskyi

Persist FAISS indexes across restarts

6f19c35 28 days ago

preview code

raw

history blame contribute delete

5.72 kB

	---
	title: Agentic RAG (Agentic DP + AIMA + MCP)
	author: O.O
	sdk: streamlit
	app_file: app.py
	---

	# Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM)

	A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline.
	It indexes chunk files with FAISS and retrieves across:
	- Agentic Design Patterns (doc_id: `agentic_design_patterns`)
	- AIMA (doc_id: `aima`)
	- MCP markdowns (doc_id prefix: `mcp::`)
	- Articles (doc_id prefix: `article::`)

	## Quick start (local)

	### 0) Prerequisites

	- Python 3.11+
	- A Hugging Face access token (for the hosted LLM)
	- Network access for article ingestion and MCP refresh scripts

	### 1) Create venv and install deps

	macOS:

	```bash
	python -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	```

	Windows (PowerShell):

	```powershell
	python -m venv .venv
	.\.venv\Scripts\Activate.ps1
	pip install -r requirements.txt
	```

	Windows (CMD):

	```bat
	python -m venv .venv
	.\.venv\Scripts\activate.bat
	pip install -r requirements.txt
	```

	### 2) Configure Hugging Face model access

	Set these environment variables (local dev or Hugging Face Spaces secrets):

	```bash
	export HF_TOKEN=hf_your_token_here
	export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
	export RAG_HF_PROVIDER_SUFFIX=featherless-ai
	export RAG_LLM_BACKEND=hf-router
	```

	Optional: set `RAG_HF_PROVIDER_SUFFIX` if your model id is missing the provider suffix.

	### 3) Prepare sources

	- Books: drop PDFs into `data/raw_pdfs/` and add entries to `sources.json`
	- Articles: edit `sources_articles.json` (list of `{id,type,url,publisher}`)
	- MCP docs (optional): `bash scripts/refresh_mcp.sh` (downloads the latest snapshot)

	### 4) Build datasets

	Recommended one-command rebuild:

	```bash
	make rebuild
	```

	Outputs to `data/normalized/`:
	- `chunks_books.jsonl` + `manifest_books.json`
	- `chunks_articles.jsonl` + `manifest_articles.json`
	- `chunks.jsonl` + `manifest.json` (merged)

	### 5) Run the app

	```bash
	streamlit run app.py
	```

	Open `http://localhost:8501`. On first run, the app builds FAISS indexes:
	- `data/cache/index_books.faiss` (local)
	- `data/cache/index_articles.faiss` (local)

	## Configuration

	You can override defaults via environment variables:

	```bash
	export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
	export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
	export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss
	export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss
	export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
	export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
	export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
	export HF_TOKEN=hf_your_token_here
	export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
	export RAG_HF_PROVIDER_SUFFIX=featherless-ai
	export RAG_LLM_BACKEND=hf-router
	export RAG_MAX_CONTEXT_TOKENS=6000
	export RAG_INJECT_MAX_CHUNKS=6
	export RAG_MAX_GENERATION_TOKENS=512
	export RAG_RETRIEVE_TOPK_MULT=2
	export RAG_OUT_DIR=data/normalized
	export RAG_ARTICLE_SOURCES=sources_articles.json
	```

	## Deploy to Hugging Face Spaces

	1. Create a new Space (Streamlit SDK) and push this repo.
	2. Enable Persistent Storage and set caches:
	- `HF_HOME=/data/.huggingface`
	- `SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers`
	3. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`.
	4. In Space Settings → Variables, set `RAG_HF_MODEL` and `RAG_LLM_BACKEND=hf-router`.
	5. Optional: `RAG_HF_PROVIDER_SUFFIX`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`.

	With persistent storage enabled, FAISS indexes are stored in `/data/rag_cache` and reused across restarts. They rebuild only when the normalized chunk/manifest files change.

	## Common maintenance tasks

	### Add new books (PDFs)

	1. Add PDFs to `data/raw_pdfs/`
	2. Update `sources.json`
	3. Run `make rebuild`
	4. (Optional) `make clean-index`
	5. `streamlit run app.py`

	### Add new articles

	1. Update `sources_articles.json`
	2. Run `make rebuild`
	3. (Optional) `make clean-index`
	4. `streamlit run app.py`

	### Rebuild indexes only

	```bash
	make clean-index
	```

	## Scripts and commands reference

	- `app.py` - Streamlit UI; loads chunk files and builds/loads FAISS indexes.
	- `scripts/normalize_all.py` - Parse PDFs and MCP markdowns into `chunks_books.jsonl` and `manifest_books.json`.
	- `scripts/ingest_articles.py` - Fetch URLs from `sources_articles.json` and write `chunks_articles.jsonl` and `manifest_articles.json` plus `articles_ingest_report.json`.
	- `scripts/merge_chunks.py` - Merge multiple chunk files and manifests; emits `chunks.jsonl`, `manifest.json`, and `merge_report.json`.
	- `scripts/rebuild_all.sh` - Run normalize, ingest, and merge in order (same as `make rebuild`).
	- `scripts/refresh_mcp.sh` - Download `llms-full.txt` and regenerate MCP markdowns in `mcp/`.
	- `scripts/split_mcp.py` - Split a single MCP snapshot text file into topic markdown files.
	- `refresh_mcp.sh` - Convenience wrapper for `scripts/refresh_mcp.sh`.
	- `normalize_all.py`, `ingest_articles.py`, `merge_chunks.py`, `split_mcp.py` - Convenience wrappers for the `scripts/` versions.
	- `Makefile` - `make install`, `make rebuild`, `make clean-index`, `make run`.
	- `build_kb.py` - Legacy entry point referencing a removed `src/` package; not used by the current app.

	## License

	Apache License 2.0. See `LICENSE`.

	## Troubleshooting

	- If you see `No chunks loaded`, ensure `data/normalized/*.jsonl` exists and has content.
	- If the Hugging Face request fails, verify `HF_TOKEN` is set and the model name or endpoint is correct.
	- If article ingestion skips sources, check `data/normalized/articles_ingest_report.json`.