Spaces:

16bitSega
/

Agentic_RAG

Sleeping

App Files Files Community

Agentic_RAG / README.md

Oleksii Obolonskyi

Persist FAISS indexes across restarts

6f19c35 27 days ago

preview code

raw

history blame contribute delete

5.72 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

metadata

title: Agentic RAG (Agentic DP + AIMA + MCP)
author: O.O
sdk: streamlit
app_file: app.py

Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM)

A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline. It indexes chunk files with FAISS and retrieves across:

Agentic Design Patterns (doc_id: agentic_design_patterns)
AIMA (doc_id: aima)
MCP markdowns (doc_id prefix: mcp::)
Articles (doc_id prefix: article::)

Quick start (local)

0) Prerequisites

Python 3.11+
A Hugging Face access token (for the hosted LLM)
Network access for article ingestion and MCP refresh scripts

1) Create venv and install deps

macOS:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Windows (PowerShell):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Windows (CMD):

python -m venv .venv
.\.venv\Scripts\activate.bat
pip install -r requirements.txt

2) Configure Hugging Face model access

Set these environment variables (local dev or Hugging Face Spaces secrets):

export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router

Optional: set RAG_HF_PROVIDER_SUFFIX if your model id is missing the provider suffix.

3) Prepare sources

Books: drop PDFs into data/raw_pdfs/ and add entries to sources.json
Articles: edit sources_articles.json (list of {id,type,url,publisher})
MCP docs (optional): bash scripts/refresh_mcp.sh (downloads the latest snapshot)

4) Build datasets

Recommended one-command rebuild:

make rebuild

Outputs to data/normalized/:

chunks_books.jsonl + manifest_books.json
chunks_articles.jsonl + manifest_articles.json
chunks.jsonl + manifest.json (merged)

5) Run the app

streamlit run app.py

Open http://localhost:8501. On first run, the app builds FAISS indexes:

data/cache/index_books.faiss (local)
data/cache/index_articles.faiss (local)

Configuration

You can override defaults via environment variables:

export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss
export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss
export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router
export RAG_MAX_CONTEXT_TOKENS=6000
export RAG_INJECT_MAX_CHUNKS=6
export RAG_MAX_GENERATION_TOKENS=512
export RAG_RETRIEVE_TOPK_MULT=2
export RAG_OUT_DIR=data/normalized
export RAG_ARTICLE_SOURCES=sources_articles.json

Deploy to Hugging Face Spaces

Create a new Space (Streamlit SDK) and push this repo.
Enable Persistent Storage and set caches:
- HF_HOME=/data/.huggingface
- SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers
In Space Settings → Secrets, set HF_TOKEN (required) and optionally GITHUB_TOKEN.
In Space Settings → Variables, set RAG_HF_MODEL and RAG_LLM_BACKEND=hf-router.
Optional: RAG_HF_PROVIDER_SUFFIX, RAG_INJECT_MAX_CHUNKS, and RAG_RETRIEVE_TOPK_MULT.

With persistent storage enabled, FAISS indexes are stored in /data/rag_cache and reused across restarts. They rebuild only when the normalized chunk/manifest files change.

Common maintenance tasks

Add new books (PDFs)

Add PDFs to data/raw_pdfs/
Update sources.json
Run make rebuild
(Optional) make clean-index
streamlit run app.py

Add new articles

Update sources_articles.json
Run make rebuild
(Optional) make clean-index
streamlit run app.py

Rebuild indexes only

make clean-index

Scripts and commands reference

app.py - Streamlit UI; loads chunk files and builds/loads FAISS indexes.
scripts/normalize_all.py - Parse PDFs and MCP markdowns into chunks_books.jsonl and manifest_books.json.
scripts/ingest_articles.py - Fetch URLs from sources_articles.json and write chunks_articles.jsonl and manifest_articles.json plus articles_ingest_report.json.
scripts/merge_chunks.py - Merge multiple chunk files and manifests; emits chunks.jsonl, manifest.json, and merge_report.json.
scripts/rebuild_all.sh - Run normalize, ingest, and merge in order (same as make rebuild).
scripts/refresh_mcp.sh - Download llms-full.txt and regenerate MCP markdowns in mcp/.
scripts/split_mcp.py - Split a single MCP snapshot text file into topic markdown files.
refresh_mcp.sh - Convenience wrapper for scripts/refresh_mcp.sh.
normalize_all.py, ingest_articles.py, merge_chunks.py, split_mcp.py - Convenience wrappers for the scripts/ versions.
Makefile - make install, make rebuild, make clean-index, make run.
build_kb.py - Legacy entry point referencing a removed src/ package; not used by the current app.

License

Apache License 2.0. See LICENSE.

Troubleshooting

If you see No chunks loaded, ensure data/normalized/*.jsonl exists and has content.
If the Hugging Face request fails, verify HF_TOKEN is set and the model name or endpoint is correct.
If article ingestion skips sources, check data/normalized/articles_ingest_report.json.