Agentic_RAG / README.md
Oleksii Obolonskyi
Persist FAISS indexes across restarts
6f19c35

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade
metadata
title: Agentic RAG (Agentic DP + AIMA + MCP)
author: O.O
sdk: streamlit
app_file: app.py

Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM)

A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline. It indexes chunk files with FAISS and retrieves across:

  • Agentic Design Patterns (doc_id: agentic_design_patterns)
  • AIMA (doc_id: aima)
  • MCP markdowns (doc_id prefix: mcp::)
  • Articles (doc_id prefix: article::)

Quick start (local)

0) Prerequisites

  • Python 3.11+
  • A Hugging Face access token (for the hosted LLM)
  • Network access for article ingestion and MCP refresh scripts

1) Create venv and install deps

macOS:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Windows (PowerShell):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Windows (CMD):

python -m venv .venv
.\.venv\Scripts\activate.bat
pip install -r requirements.txt

2) Configure Hugging Face model access

Set these environment variables (local dev or Hugging Face Spaces secrets):

export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router

Optional: set RAG_HF_PROVIDER_SUFFIX if your model id is missing the provider suffix.

3) Prepare sources

  • Books: drop PDFs into data/raw_pdfs/ and add entries to sources.json
  • Articles: edit sources_articles.json (list of {id,type,url,publisher})
  • MCP docs (optional): bash scripts/refresh_mcp.sh (downloads the latest snapshot)

4) Build datasets

Recommended one-command rebuild:

make rebuild

Outputs to data/normalized/:

  • chunks_books.jsonl + manifest_books.json
  • chunks_articles.jsonl + manifest_articles.json
  • chunks.jsonl + manifest.json (merged)

5) Run the app

streamlit run app.py

Open http://localhost:8501. On first run, the app builds FAISS indexes:

  • data/cache/index_books.faiss (local)
  • data/cache/index_articles.faiss (local)

Configuration

You can override defaults via environment variables:

export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss
export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss
export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
export HF_TOKEN=hf_your_token_here
export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
export RAG_HF_PROVIDER_SUFFIX=featherless-ai
export RAG_LLM_BACKEND=hf-router
export RAG_MAX_CONTEXT_TOKENS=6000
export RAG_INJECT_MAX_CHUNKS=6
export RAG_MAX_GENERATION_TOKENS=512
export RAG_RETRIEVE_TOPK_MULT=2
export RAG_OUT_DIR=data/normalized
export RAG_ARTICLE_SOURCES=sources_articles.json

Deploy to Hugging Face Spaces

  1. Create a new Space (Streamlit SDK) and push this repo.
  2. Enable Persistent Storage and set caches:
    • HF_HOME=/data/.huggingface
    • SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers
  3. In Space Settings → Secrets, set HF_TOKEN (required) and optionally GITHUB_TOKEN.
  4. In Space Settings → Variables, set RAG_HF_MODEL and RAG_LLM_BACKEND=hf-router.
  5. Optional: RAG_HF_PROVIDER_SUFFIX, RAG_INJECT_MAX_CHUNKS, and RAG_RETRIEVE_TOPK_MULT.

With persistent storage enabled, FAISS indexes are stored in /data/rag_cache and reused across restarts. They rebuild only when the normalized chunk/manifest files change.

Common maintenance tasks

Add new books (PDFs)

  1. Add PDFs to data/raw_pdfs/
  2. Update sources.json
  3. Run make rebuild
  4. (Optional) make clean-index
  5. streamlit run app.py

Add new articles

  1. Update sources_articles.json
  2. Run make rebuild
  3. (Optional) make clean-index
  4. streamlit run app.py

Rebuild indexes only

make clean-index

Scripts and commands reference

  • app.py - Streamlit UI; loads chunk files and builds/loads FAISS indexes.
  • scripts/normalize_all.py - Parse PDFs and MCP markdowns into chunks_books.jsonl and manifest_books.json.
  • scripts/ingest_articles.py - Fetch URLs from sources_articles.json and write chunks_articles.jsonl and manifest_articles.json plus articles_ingest_report.json.
  • scripts/merge_chunks.py - Merge multiple chunk files and manifests; emits chunks.jsonl, manifest.json, and merge_report.json.
  • scripts/rebuild_all.sh - Run normalize, ingest, and merge in order (same as make rebuild).
  • scripts/refresh_mcp.sh - Download llms-full.txt and regenerate MCP markdowns in mcp/.
  • scripts/split_mcp.py - Split a single MCP snapshot text file into topic markdown files.
  • refresh_mcp.sh - Convenience wrapper for scripts/refresh_mcp.sh.
  • normalize_all.py, ingest_articles.py, merge_chunks.py, split_mcp.py - Convenience wrappers for the scripts/ versions.
  • Makefile - make install, make rebuild, make clean-index, make run.
  • build_kb.py - Legacy entry point referencing a removed src/ package; not used by the current app.

License

Apache License 2.0. See LICENSE.

Troubleshooting

  • If you see No chunks loaded, ensure data/normalized/*.jsonl exists and has content.
  • If the Hugging Face request fails, verify HF_TOKEN is set and the model name or endpoint is correct.
  • If article ingestion skips sources, check data/normalized/articles_ingest_report.json.