--- title: Agentic RAG (Agentic DP + AIMA + MCP) author: O.O sdk: streamlit app_file: app.py --- # Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM) A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline. It indexes chunk files with FAISS and retrieves across: - Agentic Design Patterns (doc_id: `agentic_design_patterns`) - AIMA (doc_id: `aima`) - MCP markdowns (doc_id prefix: `mcp::`) - Articles (doc_id prefix: `article::`) ## Quick start (local) ### 0) Prerequisites - Python 3.11+ - A Hugging Face access token (for the hosted LLM) - Network access for article ingestion and MCP refresh scripts ### 1) Create venv and install deps macOS: ```bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` Windows (PowerShell): ```powershell python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -r requirements.txt ``` Windows (CMD): ```bat python -m venv .venv .\.venv\Scripts\activate.bat pip install -r requirements.txt ``` ### 2) Configure Hugging Face model access Set these environment variables (local dev or Hugging Face Spaces secrets): ```bash export HF_TOKEN=hf_your_token_here export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai export RAG_HF_PROVIDER_SUFFIX=featherless-ai export RAG_LLM_BACKEND=hf-router ``` Optional: set `RAG_HF_PROVIDER_SUFFIX` if your model id is missing the provider suffix. ### 3) Prepare sources - Books: drop PDFs into `data/raw_pdfs/` and add entries to `sources.json` - Articles: edit `sources_articles.json` (list of `{id,type,url,publisher}`) - MCP docs (optional): `bash scripts/refresh_mcp.sh` (downloads the latest snapshot) ### 4) Build datasets Recommended one-command rebuild: ```bash make rebuild ``` Outputs to `data/normalized/`: - `chunks_books.jsonl` + `manifest_books.json` - `chunks_articles.jsonl` + `manifest_articles.json` - `chunks.jsonl` + `manifest.json` (merged) ### 5) Run the app ```bash streamlit run app.py ``` Open `http://localhost:8501`. On first run, the app builds FAISS indexes: - `data/cache/index_books.faiss` (local) - `data/cache/index_articles.faiss` (local) ## Configuration You can override defaults via environment variables: ```bash export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 export HF_TOKEN=hf_your_token_here export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai export RAG_HF_PROVIDER_SUFFIX=featherless-ai export RAG_LLM_BACKEND=hf-router export RAG_MAX_CONTEXT_TOKENS=6000 export RAG_INJECT_MAX_CHUNKS=6 export RAG_MAX_GENERATION_TOKENS=512 export RAG_RETRIEVE_TOPK_MULT=2 export RAG_OUT_DIR=data/normalized export RAG_ARTICLE_SOURCES=sources_articles.json ``` ## Deploy to Hugging Face Spaces 1. Create a new Space (Streamlit SDK) and push this repo. 2. Enable Persistent Storage and set caches: - `HF_HOME=/data/.huggingface` - `SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers` 3. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`. 4. In Space Settings → Variables, set `RAG_HF_MODEL` and `RAG_LLM_BACKEND=hf-router`. 5. Optional: `RAG_HF_PROVIDER_SUFFIX`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`. With persistent storage enabled, FAISS indexes are stored in `/data/rag_cache` and reused across restarts. They rebuild only when the normalized chunk/manifest files change. ## Common maintenance tasks ### Add new books (PDFs) 1. Add PDFs to `data/raw_pdfs/` 2. Update `sources.json` 3. Run `make rebuild` 4. (Optional) `make clean-index` 5. `streamlit run app.py` ### Add new articles 1. Update `sources_articles.json` 2. Run `make rebuild` 3. (Optional) `make clean-index` 4. `streamlit run app.py` ### Rebuild indexes only ```bash make clean-index ``` ## Scripts and commands reference - `app.py` - Streamlit UI; loads chunk files and builds/loads FAISS indexes. - `scripts/normalize_all.py` - Parse PDFs and MCP markdowns into `chunks_books.jsonl` and `manifest_books.json`. - `scripts/ingest_articles.py` - Fetch URLs from `sources_articles.json` and write `chunks_articles.jsonl` and `manifest_articles.json` plus `articles_ingest_report.json`. - `scripts/merge_chunks.py` - Merge multiple chunk files and manifests; emits `chunks.jsonl`, `manifest.json`, and `merge_report.json`. - `scripts/rebuild_all.sh` - Run normalize, ingest, and merge in order (same as `make rebuild`). - `scripts/refresh_mcp.sh` - Download `llms-full.txt` and regenerate MCP markdowns in `mcp/`. - `scripts/split_mcp.py` - Split a single MCP snapshot text file into topic markdown files. - `refresh_mcp.sh` - Convenience wrapper for `scripts/refresh_mcp.sh`. - `normalize_all.py`, `ingest_articles.py`, `merge_chunks.py`, `split_mcp.py` - Convenience wrappers for the `scripts/` versions. - `Makefile` - `make install`, `make rebuild`, `make clean-index`, `make run`. - `build_kb.py` - Legacy entry point referencing a removed `src/` package; not used by the current app. ## License Apache License 2.0. See `LICENSE`. ## Troubleshooting - If you see `No chunks loaded`, ensure `data/normalized/*.jsonl` exists and has content. - If the Hugging Face request fails, verify `HF_TOKEN` is set and the model name or endpoint is correct. - If article ingestion skips sources, check `data/normalized/articles_ingest_report.json`.