Spaces:
Sleeping
Sleeping
| title: Agentic RAG (Agentic DP + AIMA + MCP) | |
| author: O.O | |
| sdk: streamlit | |
| app_file: app.py | |
| # Agentic RAG (FAISS + SentenceTransformers + Hugging Face LLM) | |
| A Streamlit UI that answers questions over a local RAG corpus with a retrieval-only baseline. | |
| It indexes chunk files with FAISS and retrieves across: | |
| - Agentic Design Patterns (doc_id: `agentic_design_patterns`) | |
| - AIMA (doc_id: `aima`) | |
| - MCP markdowns (doc_id prefix: `mcp::`) | |
| - Articles (doc_id prefix: `article::`) | |
| ## Quick start (local) | |
| ### 0) Prerequisites | |
| - Python 3.11+ | |
| - A Hugging Face access token (for the hosted LLM) | |
| - Network access for article ingestion and MCP refresh scripts | |
| ### 1) Create venv and install deps | |
| macOS: | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| Windows (PowerShell): | |
| ```powershell | |
| python -m venv .venv | |
| .\.venv\Scripts\Activate.ps1 | |
| pip install -r requirements.txt | |
| ``` | |
| Windows (CMD): | |
| ```bat | |
| python -m venv .venv | |
| .\.venv\Scripts\activate.bat | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2) Configure Hugging Face model access | |
| Set these environment variables (local dev or Hugging Face Spaces secrets): | |
| ```bash | |
| export HF_TOKEN=hf_your_token_here | |
| export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai | |
| export RAG_HF_PROVIDER_SUFFIX=featherless-ai | |
| export RAG_LLM_BACKEND=hf-router | |
| ``` | |
| Optional: set `RAG_HF_PROVIDER_SUFFIX` if your model id is missing the provider suffix. | |
| ### 3) Prepare sources | |
| - Books: drop PDFs into `data/raw_pdfs/` and add entries to `sources.json` | |
| - Articles: edit `sources_articles.json` (list of `{id,type,url,publisher}`) | |
| - MCP docs (optional): `bash scripts/refresh_mcp.sh` (downloads the latest snapshot) | |
| ### 4) Build datasets | |
| Recommended one-command rebuild: | |
| ```bash | |
| make rebuild | |
| ``` | |
| Outputs to `data/normalized/`: | |
| - `chunks_books.jsonl` + `manifest_books.json` | |
| - `chunks_articles.jsonl` + `manifest_articles.json` | |
| - `chunks.jsonl` + `manifest.json` (merged) | |
| ### 5) Run the app | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| Open `http://localhost:8501`. On first run, the app builds FAISS indexes: | |
| - `data/cache/index_books.faiss` (local) | |
| - `data/cache/index_articles.faiss` (local) | |
| ## Configuration | |
| You can override defaults via environment variables: | |
| ```bash | |
| export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl | |
| export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl | |
| export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss | |
| export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss | |
| export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json | |
| export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json | |
| export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 | |
| export HF_TOKEN=hf_your_token_here | |
| export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai | |
| export RAG_HF_PROVIDER_SUFFIX=featherless-ai | |
| export RAG_LLM_BACKEND=hf-router | |
| export RAG_MAX_CONTEXT_TOKENS=6000 | |
| export RAG_INJECT_MAX_CHUNKS=6 | |
| export RAG_MAX_GENERATION_TOKENS=512 | |
| export RAG_RETRIEVE_TOPK_MULT=2 | |
| export RAG_OUT_DIR=data/normalized | |
| export RAG_ARTICLE_SOURCES=sources_articles.json | |
| ``` | |
| ## Deploy to Hugging Face Spaces | |
| 1. Create a new Space (Streamlit SDK) and push this repo. | |
| 2. Enable Persistent Storage and set caches: | |
| - `HF_HOME=/data/.huggingface` | |
| - `SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers` | |
| 3. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`. | |
| 4. In Space Settings → Variables, set `RAG_HF_MODEL` and `RAG_LLM_BACKEND=hf-router`. | |
| 5. Optional: `RAG_HF_PROVIDER_SUFFIX`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`. | |
| With persistent storage enabled, FAISS indexes are stored in `/data/rag_cache` and reused across restarts. They rebuild only when the normalized chunk/manifest files change. | |
| ## Common maintenance tasks | |
| ### Add new books (PDFs) | |
| 1. Add PDFs to `data/raw_pdfs/` | |
| 2. Update `sources.json` | |
| 3. Run `make rebuild` | |
| 4. (Optional) `make clean-index` | |
| 5. `streamlit run app.py` | |
| ### Add new articles | |
| 1. Update `sources_articles.json` | |
| 2. Run `make rebuild` | |
| 3. (Optional) `make clean-index` | |
| 4. `streamlit run app.py` | |
| ### Rebuild indexes only | |
| ```bash | |
| make clean-index | |
| ``` | |
| ## Scripts and commands reference | |
| - `app.py` - Streamlit UI; loads chunk files and builds/loads FAISS indexes. | |
| - `scripts/normalize_all.py` - Parse PDFs and MCP markdowns into `chunks_books.jsonl` and `manifest_books.json`. | |
| - `scripts/ingest_articles.py` - Fetch URLs from `sources_articles.json` and write `chunks_articles.jsonl` and `manifest_articles.json` plus `articles_ingest_report.json`. | |
| - `scripts/merge_chunks.py` - Merge multiple chunk files and manifests; emits `chunks.jsonl`, `manifest.json`, and `merge_report.json`. | |
| - `scripts/rebuild_all.sh` - Run normalize, ingest, and merge in order (same as `make rebuild`). | |
| - `scripts/refresh_mcp.sh` - Download `llms-full.txt` and regenerate MCP markdowns in `mcp/`. | |
| - `scripts/split_mcp.py` - Split a single MCP snapshot text file into topic markdown files. | |
| - `refresh_mcp.sh` - Convenience wrapper for `scripts/refresh_mcp.sh`. | |
| - `normalize_all.py`, `ingest_articles.py`, `merge_chunks.py`, `split_mcp.py` - Convenience wrappers for the `scripts/` versions. | |
| - `Makefile` - `make install`, `make rebuild`, `make clean-index`, `make run`. | |
| - `build_kb.py` - Legacy entry point referencing a removed `src/` package; not used by the current app. | |
| ## License | |
| Apache License 2.0. See `LICENSE`. | |
| ## Troubleshooting | |
| - If you see `No chunks loaded`, ensure `data/normalized/*.jsonl` exists and has content. | |
| - If the Hugging Face request fails, verify `HF_TOKEN` is set and the model name or endpoint is correct. | |
| - If article ingestion skips sources, check `data/normalized/articles_ingest_report.json`. | |