Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.12.0
metadata
title: Simple RAG Pipeline
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app/app.py
pinned: false
license: mit
RAG Pipeline - Jupyter Notebook (FAISS + Multi-Backend Embeddings)
A production-ready Retrieval-Augmented Generation system built as an interactive Jupyter notebook (RAG_Attempt.ipynb) with:
Key Features:
- Fast Embeddings (50x faster than local):
- π HuggingFace Inference API (768-dim, FREE with rate limits) β Recommended
- π Voyage AI (1024-dim, $0.12/1M tokens)
- π OpenAI (1536-dim, $0.13/1M tokens)
- βοΈ FastEmbed (384-dim, free, local, CPU-only) β Fallback
- Semantic Chunking: Respects sentence boundaries, preserves page numbers
- PyMuPDF Integration: 3-5x faster PDF parsing with accurate page tracking
- Smart Embedding Caching: Detects precomputed embeddings, skips redundant API calls
- Parallel Processing: Multi-threaded API calls for 5-10x speedup
- Vector Search: Cosine similarity (FAISS IndexFlatIP) with L2 normalization
- Multi-Backend LLM Support:
- π€ HuggingFace Inference API (Mistral, Llama, etc.) β Recommended for HF Spaces
- π¦ Ollama (local models: gemma3, smollm2, gpt-oss)
- π OpenAI (GPT-4, GPT-3.5)
- π Auto-detection: Automatically uses available backend
- Interactive UI: Gradio demo with auto port detection
- Source Citations: Formatted answers with [1], [2] citations + source details
- Persistence: Save/load vector store and metadata (no reprocessing!)
Workflows:
- First run: Process document β Generate embeddings β Save to disk (~5 minutes)
- Subsequent runs: Load cached embeddings β Query (~15 seconds, zero API calls!)
- Interactive mode: Gradio web UI for testing different queries
Setup
Works on CPU-only boxes. No CUDA needed. All dependencies in
requirements.txt.
# 1) Create virtual environment
python -m venv .venv && source .venv/bin/activate
# 2) Ensure pip is installed (fix for missing pip)
python -m ensurepip --upgrade
python -m pip install --upgrade pip
# 3) Install dependencies
pip install -r requirements.txt
# 4) Configure environment (optional, for API embeddings)
cp .env.example .env
# Edit .env and add your API keys:
# HF_TOKEN (for HuggingFace β FREE, recommended)
# VOYAGE_API_KEY (optional, fastest embeddings)
# OPENAI_API_KEY (optional, for OpenAI embeddings or LLM)
# OLLAMA_MODEL (local Ollama model name, default: smollm2:360m)
# OLLAMA_BASE_URL (default: http://127.0.0.1:11434)
Environment Variables (.env):
# ===== LLM Backend (Answer Generation) =====
LLM_BACKEND=auto # Options: "auto", "huggingface", "ollama", "openai", "none"
# HuggingFace Inference API (recommended for HF Spaces - FREE!)
HF_TOKEN=hf_your_token_here # Get free token: https://huggingface.co/settings/tokens
HF_LLM_MODEL=HuggingFaceTB/SmolLM2-360M-Instruct # CPU-friendly! Or: mistralai/Mistral-7B-Instruct-v0.2
# OpenAI (alternative)
OPENAI_API_KEY=sk_your_key
OPENAI_MODEL=gpt-4o-mini
# Local Ollama (alternative - requires local server)
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_MODEL=smollm2:360m # Or: gemma3:270m
# ===== Embedding Backend =====
EMBEDDING_BACKEND=fastembed # Options: "fastembed" (local), "huggingface", "openai"
FASTEMBED_MODEL=BAAI/bge-small-en-v1.5
# ===== Document Processing =====
CHUNK_SIZE=1000 # Characters per chunk
CHUNK_OVERLAP=200 # Character overlap
TOP_K=5 # Number of sources to retrieve
Recommended Configurations:
For HuggingFace Spaces (Cloud):
LLM_BACKEND=huggingface
HF_TOKEN=your_token_here
EMBEDDING_BACKEND=fastembed # Fast, no API calls
For Local Development:
LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://127.0.0.1:11434
EMBEDDING_BACKEND=fastembed
Quickstart
Option 1: Interactive Jupyter Notebook (Recommended)
# 1) Open the notebook
jupyter notebook RAG_Attempt.ipynb
# 2) Run cells in order:
# - Cell 1-2: Load environment & configure embeddings
# - Cell 3-7: Import libraries & initialize embedding backends
# - Cell 8-10: Load document (PDF/txt/docx), chunk it, and generate embeddings
# - Cell 11-12: Build FAISS vector store and metadata
# - Cell 13-14: Retrieve documents and generate answers
# - Cell 15-16: Launch Gradio web UI (or query programmatically)
# 3) Query the document via Gradio UI or Python:
# query = "What are the Four Laws of Behavior Change?"
# result = execute_query(query, vector_store, metadata_store)
# display_result(result)
Option 2: Gradio Web Demo
The notebook includes a Gradio interface that automatically launches:
# After running the pipeline cells:
# - Opens interactive web UI at http://127.0.0.1:7860
# - Adjust "Number of Sources" (1-10) to control answer length
# - Try example questions with one click
# - Automatically detects available ports (handles conflicts)
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCUMENT INGESTION β
β load_document() β PDF/txt/docx parsed with metadata β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEMANTIC CHUNKING (Smart!) β
β chunk_document_semantic() respects sentence boundaries β
β Config: CHUNK_SIZE=1000 chars, OVERLAP=200 chars β
β Preserves page numbers for accurate citations β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EMBEDDING GENERATION (OPTIMIZED) β
β embed_texts_optimized() with: β
β β’ EmbeddingCache (file-based MD5 hashing, 10x+ speedup) β
β β’ ParallelHFEmbedder (4-worker thread pool) β
β β’ LocalFastEmbedder (batch processing) β
β β’ Smart skip: detects & loads precomputed embeddings β
β APIs: HuggingFace, Voyage, OpenAI, or local FastEmbed β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VECTOR STORAGE & METADATA β
β FAISS IndexFlatIP (cosine similarity, L2 norm) β
β Storage: ./rag_data/{faiss_index.bin, metadata.pkl} β
β Fast loading: 1-2 seconds from disk β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RETRIEVAL & ANSWER GENERATION β
β retrieve_documents(query, k=TOP_K) β
β generate_answer() with local Ollama β
β format_answer_with_sources() β citations [1], [2] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow Example:
User Query: "What are the Four Laws of Behavior Change?"
β
embed_texts([query]) β 768-dimensional vector via HuggingFace
β
vector_store.search(query_vector, k=5) β top 5 similar chunks
β
retrieved_chunks = [
{"text": "...cue, craving, response, reward...", "page": 42, "source": "Atomic_Habits.pdf"},
{"text": "...the habit loop binds them together...", "page": 45, "source": "Atomic_Habits.pdf"},
...
]
β
generate_answer(query, retrieved_chunks) via local Ollama
β
"The Four Laws are cue, craving, response, and reward [1].
Each law represents a stage in the habit loop [2].
π SOURCES:
[1] Atomic_Habits.pdf (Page 42): "...cue, craving, response..."
[2] Atomic_Habits.pdf (Page 45): "...habit loop..."
File Structure
RAG_Attempt.ipynb # Main interactive notebook
ββ Cell 1-7: Setup & environment loading
ββ Cell 8-10: Document loading (PDF/txt/docx) & chunking
ββ Cell 11-12: Embedding generation with optimizations
ββ Cell 13-14: Vector store & retrieval
ββ Cell 15-16: Answer generation & formatting
ββ Cell 17-19: Gradio UI demo & testing
pipeline.ipynb # Experimental/development notebook
.env.example # Configuration template
requirements.txt # All dependencies
README.md # This file
rag_data/ # Generated data (created after first run)
ββ faiss_index.bin # Vector index (~2.6 MB per 660 chunks)
ββ chunk_ids.pkl # Chunk ID mapping
ββ metadata.pkl # Document metadata & source info
ββ .embedding_cache/ # File-based embedding cache
Next Steps
- Query Caching: Cache popular Q&A pairs
- Re-ranking: Add cross-encoders for better retrieval
- Evaluation: Implement BLEU/ROUGE metrics
- Multi-document: Support directory ingestion
- FastAPI Wrapper: Deploy as REST API
- Monitoring: Add Langfuse/LLM observability
Support
For issues or questions:
- Check
.envfile is correctly configured - Verify API keys aren't expired
- Review cell outputs in notebook for error messages
- See Troubleshooting section above