Spaces:

Aditya-1301
/

SimpleRAGPipeline

Sleeping

App Files Files Community

SimpleRAGPipeline / README.md

Aditya-1301

Switch to SmolLM2-1.7B for CPU-friendly inference on HF Spaces

a68b4d6 5 months ago

preview code

raw

history blame contribute delete

10.9 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: Simple RAG Pipeline
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app/app.py
pinned: false
license: mit

RAG Pipeline - Jupyter Notebook (FAISS + Multi-Backend Embeddings)

A production-ready Retrieval-Augmented Generation system built as an interactive Jupyter notebook (RAG_Attempt.ipynb) with:

Key Features:

Fast Embeddings (50x faster than local):
- 🚀 HuggingFace Inference API (768-dim, FREE with rate limits) — Recommended
- 🚀 Voyage AI (1024-dim, $0.12/1M tokens)
- 🚀 OpenAI (1536-dim, $0.13/1M tokens)
- ⚙️ FastEmbed (384-dim, free, local, CPU-only) — Fallback
Semantic Chunking: Respects sentence boundaries, preserves page numbers
PyMuPDF Integration: 3-5x faster PDF parsing with accurate page tracking
Smart Embedding Caching: Detects precomputed embeddings, skips redundant API calls
Parallel Processing: Multi-threaded API calls for 5-10x speedup
Vector Search: Cosine similarity (FAISS IndexFlatIP) with L2 normalization
Multi-Backend LLM Support:
- 🤗 HuggingFace Inference API (Mistral, Llama, etc.) — Recommended for HF Spaces
- 🦙 Ollama (local models: gemma3, smollm2, gpt-oss)
- 🌐 OpenAI (GPT-4, GPT-3.5)
- 🔄 Auto-detection: Automatically uses available backend
Interactive UI: Gradio demo with auto port detection
Source Citations: Formatted answers with [1], [2] citations + source details
Persistence: Save/load vector store and metadata (no reprocessing!)

Workflows:

First run: Process document → Generate embeddings → Save to disk (~5 minutes)
Subsequent runs: Load cached embeddings → Query (~15 seconds, zero API calls!)
Interactive mode: Gradio web UI for testing different queries

Setup

Works on CPU-only boxes. No CUDA needed. All dependencies in requirements.txt.

# 1) Create virtual environment
python -m venv .venv && source .venv/bin/activate

# 2) Ensure pip is installed (fix for missing pip)
python -m ensurepip --upgrade
python -m pip install --upgrade pip

# 3) Install dependencies
pip install -r requirements.txt

# 4) Configure environment (optional, for API embeddings)
cp .env.example .env
# Edit .env and add your API keys:
#   HF_TOKEN (for HuggingFace — FREE, recommended)
#   VOYAGE_API_KEY (optional, fastest embeddings)
#   OPENAI_API_KEY (optional, for OpenAI embeddings or LLM)
#   OLLAMA_MODEL (local Ollama model name, default: smollm2:360m)
#   OLLAMA_BASE_URL (default: http://127.0.0.1:11434)

Environment Variables (`.env`):

# ===== LLM Backend (Answer Generation) =====
LLM_BACKEND=auto                    # Options: "auto", "huggingface", "ollama", "openai", "none"

# HuggingFace Inference API (recommended for HF Spaces - FREE!)
HF_TOKEN=hf_your_token_here         # Get free token: https://huggingface.co/settings/tokens
HF_LLM_MODEL=HuggingFaceTB/SmolLM2-360M-Instruct  # CPU-friendly! Or: mistralai/Mistral-7B-Instruct-v0.2

# OpenAI (alternative)
OPENAI_API_KEY=sk_your_key
OPENAI_MODEL=gpt-4o-mini

# Local Ollama (alternative - requires local server)
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_MODEL=smollm2:360m           # Or: gemma3:270m

# ===== Embedding Backend =====
EMBEDDING_BACKEND=fastembed         # Options: "fastembed" (local), "huggingface", "openai"
FASTEMBED_MODEL=BAAI/bge-small-en-v1.5

# ===== Document Processing =====
CHUNK_SIZE=1000                     # Characters per chunk
CHUNK_OVERLAP=200                   # Character overlap
TOP_K=5                             # Number of sources to retrieve

Recommended Configurations:

For HuggingFace Spaces (Cloud):

LLM_BACKEND=huggingface
HF_TOKEN=your_token_here
EMBEDDING_BACKEND=fastembed  # Fast, no API calls

For Local Development:

LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://127.0.0.1:11434
EMBEDDING_BACKEND=fastembed

Quickstart

Option 1: Interactive Jupyter Notebook (Recommended)

# 1) Open the notebook
jupyter notebook RAG_Attempt.ipynb

# 2) Run cells in order:
#    - Cell 1-2: Load environment & configure embeddings
#    - Cell 3-7: Import libraries & initialize embedding backends
#    - Cell 8-10: Load document (PDF/txt/docx), chunk it, and generate embeddings
#    - Cell 11-12: Build FAISS vector store and metadata
#    - Cell 13-14: Retrieve documents and generate answers
#    - Cell 15-16: Launch Gradio web UI (or query programmatically)

# 3) Query the document via Gradio UI or Python:
# query = "What are the Four Laws of Behavior Change?"
# result = execute_query(query, vector_store, metadata_store)
# display_result(result)

Option 2: Gradio Web Demo

The notebook includes a Gradio interface that automatically launches:

# After running the pipeline cells:
# - Opens interactive web UI at http://127.0.0.1:7860
# - Adjust "Number of Sources" (1-10) to control answer length
# - Try example questions with one click
# - Automatically detects available ports (handles conflicts)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                 DOCUMENT INGESTION                          │
│  load_document() → PDF/txt/docx parsed with metadata        │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ↓
┌─────────────────────────────────────────────────────────────┐
│              SEMANTIC CHUNKING (Smart!)                     │
│  chunk_document_semantic() respects sentence boundaries     │
│  Config: CHUNK_SIZE=1000 chars, OVERLAP=200 chars           │
│  Preserves page numbers for accurate citations              │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ↓
┌─────────────────────────────────────────────────────────────┐
│           EMBEDDING GENERATION (OPTIMIZED)                  │
│  embed_texts_optimized() with:                              │
│  • EmbeddingCache (file-based MD5 hashing, 10x+ speedup)    │
│  • ParallelHFEmbedder (4-worker thread pool)                │
│  • LocalFastEmbedder (batch processing)                     │
│  • Smart skip: detects & loads precomputed embeddings       │
│  APIs: HuggingFace, Voyage, OpenAI, or local FastEmbed      │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ↓
┌─────────────────────────────────────────────────────────────┐
│            VECTOR STORAGE & METADATA                        │
│  FAISS IndexFlatIP (cosine similarity, L2 norm)             │
│  Storage: ./rag_data/{faiss_index.bin, metadata.pkl}        │
│  Fast loading: 1-2 seconds from disk                        │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ↓
┌─────────────────────────────────────────────────────────────┐
│              RETRIEVAL & ANSWER GENERATION                  │
│  retrieve_documents(query, k=TOP_K)                         │
│  generate_answer() with local Ollama                        │
│  format_answer_with_sources() → citations [1], [2]          │
└─────────────────────────────────────────────────────────────┘

Data Flow Example:

User Query: "What are the Four Laws of Behavior Change?"
    ↓
embed_texts([query]) → 768-dimensional vector via HuggingFace
    ↓
vector_store.search(query_vector, k=5) → top 5 similar chunks
    ↓
retrieved_chunks = [
    {"text": "...cue, craving, response, reward...", "page": 42, "source": "Atomic_Habits.pdf"},
    {"text": "...the habit loop binds them together...", "page": 45, "source": "Atomic_Habits.pdf"},
    ...
]
    ↓
generate_answer(query, retrieved_chunks) via local Ollama
    ↓
"The Four Laws are cue, craving, response, and reward [1].
  Each law represents a stage in the habit loop [2].

📚 SOURCES:
[1] Atomic_Habits.pdf (Page 42): "...cue, craving, response..."
[2] Atomic_Habits.pdf (Page 45): "...habit loop..."

File Structure

RAG_Attempt.ipynb          # Main interactive notebook
├─ Cell 1-7: Setup & environment loading
├─ Cell 8-10: Document loading (PDF/txt/docx) & chunking
├─ Cell 11-12: Embedding generation with optimizations
├─ Cell 13-14: Vector store & retrieval
├─ Cell 15-16: Answer generation & formatting
└─ Cell 17-19: Gradio UI demo & testing

pipeline.ipynb             # Experimental/development notebook

.env.example               # Configuration template
requirements.txt           # All dependencies
README.md                  # This file

rag_data/                  # Generated data (created after first run)
  ├─ faiss_index.bin       # Vector index (~2.6 MB per 660 chunks)
  ├─ chunk_ids.pkl         # Chunk ID mapping
  ├─ metadata.pkl          # Document metadata & source info
  └─ .embedding_cache/     # File-based embedding cache

Next Steps

Query Caching: Cache popular Q&A pairs
Re-ranking: Add cross-encoders for better retrieval
Evaluation: Implement BLEU/ROUGE metrics
Multi-document: Support directory ingestion
FastAPI Wrapper: Deploy as REST API
Monitoring: Add Langfuse/LLM observability

Support

For issues or questions:

Check .env file is correctly configured
Verify API keys aren't expired
Review cell outputs in notebook for error messages
See Troubleshooting section above