--- title: QuerySphere emoji: 🧠 colorFrom: indigo colorTo: purple sdk: docker pinned: false license: mit app_port: 7860 ---
# QuerySphere: RAG platform for document Q&A with Knowledge Ingestion [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-green.svg)](https://fastapi.tiangolo.com/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) > **MVP-Grade RAG Platform with Multi-Format Document Ingestion, Hybrid Retrieval, and Zero API Costs**
--- A MVP grade Retrieval-Augmented Generation (RAG) system that enables organizations to unlock knowledge trapped across documents and archives while maintaining complete data privacy and eliminating costly API dependencies. --- ## 📋 Table of Contents - [Overview](#-overview) - [Key Features](#-key-features) - [System Architecture](#-system-architecture) - [Technology Stack](#-technology-stack) - [Installation](#-installation) - [Quick Start](#-quick-start) - [Core Components](#-core-components) - [API Documentation](#-api-documentation) - [Configuration](#-configuration) - [RAGAS Evaluation](#-ragas-evaluation) - [Troubleshooting](#-troubleshooting) - [License](#-license) --- ## 🎯 Overview The QuerySphere addresses a critical enterprise pain point: **information silos that cost organizations 20% of employee productivity**. Unlike existing solutions (Humata AI, ChatPDF, NotebookLM) that charge $49/user/month and rely on expensive cloud LLM APIs, this system offers: ### **Core Value Propositions** | Feature | Traditional Solutions | Our System | |---------|----------------------|------------| | **Privacy** | Cloud-based (data leaves premises) | 100% on-premise processing | | **Cost** | $49-99/user/month + API fees | Zero API costs (local inference) | | **Input Types** | PDF only | PDF, DOCX, TXT, ZIP archives | | **Quality Metrics** | Black box (no visibility) | RAGAS evaluation with detailed metrics | | **Retrieval** | Vector-only | Hybrid (Vector + BM25 + Reranking) | | **Chunking** | Fixed size | Adaptive (3 strategies) | ### **Market Context** - **$8.5B** projected enterprise AI search market by 2027 - **85%** of enterprises actively adopting AI-powered knowledge management - **Growing regulatory demands** for on-premise, privacy-compliant solutions --- ## ✨ Key Features ### **1. Multi-Format Document Ingestion** - **Supported Formats**: PDF, DOCX, TXT - **Archive Processing**: ZIP files up to 2GB with recursive extraction - **Batch Upload**: Process multiple documents simultaneously - **OCR Support**: Extract text from scanned documents and images (PaddleOCR or EasyOCR) ### **2. Intelligent Document Processing** - **Adaptive Chunking**: Automatically selects optimal strategy based on document size - Fixed-size chunks (< 50K tokens): 512 tokens with 50 overlap - Semantic chunks (50K-500K tokens): Section-aware splitting - Hierarchical chunks (> 500K tokens): Parent-child structure - **Metadata Extraction**: Title, author, date, page numbers, section headers ### **3. Hybrid Retrieval System** ```mermaid graph LR A[User Query] --> B[Query Embedding] A --> C[Keyword Analysis] B --> D[Vector Search
FAISS] C --> E[BM25 Search] D --> F[Reciprocal Rank Fusion
60% Vector + 40% BM25] E --> F F --> G[Cross-Encoder Reranking] G --> H[Top-K Results] ``` - **Vector Search**: FAISS with BGE embeddings (384-dim) - **Keyword Search**: BM25 with optimized parameters (k1=1.5, b=0.75) - **Fusion Methods**: Weighted, Reciprocal Rank Fusion (RRF), CombSum - **Reranking**: Cross-encoder for precision boost ### **4. Local LLM Generation** - **Ollama Integration**: Zero-cost inference with Mistral-7B or LLaMA-2 - **Adaptive Temperature**: Context-aware generation parameters - **Citation Tracking**: Automatic source attribution with validation - **Streaming Support**: Token-by-token response generation ### **5. RAGAS Quality Assurance** - **Real-Time Evaluation**: Answer relevancy, faithfulness, context precision/recall - **Automatic Metrics**: Computed for every query-response pair - **Analytics Dashboard**: Track quality trends over time - **Export Capability**: Download evaluation data for analysis - **Session Statistics**: Aggregate metrics across conversation sessions --- ## 🗂️ System Architecture ### **High-Level Architecture** ```mermaid graph TB subgraph "Frontend Layer" A[Web UI
HTML/CSS/JS] end subgraph "API Layer" B[FastAPI Gateway
REST Endpoints] end subgraph "Ingestion Pipeline" C[Document Parser
PDF/DOCX/TXT] D[Adaptive Chunker
3 Strategies] E[Embedding Generator
BGE-small-en-v1.5] end subgraph "Storage Layer" F[FAISS Vector DB
~10M vectors] G[BM25 Keyword Index] H[SQLite Metadata
Documents & Chunks] I[LRU Cache
Embeddings] end subgraph "Retrieval Engine" J[Hybrid Retriever
Vector + BM25] K[Cross-Encoder
Reranker] L[Context Assembler] end subgraph "Generation Engine" M[Ollama LLM
Mistral-7B] N[Prompt Builder] O[Citation Formatter] end subgraph "Evaluation Engine" P[RAGAS Evaluator
Quality Metrics] end A --> B B --> C C --> D D --> E E --> F E --> G D --> H E --> I B --> J J --> F J --> G J --> K K --> L L --> N N --> M M --> O O --> A M --> P P --> H P --> A ``` ### **Why This Architecture?** #### **Modular Design** Each component is independent and replaceable: - **Parser**: Swap PDF libraries without affecting chunking - **Embedder**: Change from BGE to OpenAI embeddings with config update - **LLM**: Switch from Ollama to OpenAI API seamlessly #### **Separation of Concerns** ``` Ingestion → Storage → Retrieval → Generation → Evaluation ``` Each stage has clear inputs/outputs and single responsibility. #### **Performance Optimization** - **Async Processing**: Non-blocking I/O for uploads and LLM calls - **Batch Operations**: Embed 32 chunks simultaneously - **Local Caching**: LRU cache for query embeddings and frequent retrievals - **Indexing**: FAISS ANN for O(log n) search vs O(n) brute force --- ## 🔧 Technology Stack ### **Core Technologies** | Component | Technology | Version | Why This Choice | |-----------|-----------|---------|-----------------| | **Backend** | FastAPI | 0.104+ | Async support, auto-docs, production-grade | | **LLM** | Ollama (Mistral-7B) | Latest | Zero API costs, on-premise, 20-30 tokens/sec | | **Embeddings** | BGE-small-en-v1.5 | 384-dim | SOTA quality, 10x faster than alternatives | | **Vector DB** | FAISS | Latest | Battle-tested, 10x faster than ChromaDB | | **Keyword Search** | BM25 (rank_bm25) | Latest | Fast probabilistic ranking | | **Document Parsing** | PyPDF2, python-docx | Latest | Industry standard, reliable | | **Chunking** | LlamaIndex | 0.9+ | Advanced semantic splitting | | **Reranking** | Cross-Encoder | Latest | +15% accuracy, minimal latency | | **Evaluation** | RAGAS | 0.1.9 | Automated RAG quality metrics | | **Frontend** | Alpine.js | 3.x | Lightweight reactivity, no build step | | **Database** | SQLite | 3.x | Zero-config, sufficient for metadata | | **Caching** | In-Memory LRU | Python functools | Fast, no external dependencies | ### **Python Dependencies** ``` fastapi>=0.104.0 uvicorn>=0.24.0 ollama>=0.1.0 sentence-transformers>=2.2.2 faiss-cpu>=1.7.4 llama-index>=0.9.0 rank-bm25>=0.2.2 PyPDF2>=3.0.0 python-docx>=0.8.11 pydantic>=2.0.0 aiohttp>=3.9.0 tiktoken>=0.5.0 ragas==0.1.9 datasets==2.14.6 ``` --- ## 📦 Installation ### **Prerequisites** - Python 3.10 or higher - 8GB RAM minimum (16GB recommended) - 10GB disk space for models and indexes - Ollama installed ([https://ollama.ai](https://ollama.ai)) ### **Step 1: Clone Repository** ```bash git clone https://github.com/satyaki-mitra/docu-vault-ai.git cd docu-vault-ai ``` ### **Step 2: Create Virtual Environment** ```bash # Using conda (recommended) conda create -n rag_env python=3.10 conda activate rag_env # Or using venv python -m venv rag_env source rag_env/bin/activate # On Windows: rag_env\Scripts\activate ``` ### **Step 3: Install Dependencies** ```bash pip install -r requirements.txt ``` ### **Step 4: Install Ollama and Model** ```bash # Install Ollama (macOS) brew install ollama # Install Ollama (Linux) curl https://ollama.ai/install.sh | sh # Pull Mistral model ollama pull mistral:7b # Verify installation ollama list ``` ### **Step 5: Configure Environment** ```bash # Copy example config cp .env.example .env # Edit configuration (optional) nano .env ``` **Key Configuration Options:** ```bash # LLM Settings OLLAMA_MODEL=mistral:7b DEFAULT_TEMPERATURE=0.1 CONTEXT_WINDOW=8192 # Retrieval Settings VECTOR_WEIGHT=0.6 BM25_WEIGHT=0.4 ENABLE_RERANKING=True TOP_K_RETRIEVE=10 # RAGAS Evaluation ENABLE_RAGAS=True RAGAS_ENABLE_GROUND_TRUTH=False OPENAI_API_KEY=your_openai_api_key_here # Required for RAGAS # Performance EMBEDDING_BATCH_SIZE=32 MAX_WORKERS=4 ``` --- ## 🚀 Quick Start ### **1. Start Ollama Server** ```bash # Terminal 1: Start Ollama ollama serve ``` ### **2. Launch Application** ```bash # Terminal 2: Start RAG system python app.py ``` Output: ``` INFO: Started server process [12345] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 ``` ### **3. Access Web Interface** Open browser to: **http://localhost:8000** ### **4. Upload Documents** 1. Click **"Upload Documents"** 2. Select PDF/DOCX/TXT files (or ZIP archives) 3. Click **"Start Building"** 4. Wait for indexing to complete (progress bar shows status) ### **5. Query Your Documents** ``` Query: "What are the key findings in the Q3 report?" Response: The Q3 report highlights three key findings: [1] Revenue increased 23% year-over-year to $45.2M, [2] Customer acquisition costs decreased 15%, and [3] Net retention rate reached 118% [1]. Sources: [1] Q3_Financial_Report.pdf (Page 3, Executive Summary) RAGAS Metrics: - Answer Relevancy: 0.89 - Faithfulness: 0.94 - Context Utilization: 0.87 - Overall Score: 0.90 ``` --- ## 🧩 Core Components ### **1. Document Ingestion Pipeline** ```python # High-level flow Document Upload → Parse → Clean → Chunk → Embed → Index ``` **Adaptive Chunking Logic:** ```mermaid graph TD A[Calculate Token Count] --> B{Tokens < 50K?} B -->|Yes| C[Fixed Chunking
512 tokens, 50 overlap] B -->|No| D{Tokens < 500K?} D -->|Yes| E[Semantic Chunking
Section-aware] D -->|No| F[Hierarchical Chunking
Parent 2048, Child 512] ``` ### **2. Hybrid Retrieval Engine** **Retrieval Flow:** ```python # Pseudocode def hybrid_retrieve(query: str, top_k: int = 10): # Dual retrieval query_embedding = embedder.embed(query) vector_results = faiss_index.search(query_embedding, top_k * 2) bm25_results = bm25_index.search(query, top_k * 2) # Fusion (RRF) fused_results = reciprocal_rank_fusion(vector_results, bm25_results, weights = (0.6, 0.4)) # Reranking reranked = cross_encoder.rerank(query, fused_results, top_k) return reranked ``` ### **3. Response Generation** **Temperature Control:** ```mermaid graph LR A[Query Type] --> B{Factual?} B -->|Yes| C[Low Temp
0.1-0.2] B -->|No| D[Context Quality] D -->|High| E[Medium Temp
0.3-0.5] D -->|Low| F[High Temp
0.6-0.8] ``` ### **4. RAGAS Evaluation Module** **Automatic Quality Assessment:** ```python # After each query-response ragas_result = ragas_evaluator.evaluate_single(query = user_query, answer = generated_answer, contexts = retrieved_chunks, retrieval_time_ms = retrieval_time, generation_time_ms = generation_time, ) # Metrics computed: - Answer Relevancy (0-1) - Faithfulness (0-1) - Context Utilization (0-1) - Context Relevancy (0-1) - Overall Score (weighted average) ``` --- ## 📚 API Documentation ### **Core Endpoints** #### **1. Health Check** ```bash GET /api/health ``` **Response:** ```json { "status": "healthy", "timestamp": "2024-11-27T03:00:00", "components": { "vector_store": true, "llm": true, "embeddings": true, "retrieval": true } } ``` #### **2. Upload Documents** ```bash POST /api/upload Content-Type: multipart/form-data files: [file1.pdf, file2.docx] ``` #### **3. Start Processing** ```bash POST /api/start-processing ``` #### **4. Query (Chat)** ```bash POST /api/chat Content-Type: application/json { "message": "What are the revenue figures?", "session_id": "session_123" } ``` **Response includes RAGAS metrics:** ```json { "session_id": "session_123", "response": "Revenue for Q3 was $45.2M [1]...", "sources": [...], "metrics": { "retrieval_time": 245, "generation_time": 3100, "total_time": 3350 }, "ragas_metrics": { "answer_relevancy": 0.89, "faithfulness": 0.94, "context_utilization": 0.87, "context_relevancy": 0.91, "overall_score": 0.90 } } ``` #### **5. RAGAS Endpoints** ```bash # Get evaluation history GET /api/ragas/history # Get session statistics GET /api/ragas/statistics # Clear evaluation history POST /api/ragas/clear # Export evaluation data GET /api/ragas/export # Get RAGAS configuration GET /api/ragas/config ``` --- ## ⚙️ Configuration ### **config/settings.py** **Key Configuration Sections:** #### **LLM Settings** ```python OLLAMA_MODEL = "mistral:7b" DEFAULT_TEMPERATURE = 0.1 MAX_TOKENS = 1000 CONTEXT_WINDOW = 8192 ``` #### **RAGAS Settings** ```python ENABLE_RAGAS = True RAGAS_ENABLE_GROUND_TRUTH = False RAGAS_METRICS = ["answer_relevancy", "faithfulness", "context_utilization", "context_relevancy" ] RAGAS_EVALUATION_TIMEOUT = 60 RAGAS_BATCH_SIZE = 10 ``` #### **Caching Settings** ```python ENABLE_EMBEDDING_CACHE = True CACHE_MAX_SIZE = 1000 # LRU cache size CACHE_TTL = 3600 # Time to live in seconds ``` --- ## 📊 RAGAS Evaluation ### **What is RAGAS?** RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG systems using automated metrics. Our implementation provides real-time quality assessment for every query-response pair. ### **Metrics Explained** | Metric | Definition | Target | Interpretation | |--------|-----------|--------|----------------| | **Answer Relevancy** | How well the answer addresses the question | > 0.85 | Measures usefulness to user | | **Faithfulness** | Is the answer grounded in retrieved context? | > 0.90 | Prevents hallucinations | | **Context Utilization** | How well the context is used in the answer | > 0.80 | Retrieval effectiveness | | **Context Relevancy** | Are retrieved chunks relevant to the query? | > 0.85 | Search quality | | **Overall Score** | Weighted average of all metrics | > 0.85 | System performance | ### **Using the Analytics Dashboard** 1. Navigate to **Analytics & Quality** section 2. View real-time RAGAS metrics table 3. Monitor session statistics (averages, trends) 4. Export evaluation data for offline analysis ### **Example Evaluation Output** ``` Query: "What were the Q3 revenue trends?" Answer: "Q3 revenue increased 23% YoY to $45.2M..." RAGAS Evaluation: ├─ Answer Relevancy: 0.89 ✓ (Good) ├─ Faithfulness: 0.94 ✓ (Excellent) ├─ Context Utilization: 0.87 ✓ (Good) ├─ Context Relevancy: 0.91 ✓ (Excellent) └─ Overall Score: 0.90 ✓ (Excellent) Performance: ├─ Retrieval Time: 245ms ├─ Generation Time: 3100ms └─ Total Time: 3345ms ``` --- ## 🔧 Troubleshooting ### **Common Issues** #### **1. "RAGAS evaluation failed"** **Cause:** OpenAI API key not configured **Solution:** ```bash # Add to .env file OPENAI_API_KEY=your_openai_api_key_here # Or disable RAGAS if not needed ENABLE_RAGAS=False ``` #### **2. "Context assembly returning 0 chunks"** **Cause:** Missing token counts in chunks **Solution:** Already fixed in `context_assembler.py`. Tokens calculated on-the-fly if missing. #### **3. "Slow query responses"** **Solutions:** - Enable embedding cache : `ENABLE_EMBEDDING_CACHE=True` - Reduce retrieval count : `TOP_K_RETRIEVE=5` - Disable reranking : `ENABLE_RERANKING=False` - Use quantized model for faster inference #### **4. "RAGAS metrics not appearing"** **Symptoms:** Chat responses lack quality metrics **Solution:** ```python # Verify RAGAS is enabled in settings ENABLE_RAGAS = True # Check OpenAI API key is valid # View logs for RAGAS evaluation errors tail -f logs/app.log | grep "RAGAS" ``` --- ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. --- ## 🙏 Acknowledgments **Open Source Technologies:** - [FastAPI](https://fastapi.tiangolo.com/) - Modern web framework - [Ollama](https://ollama.ai/) - Local LLM inference - [FAISS](https://github.com/facebookresearch/faiss) - Vector similarity search - [LlamaIndex](https://www.llamaindex.ai/) - Document chunking - [Sentence Transformers](https://www.sbert.net/) - Embedding models - [RAGAS](https://github.com/explodinggradients/ragas) - RAG evaluation **Research Papers:** - Karpukhin et al. (2020) - Dense Passage Retrieval - Robertson & Zaragoza (2009) - The Probabilistic Relevance Framework: BM25 - Lewis et al. (2020) - Retrieval-Augmented Generation - Es et al. (2023) - RAGAS: Automated Evaluation of RAG ---
**Built with ❤️ for the open-source community**