QuerySphere / README.md
satyakimitra's picture
Fix: README Update
bbfcdfc
metadata
title: QuerySphere
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

QuerySphere: RAG platform for document Q&A with Knowledge Ingestion

Python 3.10+ FastAPI License: MIT

MVP-Grade RAG Platform with Multi-Format Document Ingestion, Hybrid Retrieval, and Zero API Costs


A MVP grade Retrieval-Augmented Generation (RAG) system that enables organizations to unlock knowledge trapped across documents and archives while maintaining complete data privacy and eliminating costly API dependencies.


πŸ“‹ Table of Contents


🎯 Overview

The QuerySphere addresses a critical enterprise pain point: information silos that cost organizations 20% of employee productivity. Unlike existing solutions (Humata AI, ChatPDF, NotebookLM) that charge $49/user/month and rely on expensive cloud LLM APIs, this system offers:

Core Value Propositions

Feature Traditional Solutions Our System
Privacy Cloud-based (data leaves premises) 100% on-premise processing
Cost $49-99/user/month + API fees Zero API costs (local inference)
Input Types PDF only PDF, DOCX, TXT, ZIP archives
Quality Metrics Black box (no visibility) RAGAS evaluation with detailed metrics
Retrieval Vector-only Hybrid (Vector + BM25 + Reranking)
Chunking Fixed size Adaptive (3 strategies)

Market Context

  • $8.5B projected enterprise AI search market by 2027
  • 85% of enterprises actively adopting AI-powered knowledge management
  • Growing regulatory demands for on-premise, privacy-compliant solutions

✨ Key Features

1. Multi-Format Document Ingestion

  • Supported Formats: PDF, DOCX, TXT
  • Archive Processing: ZIP files up to 2GB with recursive extraction
  • Batch Upload: Process multiple documents simultaneously
  • OCR Support: Extract text from scanned documents and images (PaddleOCR or EasyOCR)

2. Intelligent Document Processing

  • Adaptive Chunking: Automatically selects optimal strategy based on document size
    • Fixed-size chunks (< 50K tokens): 512 tokens with 50 overlap
    • Semantic chunks (50K-500K tokens): Section-aware splitting
    • Hierarchical chunks (> 500K tokens): Parent-child structure
  • Metadata Extraction: Title, author, date, page numbers, section headers

3. Hybrid Retrieval System

graph LR
    A[User Query] --> B[Query Embedding]
    A --> C[Keyword Analysis]
    B --> D[Vector Search<br/>FAISS]
    C --> E[BM25 Search]
    D --> F[Reciprocal Rank Fusion<br/>60% Vector + 40% BM25]
    E --> F
    F --> G[Cross-Encoder Reranking]
    G --> H[Top-K Results]
  • Vector Search: FAISS with BGE embeddings (384-dim)
  • Keyword Search: BM25 with optimized parameters (k1=1.5, b=0.75)
  • Fusion Methods: Weighted, Reciprocal Rank Fusion (RRF), CombSum
  • Reranking: Cross-encoder for precision boost

4. Local LLM Generation

  • Ollama Integration: Zero-cost inference with Mistral-7B or LLaMA-2
  • Adaptive Temperature: Context-aware generation parameters
  • Citation Tracking: Automatic source attribution with validation
  • Streaming Support: Token-by-token response generation

5. RAGAS Quality Assurance

  • Real-Time Evaluation: Answer relevancy, faithfulness, context precision/recall
  • Automatic Metrics: Computed for every query-response pair
  • Analytics Dashboard: Track quality trends over time
  • Export Capability: Download evaluation data for analysis
  • Session Statistics: Aggregate metrics across conversation sessions

πŸ—‚οΈ System Architecture

High-Level Architecture

graph TB
    subgraph "Frontend Layer"
        A[Web UI<br/>HTML/CSS/JS]
    end
    
    subgraph "API Layer"
        B[FastAPI Gateway<br/>REST Endpoints]
    end
    
    subgraph "Ingestion Pipeline"
        C[Document Parser<br/>PDF/DOCX/TXT]
        D[Adaptive Chunker<br/>3 Strategies]
        E[Embedding Generator<br/>BGE-small-en-v1.5]
    end
    
    subgraph "Storage Layer"
        F[FAISS Vector DB<br/>~10M vectors]
        G[BM25 Keyword Index]
        H[SQLite Metadata<br/>Documents & Chunks]
        I[LRU Cache<br/>Embeddings]
    end
    
    subgraph "Retrieval Engine"
        J[Hybrid Retriever<br/>Vector + BM25]
        K[Cross-Encoder<br/>Reranker]
        L[Context Assembler]
    end
    
    subgraph "Generation Engine"
        M[Ollama LLM<br/>Mistral-7B]
        N[Prompt Builder]
        O[Citation Formatter]
    end
    
    subgraph "Evaluation Engine"
        P[RAGAS Evaluator<br/>Quality Metrics]
    end
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    E --> G
    D --> H
    E --> I
    
    B --> J
    J --> F
    J --> G
    J --> K
    K --> L
    
    L --> N
    N --> M
    M --> O
    O --> A
    
    M --> P
    P --> H
    P --> A

Why This Architecture?

Modular Design

Each component is independent and replaceable:

  • Parser: Swap PDF libraries without affecting chunking
  • Embedder: Change from BGE to OpenAI embeddings with config update
  • LLM: Switch from Ollama to OpenAI API seamlessly

Separation of Concerns

Ingestion β†’ Storage β†’ Retrieval β†’ Generation β†’ Evaluation

Each stage has clear inputs/outputs and single responsibility.

Performance Optimization

  • Async Processing: Non-blocking I/O for uploads and LLM calls
  • Batch Operations: Embed 32 chunks simultaneously
  • Local Caching: LRU cache for query embeddings and frequent retrievals
  • Indexing: FAISS ANN for O(log n) search vs O(n) brute force

πŸ”§ Technology Stack

Core Technologies

Component Technology Version Why This Choice
Backend FastAPI 0.104+ Async support, auto-docs, production-grade
LLM Ollama (Mistral-7B) Latest Zero API costs, on-premise, 20-30 tokens/sec
Embeddings BGE-small-en-v1.5 384-dim SOTA quality, 10x faster than alternatives
Vector DB FAISS Latest Battle-tested, 10x faster than ChromaDB
Keyword Search BM25 (rank_bm25) Latest Fast probabilistic ranking
Document Parsing PyPDF2, python-docx Latest Industry standard, reliable
Chunking LlamaIndex 0.9+ Advanced semantic splitting
Reranking Cross-Encoder Latest +15% accuracy, minimal latency
Evaluation RAGAS 0.1.9 Automated RAG quality metrics
Frontend Alpine.js 3.x Lightweight reactivity, no build step
Database SQLite 3.x Zero-config, sufficient for metadata
Caching In-Memory LRU Python functools Fast, no external dependencies

Python Dependencies

fastapi>=0.104.0
uvicorn>=0.24.0
ollama>=0.1.0
sentence-transformers>=2.2.2
faiss-cpu>=1.7.4
llama-index>=0.9.0
rank-bm25>=0.2.2
PyPDF2>=3.0.0
python-docx>=0.8.11
pydantic>=2.0.0
aiohttp>=3.9.0
tiktoken>=0.5.0
ragas==0.1.9
datasets==2.14.6

πŸ“¦ Installation

Prerequisites

  • Python 3.10 or higher
  • 8GB RAM minimum (16GB recommended)
  • 10GB disk space for models and indexes
  • Ollama installed (https://ollama.ai)

Step 1: Clone Repository

git clone https://github.com/satyaki-mitra/docu-vault-ai.git
cd docu-vault-ai

Step 2: Create Virtual Environment

# Using conda (recommended)
conda create -n rag_env python=3.10
conda activate rag_env

# Or using venv
python -m venv rag_env
source rag_env/bin/activate  # On Windows: rag_env\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Install Ollama and Model

# Install Ollama (macOS)
brew install ollama

# Install Ollama (Linux)
curl https://ollama.ai/install.sh | sh

# Pull Mistral model
ollama pull mistral:7b

# Verify installation
ollama list

Step 5: Configure Environment

# Copy example config
cp .env.example .env

# Edit configuration (optional)
nano .env

Key Configuration Options:

# LLM Settings
OLLAMA_MODEL=mistral:7b
DEFAULT_TEMPERATURE=0.1
CONTEXT_WINDOW=8192

# Retrieval Settings
VECTOR_WEIGHT=0.6
BM25_WEIGHT=0.4
ENABLE_RERANKING=True
TOP_K_RETRIEVE=10

# RAGAS Evaluation
ENABLE_RAGAS=True
RAGAS_ENABLE_GROUND_TRUTH=False
OPENAI_API_KEY=your_openai_api_key_here  # Required for RAGAS

# Performance
EMBEDDING_BATCH_SIZE=32
MAX_WORKERS=4

πŸš€ Quick Start

1. Start Ollama Server

# Terminal 1: Start Ollama
ollama serve

2. Launch Application

# Terminal 2: Start RAG system
python app.py

Output:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

3. Access Web Interface

Open browser to: http://localhost:8000

4. Upload Documents

  1. Click "Upload Documents"
  2. Select PDF/DOCX/TXT files (or ZIP archives)
  3. Click "Start Building"
  4. Wait for indexing to complete (progress bar shows status)

5. Query Your Documents

Query: "What are the key findings in the Q3 report?"

Response: The Q3 report highlights three key findings: 
[1] Revenue increased 23% year-over-year to $45.2M, 
[2] Customer acquisition costs decreased 15%, and 
[3] Net retention rate reached 118% [1].

Sources:
[1] Q3_Financial_Report.pdf (Page 3, Executive Summary)

RAGAS Metrics:
- Answer Relevancy: 0.89
- Faithfulness: 0.94
- Context Utilization: 0.87
- Overall Score: 0.90

🧩 Core Components

1. Document Ingestion Pipeline

# High-level flow
Document Upload β†’ Parse β†’ Clean β†’ Chunk β†’ Embed β†’ Index

Adaptive Chunking Logic:

graph TD
    A[Calculate Token Count] --> B{Tokens < 50K?}
    B -->|Yes| C[Fixed Chunking<br/>512 tokens, 50 overlap]
    B -->|No| D{Tokens < 500K?}
    D -->|Yes| E[Semantic Chunking<br/>Section-aware]
    D -->|No| F[Hierarchical Chunking<br/>Parent 2048, Child 512]

2. Hybrid Retrieval Engine

Retrieval Flow:

# Pseudocode
def hybrid_retrieve(query: str, top_k: int = 10):
    # Dual retrieval
    query_embedding = embedder.embed(query)
    vector_results  = faiss_index.search(query_embedding, top_k * 2)
    bm25_results    = bm25_index.search(query, top_k * 2)
    
    # Fusion (RRF)
    fused_results   = reciprocal_rank_fusion(vector_results, 
                                             bm25_results,
                                             weights = (0.6, 0.4))
    
    # Reranking
    reranked        = cross_encoder.rerank(query, fused_results, top_k)
    
    return reranked

3. Response Generation

Temperature Control:

graph LR
    A[Query Type] --> B{Factual?}
    B -->|Yes| C[Low Temp<br/>0.1-0.2]
    B -->|No| D[Context Quality]
    D -->|High| E[Medium Temp<br/>0.3-0.5]
    D -->|Low| F[High Temp<br/>0.6-0.8]

4. RAGAS Evaluation Module

Automatic Quality Assessment:

# After each query-response
ragas_result = ragas_evaluator.evaluate_single(query              = user_query,
                                               answer             = generated_answer,
                                               contexts           = retrieved_chunks,
                                               retrieval_time_ms  = retrieval_time,
                                               generation_time_ms = generation_time,
                                              )

# Metrics computed:
- Answer Relevancy (0-1)
- Faithfulness (0-1)
- Context Utilization (0-1)
- Context Relevancy (0-1)
- Overall Score (weighted average)

πŸ“š API Documentation

Core Endpoints

1. Health Check

GET /api/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-11-27T03:00:00",
  "components": {
    "vector_store": true,
    "llm": true,
    "embeddings": true,
    "retrieval": true
  }
}

2. Upload Documents

POST /api/upload
Content-Type: multipart/form-data

files: [file1.pdf, file2.docx]

3. Start Processing

POST /api/start-processing

4. Query (Chat)

POST /api/chat
Content-Type: application/json

{
  "message": "What are the revenue figures?",
  "session_id": "session_123"
}

Response includes RAGAS metrics:

{
  "session_id": "session_123",
  "response": "Revenue for Q3 was $45.2M [1]...",
  "sources": [...],
  "metrics": {
    "retrieval_time": 245,
    "generation_time": 3100,
    "total_time": 3350
  },
  "ragas_metrics": {
    "answer_relevancy": 0.89,
    "faithfulness": 0.94,
    "context_utilization": 0.87,
    "context_relevancy": 0.91,
    "overall_score": 0.90
  }
}

5. RAGAS Endpoints

# Get evaluation history
GET /api/ragas/history

# Get session statistics
GET /api/ragas/statistics

# Clear evaluation history
POST /api/ragas/clear

# Export evaluation data
GET /api/ragas/export

# Get RAGAS configuration
GET /api/ragas/config

βš™οΈ Configuration

config/settings.py

Key Configuration Sections:

LLM Settings

OLLAMA_MODEL        = "mistral:7b"
DEFAULT_TEMPERATURE = 0.1
MAX_TOKENS          = 1000
CONTEXT_WINDOW      = 8192

RAGAS Settings

ENABLE_RAGAS              = True
RAGAS_ENABLE_GROUND_TRUTH = False
RAGAS_METRICS             = ["answer_relevancy",
                             "faithfulness",
                             "context_utilization",
                             "context_relevancy"
                            ]
RAGAS_EVALUATION_TIMEOUT  = 60
RAGAS_BATCH_SIZE          = 10

Caching Settings

ENABLE_EMBEDDING_CACHE = True
CACHE_MAX_SIZE         = 1000  # LRU cache size
CACHE_TTL              = 3600  # Time to live in seconds

πŸ“Š RAGAS Evaluation

What is RAGAS?

RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG systems using automated metrics. Our implementation provides real-time quality assessment for every query-response pair.

Metrics Explained

Metric Definition Target Interpretation
Answer Relevancy How well the answer addresses the question > 0.85 Measures usefulness to user
Faithfulness Is the answer grounded in retrieved context? > 0.90 Prevents hallucinations
Context Utilization How well the context is used in the answer > 0.80 Retrieval effectiveness
Context Relevancy Are retrieved chunks relevant to the query? > 0.85 Search quality
Overall Score Weighted average of all metrics > 0.85 System performance

Using the Analytics Dashboard

  1. Navigate to Analytics & Quality section
  2. View real-time RAGAS metrics table
  3. Monitor session statistics (averages, trends)
  4. Export evaluation data for offline analysis

Example Evaluation Output

Query: "What were the Q3 revenue trends?"
Answer: "Q3 revenue increased 23% YoY to $45.2M..."

RAGAS Evaluation:
β”œβ”€ Answer Relevancy: 0.89 βœ“ (Good)
β”œβ”€ Faithfulness: 0.94 βœ“ (Excellent)
β”œβ”€ Context Utilization: 0.87 βœ“ (Good)
β”œβ”€ Context Relevancy: 0.91 βœ“ (Excellent)
└─ Overall Score: 0.90 βœ“ (Excellent)

Performance:
β”œβ”€ Retrieval Time: 245ms
β”œβ”€ Generation Time: 3100ms
└─ Total Time: 3345ms

πŸ”§ Troubleshooting

Common Issues

1. "RAGAS evaluation failed"

Cause: OpenAI API key not configured

Solution:

# Add to .env file
OPENAI_API_KEY=your_openai_api_key_here

# Or disable RAGAS if not needed
ENABLE_RAGAS=False

2. "Context assembly returning 0 chunks"

Cause: Missing token counts in chunks

Solution: Already fixed in context_assembler.py. Tokens calculated on-the-fly if missing.

3. "Slow query responses"

Solutions:

  • Enable embedding cache : ENABLE_EMBEDDING_CACHE=True

  • Reduce retrieval count : TOP_K_RETRIEVE=5

  • Disable reranking : ENABLE_RERANKING=False

  • Use quantized model for faster inference

4. "RAGAS metrics not appearing"

Symptoms: Chat responses lack quality metrics

Solution:

# Verify RAGAS is enabled in settings
ENABLE_RAGAS = True

# Check OpenAI API key is valid
# View logs for RAGAS evaluation errors
tail -f logs/app.log | grep "RAGAS"

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

Open Source Technologies:

Research Papers:

  • Karpukhin et al. (2020) - Dense Passage Retrieval
  • Robertson & Zaragoza (2009) - The Probabilistic Relevance Framework: BM25
  • Lewis et al. (2020) - Retrieval-Augmented Generation
  • Es et al. (2023) - RAGAS: Automated Evaluation of RAG

Built with ❀️ for the open-source community