Spaces:

satyakimitra
/

QuerySphere

Running

App Files Files Community

QuerySphere / README.md

satyakimitra

Fix: README Update

bbfcdfc about 2 months ago

preview code

raw

history blame contribute delete

18.4 kB

metadata

title: QuerySphere
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

QuerySphere: RAG platform for document Q&A with Knowledge Ingestion

MVP-Grade RAG Platform with Multi-Format Document Ingestion, Hybrid Retrieval, and Zero API Costs

A MVP grade Retrieval-Augmented Generation (RAG) system that enables organizations to unlock knowledge trapped across documents and archives while maintaining complete data privacy and eliminating costly API dependencies.

🎯 Overview

The QuerySphere addresses a critical enterprise pain point: information silos that cost organizations 20% of employee productivity. Unlike existing solutions (Humata AI, ChatPDF, NotebookLM) that charge $49/user/month and rely on expensive cloud LLM APIs, this system offers:

Core Value Propositions

Feature	Traditional Solutions	Our System
Privacy	Cloud-based (data leaves premises)	100% on-premise processing
Cost	$49-99/user/month + API fees	Zero API costs (local inference)
Input Types	PDF only	PDF, DOCX, TXT, ZIP archives
Quality Metrics	Black box (no visibility)	RAGAS evaluation with detailed metrics
Retrieval	Vector-only	Hybrid (Vector + BM25 + Reranking)
Chunking	Fixed size	Adaptive (3 strategies)

Market Context

$8.5B projected enterprise AI search market by 2027
85% of enterprises actively adopting AI-powered knowledge management
Growing regulatory demands for on-premise, privacy-compliant solutions

✨ Key Features

1. Multi-Format Document Ingestion

Supported Formats: PDF, DOCX, TXT
Archive Processing: ZIP files up to 2GB with recursive extraction
Batch Upload: Process multiple documents simultaneously
OCR Support: Extract text from scanned documents and images (PaddleOCR or EasyOCR)

2. Intelligent Document Processing

Adaptive Chunking: Automatically selects optimal strategy based on document size
- Fixed-size chunks (< 50K tokens): 512 tokens with 50 overlap
- Semantic chunks (50K-500K tokens): Section-aware splitting
- Hierarchical chunks (> 500K tokens): Parent-child structure
Metadata Extraction: Title, author, date, page numbers, section headers

3. Hybrid Retrieval System

graph LR
    A[User Query] --> B[Query Embedding]
    A --> C[Keyword Analysis]
    B --> D[Vector Search<br/>FAISS]
    C --> E[BM25 Search]
    D --> F[Reciprocal Rank Fusion<br/>60% Vector + 40% BM25]
    E --> F
    F --> G[Cross-Encoder Reranking]
    G --> H[Top-K Results]

Vector Search: FAISS with BGE embeddings (384-dim)
Keyword Search: BM25 with optimized parameters (k1=1.5, b=0.75)
Fusion Methods: Weighted, Reciprocal Rank Fusion (RRF), CombSum
Reranking: Cross-encoder for precision boost

4. Local LLM Generation

Ollama Integration: Zero-cost inference with Mistral-7B or LLaMA-2
Adaptive Temperature: Context-aware generation parameters
Citation Tracking: Automatic source attribution with validation
Streaming Support: Token-by-token response generation

5. RAGAS Quality Assurance

Real-Time Evaluation: Answer relevancy, faithfulness, context precision/recall
Automatic Metrics: Computed for every query-response pair
Analytics Dashboard: Track quality trends over time
Export Capability: Download evaluation data for analysis
Session Statistics: Aggregate metrics across conversation sessions

🗂️ System Architecture

High-Level Architecture

graph TB
    subgraph "Frontend Layer"
        A[Web UI<br/>HTML/CSS/JS]
    end
    
    subgraph "API Layer"
        B[FastAPI Gateway<br/>REST Endpoints]
    end
    
    subgraph "Ingestion Pipeline"
        C[Document Parser<br/>PDF/DOCX/TXT]
        D[Adaptive Chunker<br/>3 Strategies]
        E[Embedding Generator<br/>BGE-small-en-v1.5]
    end
    
    subgraph "Storage Layer"
        F[FAISS Vector DB<br/>~10M vectors]
        G[BM25 Keyword Index]
        H[SQLite Metadata<br/>Documents & Chunks]
        I[LRU Cache<br/>Embeddings]
    end
    
    subgraph "Retrieval Engine"
        J[Hybrid Retriever<br/>Vector + BM25]
        K[Cross-Encoder<br/>Reranker]
        L[Context Assembler]
    end
    
    subgraph "Generation Engine"
        M[Ollama LLM<br/>Mistral-7B]
        N[Prompt Builder]
        O[Citation Formatter]
    end
    
    subgraph "Evaluation Engine"
        P[RAGAS Evaluator<br/>Quality Metrics]
    end
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    E --> G
    D --> H
    E --> I
    
    B --> J
    J --> F
    J --> G
    J --> K
    K --> L
    
    L --> N
    N --> M
    M --> O
    O --> A
    
    M --> P
    P --> H
    P --> A

Why This Architecture?

Modular Design

Each component is independent and replaceable:

Parser: Swap PDF libraries without affecting chunking
Embedder: Change from BGE to OpenAI embeddings with config update
LLM: Switch from Ollama to OpenAI API seamlessly

Separation of Concerns

Ingestion → Storage → Retrieval → Generation → Evaluation

Each stage has clear inputs/outputs and single responsibility.

Performance Optimization

Async Processing: Non-blocking I/O for uploads and LLM calls
Batch Operations: Embed 32 chunks simultaneously
Local Caching: LRU cache for query embeddings and frequent retrievals
Indexing: FAISS ANN for O(log n) search vs O(n) brute force

🔧 Technology Stack

Core Technologies

Component	Technology	Version	Why This Choice
Backend	FastAPI	0.104+	Async support, auto-docs, production-grade
LLM	Ollama (Mistral-7B)	Latest	Zero API costs, on-premise, 20-30 tokens/sec
Embeddings	BGE-small-en-v1.5	384-dim	SOTA quality, 10x faster than alternatives
Vector DB	FAISS	Latest	Battle-tested, 10x faster than ChromaDB
Keyword Search	BM25 (rank_bm25)	Latest	Fast probabilistic ranking
Document Parsing	PyPDF2, python-docx	Latest	Industry standard, reliable
Chunking	LlamaIndex	0.9+	Advanced semantic splitting
Reranking	Cross-Encoder	Latest	+15% accuracy, minimal latency
Evaluation	RAGAS	0.1.9	Automated RAG quality metrics
Frontend	Alpine.js	3.x	Lightweight reactivity, no build step
Database	SQLite	3.x	Zero-config, sufficient for metadata
Caching	In-Memory LRU	Python functools	Fast, no external dependencies

Python Dependencies

fastapi>=0.104.0
uvicorn>=0.24.0
ollama>=0.1.0
sentence-transformers>=2.2.2
faiss-cpu>=1.7.4
llama-index>=0.9.0
rank-bm25>=0.2.2
PyPDF2>=3.0.0
python-docx>=0.8.11
pydantic>=2.0.0
aiohttp>=3.9.0
tiktoken>=0.5.0
ragas==0.1.9
datasets==2.14.6

📦 Installation

Prerequisites

Python 3.10 or higher
8GB RAM minimum (16GB recommended)
10GB disk space for models and indexes
Ollama installed (https://ollama.ai)

Step 1: Clone Repository

git clone https://github.com/satyaki-mitra/docu-vault-ai.git
cd docu-vault-ai

Step 2: Create Virtual Environment

# Using conda (recommended)
conda create -n rag_env python=3.10
conda activate rag_env

# Or using venv
python -m venv rag_env
source rag_env/bin/activate  # On Windows: rag_env\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Install Ollama and Model

# Install Ollama (macOS)
brew install ollama

# Install Ollama (Linux)
curl https://ollama.ai/install.sh | sh

# Pull Mistral model
ollama pull mistral:7b

# Verify installation
ollama list

Step 5: Configure Environment

# Copy example config
cp .env.example .env

# Edit configuration (optional)
nano .env

Key Configuration Options:

# LLM Settings
OLLAMA_MODEL=mistral:7b
DEFAULT_TEMPERATURE=0.1
CONTEXT_WINDOW=8192

# Retrieval Settings
VECTOR_WEIGHT=0.6
BM25_WEIGHT=0.4
ENABLE_RERANKING=True
TOP_K_RETRIEVE=10

# RAGAS Evaluation
ENABLE_RAGAS=True
RAGAS_ENABLE_GROUND_TRUTH=False
OPENAI_API_KEY=your_openai_api_key_here  # Required for RAGAS

# Performance
EMBEDDING_BATCH_SIZE=32
MAX_WORKERS=4

🚀 Quick Start

1. Start Ollama Server

# Terminal 1: Start Ollama
ollama serve

2. Launch Application

# Terminal 2: Start RAG system
python app.py

Output:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

3. Access Web Interface

Open browser to: http://localhost:8000

4. Upload Documents

Click "Upload Documents"
Select PDF/DOCX/TXT files (or ZIP archives)
Click "Start Building"
Wait for indexing to complete (progress bar shows status)

5. Query Your Documents

Query: "What are the key findings in the Q3 report?"

Response: The Q3 report highlights three key findings: 
[1] Revenue increased 23% year-over-year to $45.2M, 
[2] Customer acquisition costs decreased 15%, and 
[3] Net retention rate reached 118% [1].

Sources:
[1] Q3_Financial_Report.pdf (Page 3, Executive Summary)

RAGAS Metrics:
- Answer Relevancy: 0.89
- Faithfulness: 0.94
- Context Utilization: 0.87
- Overall Score: 0.90

🧩 Core Components

1. Document Ingestion Pipeline

# High-level flow
Document Upload → Parse → Clean → Chunk → Embed → Index

Adaptive Chunking Logic:

graph TD
    A[Calculate Token Count] --> B{Tokens < 50K?}
    B -->|Yes| C[Fixed Chunking<br/>512 tokens, 50 overlap]
    B -->|No| D{Tokens < 500K?}
    D -->|Yes| E[Semantic Chunking<br/>Section-aware]
    D -->|No| F[Hierarchical Chunking<br/>Parent 2048, Child 512]

2. Hybrid Retrieval Engine

Retrieval Flow:

# Pseudocode
def hybrid_retrieve(query: str, top_k: int = 10):
    # Dual retrieval
    query_embedding = embedder.embed(query)
    vector_results  = faiss_index.search(query_embedding, top_k * 2)
    bm25_results    = bm25_index.search(query, top_k * 2)
    
    # Fusion (RRF)
    fused_results   = reciprocal_rank_fusion(vector_results, 
                                             bm25_results,
                                             weights = (0.6, 0.4))
    
    # Reranking
    reranked        = cross_encoder.rerank(query, fused_results, top_k)
    
    return reranked

3. Response Generation

Temperature Control:

graph LR
    A[Query Type] --> B{Factual?}
    B -->|Yes| C[Low Temp<br/>0.1-0.2]
    B -->|No| D[Context Quality]
    D -->|High| E[Medium Temp<br/>0.3-0.5]
    D -->|Low| F[High Temp<br/>0.6-0.8]

4. RAGAS Evaluation Module

Automatic Quality Assessment:

# After each query-response
ragas_result = ragas_evaluator.evaluate_single(query              = user_query,
                                               answer             = generated_answer,
                                               contexts           = retrieved_chunks,
                                               retrieval_time_ms  = retrieval_time,
                                               generation_time_ms = generation_time,
                                              )

# Metrics computed:
- Answer Relevancy (0-1)
- Faithfulness (0-1)
- Context Utilization (0-1)
- Context Relevancy (0-1)
- Overall Score (weighted average)

📚 API Documentation

Core Endpoints

1. Health Check

GET /api/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-11-27T03:00:00",
  "components": {
    "vector_store": true,
    "llm": true,
    "embeddings": true,
    "retrieval": true
  }
}

2. Upload Documents

POST /api/upload
Content-Type: multipart/form-data

files: [file1.pdf, file2.docx]

3. Start Processing

POST /api/start-processing

4. Query (Chat)

POST /api/chat
Content-Type: application/json

{
  "message": "What are the revenue figures?",
  "session_id": "session_123"
}

Response includes RAGAS metrics:

{
  "session_id": "session_123",
  "response": "Revenue for Q3 was $45.2M [1]...",
  "sources": [...],
  "metrics": {
    "retrieval_time": 245,
    "generation_time": 3100,
    "total_time": 3350
  },
  "ragas_metrics": {
    "answer_relevancy": 0.89,
    "faithfulness": 0.94,
    "context_utilization": 0.87,
    "context_relevancy": 0.91,
    "overall_score": 0.90
  }
}

5. RAGAS Endpoints

# Get evaluation history
GET /api/ragas/history

# Get session statistics
GET /api/ragas/statistics

# Clear evaluation history
POST /api/ragas/clear

# Export evaluation data
GET /api/ragas/export

# Get RAGAS configuration
GET /api/ragas/config

⚙️ Configuration

config/settings.py

Key Configuration Sections:

LLM Settings

OLLAMA_MODEL        = "mistral:7b"
DEFAULT_TEMPERATURE = 0.1
MAX_TOKENS          = 1000
CONTEXT_WINDOW      = 8192

RAGAS Settings

ENABLE_RAGAS              = True
RAGAS_ENABLE_GROUND_TRUTH = False
RAGAS_METRICS             = ["answer_relevancy",
                             "faithfulness",
                             "context_utilization",
                             "context_relevancy"
                            ]
RAGAS_EVALUATION_TIMEOUT  = 60
RAGAS_BATCH_SIZE          = 10

Caching Settings

ENABLE_EMBEDDING_CACHE = True
CACHE_MAX_SIZE         = 1000  # LRU cache size
CACHE_TTL              = 3600  # Time to live in seconds

📊 RAGAS Evaluation

What is RAGAS?

RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG systems using automated metrics. Our implementation provides real-time quality assessment for every query-response pair.

Metrics Explained

Metric	Definition	Target	Interpretation
Answer Relevancy	How well the answer addresses the question	> 0.85	Measures usefulness to user
Faithfulness	Is the answer grounded in retrieved context?	> 0.90	Prevents hallucinations
Context Utilization	How well the context is used in the answer	> 0.80	Retrieval effectiveness
Context Relevancy	Are retrieved chunks relevant to the query?	> 0.85	Search quality
Overall Score	Weighted average of all metrics	> 0.85	System performance

Using the Analytics Dashboard

Navigate to Analytics & Quality section
View real-time RAGAS metrics table
Monitor session statistics (averages, trends)
Export evaluation data for offline analysis

Example Evaluation Output

Query: "What were the Q3 revenue trends?"
Answer: "Q3 revenue increased 23% YoY to $45.2M..."

RAGAS Evaluation:
├─ Answer Relevancy: 0.89 ✓ (Good)
├─ Faithfulness: 0.94 ✓ (Excellent)
├─ Context Utilization: 0.87 ✓ (Good)
├─ Context Relevancy: 0.91 ✓ (Excellent)
└─ Overall Score: 0.90 ✓ (Excellent)

Performance:
├─ Retrieval Time: 245ms
├─ Generation Time: 3100ms
└─ Total Time: 3345ms

🔧 Troubleshooting

Common Issues

1. "RAGAS evaluation failed"

Cause: OpenAI API key not configured

Solution:

# Add to .env file
OPENAI_API_KEY=your_openai_api_key_here

# Or disable RAGAS if not needed
ENABLE_RAGAS=False

2. "Context assembly returning 0 chunks"

Cause: Missing token counts in chunks

Solution: Already fixed in context_assembler.py. Tokens calculated on-the-fly if missing.

3. "Slow query responses"

Solutions:

Enable embedding cache : ENABLE_EMBEDDING_CACHE=True
Reduce retrieval count : TOP_K_RETRIEVE=5
Disable reranking : ENABLE_RERANKING=False
Use quantized model for faster inference

4. "RAGAS metrics not appearing"

Symptoms: Chat responses lack quality metrics

Solution:

# Verify RAGAS is enabled in settings
ENABLE_RAGAS = True

# Check OpenAI API key is valid
# View logs for RAGAS evaluation errors
tail -f logs/app.log | grep "RAGAS"

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Open Source Technologies:

FastAPI - Modern web framework
Ollama - Local LLM inference
FAISS - Vector similarity search
LlamaIndex - Document chunking
Sentence Transformers - Embedding models
RAGAS - RAG evaluation

Research Papers:

Karpukhin et al. (2020) - Dense Passage Retrieval
Robertson & Zaragoza (2009) - The Probabilistic Relevance Framework: BM25
Lewis et al. (2020) - Retrieval-Augmented Generation
Es et al. (2023) - RAGAS: Automated Evaluation of RAG

Built with ❤️ for the open-source community