Spaces:

satyakimitra
/

QuerySphere

Running

App Files Files Community

QuerySphere / README.md

satyakimitra

Fix: README Update

bbfcdfc about 2 months ago

preview code

raw

history blame contribute delete

18.4 kB

	---
	title: QuerySphere
	emoji: 🧠
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	app_port: 7860
	---

	<div align="center">

	# QuerySphere: RAG platform for document Q&A with Knowledge Ingestion

	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-green.svg)](https://fastapi.tiangolo.com/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	> MVP-Grade RAG Platform with Multi-Format Document Ingestion, Hybrid Retrieval, and Zero API Costs

	</div>

	---

	A MVP grade Retrieval-Augmented Generation (RAG) system that enables organizations to unlock knowledge trapped across documents and archives while maintaining complete data privacy and eliminating costly API dependencies.

	---

	## 📋 Table of Contents

	- [Overview](#-overview)
	- [Key Features](#-key-features)
	- [System Architecture](#-system-architecture)
	- [Technology Stack](#-technology-stack)
	- [Installation](#-installation)
	- [Quick Start](#-quick-start)
	- [Core Components](#-core-components)
	- [API Documentation](#-api-documentation)
	- [Configuration](#-configuration)
	- [RAGAS Evaluation](#-ragas-evaluation)
	- [Troubleshooting](#-troubleshooting)
	- [License](#-license)

	---

	## 🎯 Overview

	The QuerySphere addresses a critical enterprise pain point: information silos that cost organizations 20% of employee productivity. Unlike existing solutions (Humata AI, ChatPDF, NotebookLM) that charge $49/user/month and rely on expensive cloud LLM APIs, this system offers:

	### Core Value Propositions

	\| Feature \| Traditional Solutions \| Our System \|
	\|---------\|----------------------\|------------\|
	\| Privacy \| Cloud-based (data leaves premises) \| 100% on-premise processing \|
	\| Cost \| $49-99/user/month + API fees \| Zero API costs (local inference) \|
	\| Input Types \| PDF only \| PDF, DOCX, TXT, ZIP archives \|
	\| Quality Metrics \| Black box (no visibility) \| RAGAS evaluation with detailed metrics \|
	\| Retrieval \| Vector-only \| Hybrid (Vector + BM25 + Reranking) \|
	\| Chunking \| Fixed size \| Adaptive (3 strategies) \|

	### Market Context

	- $8.5B projected enterprise AI search market by 2027
	- 85% of enterprises actively adopting AI-powered knowledge management
	- Growing regulatory demands for on-premise, privacy-compliant solutions

	---

	## ✨ Key Features

	### 1. Multi-Format Document Ingestion
	- Supported Formats: PDF, DOCX, TXT
	- Archive Processing: ZIP files up to 2GB with recursive extraction
	- Batch Upload: Process multiple documents simultaneously
	- OCR Support: Extract text from scanned documents and images (PaddleOCR or EasyOCR)

	### 2. Intelligent Document Processing
	- Adaptive Chunking: Automatically selects optimal strategy based on document size
	- Fixed-size chunks (< 50K tokens): 512 tokens with 50 overlap
	- Semantic chunks (50K-500K tokens): Section-aware splitting
	- Hierarchical chunks (> 500K tokens): Parent-child structure
	- Metadata Extraction: Title, author, date, page numbers, section headers

	### 3. Hybrid Retrieval System
	```mermaid
	graph LR
	A[User Query] --> B[Query Embedding]
	A --> C[Keyword Analysis]
	B --> D[Vector Search<br/>FAISS]
	C --> E[BM25 Search]
	D --> F[Reciprocal Rank Fusion<br/>60% Vector + 40% BM25]
	E --> F
	F --> G[Cross-Encoder Reranking]
	G --> H[Top-K Results]
	```

	- Vector Search: FAISS with BGE embeddings (384-dim)
	- Keyword Search: BM25 with optimized parameters (k1=1.5, b=0.75)
	- Fusion Methods: Weighted, Reciprocal Rank Fusion (RRF), CombSum
	- Reranking: Cross-encoder for precision boost

	### 4. Local LLM Generation
	- Ollama Integration: Zero-cost inference with Mistral-7B or LLaMA-2
	- Adaptive Temperature: Context-aware generation parameters
	- Citation Tracking: Automatic source attribution with validation
	- Streaming Support: Token-by-token response generation

	### 5. RAGAS Quality Assurance
	- Real-Time Evaluation: Answer relevancy, faithfulness, context precision/recall
	- Automatic Metrics: Computed for every query-response pair
	- Analytics Dashboard: Track quality trends over time
	- Export Capability: Download evaluation data for analysis
	- Session Statistics: Aggregate metrics across conversation sessions

	---

	## 🗂️ System Architecture

	### High-Level Architecture

	```mermaid
	graph TB
	subgraph "Frontend Layer"
	A[Web UI<br/>HTML/CSS/JS]
	end

	subgraph "API Layer"
	B[FastAPI Gateway<br/>REST Endpoints]
	end

	subgraph "Ingestion Pipeline"
	C[Document Parser<br/>PDF/DOCX/TXT]
	D[Adaptive Chunker<br/>3 Strategies]
	E[Embedding Generator<br/>BGE-small-en-v1.5]
	end

	subgraph "Storage Layer"
	F[FAISS Vector DB<br/>~10M vectors]
	G[BM25 Keyword Index]
	H[SQLite Metadata<br/>Documents & Chunks]
	I[LRU Cache<br/>Embeddings]
	end

	subgraph "Retrieval Engine"
	J[Hybrid Retriever<br/>Vector + BM25]
	K[Cross-Encoder<br/>Reranker]
	L[Context Assembler]
	end

	subgraph "Generation Engine"
	M[Ollama LLM<br/>Mistral-7B]
	N[Prompt Builder]
	O[Citation Formatter]
	end

	subgraph "Evaluation Engine"
	P[RAGAS Evaluator<br/>Quality Metrics]
	end

	A --> B
	B --> C
	C --> D
	D --> E
	E --> F
	E --> G
	D --> H
	E --> I

	B --> J
	J --> F
	J --> G
	J --> K
	K --> L

	L --> N
	N --> M
	M --> O
	O --> A

	M --> P
	P --> H
	P --> A
	```

	### Why This Architecture?

	#### Modular Design
	Each component is independent and replaceable:
	- Parser: Swap PDF libraries without affecting chunking
	- Embedder: Change from BGE to OpenAI embeddings with config update
	- LLM: Switch from Ollama to OpenAI API seamlessly

	#### Separation of Concerns
	```
	Ingestion → Storage → Retrieval → Generation → Evaluation
	```
	Each stage has clear inputs/outputs and single responsibility.

	#### Performance Optimization
	- Async Processing: Non-blocking I/O for uploads and LLM calls
	- Batch Operations: Embed 32 chunks simultaneously
	- Local Caching: LRU cache for query embeddings and frequent retrievals
	- Indexing: FAISS ANN for O(log n) search vs O(n) brute force

	---

	## 🔧 Technology Stack

	### Core Technologies

	\| Component \| Technology \| Version \| Why This Choice \|
	\|-----------\|-----------\|---------\|-----------------\|
	\| Backend \| FastAPI \| 0.104+ \| Async support, auto-docs, production-grade \|
	\| LLM \| Ollama (Mistral-7B) \| Latest \| Zero API costs, on-premise, 20-30 tokens/sec \|
	\| Embeddings \| BGE-small-en-v1.5 \| 384-dim \| SOTA quality, 10x faster than alternatives \|
	\| Vector DB \| FAISS \| Latest \| Battle-tested, 10x faster than ChromaDB \|
	\| Keyword Search \| BM25 (rank_bm25) \| Latest \| Fast probabilistic ranking \|
	\| Document Parsing \| PyPDF2, python-docx \| Latest \| Industry standard, reliable \|
	\| Chunking \| LlamaIndex \| 0.9+ \| Advanced semantic splitting \|
	\| Reranking \| Cross-Encoder \| Latest \| +15% accuracy, minimal latency \|
	\| Evaluation \| RAGAS \| 0.1.9 \| Automated RAG quality metrics \|
	\| Frontend \| Alpine.js \| 3.x \| Lightweight reactivity, no build step \|
	\| Database \| SQLite \| 3.x \| Zero-config, sufficient for metadata \|
	\| Caching \| In-Memory LRU \| Python functools \| Fast, no external dependencies \|

	### Python Dependencies

	```
	fastapi>=0.104.0
	uvicorn>=0.24.0
	ollama>=0.1.0
	sentence-transformers>=2.2.2
	faiss-cpu>=1.7.4
	llama-index>=0.9.0
	rank-bm25>=0.2.2
	PyPDF2>=3.0.0
	python-docx>=0.8.11
	pydantic>=2.0.0
	aiohttp>=3.9.0
	tiktoken>=0.5.0
	ragas==0.1.9
	datasets==2.14.6
	```

	---

	## 📦 Installation

	### Prerequisites

	- Python 3.10 or higher
	- 8GB RAM minimum (16GB recommended)
	- 10GB disk space for models and indexes
	- Ollama installed ([https://ollama.ai](https://ollama.ai))

	### Step 1: Clone Repository

	```bash
	git clone https://github.com/satyaki-mitra/docu-vault-ai.git
	cd docu-vault-ai
	```

	### Step 2: Create Virtual Environment

	```bash
	# Using conda (recommended)
	conda create -n rag_env python=3.10
	conda activate rag_env

	# Or using venv
	python -m venv rag_env
	source rag_env/bin/activate # On Windows: rag_env\Scripts\activate
	```

	### Step 3: Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	### Step 4: Install Ollama and Model

	```bash
	# Install Ollama (macOS)
	brew install ollama

	# Install Ollama (Linux)
	curl https://ollama.ai/install.sh \| sh

	# Pull Mistral model
	ollama pull mistral:7b

	# Verify installation
	ollama list
	```

	### Step 5: Configure Environment

	```bash
	# Copy example config
	cp .env.example .env

	# Edit configuration (optional)
	nano .env
	```

	Key Configuration Options:

	```bash
	# LLM Settings
	OLLAMA_MODEL=mistral:7b
	DEFAULT_TEMPERATURE=0.1
	CONTEXT_WINDOW=8192

	# Retrieval Settings
	VECTOR_WEIGHT=0.6
	BM25_WEIGHT=0.4
	ENABLE_RERANKING=True
	TOP_K_RETRIEVE=10

	# RAGAS Evaluation
	ENABLE_RAGAS=True
	RAGAS_ENABLE_GROUND_TRUTH=False
	OPENAI_API_KEY=your_openai_api_key_here # Required for RAGAS

	# Performance
	EMBEDDING_BATCH_SIZE=32
	MAX_WORKERS=4
	```

	---

	## 🚀 Quick Start

	### 1. Start Ollama Server

	```bash
	# Terminal 1: Start Ollama
	ollama serve
	```

	### 2. Launch Application

	```bash
	# Terminal 2: Start RAG system
	python app.py
	```

	Output:
	```
	INFO: Started server process [12345]
	INFO: Waiting for application startup.
	INFO: Application startup complete.
	INFO: Uvicorn running on http://0.0.0.0:8000
	```

	### 3. Access Web Interface

	Open browser to: http://localhost:8000

	### 4. Upload Documents

	1. Click "Upload Documents"
	2. Select PDF/DOCX/TXT files (or ZIP archives)
	3. Click "Start Building"
	4. Wait for indexing to complete (progress bar shows status)

	### 5. Query Your Documents

	```
	Query: "What are the key findings in the Q3 report?"

	Response: The Q3 report highlights three key findings:
	[1] Revenue increased 23% year-over-year to $45.2M,
	[2] Customer acquisition costs decreased 15%, and
	[3] Net retention rate reached 118% [1].

	Sources:
	[1] Q3_Financial_Report.pdf (Page 3, Executive Summary)

	RAGAS Metrics:
	- Answer Relevancy: 0.89
	- Faithfulness: 0.94
	- Context Utilization: 0.87
	- Overall Score: 0.90
	```

	---

	## 🧩 Core Components

	### 1. Document Ingestion Pipeline

	```python
	# High-level flow
	Document Upload → Parse → Clean → Chunk → Embed → Index
	```

	Adaptive Chunking Logic:

	```mermaid
	graph TD
	A[Calculate Token Count] --> B{Tokens < 50K?}
	B -->\|Yes\| C[Fixed Chunking<br/>512 tokens, 50 overlap]
	B -->\|No\| D{Tokens < 500K?}
	D -->\|Yes\| E[Semantic Chunking<br/>Section-aware]
	D -->\|No\| F[Hierarchical Chunking<br/>Parent 2048, Child 512]
	```

	### 2. Hybrid Retrieval Engine

	Retrieval Flow:

	```python
	# Pseudocode
	def hybrid_retrieve(query: str, top_k: int = 10):
	# Dual retrieval
	query_embedding = embedder.embed(query)
	vector_results = faiss_index.search(query_embedding, top_k * 2)
	bm25_results = bm25_index.search(query, top_k * 2)

	# Fusion (RRF)
	fused_results = reciprocal_rank_fusion(vector_results,
	bm25_results,
	weights = (0.6, 0.4))

	# Reranking
	reranked = cross_encoder.rerank(query, fused_results, top_k)

	return reranked
	```

	### 3. Response Generation

	Temperature Control:

	```mermaid
	graph LR
	A[Query Type] --> B{Factual?}
	B -->\|Yes\| C[Low Temp<br/>0.1-0.2]
	B -->\|No\| D[Context Quality]
	D -->\|High\| E[Medium Temp<br/>0.3-0.5]
	D -->\|Low\| F[High Temp<br/>0.6-0.8]
	```

	### 4. RAGAS Evaluation Module

	Automatic Quality Assessment:

	```python
	# After each query-response
	ragas_result = ragas_evaluator.evaluate_single(query = user_query,
	answer = generated_answer,
	contexts = retrieved_chunks,
	retrieval_time_ms = retrieval_time,
	generation_time_ms = generation_time,
	)

	# Metrics computed:
	- Answer Relevancy (0-1)
	- Faithfulness (0-1)
	- Context Utilization (0-1)
	- Context Relevancy (0-1)
	- Overall Score (weighted average)
	```

	---

	## 📚 API Documentation

	### Core Endpoints

	#### 1. Health Check

	```bash
	GET /api/health
	```

	Response:
	```json
	{
	"status": "healthy",
	"timestamp": "2024-11-27T03:00:00",
	"components": {
	"vector_store": true,
	"llm": true,
	"embeddings": true,
	"retrieval": true
	}
	}
	```

	#### 2. Upload Documents

	```bash
	POST /api/upload
	Content-Type: multipart/form-data

	files: [file1.pdf, file2.docx]
	```

	#### 3. Start Processing

	```bash
	POST /api/start-processing
	```

	#### 4. Query (Chat)

	```bash
	POST /api/chat
	Content-Type: application/json

	{
	"message": "What are the revenue figures?",
	"session_id": "session_123"
	}
	```

	Response includes RAGAS metrics:
	```json
	{
	"session_id": "session_123",
	"response": "Revenue for Q3 was $45.2M [1]...",
	"sources": [...],
	"metrics": {
	"retrieval_time": 245,
	"generation_time": 3100,
	"total_time": 3350
	},
	"ragas_metrics": {
	"answer_relevancy": 0.89,
	"faithfulness": 0.94,
	"context_utilization": 0.87,
	"context_relevancy": 0.91,
	"overall_score": 0.90
	}
	}
	```

	#### 5. RAGAS Endpoints

	```bash
	# Get evaluation history
	GET /api/ragas/history

	# Get session statistics
	GET /api/ragas/statistics

	# Clear evaluation history
	POST /api/ragas/clear

	# Export evaluation data
	GET /api/ragas/export

	# Get RAGAS configuration
	GET /api/ragas/config
	```

	---

	## ⚙️ Configuration

	### config/settings.py

	Key Configuration Sections:

	#### LLM Settings
	```python
	OLLAMA_MODEL = "mistral:7b"
	DEFAULT_TEMPERATURE = 0.1
	MAX_TOKENS = 1000
	CONTEXT_WINDOW = 8192
	```

	#### RAGAS Settings
	```python
	ENABLE_RAGAS = True
	RAGAS_ENABLE_GROUND_TRUTH = False
	RAGAS_METRICS = ["answer_relevancy",
	"faithfulness",
	"context_utilization",
	"context_relevancy"
	]
	RAGAS_EVALUATION_TIMEOUT = 60
	RAGAS_BATCH_SIZE = 10
	```

	#### Caching Settings
	```python
	ENABLE_EMBEDDING_CACHE = True
	CACHE_MAX_SIZE = 1000 # LRU cache size
	CACHE_TTL = 3600 # Time to live in seconds
	```

	---

	## 📊 RAGAS Evaluation

	### What is RAGAS?

	RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG systems using automated metrics. Our implementation provides real-time quality assessment for every query-response pair.

	### Metrics Explained

	\| Metric \| Definition \| Target \| Interpretation \|
	\|--------\|-----------\|--------\|----------------\|
	\| Answer Relevancy \| How well the answer addresses the question \| > 0.85 \| Measures usefulness to user \|
	\| Faithfulness \| Is the answer grounded in retrieved context? \| > 0.90 \| Prevents hallucinations \|
	\| Context Utilization \| How well the context is used in the answer \| > 0.80 \| Retrieval effectiveness \|
	\| Context Relevancy \| Are retrieved chunks relevant to the query? \| > 0.85 \| Search quality \|
	\| Overall Score \| Weighted average of all metrics \| > 0.85 \| System performance \|

	### Using the Analytics Dashboard

	1. Navigate to Analytics & Quality section
	2. View real-time RAGAS metrics table
	3. Monitor session statistics (averages, trends)
	4. Export evaluation data for offline analysis

	### Example Evaluation Output

	```
	Query: "What were the Q3 revenue trends?"
	Answer: "Q3 revenue increased 23% YoY to $45.2M..."

	RAGAS Evaluation:
	├─ Answer Relevancy: 0.89 ✓ (Good)
	├─ Faithfulness: 0.94 ✓ (Excellent)
	├─ Context Utilization: 0.87 ✓ (Good)
	├─ Context Relevancy: 0.91 ✓ (Excellent)
	└─ Overall Score: 0.90 ✓ (Excellent)

	Performance:
	├─ Retrieval Time: 245ms
	├─ Generation Time: 3100ms
	└─ Total Time: 3345ms
	```

	---

	## 🔧 Troubleshooting

	### Common Issues

	#### 1. "RAGAS evaluation failed"

	Cause: OpenAI API key not configured

	Solution:
	```bash
	# Add to .env file
	OPENAI_API_KEY=your_openai_api_key_here

	# Or disable RAGAS if not needed
	ENABLE_RAGAS=False
	```

	#### 2. "Context assembly returning 0 chunks"

	Cause: Missing token counts in chunks

	Solution: Already fixed in `context_assembler.py`. Tokens calculated on-the-fly if missing.

	#### 3. "Slow query responses"

	Solutions:
	- Enable embedding cache : `ENABLE_EMBEDDING_CACHE=True`
	- Reduce retrieval count : `TOP_K_RETRIEVE=5`
	- Disable reranking : `ENABLE_RERANKING=False`

	- Use quantized model for faster inference

	#### 4. "RAGAS metrics not appearing"

	Symptoms: Chat responses lack quality metrics

	Solution:
	```python
	# Verify RAGAS is enabled in settings
	ENABLE_RAGAS = True

	# Check OpenAI API key is valid
	# View logs for RAGAS evaluation errors
	tail -f logs/app.log \| grep "RAGAS"
	```

	---

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	---

	## 🙏 Acknowledgments

	Open Source Technologies:
	- [FastAPI](https://fastapi.tiangolo.com/) - Modern web framework
	- [Ollama](https://ollama.ai/) - Local LLM inference
	- [FAISS](https://github.com/facebookresearch/faiss) - Vector similarity search
	- [LlamaIndex](https://www.llamaindex.ai/) - Document chunking
	- [Sentence Transformers](https://www.sbert.net/) - Embedding models
	- [RAGAS](https://github.com/explodinggradients/ragas) - RAG evaluation

	Research Papers:
	- Karpukhin et al. (2020) - Dense Passage Retrieval
	- Robertson & Zaragoza (2009) - The Probabilistic Relevance Framework: BM25
	- Lewis et al. (2020) - Retrieval-Augmented Generation
	- Es et al. (2023) - RAGAS: Automated Evaluation of RAG

	---

	<div align="center">

	Built with ❤️ for the open-source community

	</div>