Spaces:
Running
title: QuerySphere
emoji: π§
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
QuerySphere: RAG platform for document Q&A with Knowledge Ingestion
MVP-Grade RAG Platform with Multi-Format Document Ingestion, Hybrid Retrieval, and Zero API Costs
A MVP grade Retrieval-Augmented Generation (RAG) system that enables organizations to unlock knowledge trapped across documents and archives while maintaining complete data privacy and eliminating costly API dependencies.
π Table of Contents
- Overview
- Key Features
- System Architecture
- Technology Stack
- Installation
- Quick Start
- Core Components
- API Documentation
- Configuration
- RAGAS Evaluation
- Troubleshooting
- License
π― Overview
The QuerySphere addresses a critical enterprise pain point: information silos that cost organizations 20% of employee productivity. Unlike existing solutions (Humata AI, ChatPDF, NotebookLM) that charge $49/user/month and rely on expensive cloud LLM APIs, this system offers:
Core Value Propositions
| Feature | Traditional Solutions | Our System |
|---|---|---|
| Privacy | Cloud-based (data leaves premises) | 100% on-premise processing |
| Cost | $49-99/user/month + API fees | Zero API costs (local inference) |
| Input Types | PDF only | PDF, DOCX, TXT, ZIP archives |
| Quality Metrics | Black box (no visibility) | RAGAS evaluation with detailed metrics |
| Retrieval | Vector-only | Hybrid (Vector + BM25 + Reranking) |
| Chunking | Fixed size | Adaptive (3 strategies) |
Market Context
- $8.5B projected enterprise AI search market by 2027
- 85% of enterprises actively adopting AI-powered knowledge management
- Growing regulatory demands for on-premise, privacy-compliant solutions
β¨ Key Features
1. Multi-Format Document Ingestion
- Supported Formats: PDF, DOCX, TXT
- Archive Processing: ZIP files up to 2GB with recursive extraction
- Batch Upload: Process multiple documents simultaneously
- OCR Support: Extract text from scanned documents and images (PaddleOCR or EasyOCR)
2. Intelligent Document Processing
- Adaptive Chunking: Automatically selects optimal strategy based on document size
- Fixed-size chunks (< 50K tokens): 512 tokens with 50 overlap
- Semantic chunks (50K-500K tokens): Section-aware splitting
- Hierarchical chunks (> 500K tokens): Parent-child structure
- Metadata Extraction: Title, author, date, page numbers, section headers
3. Hybrid Retrieval System
graph LR
A[User Query] --> B[Query Embedding]
A --> C[Keyword Analysis]
B --> D[Vector Search<br/>FAISS]
C --> E[BM25 Search]
D --> F[Reciprocal Rank Fusion<br/>60% Vector + 40% BM25]
E --> F
F --> G[Cross-Encoder Reranking]
G --> H[Top-K Results]
- Vector Search: FAISS with BGE embeddings (384-dim)
- Keyword Search: BM25 with optimized parameters (k1=1.5, b=0.75)
- Fusion Methods: Weighted, Reciprocal Rank Fusion (RRF), CombSum
- Reranking: Cross-encoder for precision boost
4. Local LLM Generation
- Ollama Integration: Zero-cost inference with Mistral-7B or LLaMA-2
- Adaptive Temperature: Context-aware generation parameters
- Citation Tracking: Automatic source attribution with validation
- Streaming Support: Token-by-token response generation
5. RAGAS Quality Assurance
- Real-Time Evaluation: Answer relevancy, faithfulness, context precision/recall
- Automatic Metrics: Computed for every query-response pair
- Analytics Dashboard: Track quality trends over time
- Export Capability: Download evaluation data for analysis
- Session Statistics: Aggregate metrics across conversation sessions
ποΈ System Architecture
High-Level Architecture
graph TB
subgraph "Frontend Layer"
A[Web UI<br/>HTML/CSS/JS]
end
subgraph "API Layer"
B[FastAPI Gateway<br/>REST Endpoints]
end
subgraph "Ingestion Pipeline"
C[Document Parser<br/>PDF/DOCX/TXT]
D[Adaptive Chunker<br/>3 Strategies]
E[Embedding Generator<br/>BGE-small-en-v1.5]
end
subgraph "Storage Layer"
F[FAISS Vector DB<br/>~10M vectors]
G[BM25 Keyword Index]
H[SQLite Metadata<br/>Documents & Chunks]
I[LRU Cache<br/>Embeddings]
end
subgraph "Retrieval Engine"
J[Hybrid Retriever<br/>Vector + BM25]
K[Cross-Encoder<br/>Reranker]
L[Context Assembler]
end
subgraph "Generation Engine"
M[Ollama LLM<br/>Mistral-7B]
N[Prompt Builder]
O[Citation Formatter]
end
subgraph "Evaluation Engine"
P[RAGAS Evaluator<br/>Quality Metrics]
end
A --> B
B --> C
C --> D
D --> E
E --> F
E --> G
D --> H
E --> I
B --> J
J --> F
J --> G
J --> K
K --> L
L --> N
N --> M
M --> O
O --> A
M --> P
P --> H
P --> A
Why This Architecture?
Modular Design
Each component is independent and replaceable:
- Parser: Swap PDF libraries without affecting chunking
- Embedder: Change from BGE to OpenAI embeddings with config update
- LLM: Switch from Ollama to OpenAI API seamlessly
Separation of Concerns
Ingestion β Storage β Retrieval β Generation β Evaluation
Each stage has clear inputs/outputs and single responsibility.
Performance Optimization
- Async Processing: Non-blocking I/O for uploads and LLM calls
- Batch Operations: Embed 32 chunks simultaneously
- Local Caching: LRU cache for query embeddings and frequent retrievals
- Indexing: FAISS ANN for O(log n) search vs O(n) brute force
π§ Technology Stack
Core Technologies
| Component | Technology | Version | Why This Choice |
|---|---|---|---|
| Backend | FastAPI | 0.104+ | Async support, auto-docs, production-grade |
| LLM | Ollama (Mistral-7B) | Latest | Zero API costs, on-premise, 20-30 tokens/sec |
| Embeddings | BGE-small-en-v1.5 | 384-dim | SOTA quality, 10x faster than alternatives |
| Vector DB | FAISS | Latest | Battle-tested, 10x faster than ChromaDB |
| Keyword Search | BM25 (rank_bm25) | Latest | Fast probabilistic ranking |
| Document Parsing | PyPDF2, python-docx | Latest | Industry standard, reliable |
| Chunking | LlamaIndex | 0.9+ | Advanced semantic splitting |
| Reranking | Cross-Encoder | Latest | +15% accuracy, minimal latency |
| Evaluation | RAGAS | 0.1.9 | Automated RAG quality metrics |
| Frontend | Alpine.js | 3.x | Lightweight reactivity, no build step |
| Database | SQLite | 3.x | Zero-config, sufficient for metadata |
| Caching | In-Memory LRU | Python functools | Fast, no external dependencies |
Python Dependencies
fastapi>=0.104.0
uvicorn>=0.24.0
ollama>=0.1.0
sentence-transformers>=2.2.2
faiss-cpu>=1.7.4
llama-index>=0.9.0
rank-bm25>=0.2.2
PyPDF2>=3.0.0
python-docx>=0.8.11
pydantic>=2.0.0
aiohttp>=3.9.0
tiktoken>=0.5.0
ragas==0.1.9
datasets==2.14.6
π¦ Installation
Prerequisites
- Python 3.10 or higher
- 8GB RAM minimum (16GB recommended)
- 10GB disk space for models and indexes
- Ollama installed (https://ollama.ai)
Step 1: Clone Repository
git clone https://github.com/satyaki-mitra/docu-vault-ai.git
cd docu-vault-ai
Step 2: Create Virtual Environment
# Using conda (recommended)
conda create -n rag_env python=3.10
conda activate rag_env
# Or using venv
python -m venv rag_env
source rag_env/bin/activate # On Windows: rag_env\Scripts\activate
Step 3: Install Dependencies
pip install -r requirements.txt
Step 4: Install Ollama and Model
# Install Ollama (macOS)
brew install ollama
# Install Ollama (Linux)
curl https://ollama.ai/install.sh | sh
# Pull Mistral model
ollama pull mistral:7b
# Verify installation
ollama list
Step 5: Configure Environment
# Copy example config
cp .env.example .env
# Edit configuration (optional)
nano .env
Key Configuration Options:
# LLM Settings
OLLAMA_MODEL=mistral:7b
DEFAULT_TEMPERATURE=0.1
CONTEXT_WINDOW=8192
# Retrieval Settings
VECTOR_WEIGHT=0.6
BM25_WEIGHT=0.4
ENABLE_RERANKING=True
TOP_K_RETRIEVE=10
# RAGAS Evaluation
ENABLE_RAGAS=True
RAGAS_ENABLE_GROUND_TRUTH=False
OPENAI_API_KEY=your_openai_api_key_here # Required for RAGAS
# Performance
EMBEDDING_BATCH_SIZE=32
MAX_WORKERS=4
π Quick Start
1. Start Ollama Server
# Terminal 1: Start Ollama
ollama serve
2. Launch Application
# Terminal 2: Start RAG system
python app.py
Output:
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
3. Access Web Interface
Open browser to: http://localhost:8000
4. Upload Documents
- Click "Upload Documents"
- Select PDF/DOCX/TXT files (or ZIP archives)
- Click "Start Building"
- Wait for indexing to complete (progress bar shows status)
5. Query Your Documents
Query: "What are the key findings in the Q3 report?"
Response: The Q3 report highlights three key findings:
[1] Revenue increased 23% year-over-year to $45.2M,
[2] Customer acquisition costs decreased 15%, and
[3] Net retention rate reached 118% [1].
Sources:
[1] Q3_Financial_Report.pdf (Page 3, Executive Summary)
RAGAS Metrics:
- Answer Relevancy: 0.89
- Faithfulness: 0.94
- Context Utilization: 0.87
- Overall Score: 0.90
π§© Core Components
1. Document Ingestion Pipeline
# High-level flow
Document Upload β Parse β Clean β Chunk β Embed β Index
Adaptive Chunking Logic:
graph TD
A[Calculate Token Count] --> B{Tokens < 50K?}
B -->|Yes| C[Fixed Chunking<br/>512 tokens, 50 overlap]
B -->|No| D{Tokens < 500K?}
D -->|Yes| E[Semantic Chunking<br/>Section-aware]
D -->|No| F[Hierarchical Chunking<br/>Parent 2048, Child 512]
2. Hybrid Retrieval Engine
Retrieval Flow:
# Pseudocode
def hybrid_retrieve(query: str, top_k: int = 10):
# Dual retrieval
query_embedding = embedder.embed(query)
vector_results = faiss_index.search(query_embedding, top_k * 2)
bm25_results = bm25_index.search(query, top_k * 2)
# Fusion (RRF)
fused_results = reciprocal_rank_fusion(vector_results,
bm25_results,
weights = (0.6, 0.4))
# Reranking
reranked = cross_encoder.rerank(query, fused_results, top_k)
return reranked
3. Response Generation
Temperature Control:
graph LR
A[Query Type] --> B{Factual?}
B -->|Yes| C[Low Temp<br/>0.1-0.2]
B -->|No| D[Context Quality]
D -->|High| E[Medium Temp<br/>0.3-0.5]
D -->|Low| F[High Temp<br/>0.6-0.8]
4. RAGAS Evaluation Module
Automatic Quality Assessment:
# After each query-response
ragas_result = ragas_evaluator.evaluate_single(query = user_query,
answer = generated_answer,
contexts = retrieved_chunks,
retrieval_time_ms = retrieval_time,
generation_time_ms = generation_time,
)
# Metrics computed:
- Answer Relevancy (0-1)
- Faithfulness (0-1)
- Context Utilization (0-1)
- Context Relevancy (0-1)
- Overall Score (weighted average)
π API Documentation
Core Endpoints
1. Health Check
GET /api/health
Response:
{
"status": "healthy",
"timestamp": "2024-11-27T03:00:00",
"components": {
"vector_store": true,
"llm": true,
"embeddings": true,
"retrieval": true
}
}
2. Upload Documents
POST /api/upload
Content-Type: multipart/form-data
files: [file1.pdf, file2.docx]
3. Start Processing
POST /api/start-processing
4. Query (Chat)
POST /api/chat
Content-Type: application/json
{
"message": "What are the revenue figures?",
"session_id": "session_123"
}
Response includes RAGAS metrics:
{
"session_id": "session_123",
"response": "Revenue for Q3 was $45.2M [1]...",
"sources": [...],
"metrics": {
"retrieval_time": 245,
"generation_time": 3100,
"total_time": 3350
},
"ragas_metrics": {
"answer_relevancy": 0.89,
"faithfulness": 0.94,
"context_utilization": 0.87,
"context_relevancy": 0.91,
"overall_score": 0.90
}
}
5. RAGAS Endpoints
# Get evaluation history
GET /api/ragas/history
# Get session statistics
GET /api/ragas/statistics
# Clear evaluation history
POST /api/ragas/clear
# Export evaluation data
GET /api/ragas/export
# Get RAGAS configuration
GET /api/ragas/config
βοΈ Configuration
config/settings.py
Key Configuration Sections:
LLM Settings
OLLAMA_MODEL = "mistral:7b"
DEFAULT_TEMPERATURE = 0.1
MAX_TOKENS = 1000
CONTEXT_WINDOW = 8192
RAGAS Settings
ENABLE_RAGAS = True
RAGAS_ENABLE_GROUND_TRUTH = False
RAGAS_METRICS = ["answer_relevancy",
"faithfulness",
"context_utilization",
"context_relevancy"
]
RAGAS_EVALUATION_TIMEOUT = 60
RAGAS_BATCH_SIZE = 10
Caching Settings
ENABLE_EMBEDDING_CACHE = True
CACHE_MAX_SIZE = 1000 # LRU cache size
CACHE_TTL = 3600 # Time to live in seconds
π RAGAS Evaluation
What is RAGAS?
RAGAS (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG systems using automated metrics. Our implementation provides real-time quality assessment for every query-response pair.
Metrics Explained
| Metric | Definition | Target | Interpretation |
|---|---|---|---|
| Answer Relevancy | How well the answer addresses the question | > 0.85 | Measures usefulness to user |
| Faithfulness | Is the answer grounded in retrieved context? | > 0.90 | Prevents hallucinations |
| Context Utilization | How well the context is used in the answer | > 0.80 | Retrieval effectiveness |
| Context Relevancy | Are retrieved chunks relevant to the query? | > 0.85 | Search quality |
| Overall Score | Weighted average of all metrics | > 0.85 | System performance |
Using the Analytics Dashboard
- Navigate to Analytics & Quality section
- View real-time RAGAS metrics table
- Monitor session statistics (averages, trends)
- Export evaluation data for offline analysis
Example Evaluation Output
Query: "What were the Q3 revenue trends?"
Answer: "Q3 revenue increased 23% YoY to $45.2M..."
RAGAS Evaluation:
ββ Answer Relevancy: 0.89 β (Good)
ββ Faithfulness: 0.94 β (Excellent)
ββ Context Utilization: 0.87 β (Good)
ββ Context Relevancy: 0.91 β (Excellent)
ββ Overall Score: 0.90 β (Excellent)
Performance:
ββ Retrieval Time: 245ms
ββ Generation Time: 3100ms
ββ Total Time: 3345ms
π§ Troubleshooting
Common Issues
1. "RAGAS evaluation failed"
Cause: OpenAI API key not configured
Solution:
# Add to .env file
OPENAI_API_KEY=your_openai_api_key_here
# Or disable RAGAS if not needed
ENABLE_RAGAS=False
2. "Context assembly returning 0 chunks"
Cause: Missing token counts in chunks
Solution: Already fixed in context_assembler.py. Tokens calculated on-the-fly if missing.
3. "Slow query responses"
Solutions:
Enable embedding cache :
ENABLE_EMBEDDING_CACHE=TrueReduce retrieval count :
TOP_K_RETRIEVE=5Disable reranking :
ENABLE_RERANKING=FalseUse quantized model for faster inference
4. "RAGAS metrics not appearing"
Symptoms: Chat responses lack quality metrics
Solution:
# Verify RAGAS is enabled in settings
ENABLE_RAGAS = True
# Check OpenAI API key is valid
# View logs for RAGAS evaluation errors
tail -f logs/app.log | grep "RAGAS"
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
Open Source Technologies:
- FastAPI - Modern web framework
- Ollama - Local LLM inference
- FAISS - Vector similarity search
- LlamaIndex - Document chunking
- Sentence Transformers - Embedding models
- RAGAS - RAG evaluation
Research Papers:
- Karpukhin et al. (2020) - Dense Passage Retrieval
- Robertson & Zaragoza (2009) - The Probabilistic Relevance Framework: BM25
- Lewis et al. (2020) - Retrieval-Augmented Generation
- Es et al. (2023) - RAGAS: Automated Evaluation of RAG
Built with β€οΈ for the open-source community