Spaces:

satyakimitra
/

QuerySphere

Running

App Files Files Community

QuerySphere / docs /ARCHITECTURE.md

satyakimitra

Renaming the app at places

69c2ef1 about 2 months ago

preview code

raw

history blame contribute delete

22.7 kB

	# QuerySphere - Technical Architecture Document

	## 1. System Overview

	### 1.1 High-Level Architecture

	```mermaid
	graph TB
	subgraph "Frontend Layer"
	A[Web UI<br/>HTML/CSS/JS]
	B[File Upload<br/>Drag & Drop]
	C[Chat Interface<br/>Real-time]
	D[Analytics Dashboard<br/>RAGAS Metrics]
	end

	subgraph "API Gateway"
	E[FastAPI Server<br/>Python 3.11+]
	end

	subgraph "Core Processing Engine"
	F[Ingestion Module]
	G[Processing Module]
	H[Retrieval Module]
	I[Generation Module]
	J[Evaluation Module]
	end

	subgraph "AI/ML Layer"
	K[Ollama LLM<br/>Mistral-7B]
	L[Embedding Model<br/>BGE-small-en]
	M[FAISS Vector DB]
	end

	subgraph "Quality Assurance"
	N[RAGAS Evaluator<br/>Real-time Metrics]
	end

	A --> E
	E --> F
	F --> G
	G --> H
	H --> I
	I --> K
	G --> L
	L --> M
	H --> M
	I --> N
	N --> E
	```

	### 1.2 System Characteristics

	\| Aspect \| Specification \|
	\|--------\|---------------\|
	\| Architecture Style \| Modular Microservices-inspired \|
	\| Deployment \| Docker Containerized \|
	\| Processing Model \| Async/Event-driven \|
	\| Data Flow \| Pipeline-based with Checkpoints \|
	\| Scalability \| Horizontal (Stateless API) + Vertical (GPU) \|
	\| Caching \| In-Memory LRU Cache \|
	\| Evaluation \| Real-time RAGAS Metrics \|

	---

	## 2. Component Architecture

	### 2.1 Ingestion Module

	```mermaid
	flowchart TD
	A[User Input] --> B{Input Type Detection}

	B -->\|PDF/DOCX\| D[Document Parser]
	B -->\|ZIP\| E[Archive Extractor]

	subgraph D [Document Processing]
	D1[PyPDF2<br/>PDF Text]
	D2[python-docx<br/>Word Docs]
	D3[EasyOCR<br/>Scanned PDFs]
	end

	subgraph E [Archive Handling]
	E1[zipfile<br/>Extraction]
	E2[Recursive Processing]
	E3[Size Validation<br/>2GB Max]
	end

	D --> F[Text Cleaning]
	E --> F

	F --> G[Encoding Normalization]
	G --> H[Structure Preservation]
	H --> I[Output: Cleaned Text<br/>+ Metadata]
	```

	#### Ingestion Specifications:

	\| Component \| Technology \| Configuration \| Limits \|
	\|-----------\|------------\|---------------\|---------\|
	\| PDF Parser \| PyPDF2 + EasyOCR \| OCR: English+Multilingual \| 1000 pages max \|
	\| Document Parser \| python-docx \| Preserve formatting \| 50MB per file \|
	\| Archive Handler \| zipfile \| Recursion depth: 5 \| 2GB total, 10k files \|

	### 2.2 Processing Module

	#### 2.2.1 Adaptive Chunking Strategy

	```mermaid
	flowchart TD
	A[Input Text] --> B[Token Count Analysis]
	B --> C{Document Size}

	C -->\|<50K tokens\| D[Fixed-Size Chunking]
	C -->\|50K-500K tokens\| E[Semantic Chunking]
	C -->\|>500K tokens\| F[Hierarchical Chunking]

	subgraph D [Strategy 1: Fixed]
	D1[Chunk Size: 512 tokens]
	D2[Overlap: 50 tokens]
	D3[Method: Simple sliding window]
	end

	subgraph E [Strategy 2: Semantic]
	E1[Breakpoint: 95th percentile similarity]
	E2[Method: LlamaIndex SemanticSplitter]
	E3[Preserve: Section boundaries]
	end

	subgraph F [Strategy 3: Hierarchical]
	F1[Parent: 2048 tokens]
	F2[Child: 512 tokens]
	F3[Retrieval: Child → Parent expansion]
	end

	D --> G[Chunk Metadata]
	E --> G
	F --> G

	G --> H[Embedding Generation]
	```

	#### 2.2.2 Embedding Pipeline

	```python
	# Embedding Configuration
	EMBEDDING_CONFIG = {
	"model": "BAAI/bge-small-en-v1.5",
	"dimensions": 384,
	"batch_size": 32,
	"normalize": True,
	"device": "cuda" if torch.cuda.is_available() else "cpu",
	"max_sequence_length": 512
	}
	```

	\| Parameter \| Value \| Rationale \|
	\|-----------\|-------\|-----------\|
	\| Model \| BAAI/bge-small-en-v1.5 \| SOTA quality, 62.17 MTEB score \|
	\| Dimensions \| 384 \| Optimal speed/accuracy balance \|
	\| Batch Size \| 32 \| Memory efficiency on GPU/CPU \|
	\| Normalization \| L2 \| Required for cosine similarity \|
	\| Speed \| 1000 docs/sec (CPU) \| 10x faster than alternatives \|

	---

	### 2.3 Storage Module Architecture

	```mermaid
	graph TB
	subgraph "Storage Layer"
	A[FAISS Vector Store]
	B[BM25 Keyword Index]
	C[SQLite Metadata]
	D[LRU Cache<br/>In-Memory]
	end

	subgraph A [Vector Storage Architecture]
	A1[IndexHNSW<br/>Large datasets]
	A2[IndexIVFFlat<br/>Medium datasets]
	A3[IndexFlatL2<br/>Small datasets]
	end

	subgraph B [Keyword Index]
	B1[rank_bm25 Library]
	B2[TF-IDF Weights]
	B3[In-memory Index]
	end

	subgraph C [Metadata Management]
	C1[Document Metadata]
	C2[Chunk Relationships]
	C3[User Sessions]
	C4[RAGAS Evaluations]
	end

	subgraph D [Cache Layer]
	D1[Query Embeddings]
	D2[Frequent Results]
	D3[LRU Eviction]
	end

	A --> E[Hybrid Retrieval]
	B --> E
	C --> E
	D --> E
	```

	#### Vector Store Configuration

	\| Index Type \| Use Case \| Parameters \| Performance \|
	\|------------\|----------\|------------\|-------------\|
	\| IndexFlatL2 \| < 100K vectors \| Exact search \| O(n), High accuracy \|
	\| IndexIVFFlat \| 100K-1M vectors \| nprobe: 10-20 \| O(log n), Balanced \|
	\| IndexHNSW \| > 1M vectors \| M: 16, efConstruction: 40 \| O(log n), Fastest \|

	#### Caching Strategy

	```python
	# LRU Cache Configuration
	CACHE_CONFIG = {
	"max_size": 1000, # Maximum cached items
	"ttl": 3600, # Time to live (seconds)
	"eviction": "LRU", # Least Recently Used
	"cache_embeddings": True,
	"cache_results": True
	}
	```

	Benefits:
	- Reduced latency: 80% reduction for repeat queries
	- Resource efficiency: Avoid re-computing embeddings
	- No external dependencies: Pure Python implementation
	- Memory efficient: LRU eviction prevents unbounded growth

	---

	### 2.4 Retrieval Module

	#### 2.4.1 Hybrid Retrieval Pipeline

	```mermaid
	flowchart TD
	A[User Query] --> B[Query Processing]

	B --> C[Vector Embedding]
	B --> D[Keyword Extraction]

	C --> E[FAISS Search<br/>Top-K: 10]
	D --> F[BM25 Search<br/>Top-K: 10]

	E --> G[Reciprocal Rank Fusion]
	F --> G

	G --> H{Reranking Enabled?}

	H -->\|Yes\| I[Cross-Encoder Reranking]
	H -->\|No\| J[Final Top-5 Selection]

	I --> J

	J --> K[Context Assembly]
	K --> L[Citation Formatting]
	L --> M[Output: Context + Sources]
	```

	#### 2.4.2 Retrieval Algorithms

	Hybrid Fusion Formula:

	```text
	RRF_score(doc) = vector_weight * (1 / (60 + vector_rank)) + bm25_weight * (1 / (60 + bm25_rank))
	```

	Default Weights:
	- Vector Similarity: 60%
	- BM25 Keyword: 40%

	BM25 Parameters:

	```python
	BM25_CONFIG = {
	"k1": 1.5, # Term frequency saturation
	"b": 0.75, # Length normalization
	"epsilon": 0.25 # Smoothing factor
	}
	```

	---

	### 2.5 Generation Module

	#### 2.5.1 LLM Integration Architecture

	```mermaid
	graph TB
	subgraph "Ollama Integration"
	A[Ollama Server]
	B[Mistral-7B-Instruct]
	C[LLaMA-2-13B-Chat]
	end

	subgraph "Prompt Engineering"
	D[System Prompt Template]
	E[Context Formatting]
	F[Citation Injection]
	end

	subgraph "Generation Control"
	G[Temperature Controller]
	H[Token Manager]
	I[Streaming Handler]
	end

	A --> J[API Client]
	B --> A
	C --> A

	D --> K[Prompt Assembly]
	E --> K
	F --> K

	G --> L[Generation Parameters]
	H --> L
	I --> L

	K --> M[LLM Request]
	L --> M
	M --> J
	J --> N[Response Processing]
	```

	#### 2.5.2 LLM Configuration

	\| Parameter \| Default Value \| Range \| Description \|
	\|-----------\|---------------\|-------\|-------------\|
	\| Model \| Mistral-7B-Instruct \| - \| Primary inference model \|
	\| Temperature \| 0.1 \| 0.0-1.0 \| Response creativity \|
	\| Max Tokens \| 1000 \| 100-4000 \| Response length limit \|
	\| Top-P \| 0.9 \| 0.1-1.0 \| Nucleus sampling \|
	\| Context Window \| 32K \| - \| Mistral model capacity \|

	---

	### 2.6 RAGAS Evaluation Module

	#### 2.6.1 RAGAS Evaluation Pipeline

	```mermaid
	flowchart LR
	A[Query] --> B[Generated Answer]
	C[Retrieved Context] --> B

	B --> D[RAGAS Evaluator]
	C --> D

	D --> E[Answer Relevancy]
	D --> F[Faithfulness]
	D --> G[Context Utilization]
	D --> H[Context Relevancy]

	E --> I[Metrics Aggregation]
	F --> I
	G --> I
	H --> I

	I --> J[Analytics Dashboard]
	I --> K[SQLite Storage]
	I --> L[Session Statistics]
	```

	#### 2.6.2 Evaluation Metrics

	\| Metric \| Target \| Measurement Method \| Importance \|
	\|--------\|--------\|-------------------\|------------\|
	\| Answer Relevancy \| > 0.85 \| LLM-based evaluation \| Core user satisfaction \|
	\| Faithfulness \| > 0.90 \| Grounded in context check \| Prevents hallucinations \|
	\| Context Utilization \| > 0.80 \| How well context is used \| Generation effectiveness \|
	\| Context Relevancy \| > 0.85 \| Retrieved chunks relevance \| Retrieval quality \|

	Implementation Details:

	```python
	# RAGAS Configuration
	RAGAS_CONFIG = {
	"enable_ragas": True,
	"enable_ground_truth": False,
	"base_metrics": [
	"answer_relevancy",
	"faithfulness",
	"context_utilization",
	"context_relevancy"
	],
	"ground_truth_metrics": [
	"context_precision",
	"context_recall",
	"answer_similarity",
	"answer_correctness"
	],
	"evaluation_timeout": 60,
	"batch_size": 10
	}
	```

	Evaluation Flow:

	1. Automatic Trigger: Every query-response pair is evaluated
	2. Async Processing: Evaluation runs in background (non-blocking)
	3. Storage: Results stored in SQLite for analytics
	4. Aggregation: Session-level statistics computed on-demand
	5. Export: Full evaluation data available for download

	---

	## 3. Data Flow & Workflows

	### 3.1 End-to-End Processing Pipeline

	```mermaid
	sequenceDiagram
	participant U as User
	participant F as Frontend
	participant A as API Gateway
	participant I as Ingestion
	participant P as Processing
	participant S as Storage
	participant R as Retrieval
	participant G as Generation
	participant E as RAGAS Evaluator

	U->>F: Upload Documents
	F->>A: POST /api/upload
	A->>I: Process Input Sources

	Note over I: Parallel Processing
	I->>I: Document Parsing
	I->>P: Extracted Text + Metadata

	P->>P: Adaptive Chunking
	P->>P: Embedding Generation
	P->>S: Store Vectors + Indexes

	S->>F: Processing Complete

	U->>F: Send Query
	F->>A: POST /api/chat

	A->>R: Hybrid Retrieval
	R->>S: Vector + BM25 Search
	S->>R: Top-K Chunks

	R->>G: Context + Query
	G->>G: LLM Generation
	G->>F: Response + Citations

	G->>E: Auto-evaluation (async)
	E->>E: Compute RAGAS Metrics
	E->>S: Store Evaluation Results
	E->>F: Return Metrics
	```

	### 3.2 Real-time Query Processing

	```mermaid
	flowchart TD
	A[User Query] --> B[Query Understanding]
	B --> C[Check Cache]

	C --> D{Cache Hit?}
	D -->\|Yes\| E[Return Cached Embedding]
	D -->\|No\| F[Generate Embedding]

	F --> G[Store in Cache]
	E --> H[FAISS Vector Search]
	G --> H

	B --> I[Keyword Extraction]
	I --> J[BM25 Keyword Search]

	H --> K[Reciprocal Rank Fusion]
	J --> K

	K --> L[Top-20 Candidates]
	L --> M{Reranking Enabled?}

	M -->\|Yes\| N[Cross-Encoder Reranking]
	M -->\|No\| O[Select Top-5]

	N --> O
	O --> P[Context Assembly]
	P --> Q[LLM Prompt Construction]
	Q --> R[Ollama Generation]
	R --> S[Citation Formatting]
	S --> T[Response Streaming]
	T --> U[User Display]

	R --> V[Async RAGAS Evaluation]
	V --> W[Compute Metrics]
	W --> X[Store Results]
	X --> Y[Update Dashboard]
	```

	---

	## 4. Infrastructure & Deployment

	### 4.1 Container Architecture

	```mermaid
	graph TB
	subgraph "Docker Compose Stack"
	A[Frontend Container<br/>nginx:alpine]
	B[Backend Container<br/>python:3.11]
	C[Ollama Container<br/>ollama/ollama]
	end

	subgraph "External Services"
	D[FAISS Indices<br/>Persistent Volume]
	E[SQLite Database<br/>Persistent Volume]
	F[Log Files<br/>Persistent Volume]
	end

	A --> B
	B --> C
	B --> D
	B --> E
	B --> F
	```

	### 4.2 Resource Requirements

	#### 4.2.1 Minimum Deployment

	\| Resource \| Specification \| Purpose \|
	\|----------\|---------------\|---------\|
	\| CPU \| 4 cores \| Document processing, embeddings \|
	\| RAM \| 8GB \| Model loading, FAISS indices, cache \|
	\| Storage \| 20GB \| Models, indices, documents \|
	\| GPU \| Optional \| 2-3x speedup for inference \|

	#### 4.2.2 Production Deployment

	\| Resource \| Specification \| Purpose \|
	\|----------\|---------------\|---------\|
	\| CPU \| 8+ cores \| Concurrent processing \|
	\| RAM \| 16GB+ \| Larger datasets, caching \|
	\| GPU \| RTX 3090/4090 \| 20-30 tokens/sec inference \|
	\| Storage \| 100GB+ SSD \| Fast vector search \|

	---

	## 5. API Architecture

	### 5.1 REST API Endpoints

	```mermaid
	graph TB
	subgraph "System Management"
	A[GET /api/health]
	B[GET /api/system-info]
	C[GET /api/configuration]
	D[POST /api/configuration]
	end

	subgraph "Document Management"
	E[POST /api/upload]
	F[POST /api/start-processing]
	G[GET /api/processing-status]
	end

	subgraph "Query & Chat"
	H[POST /api/chat]
	I[GET /api/export-chat/:session_id]
	end

	subgraph "RAGAS Evaluation"
	J[GET /api/ragas/history]
	K[GET /api/ragas/statistics]
	L[POST /api/ragas/clear]
	M[GET /api/ragas/export]
	N[GET /api/ragas/config]
	end

	subgraph "Analytics"
	O[GET /api/analytics]
	P[GET /api/analytics/refresh]
	Q[GET /api/analytics/detailed]
	end
	```

	### 5.2 Request/Response Flow

	```python
	# Typical Chat Request Flow with RAGAS
	REQUEST_FLOW = {
	"authentication": "None (local deployment)",
	"rate_limiting": "100 requests/minute per IP",
	"validation": "Query length, session ID format",
	"processing": "Async with progress tracking",
	"response": "JSON with citations + metrics + RAGAS scores",
	"caching": "LRU cache for embeddings",
	"evaluation": "Automatic RAGAS metrics (async)"
	}
	```

	---

	## 6. Monitoring & Quality Assurance

	### 6.1 RAGAS Integration

	```mermaid
	graph LR
	A[API Gateway] --> B[Query Processing]
	C[Retrieval Module] --> B
	D[Generation Module] --> B

	B --> E[RAGAS Evaluator]

	E --> F[Analytics Dashboard]

	F --> G[Answer Relevancy]
	F --> H[Faithfulness]
	F --> I[Context Utilization]
	F --> J[Context Relevancy]
	F --> K[Session Statistics]
	```

	### 6.2 Key Performance Indicators

	\| Category \| Metric \| Target \| Alert Threshold \|
	\|----------\|--------\|--------\|-----------------\|
	\| Performance \| Query Latency (p95) \| < 5s \| > 10s \|
	\| Quality \| Answer Relevancy \| > 0.85 \| < 0.70 \|
	\| Quality \| Faithfulness \| > 0.90 \| < 0.80 \|
	\| Quality \| Context Utilization \| > 0.80 \| < 0.65 \|
	\| Quality \| Overall Score \| > 0.85 \| < 0.70 \|
	\| Reliability \| Uptime \| > 99.5% \| < 95% \|

	### 6.3 Analytics Dashboard Features

	Real-Time Metrics:
	- RAGAS evaluation table with all query-response pairs
	- Session-level aggregate statistics
	- Performance metrics (latency, throughput)
	- Component health status

	Historical Analysis:
	- Quality trend over time
	- Performance degradation detection
	- Cache hit rate monitoring
	- Resource utilization tracking

	Export Capabilities:
	- JSON export of all evaluation data
	- CSV export for external analysis
	- Session-based filtering
	- Time-range queries

	---

	## 7. Technology Stack Details

	### Complete Technology Matrix

	\| Layer \| Component \| Technology \| Version \| Purpose \|
	\|-------\|-----------\|------------\|---------\|----------\|
	\| Frontend \| UI Framework \| HTML5/CSS3/JS \| - \| Responsive interface \|
	\| Frontend \| Styling \| Tailwind CSS \| 3.3+ \| Utility-first CSS \|
	\| Frontend \| Icons \| Font Awesome \| 6.0+ \| Icon library \|
	\| Backend \| API Framework \| FastAPI \| 0.104+ \| Async REST API \|
	\| Backend \| Python Version \| Python \| 3.11+ \| Runtime \|
	\| AI/ML \| LLM Engine \| Ollama \| 0.1.20+ \| Local LLM inference \|
	\| AI/ML \| Primary Model \| Mistral-7B-Instruct \| v0.2 \| Text generation \|
	\| AI/ML \| Embeddings \| sentence-transformers \| 2.2.2+ \| Vector embeddings \|
	\| AI/ML \| Embedding Model \| BAAI/bge-small-en \| v1.5 \| Semantic search \|
	\| Vector DB \| Storage \| FAISS \| 1.7.4+ \| Vector similarity \|
	\| Search \| Keyword \| rank-bm25 \| 0.2.1 \| BM25 implementation \|
	\| Evaluation \| Quality \| Ragas \| 0.1.9 \| RAG evaluation \|
	\| Document \| PDF \| PyPDF2 \| 3.0+ \| PDF text extraction \|
	\| Document \| Word \| python-docx \| 1.1+ \| DOCX processing \|
	\| OCR \| Text Recognition \| EasyOCR \| 1.7+ \| Scanned documents \|
	\| Database \| Metadata \| SQLite \| 3.35+ \| Local storage \|
	\| Cache \| In-memory \| Python functools \| - \| LRU caching \|
	\| Deployment \| Container \| Docker \| 24.0+ \| Containerization \|
	\| Deployment \| Orchestration \| Docker Compose \| 2.20+ \| Multi-container \|

	---

	## 8. Key Architectural Decisions

	### 8.1 Why Local Caching Instead of Redis?

	Decision: Use in-memory LRU cache with Python's `functools.lru_cache`

	Rationale:
	- Simplicity: No external service to manage
	- Performance: Faster access (no network overhead)
	- MVP Focus: Adequate for initial deployment
	- Resource Efficient: No additional memory footprint
	- Easy Migration: Can upgrade to Redis later if needed

	Trade-offs:
	- Cache doesn't persist across restarts
	- Can't share cache across multiple instances
	- Limited by single-process memory

	### 8.2 Why RAGAS for Evaluation?

	Decision: Integrate RAGAS for real-time quality assessment

	Rationale:
	- Automated Metrics: No manual annotation required
	- Production-Ready: Quantifiable quality scores
	- Real-Time: Evaluate every query-response pair
	- Comprehensive: Multiple dimensions of quality
	- Research-Backed: Based on academic research

	Implementation Details:
	- OpenAI API key required for LLM-based metrics
	- Async evaluation to avoid blocking responses
	- SQLite storage for historical analysis
	- Export capability for offline processing

	### 8.3 Why No Web Scraping?

	Decision: Removed web scraping from MVP

	Rationale:
	- Complexity: Anti-scraping mechanisms require maintenance
	- Reliability: Website changes break scrapers
	- Legal: Potential legal/ethical issues
	- Scope: Focus on core RAG functionality first

	Alternative:
	- Users can save web pages as PDFs
	- Future enhancement if market demands it

	---

	## 9. Performance Optimization Strategies

	### 9.1 Embedding Cache Strategy

	```python
	# Cache Implementation
	from functools import lru_cache

	@lru_cache(maxsize=1000)
	def get_query_embedding(query: str) -> np.ndarray:
	"""Cache query embeddings for repeat queries"""
	return embedder.embed(query)

	# Benefits:
	# - 80% reduction in latency for repeat queries
	# - No re-computation of identical queries
	# - Automatic LRU eviction
	```

	### 9.2 Batch Processing

	```python
	# Batch Embedding Generation
	BATCH_SIZE = 32

	def embed_chunks_batch(chunks: List[str]) -> List[np.ndarray]:
	embeddings = []
	for i in range(0, len(chunks), BATCH_SIZE):
	batch = chunks[i:i+BATCH_SIZE]
	batch_embeddings = embedder.embed_batch(batch)
	embeddings.extend(batch_embeddings)
	return embeddings
	```

	### 9.3 Async Processing

	```python
	# Async Document Processing
	import asyncio

	async def process_documents_async(documents: List[Path]):
	tasks = [process_single_document(doc) for doc in documents]
	results = await asyncio.gather(*tasks)
	return results
	```

	---

	## 10. Security Considerations

	### 10.1 Data Privacy

	- On-Premise Processing: All data stays local
	- No External APIs: Except OpenAI for RAGAS (configurable)
	- Local LLM: Ollama runs entirely on-premise
	- Encrypted Storage: Optional SQLite encryption

	### 10.2 Input Validation

	```python
	# File Upload Validation
	MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB
	ALLOWED_EXTENSIONS = {'.pdf', '.docx', '.txt', '.zip'}

	def validate_upload(file: UploadFile):
	# Check extension
	if Path(file.filename).suffix not in ALLOWED_EXTENSIONS:
	raise ValueError("Unsupported file type")

	# Check size
	if file.size > MAX_FILE_SIZE:
	raise ValueError("File too large")

	# Scan for malicious content (optional)
	# scan_for_malware(file)
	```

	### 10.3 Rate Limiting

	```python
	# Simple rate limiting
	from fastapi import Request
	from collections import defaultdict
	from datetime import datetime, timedelta

	rate_limits = defaultdict(list)

	def check_rate_limit(request: Request, limit: int = 100):
	ip = request.client.host
	now = datetime.now()

	# Clean old requests
	rate_limits[ip] = [
	ts for ts in rate_limits[ip]
	if now - ts < timedelta(minutes=1)
	]

	# Check limit
	if len(rate_limits[ip]) >= limit:
	raise HTTPException(429, "Rate limit exceeded")

	rate_limits[ip].append(now)
	```

	---

	## Conclusion

	This architecture document provides a comprehensive technical blueprint for the QuerySphere system. The modular design, clear separation of concerns, and production-ready considerations make this system suitable for enterprise deployment while maintaining flexibility for future enhancements.

	### Key Architectural Strengths

	1. Modularity: Each component is independent and replaceable
	2. Scalability: Horizontal scaling through stateless API design
	3. Performance: Intelligent caching and batch processing
	4. Quality: Real-time RAGAS evaluation for continuous monitoring
	5. Privacy: Complete on-premise processing with local LLM
	6. Simplicity: Minimal external dependencies (no Redis, no web scraping)

	### Future Enhancements

	Short-term:
	- Redis cache for multi-instance deployments
	- Advanced monitoring dashboard
	- User authentication and authorization
	- API rate limiting enhancements

	Long-term:
	- Distributed processing with Celery
	- Web scraping module (optional)
	- Fine-tuned domain-specific embeddings
	- Multi-tenant support
	- Advanced analytics and reporting

	---

	Document Version: 1.0
	Last Updated: November 2025
	Author: Satyaki Mitra

	---

	> This document is part of the QuerySphere technical documentation suite.