Spaces:

satyakimitra
/

QuerySphere

Running

App Files Files Community

QuerySphere / docs /ARCHITECTURE.md

satyakimitra

Renaming the app at places

69c2ef1 about 2 months ago

preview code

raw

history blame contribute delete

22.7 kB

QuerySphere - Technical Architecture Document

1. System Overview

1.1 High-Level Architecture

graph TB
    subgraph "Frontend Layer"
        A[Web UI<br/>HTML/CSS/JS]
        B[File Upload<br/>Drag & Drop]
        C[Chat Interface<br/>Real-time]
        D[Analytics Dashboard<br/>RAGAS Metrics]
    end
    
    subgraph "API Gateway"
        E[FastAPI Server<br/>Python 3.11+]
    end
    
    subgraph "Core Processing Engine"
        F[Ingestion Module]
        G[Processing Module]
        H[Retrieval Module]
        I[Generation Module]
        J[Evaluation Module]
    end
    
    subgraph "AI/ML Layer"
        K[Ollama LLM<br/>Mistral-7B]
        L[Embedding Model<br/>BGE-small-en]
        M[FAISS Vector DB]
    end
    
    subgraph "Quality Assurance"
        N[RAGAS Evaluator<br/>Real-time Metrics]
    end
    
    A --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> K
    G --> L
    L --> M
    H --> M
    I --> N
    N --> E

1.2 System Characteristics

Aspect	Specification
Architecture Style	Modular Microservices-inspired
Deployment	Docker Containerized
Processing Model	Async/Event-driven
Data Flow	Pipeline-based with Checkpoints
Scalability	Horizontal (Stateless API) + Vertical (GPU)
Caching	In-Memory LRU Cache
Evaluation	Real-time RAGAS Metrics

2. Component Architecture

2.1 Ingestion Module

flowchart TD
    A[User Input] --> B{Input Type Detection}
    
    B -->|PDF/DOCX| D[Document Parser]
    B -->|ZIP| E[Archive Extractor]
    
    subgraph D [Document Processing]
        D1[PyPDF2<br/>PDF Text]
        D2[python-docx<br/>Word Docs]
        D3[EasyOCR<br/>Scanned PDFs]
    end
    
    subgraph E [Archive Handling]
        E1[zipfile<br/>Extraction]
        E2[Recursive Processing]
        E3[Size Validation<br/>2GB Max]
    end
    
    D --> F[Text Cleaning]
    E --> F
    
    F --> G[Encoding Normalization]
    G --> H[Structure Preservation]
    H --> I[Output: Cleaned Text<br/>+ Metadata]

Ingestion Specifications:

Component	Technology	Configuration	Limits
PDF Parser	PyPDF2 + EasyOCR	OCR: English+Multilingual	1000 pages max
Document Parser	python-docx	Preserve formatting	50MB per file
Archive Handler	zipfile	Recursion depth: 5	2GB total, 10k files

2.2 Processing Module

2.2.1 Adaptive Chunking Strategy

flowchart TD
    A[Input Text] --> B[Token Count Analysis]
    B --> C{Document Size}
    
    C -->|<50K tokens| D[Fixed-Size Chunking]
    C -->|50K-500K tokens| E[Semantic Chunking]
    C -->|>500K tokens| F[Hierarchical Chunking]
    
    subgraph D [Strategy 1: Fixed]
        D1[Chunk Size: 512 tokens]
        D2[Overlap: 50 tokens]
        D3[Method: Simple sliding window]
    end
    
    subgraph E [Strategy 2: Semantic]
        E1[Breakpoint: 95th percentile similarity]
        E2[Method: LlamaIndex SemanticSplitter]
        E3[Preserve: Section boundaries]
    end
    
    subgraph F [Strategy 3: Hierarchical]
        F1[Parent: 2048 tokens]
        F2[Child: 512 tokens]
        F3[Retrieval: Child → Parent expansion]
    end
    
    D --> G[Chunk Metadata]
    E --> G
    F --> G
    
    G --> H[Embedding Generation]

2.2.2 Embedding Pipeline

# Embedding Configuration
EMBEDDING_CONFIG = {
    "model": "BAAI/bge-small-en-v1.5",
    "dimensions": 384,
    "batch_size": 32,
    "normalize": True,
    "device": "cuda" if torch.cuda.is_available() else "cpu",
    "max_sequence_length": 512
}

Parameter	Value	Rationale
Model	BAAI/bge-small-en-v1.5	SOTA quality, 62.17 MTEB score
Dimensions	384	Optimal speed/accuracy balance
Batch Size	32	Memory efficiency on GPU/CPU
Normalization	L2	Required for cosine similarity
Speed	1000 docs/sec (CPU)	10x faster than alternatives

2.3 Storage Module Architecture

graph TB
    subgraph "Storage Layer"
        A[FAISS Vector Store]
        B[BM25 Keyword Index]
        C[SQLite Metadata]
        D[LRU Cache<br/>In-Memory]
    end
    
    subgraph A [Vector Storage Architecture]
        A1[IndexHNSW<br/>Large datasets]
        A2[IndexIVFFlat<br/>Medium datasets]
        A3[IndexFlatL2<br/>Small datasets]
    end
    
    subgraph B [Keyword Index]
        B1[rank_bm25 Library]
        B2[TF-IDF Weights]
        B3[In-memory Index]
    end
    
    subgraph C [Metadata Management]
        C1[Document Metadata]
        C2[Chunk Relationships]
        C3[User Sessions]
        C4[RAGAS Evaluations]
    end
    
    subgraph D [Cache Layer]
        D1[Query Embeddings]
        D2[Frequent Results]
        D3[LRU Eviction]
    end
    
    A --> E[Hybrid Retrieval]
    B --> E
    C --> E
    D --> E

Vector Store Configuration

Index Type	Use Case	Parameters	Performance
IndexFlatL2	< 100K vectors	Exact search	O(n), High accuracy
IndexIVFFlat	100K-1M vectors	nprobe: 10-20	O(log n), Balanced
IndexHNSW	> 1M vectors	M: 16, efConstruction: 40	O(log n), Fastest

Caching Strategy

# LRU Cache Configuration
CACHE_CONFIG = {
    "max_size": 1000,        # Maximum cached items
    "ttl": 3600,             # Time to live (seconds)
    "eviction": "LRU",       # Least Recently Used
    "cache_embeddings": True,
    "cache_results": True
}

Benefits:

Reduced latency: 80% reduction for repeat queries
Resource efficiency: Avoid re-computing embeddings
No external dependencies: Pure Python implementation
Memory efficient: LRU eviction prevents unbounded growth

2.4 Retrieval Module

2.4.1 Hybrid Retrieval Pipeline

flowchart TD
    A[User Query] --> B[Query Processing]
    
    B --> C[Vector Embedding]
    B --> D[Keyword Extraction]
    
    C --> E[FAISS Search<br/>Top-K: 10]
    D --> F[BM25 Search<br/>Top-K: 10]
    
    E --> G[Reciprocal Rank Fusion]
    F --> G
    
    G --> H{Reranking Enabled?}
    
    H -->|Yes| I[Cross-Encoder Reranking]
    H -->|No| J[Final Top-5 Selection]
    
    I --> J
    
    J --> K[Context Assembly]
    K --> L[Citation Formatting]
    L --> M[Output: Context + Sources]

2.4.2 Retrieval Algorithms

Hybrid Fusion Formula:

RRF_score(doc) = vector_weight * (1 / (60 + vector_rank)) + bm25_weight * (1 / (60 + bm25_rank))

Default Weights:

Vector Similarity: 60%
BM25 Keyword: 40%

BM25 Parameters:

BM25_CONFIG = {
    "k1": 1.5,      # Term frequency saturation
    "b": 0.75,      # Length normalization
    "epsilon": 0.25  # Smoothing factor
}

2.5 Generation Module

2.5.1 LLM Integration Architecture

graph TB
    subgraph "Ollama Integration"
        A[Ollama Server]
        B[Mistral-7B-Instruct]
        C[LLaMA-2-13B-Chat]
    end
    
    subgraph "Prompt Engineering"
        D[System Prompt Template]
        E[Context Formatting]
        F[Citation Injection]
    end
    
    subgraph "Generation Control"
        G[Temperature Controller]
        H[Token Manager]
        I[Streaming Handler]
    end
    
    A --> J[API Client]
    B --> A
    C --> A
    
    D --> K[Prompt Assembly]
    E --> K
    F --> K
    
    G --> L[Generation Parameters]
    H --> L
    I --> L
    
    K --> M[LLM Request]
    L --> M
    M --> J
    J --> N[Response Processing]

2.5.2 LLM Configuration

Parameter	Default Value	Range	Description
Model	Mistral-7B-Instruct	-	Primary inference model
Temperature	0.1	0.0-1.0	Response creativity
Max Tokens	1000	100-4000	Response length limit
Top-P	0.9	0.1-1.0	Nucleus sampling
Context Window	32K	-	Mistral model capacity

2.6 RAGAS Evaluation Module

2.6.1 RAGAS Evaluation Pipeline

flowchart LR
    A[Query] --> B[Generated Answer]
    C[Retrieved Context] --> B
    
    B --> D[RAGAS Evaluator]
    C --> D
    
    D --> E[Answer Relevancy]
    D --> F[Faithfulness]
    D --> G[Context Utilization]
    D --> H[Context Relevancy]
    
    E --> I[Metrics Aggregation]
    F --> I
    G --> I
    H --> I
    
    I --> J[Analytics Dashboard]
    I --> K[SQLite Storage]
    I --> L[Session Statistics]

2.6.2 Evaluation Metrics

Metric	Target	Measurement Method	Importance
Answer Relevancy	> 0.85	LLM-based evaluation	Core user satisfaction
Faithfulness	> 0.90	Grounded in context check	Prevents hallucinations
Context Utilization	> 0.80	How well context is used	Generation effectiveness
Context Relevancy	> 0.85	Retrieved chunks relevance	Retrieval quality

Implementation Details:

# RAGAS Configuration
RAGAS_CONFIG = {
    "enable_ragas": True,
    "enable_ground_truth": False,
    "base_metrics": [
        "answer_relevancy",
        "faithfulness",
        "context_utilization",
        "context_relevancy"
    ],
    "ground_truth_metrics": [
        "context_precision",
        "context_recall",
        "answer_similarity",
        "answer_correctness"
    ],
    "evaluation_timeout": 60,
    "batch_size": 10
}

Evaluation Flow:

Automatic Trigger: Every query-response pair is evaluated
Async Processing: Evaluation runs in background (non-blocking)
Storage: Results stored in SQLite for analytics
Aggregation: Session-level statistics computed on-demand
Export: Full evaluation data available for download

3. Data Flow & Workflows

3.1 End-to-End Processing Pipeline

sequenceDiagram
    participant U as User
    participant F as Frontend
    participant A as API Gateway
    participant I as Ingestion
    participant P as Processing
    participant S as Storage
    participant R as Retrieval
    participant G as Generation
    participant E as RAGAS Evaluator
    
    U->>F: Upload Documents
    F->>A: POST /api/upload
    A->>I: Process Input Sources
    
    Note over I: Parallel Processing
    I->>I: Document Parsing
    I->>P: Extracted Text + Metadata
    
    P->>P: Adaptive Chunking
    P->>P: Embedding Generation
    P->>S: Store Vectors + Indexes
    
    S->>F: Processing Complete
    
    U->>F: Send Query
    F->>A: POST /api/chat
    
    A->>R: Hybrid Retrieval
    R->>S: Vector + BM25 Search
    S->>R: Top-K Chunks
    
    R->>G: Context + Query
    G->>G: LLM Generation
    G->>F: Response + Citations
    
    G->>E: Auto-evaluation (async)
    E->>E: Compute RAGAS Metrics
    E->>S: Store Evaluation Results
    E->>F: Return Metrics

3.2 Real-time Query Processing

flowchart TD
    A[User Query] --> B[Query Understanding]
    B --> C[Check Cache]
    
    C --> D{Cache Hit?}
    D -->|Yes| E[Return Cached Embedding]
    D -->|No| F[Generate Embedding]
    
    F --> G[Store in Cache]
    E --> H[FAISS Vector Search]
    G --> H
    
    B --> I[Keyword Extraction]
    I --> J[BM25 Keyword Search]
    
    H --> K[Reciprocal Rank Fusion]
    J --> K
    
    K --> L[Top-20 Candidates]
    L --> M{Reranking Enabled?}
    
    M -->|Yes| N[Cross-Encoder Reranking]
    M -->|No| O[Select Top-5]
    
    N --> O
    O --> P[Context Assembly]
    P --> Q[LLM Prompt Construction]
    Q --> R[Ollama Generation]
    R --> S[Citation Formatting]
    S --> T[Response Streaming]
    T --> U[User Display]
    
    R --> V[Async RAGAS Evaluation]
    V --> W[Compute Metrics]
    W --> X[Store Results]
    X --> Y[Update Dashboard]

4. Infrastructure & Deployment

4.1 Container Architecture

graph TB
    subgraph "Docker Compose Stack"
        A[Frontend Container<br/>nginx:alpine]
        B[Backend Container<br/>python:3.11]
        C[Ollama Container<br/>ollama/ollama]
    end
    
    subgraph "External Services"
        D[FAISS Indices<br/>Persistent Volume]
        E[SQLite Database<br/>Persistent Volume]
        F[Log Files<br/>Persistent Volume]
    end
    
    A --> B
    B --> C
    B --> D
    B --> E
    B --> F

4.2 Resource Requirements

4.2.1 Minimum Deployment

Resource	Specification	Purpose
CPU	4 cores	Document processing, embeddings
RAM	8GB	Model loading, FAISS indices, cache
Storage	20GB	Models, indices, documents
GPU	Optional	2-3x speedup for inference

4.2.2 Production Deployment

Resource	Specification	Purpose
CPU	8+ cores	Concurrent processing
RAM	16GB+	Larger datasets, caching
GPU	RTX 3090/4090	20-30 tokens/sec inference
Storage	100GB+ SSD	Fast vector search

5. API Architecture

5.1 REST API Endpoints

graph TB
    subgraph "System Management"
        A[GET /api/health]
        B[GET /api/system-info]
        C[GET /api/configuration]
        D[POST /api/configuration]
    end
    
    subgraph "Document Management"
        E[POST /api/upload]
        F[POST /api/start-processing]
        G[GET /api/processing-status]
    end
    
    subgraph "Query & Chat"
        H[POST /api/chat]
        I[GET /api/export-chat/:session_id]
    end
    
    subgraph "RAGAS Evaluation"
        J[GET /api/ragas/history]
        K[GET /api/ragas/statistics]
        L[POST /api/ragas/clear]
        M[GET /api/ragas/export]
        N[GET /api/ragas/config]
    end
    
    subgraph "Analytics"
        O[GET /api/analytics]
        P[GET /api/analytics/refresh]
        Q[GET /api/analytics/detailed]
    end

5.2 Request/Response Flow

# Typical Chat Request Flow with RAGAS
REQUEST_FLOW = {
    "authentication": "None (local deployment)",
    "rate_limiting": "100 requests/minute per IP",
    "validation": "Query length, session ID format",
    "processing": "Async with progress tracking",
    "response": "JSON with citations + metrics + RAGAS scores",
    "caching": "LRU cache for embeddings",
    "evaluation": "Automatic RAGAS metrics (async)"
}

6. Monitoring & Quality Assurance

6.1 RAGAS Integration

graph LR
    A[API Gateway] --> B[Query Processing]
    C[Retrieval Module] --> B
    D[Generation Module] --> B
    
    B --> E[RAGAS Evaluator]
    
    E --> F[Analytics Dashboard]
    
    F --> G[Answer Relevancy]
    F --> H[Faithfulness]
    F --> I[Context Utilization]
    F --> J[Context Relevancy]
    F --> K[Session Statistics]

6.2 Key Performance Indicators

Category	Metric	Target	Alert Threshold
Performance	Query Latency (p95)	< 5s	> 10s
Quality	Answer Relevancy	> 0.85	< 0.70
Quality	Faithfulness	> 0.90	< 0.80
Quality	Context Utilization	> 0.80	< 0.65
Quality	Overall Score	> 0.85	< 0.70
Reliability	Uptime	> 99.5%	< 95%

6.3 Analytics Dashboard Features

Real-Time Metrics:

RAGAS evaluation table with all query-response pairs
Session-level aggregate statistics
Performance metrics (latency, throughput)
Component health status

Historical Analysis:

Quality trend over time
Performance degradation detection
Cache hit rate monitoring
Resource utilization tracking

Export Capabilities:

JSON export of all evaluation data
CSV export for external analysis
Session-based filtering
Time-range queries

7. Technology Stack Details

Complete Technology Matrix

Layer	Component	Technology	Version	Purpose
Frontend	UI Framework	HTML5/CSS3/JS	-	Responsive interface
Frontend	Styling	Tailwind CSS	3.3+	Utility-first CSS
Frontend	Icons	Font Awesome	6.0+	Icon library
Backend	API Framework	FastAPI	0.104+	Async REST API
Backend	Python Version	Python	3.11+	Runtime
AI/ML	LLM Engine	Ollama	0.1.20+	Local LLM inference
AI/ML	Primary Model	Mistral-7B-Instruct	v0.2	Text generation
AI/ML	Embeddings	sentence-transformers	2.2.2+	Vector embeddings
AI/ML	Embedding Model	BAAI/bge-small-en	v1.5	Semantic search
Vector DB	Storage	FAISS	1.7.4+	Vector similarity
Search	Keyword	rank-bm25	0.2.1	BM25 implementation
Evaluation	Quality	Ragas	0.1.9	RAG evaluation
Document	PDF	PyPDF2	3.0+	PDF text extraction
Document	Word	python-docx	1.1+	DOCX processing
OCR	Text Recognition	EasyOCR	1.7+	Scanned documents
Database	Metadata	SQLite	3.35+	Local storage
Cache	In-memory	Python functools	-	LRU caching
Deployment	Container	Docker	24.0+	Containerization
Deployment	Orchestration	Docker Compose	2.20+	Multi-container

8. Key Architectural Decisions

8.1 Why Local Caching Instead of Redis?

Decision: Use in-memory LRU cache with Python's functools.lru_cache

Rationale:

Simplicity: No external service to manage
Performance: Faster access (no network overhead)
MVP Focus: Adequate for initial deployment
Resource Efficient: No additional memory footprint
Easy Migration: Can upgrade to Redis later if needed

Trade-offs:

Cache doesn't persist across restarts
Can't share cache across multiple instances
Limited by single-process memory

8.2 Why RAGAS for Evaluation?

Decision: Integrate RAGAS for real-time quality assessment

Rationale:

Automated Metrics: No manual annotation required
Production-Ready: Quantifiable quality scores
Real-Time: Evaluate every query-response pair
Comprehensive: Multiple dimensions of quality
Research-Backed: Based on academic research

Implementation Details:

OpenAI API key required for LLM-based metrics
Async evaluation to avoid blocking responses
SQLite storage for historical analysis
Export capability for offline processing

8.3 Why No Web Scraping?

Decision: Removed web scraping from MVP

Rationale:

Complexity: Anti-scraping mechanisms require maintenance
Reliability: Website changes break scrapers
Legal: Potential legal/ethical issues
Scope: Focus on core RAG functionality first

Alternative:

Users can save web pages as PDFs
Future enhancement if market demands it

9. Performance Optimization Strategies

9.1 Embedding Cache Strategy

# Cache Implementation
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query: str) -> np.ndarray:
    """Cache query embeddings for repeat queries"""
    return embedder.embed(query)

# Benefits:
# - 80% reduction in latency for repeat queries
# - No re-computation of identical queries
# - Automatic LRU eviction

9.2 Batch Processing

# Batch Embedding Generation
BATCH_SIZE = 32

def embed_chunks_batch(chunks: List[str]) -> List[np.ndarray]:
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i:i+BATCH_SIZE]
        batch_embeddings = embedder.embed_batch(batch)
        embeddings.extend(batch_embeddings)
    return embeddings

9.3 Async Processing

# Async Document Processing
import asyncio

async def process_documents_async(documents: List[Path]):
    tasks = [process_single_document(doc) for doc in documents]
    results = await asyncio.gather(*tasks)
    return results

10. Security Considerations

10.1 Data Privacy

On-Premise Processing: All data stays local
No External APIs: Except OpenAI for RAGAS (configurable)
Local LLM: Ollama runs entirely on-premise
Encrypted Storage: Optional SQLite encryption

10.2 Input Validation

# File Upload Validation
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100MB
ALLOWED_EXTENSIONS = {'.pdf', '.docx', '.txt', '.zip'}

def validate_upload(file: UploadFile):
    # Check extension
    if Path(file.filename).suffix not in ALLOWED_EXTENSIONS:
        raise ValueError("Unsupported file type")
    
    # Check size
    if file.size > MAX_FILE_SIZE:
        raise ValueError("File too large")
    
    # Scan for malicious content (optional)
    # scan_for_malware(file)

10.3 Rate Limiting

# Simple rate limiting
from fastapi import Request
from collections import defaultdict
from datetime import datetime, timedelta

rate_limits = defaultdict(list)

def check_rate_limit(request: Request, limit: int = 100):
    ip = request.client.host
    now = datetime.now()
    
    # Clean old requests
    rate_limits[ip] = [
        ts for ts in rate_limits[ip] 
        if now - ts < timedelta(minutes=1)
    ]
    
    # Check limit
    if len(rate_limits[ip]) >= limit:
        raise HTTPException(429, "Rate limit exceeded")
    
    rate_limits[ip].append(now)

Conclusion

This architecture document provides a comprehensive technical blueprint for the QuerySphere system. The modular design, clear separation of concerns, and production-ready considerations make this system suitable for enterprise deployment while maintaining flexibility for future enhancements.

Key Architectural Strengths

Modularity: Each component is independent and replaceable
Scalability: Horizontal scaling through stateless API design
Performance: Intelligent caching and batch processing
Quality: Real-time RAGAS evaluation for continuous monitoring
Privacy: Complete on-premise processing with local LLM
Simplicity: Minimal external dependencies (no Redis, no web scraping)

Future Enhancements

Short-term:

Redis cache for multi-instance deployments
Advanced monitoring dashboard
User authentication and authorization
API rate limiting enhancements

Long-term:

Distributed processing with Celery
Web scraping module (optional)
Fine-tuned domain-specific embeddings
Multi-tenant support
Advanced analytics and reporting

Document Version: 1.0 Last Updated: November 2025 Author: Satyaki Mitra

This document is part of the QuerySphere technical documentation suite.