Spaces:
Running
Running
| # QuerySphere - Technical Architecture Document | |
| ## 1. System Overview | |
| ### 1.1 High-Level Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Frontend Layer" | |
| A[Web UI<br/>HTML/CSS/JS] | |
| B[File Upload<br/>Drag & Drop] | |
| C[Chat Interface<br/>Real-time] | |
| D[Analytics Dashboard<br/>RAGAS Metrics] | |
| end | |
| subgraph "API Gateway" | |
| E[FastAPI Server<br/>Python 3.11+] | |
| end | |
| subgraph "Core Processing Engine" | |
| F[Ingestion Module] | |
| G[Processing Module] | |
| H[Retrieval Module] | |
| I[Generation Module] | |
| J[Evaluation Module] | |
| end | |
| subgraph "AI/ML Layer" | |
| K[Ollama LLM<br/>Mistral-7B] | |
| L[Embedding Model<br/>BGE-small-en] | |
| M[FAISS Vector DB] | |
| end | |
| subgraph "Quality Assurance" | |
| N[RAGAS Evaluator<br/>Real-time Metrics] | |
| end | |
| A --> E | |
| E --> F | |
| F --> G | |
| G --> H | |
| H --> I | |
| I --> K | |
| G --> L | |
| L --> M | |
| H --> M | |
| I --> N | |
| N --> E | |
| ``` | |
| ### 1.2 System Characteristics | |
| | Aspect | Specification | | |
| |--------|---------------| | |
| | **Architecture Style** | Modular Microservices-inspired | | |
| | **Deployment** | Docker Containerized | | |
| | **Processing Model** | Async/Event-driven | | |
| | **Data Flow** | Pipeline-based with Checkpoints | | |
| | **Scalability** | Horizontal (Stateless API) + Vertical (GPU) | | |
| | **Caching** | In-Memory LRU Cache | | |
| | **Evaluation** | Real-time RAGAS Metrics | | |
| --- | |
| ## 2. Component Architecture | |
| ### 2.1 Ingestion Module | |
| ```mermaid | |
| flowchart TD | |
| A[User Input] --> B{Input Type Detection} | |
| B -->|PDF/DOCX| D[Document Parser] | |
| B -->|ZIP| E[Archive Extractor] | |
| subgraph D [Document Processing] | |
| D1[PyPDF2<br/>PDF Text] | |
| D2[python-docx<br/>Word Docs] | |
| D3[EasyOCR<br/>Scanned PDFs] | |
| end | |
| subgraph E [Archive Handling] | |
| E1[zipfile<br/>Extraction] | |
| E2[Recursive Processing] | |
| E3[Size Validation<br/>2GB Max] | |
| end | |
| D --> F[Text Cleaning] | |
| E --> F | |
| F --> G[Encoding Normalization] | |
| G --> H[Structure Preservation] | |
| H --> I[Output: Cleaned Text<br/>+ Metadata] | |
| ``` | |
| #### Ingestion Specifications: | |
| | Component | Technology | Configuration | Limits | | |
| |-----------|------------|---------------|---------| | |
| | **PDF Parser** | PyPDF2 + EasyOCR | OCR: English+Multilingual | 1000 pages max | | |
| | **Document Parser** | python-docx | Preserve formatting | 50MB per file | | |
| | **Archive Handler** | zipfile | Recursion depth: 5 | 2GB total, 10k files | | |
| ### 2.2 Processing Module | |
| #### 2.2.1 Adaptive Chunking Strategy | |
| ```mermaid | |
| flowchart TD | |
| A[Input Text] --> B[Token Count Analysis] | |
| B --> C{Document Size} | |
| C -->|<50K tokens| D[Fixed-Size Chunking] | |
| C -->|50K-500K tokens| E[Semantic Chunking] | |
| C -->|>500K tokens| F[Hierarchical Chunking] | |
| subgraph D [Strategy 1: Fixed] | |
| D1[Chunk Size: 512 tokens] | |
| D2[Overlap: 50 tokens] | |
| D3[Method: Simple sliding window] | |
| end | |
| subgraph E [Strategy 2: Semantic] | |
| E1[Breakpoint: 95th percentile similarity] | |
| E2[Method: LlamaIndex SemanticSplitter] | |
| E3[Preserve: Section boundaries] | |
| end | |
| subgraph F [Strategy 3: Hierarchical] | |
| F1[Parent: 2048 tokens] | |
| F2[Child: 512 tokens] | |
| F3[Retrieval: Child → Parent expansion] | |
| end | |
| D --> G[Chunk Metadata] | |
| E --> G | |
| F --> G | |
| G --> H[Embedding Generation] | |
| ``` | |
| #### 2.2.2 Embedding Pipeline | |
| ```python | |
| # Embedding Configuration | |
| EMBEDDING_CONFIG = { | |
| "model": "BAAI/bge-small-en-v1.5", | |
| "dimensions": 384, | |
| "batch_size": 32, | |
| "normalize": True, | |
| "device": "cuda" if torch.cuda.is_available() else "cpu", | |
| "max_sequence_length": 512 | |
| } | |
| ``` | |
| | Parameter | Value | Rationale | | |
| |-----------|-------|-----------| | |
| | **Model** | BAAI/bge-small-en-v1.5 | SOTA quality, 62.17 MTEB score | | |
| | **Dimensions** | 384 | Optimal speed/accuracy balance | | |
| | **Batch Size** | 32 | Memory efficiency on GPU/CPU | | |
| | **Normalization** | L2 | Required for cosine similarity | | |
| | **Speed** | 1000 docs/sec (CPU) | 10x faster than alternatives | | |
| --- | |
| ### 2.3 Storage Module Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Storage Layer" | |
| A[FAISS Vector Store] | |
| B[BM25 Keyword Index] | |
| C[SQLite Metadata] | |
| D[LRU Cache<br/>In-Memory] | |
| end | |
| subgraph A [Vector Storage Architecture] | |
| A1[IndexHNSW<br/>Large datasets] | |
| A2[IndexIVFFlat<br/>Medium datasets] | |
| A3[IndexFlatL2<br/>Small datasets] | |
| end | |
| subgraph B [Keyword Index] | |
| B1[rank_bm25 Library] | |
| B2[TF-IDF Weights] | |
| B3[In-memory Index] | |
| end | |
| subgraph C [Metadata Management] | |
| C1[Document Metadata] | |
| C2[Chunk Relationships] | |
| C3[User Sessions] | |
| C4[RAGAS Evaluations] | |
| end | |
| subgraph D [Cache Layer] | |
| D1[Query Embeddings] | |
| D2[Frequent Results] | |
| D3[LRU Eviction] | |
| end | |
| A --> E[Hybrid Retrieval] | |
| B --> E | |
| C --> E | |
| D --> E | |
| ``` | |
| #### Vector Store Configuration | |
| | Index Type | Use Case | Parameters | Performance | | |
| |------------|----------|------------|-------------| | |
| | **IndexFlatL2** | < 100K vectors | Exact search | O(n), High accuracy | | |
| | **IndexIVFFlat** | 100K-1M vectors | nprobe: 10-20 | O(log n), Balanced | | |
| | **IndexHNSW** | > 1M vectors | M: 16, efConstruction: 40 | O(log n), Fastest | | |
| #### Caching Strategy | |
| ```python | |
| # LRU Cache Configuration | |
| CACHE_CONFIG = { | |
| "max_size": 1000, # Maximum cached items | |
| "ttl": 3600, # Time to live (seconds) | |
| "eviction": "LRU", # Least Recently Used | |
| "cache_embeddings": True, | |
| "cache_results": True | |
| } | |
| ``` | |
| **Benefits:** | |
| - **Reduced latency**: 80% reduction for repeat queries | |
| - **Resource efficiency**: Avoid re-computing embeddings | |
| - **No external dependencies**: Pure Python implementation | |
| - **Memory efficient**: LRU eviction prevents unbounded growth | |
| --- | |
| ### 2.4 Retrieval Module | |
| #### 2.4.1 Hybrid Retrieval Pipeline | |
| ```mermaid | |
| flowchart TD | |
| A[User Query] --> B[Query Processing] | |
| B --> C[Vector Embedding] | |
| B --> D[Keyword Extraction] | |
| C --> E[FAISS Search<br/>Top-K: 10] | |
| D --> F[BM25 Search<br/>Top-K: 10] | |
| E --> G[Reciprocal Rank Fusion] | |
| F --> G | |
| G --> H{Reranking Enabled?} | |
| H -->|Yes| I[Cross-Encoder Reranking] | |
| H -->|No| J[Final Top-5 Selection] | |
| I --> J | |
| J --> K[Context Assembly] | |
| K --> L[Citation Formatting] | |
| L --> M[Output: Context + Sources] | |
| ``` | |
| #### 2.4.2 Retrieval Algorithms | |
| **Hybrid Fusion Formula:** | |
| ```text | |
| RRF_score(doc) = vector_weight * (1 / (60 + vector_rank)) + bm25_weight * (1 / (60 + bm25_rank)) | |
| ``` | |
| **Default Weights:** | |
| - Vector Similarity: 60% | |
| - BM25 Keyword: 40% | |
| **BM25 Parameters:** | |
| ```python | |
| BM25_CONFIG = { | |
| "k1": 1.5, # Term frequency saturation | |
| "b": 0.75, # Length normalization | |
| "epsilon": 0.25 # Smoothing factor | |
| } | |
| ``` | |
| --- | |
| ### 2.5 Generation Module | |
| #### 2.5.1 LLM Integration Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Ollama Integration" | |
| A[Ollama Server] | |
| B[Mistral-7B-Instruct] | |
| C[LLaMA-2-13B-Chat] | |
| end | |
| subgraph "Prompt Engineering" | |
| D[System Prompt Template] | |
| E[Context Formatting] | |
| F[Citation Injection] | |
| end | |
| subgraph "Generation Control" | |
| G[Temperature Controller] | |
| H[Token Manager] | |
| I[Streaming Handler] | |
| end | |
| A --> J[API Client] | |
| B --> A | |
| C --> A | |
| D --> K[Prompt Assembly] | |
| E --> K | |
| F --> K | |
| G --> L[Generation Parameters] | |
| H --> L | |
| I --> L | |
| K --> M[LLM Request] | |
| L --> M | |
| M --> J | |
| J --> N[Response Processing] | |
| ``` | |
| #### 2.5.2 LLM Configuration | |
| | Parameter | Default Value | Range | Description | | |
| |-----------|---------------|-------|-------------| | |
| | **Model** | Mistral-7B-Instruct | - | Primary inference model | | |
| | **Temperature** | 0.1 | 0.0-1.0 | Response creativity | | |
| | **Max Tokens** | 1000 | 100-4000 | Response length limit | | |
| | **Top-P** | 0.9 | 0.1-1.0 | Nucleus sampling | | |
| | **Context Window** | 32K | - | Mistral model capacity | | |
| --- | |
| ### 2.6 RAGAS Evaluation Module | |
| #### 2.6.1 RAGAS Evaluation Pipeline | |
| ```mermaid | |
| flowchart LR | |
| A[Query] --> B[Generated Answer] | |
| C[Retrieved Context] --> B | |
| B --> D[RAGAS Evaluator] | |
| C --> D | |
| D --> E[Answer Relevancy] | |
| D --> F[Faithfulness] | |
| D --> G[Context Utilization] | |
| D --> H[Context Relevancy] | |
| E --> I[Metrics Aggregation] | |
| F --> I | |
| G --> I | |
| H --> I | |
| I --> J[Analytics Dashboard] | |
| I --> K[SQLite Storage] | |
| I --> L[Session Statistics] | |
| ``` | |
| #### 2.6.2 Evaluation Metrics | |
| | Metric | Target | Measurement Method | Importance | | |
| |--------|--------|-------------------|------------| | |
| | **Answer Relevancy** | > 0.85 | LLM-based evaluation | Core user satisfaction | | |
| | **Faithfulness** | > 0.90 | Grounded in context check | Prevents hallucinations | | |
| | **Context Utilization** | > 0.80 | How well context is used | Generation effectiveness | | |
| | **Context Relevancy** | > 0.85 | Retrieved chunks relevance | Retrieval quality | | |
| **Implementation Details:** | |
| ```python | |
| # RAGAS Configuration | |
| RAGAS_CONFIG = { | |
| "enable_ragas": True, | |
| "enable_ground_truth": False, | |
| "base_metrics": [ | |
| "answer_relevancy", | |
| "faithfulness", | |
| "context_utilization", | |
| "context_relevancy" | |
| ], | |
| "ground_truth_metrics": [ | |
| "context_precision", | |
| "context_recall", | |
| "answer_similarity", | |
| "answer_correctness" | |
| ], | |
| "evaluation_timeout": 60, | |
| "batch_size": 10 | |
| } | |
| ``` | |
| **Evaluation Flow:** | |
| 1. **Automatic Trigger**: Every query-response pair is evaluated | |
| 2. **Async Processing**: Evaluation runs in background (non-blocking) | |
| 3. **Storage**: Results stored in SQLite for analytics | |
| 4. **Aggregation**: Session-level statistics computed on-demand | |
| 5. **Export**: Full evaluation data available for download | |
| --- | |
| ## 3. Data Flow & Workflows | |
| ### 3.1 End-to-End Processing Pipeline | |
| ```mermaid | |
| sequenceDiagram | |
| participant U as User | |
| participant F as Frontend | |
| participant A as API Gateway | |
| participant I as Ingestion | |
| participant P as Processing | |
| participant S as Storage | |
| participant R as Retrieval | |
| participant G as Generation | |
| participant E as RAGAS Evaluator | |
| U->>F: Upload Documents | |
| F->>A: POST /api/upload | |
| A->>I: Process Input Sources | |
| Note over I: Parallel Processing | |
| I->>I: Document Parsing | |
| I->>P: Extracted Text + Metadata | |
| P->>P: Adaptive Chunking | |
| P->>P: Embedding Generation | |
| P->>S: Store Vectors + Indexes | |
| S->>F: Processing Complete | |
| U->>F: Send Query | |
| F->>A: POST /api/chat | |
| A->>R: Hybrid Retrieval | |
| R->>S: Vector + BM25 Search | |
| S->>R: Top-K Chunks | |
| R->>G: Context + Query | |
| G->>G: LLM Generation | |
| G->>F: Response + Citations | |
| G->>E: Auto-evaluation (async) | |
| E->>E: Compute RAGAS Metrics | |
| E->>S: Store Evaluation Results | |
| E->>F: Return Metrics | |
| ``` | |
| ### 3.2 Real-time Query Processing | |
| ```mermaid | |
| flowchart TD | |
| A[User Query] --> B[Query Understanding] | |
| B --> C[Check Cache] | |
| C --> D{Cache Hit?} | |
| D -->|Yes| E[Return Cached Embedding] | |
| D -->|No| F[Generate Embedding] | |
| F --> G[Store in Cache] | |
| E --> H[FAISS Vector Search] | |
| G --> H | |
| B --> I[Keyword Extraction] | |
| I --> J[BM25 Keyword Search] | |
| H --> K[Reciprocal Rank Fusion] | |
| J --> K | |
| K --> L[Top-20 Candidates] | |
| L --> M{Reranking Enabled?} | |
| M -->|Yes| N[Cross-Encoder Reranking] | |
| M -->|No| O[Select Top-5] | |
| N --> O | |
| O --> P[Context Assembly] | |
| P --> Q[LLM Prompt Construction] | |
| Q --> R[Ollama Generation] | |
| R --> S[Citation Formatting] | |
| S --> T[Response Streaming] | |
| T --> U[User Display] | |
| R --> V[Async RAGAS Evaluation] | |
| V --> W[Compute Metrics] | |
| W --> X[Store Results] | |
| X --> Y[Update Dashboard] | |
| ``` | |
| --- | |
| ## 4. Infrastructure & Deployment | |
| ### 4.1 Container Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Docker Compose Stack" | |
| A[Frontend Container<br/>nginx:alpine] | |
| B[Backend Container<br/>python:3.11] | |
| C[Ollama Container<br/>ollama/ollama] | |
| end | |
| subgraph "External Services" | |
| D[FAISS Indices<br/>Persistent Volume] | |
| E[SQLite Database<br/>Persistent Volume] | |
| F[Log Files<br/>Persistent Volume] | |
| end | |
| A --> B | |
| B --> C | |
| B --> D | |
| B --> E | |
| B --> F | |
| ``` | |
| ### 4.2 Resource Requirements | |
| #### 4.2.1 Minimum Deployment | |
| | Resource | Specification | Purpose | | |
| |----------|---------------|---------| | |
| | **CPU** | 4 cores | Document processing, embeddings | | |
| | **RAM** | 8GB | Model loading, FAISS indices, cache | | |
| | **Storage** | 20GB | Models, indices, documents | | |
| | **GPU** | Optional | 2-3x speedup for inference | | |
| #### 4.2.2 Production Deployment | |
| | Resource | Specification | Purpose | | |
| |----------|---------------|---------| | |
| | **CPU** | 8+ cores | Concurrent processing | | |
| | **RAM** | 16GB+ | Larger datasets, caching | | |
| | **GPU** | RTX 3090/4090 | 20-30 tokens/sec inference | | |
| | **Storage** | 100GB+ SSD | Fast vector search | | |
| --- | |
| ## 5. API Architecture | |
| ### 5.1 REST API Endpoints | |
| ```mermaid | |
| graph TB | |
| subgraph "System Management" | |
| A[GET /api/health] | |
| B[GET /api/system-info] | |
| C[GET /api/configuration] | |
| D[POST /api/configuration] | |
| end | |
| subgraph "Document Management" | |
| E[POST /api/upload] | |
| F[POST /api/start-processing] | |
| G[GET /api/processing-status] | |
| end | |
| subgraph "Query & Chat" | |
| H[POST /api/chat] | |
| I[GET /api/export-chat/:session_id] | |
| end | |
| subgraph "RAGAS Evaluation" | |
| J[GET /api/ragas/history] | |
| K[GET /api/ragas/statistics] | |
| L[POST /api/ragas/clear] | |
| M[GET /api/ragas/export] | |
| N[GET /api/ragas/config] | |
| end | |
| subgraph "Analytics" | |
| O[GET /api/analytics] | |
| P[GET /api/analytics/refresh] | |
| Q[GET /api/analytics/detailed] | |
| end | |
| ``` | |
| ### 5.2 Request/Response Flow | |
| ```python | |
| # Typical Chat Request Flow with RAGAS | |
| REQUEST_FLOW = { | |
| "authentication": "None (local deployment)", | |
| "rate_limiting": "100 requests/minute per IP", | |
| "validation": "Query length, session ID format", | |
| "processing": "Async with progress tracking", | |
| "response": "JSON with citations + metrics + RAGAS scores", | |
| "caching": "LRU cache for embeddings", | |
| "evaluation": "Automatic RAGAS metrics (async)" | |
| } | |
| ``` | |
| --- | |
| ## 6. Monitoring & Quality Assurance | |
| ### 6.1 RAGAS Integration | |
| ```mermaid | |
| graph LR | |
| A[API Gateway] --> B[Query Processing] | |
| C[Retrieval Module] --> B | |
| D[Generation Module] --> B | |
| B --> E[RAGAS Evaluator] | |
| E --> F[Analytics Dashboard] | |
| F --> G[Answer Relevancy] | |
| F --> H[Faithfulness] | |
| F --> I[Context Utilization] | |
| F --> J[Context Relevancy] | |
| F --> K[Session Statistics] | |
| ``` | |
| ### 6.2 Key Performance Indicators | |
| | Category | Metric | Target | Alert Threshold | | |
| |----------|--------|--------|-----------------| | |
| | **Performance** | Query Latency (p95) | < 5s | > 10s | | |
| | **Quality** | Answer Relevancy | > 0.85 | < 0.70 | | |
| | **Quality** | Faithfulness | > 0.90 | < 0.80 | | |
| | **Quality** | Context Utilization | > 0.80 | < 0.65 | | |
| | **Quality** | Overall Score | > 0.85 | < 0.70 | | |
| | **Reliability** | Uptime | > 99.5% | < 95% | | |
| ### 6.3 Analytics Dashboard Features | |
| **Real-Time Metrics:** | |
| - RAGAS evaluation table with all query-response pairs | |
| - Session-level aggregate statistics | |
| - Performance metrics (latency, throughput) | |
| - Component health status | |
| **Historical Analysis:** | |
| - Quality trend over time | |
| - Performance degradation detection | |
| - Cache hit rate monitoring | |
| - Resource utilization tracking | |
| **Export Capabilities:** | |
| - JSON export of all evaluation data | |
| - CSV export for external analysis | |
| - Session-based filtering | |
| - Time-range queries | |
| --- | |
| ## 7. Technology Stack Details | |
| ### Complete Technology Matrix | |
| | Layer | Component | Technology | Version | Purpose | | |
| |-------|-----------|------------|---------|----------| | |
| | **Frontend** | UI Framework | HTML5/CSS3/JS | - | Responsive interface | | |
| | **Frontend** | Styling | Tailwind CSS | 3.3+ | Utility-first CSS | | |
| | **Frontend** | Icons | Font Awesome | 6.0+ | Icon library | | |
| | **Backend** | API Framework | FastAPI | 0.104+ | Async REST API | | |
| | **Backend** | Python Version | Python | 3.11+ | Runtime | | |
| | **AI/ML** | LLM Engine | Ollama | 0.1.20+ | Local LLM inference | | |
| | **AI/ML** | Primary Model | Mistral-7B-Instruct | v0.2 | Text generation | | |
| | **AI/ML** | Embeddings | sentence-transformers | 2.2.2+ | Vector embeddings | | |
| | **AI/ML** | Embedding Model | BAAI/bge-small-en | v1.5 | Semantic search | | |
| | **Vector DB** | Storage | FAISS | 1.7.4+ | Vector similarity | | |
| | **Search** | Keyword | rank-bm25 | 0.2.1 | BM25 implementation | | |
| | **Evaluation** | Quality | Ragas | 0.1.9 | RAG evaluation | | |
| | **Document** | PDF | PyPDF2 | 3.0+ | PDF text extraction | | |
| | **Document** | Word | python-docx | 1.1+ | DOCX processing | | |
| | **OCR** | Text Recognition | EasyOCR | 1.7+ | Scanned documents | | |
| | **Database** | Metadata | SQLite | 3.35+ | Local storage | | |
| | **Cache** | In-memory | Python functools | - | LRU caching | | |
| | **Deployment** | Container | Docker | 24.0+ | Containerization | | |
| | **Deployment** | Orchestration | Docker Compose | 2.20+ | Multi-container | | |
| --- | |
| ## 8. Key Architectural Decisions | |
| ### 8.1 Why Local Caching Instead of Redis? | |
| **Decision:** Use in-memory LRU cache with Python's `functools.lru_cache` | |
| **Rationale:** | |
| - **Simplicity**: No external service to manage | |
| - **Performance**: Faster access (no network overhead) | |
| - **MVP Focus**: Adequate for initial deployment | |
| - **Resource Efficient**: No additional memory footprint | |
| - **Easy Migration**: Can upgrade to Redis later if needed | |
| **Trade-offs:** | |
| - Cache doesn't persist across restarts | |
| - Can't share cache across multiple instances | |
| - Limited by single-process memory | |
| ### 8.2 Why RAGAS for Evaluation? | |
| **Decision:** Integrate RAGAS for real-time quality assessment | |
| **Rationale:** | |
| - **Automated Metrics**: No manual annotation required | |
| - **Production-Ready**: Quantifiable quality scores | |
| - **Real-Time**: Evaluate every query-response pair | |
| - **Comprehensive**: Multiple dimensions of quality | |
| - **Research-Backed**: Based on academic research | |
| **Implementation Details:** | |
| - OpenAI API key required for LLM-based metrics | |
| - Async evaluation to avoid blocking responses | |
| - SQLite storage for historical analysis | |
| - Export capability for offline processing | |
| ### 8.3 Why No Web Scraping? | |
| **Decision:** Removed web scraping from MVP | |
| **Rationale:** | |
| - **Complexity**: Anti-scraping mechanisms require maintenance | |
| - **Reliability**: Website changes break scrapers | |
| - **Legal**: Potential legal/ethical issues | |
| - **Scope**: Focus on core RAG functionality first | |
| **Alternative:** | |
| - Users can save web pages as PDFs | |
| - Future enhancement if market demands it | |
| --- | |
| ## 9. Performance Optimization Strategies | |
| ### 9.1 Embedding Cache Strategy | |
| ```python | |
| # Cache Implementation | |
| from functools import lru_cache | |
| @lru_cache(maxsize=1000) | |
| def get_query_embedding(query: str) -> np.ndarray: | |
| """Cache query embeddings for repeat queries""" | |
| return embedder.embed(query) | |
| # Benefits: | |
| # - 80% reduction in latency for repeat queries | |
| # - No re-computation of identical queries | |
| # - Automatic LRU eviction | |
| ``` | |
| ### 9.2 Batch Processing | |
| ```python | |
| # Batch Embedding Generation | |
| BATCH_SIZE = 32 | |
| def embed_chunks_batch(chunks: List[str]) -> List[np.ndarray]: | |
| embeddings = [] | |
| for i in range(0, len(chunks), BATCH_SIZE): | |
| batch = chunks[i:i+BATCH_SIZE] | |
| batch_embeddings = embedder.embed_batch(batch) | |
| embeddings.extend(batch_embeddings) | |
| return embeddings | |
| ``` | |
| ### 9.3 Async Processing | |
| ```python | |
| # Async Document Processing | |
| import asyncio | |
| async def process_documents_async(documents: List[Path]): | |
| tasks = [process_single_document(doc) for doc in documents] | |
| results = await asyncio.gather(*tasks) | |
| return results | |
| ``` | |
| --- | |
| ## 10. Security Considerations | |
| ### 10.1 Data Privacy | |
| - **On-Premise Processing**: All data stays local | |
| - **No External APIs**: Except OpenAI for RAGAS (configurable) | |
| - **Local LLM**: Ollama runs entirely on-premise | |
| - **Encrypted Storage**: Optional SQLite encryption | |
| ### 10.2 Input Validation | |
| ```python | |
| # File Upload Validation | |
| MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB | |
| ALLOWED_EXTENSIONS = {'.pdf', '.docx', '.txt', '.zip'} | |
| def validate_upload(file: UploadFile): | |
| # Check extension | |
| if Path(file.filename).suffix not in ALLOWED_EXTENSIONS: | |
| raise ValueError("Unsupported file type") | |
| # Check size | |
| if file.size > MAX_FILE_SIZE: | |
| raise ValueError("File too large") | |
| # Scan for malicious content (optional) | |
| # scan_for_malware(file) | |
| ``` | |
| ### 10.3 Rate Limiting | |
| ```python | |
| # Simple rate limiting | |
| from fastapi import Request | |
| from collections import defaultdict | |
| from datetime import datetime, timedelta | |
| rate_limits = defaultdict(list) | |
| def check_rate_limit(request: Request, limit: int = 100): | |
| ip = request.client.host | |
| now = datetime.now() | |
| # Clean old requests | |
| rate_limits[ip] = [ | |
| ts for ts in rate_limits[ip] | |
| if now - ts < timedelta(minutes=1) | |
| ] | |
| # Check limit | |
| if len(rate_limits[ip]) >= limit: | |
| raise HTTPException(429, "Rate limit exceeded") | |
| rate_limits[ip].append(now) | |
| ``` | |
| --- | |
| ## Conclusion | |
| This architecture document provides a comprehensive technical blueprint for the QuerySphere system. The modular design, clear separation of concerns, and production-ready considerations make this system suitable for enterprise deployment while maintaining flexibility for future enhancements. | |
| ### Key Architectural Strengths | |
| 1. **Modularity**: Each component is independent and replaceable | |
| 2. **Scalability**: Horizontal scaling through stateless API design | |
| 3. **Performance**: Intelligent caching and batch processing | |
| 4. **Quality**: Real-time RAGAS evaluation for continuous monitoring | |
| 5. **Privacy**: Complete on-premise processing with local LLM | |
| 6. **Simplicity**: Minimal external dependencies (no Redis, no web scraping) | |
| ### Future Enhancements | |
| **Short-term:** | |
| - Redis cache for multi-instance deployments | |
| - Advanced monitoring dashboard | |
| - User authentication and authorization | |
| - API rate limiting enhancements | |
| **Long-term:** | |
| - Distributed processing with Celery | |
| - Web scraping module (optional) | |
| - Fine-tuned domain-specific embeddings | |
| - Multi-tenant support | |
| - Advanced analytics and reporting | |
| --- | |
| Document Version: 1.0 | |
| Last Updated: November 2025 | |
| Author: Satyaki Mitra | |
| --- | |
| > This document is part of the QuerySphere technical documentation suite. |