Spaces:
Running
Running
File size: 22,692 Bytes
69c2ef1 0a4529c 69c2ef1 0a4529c 69c2ef1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 |
# QuerySphere - Technical Architecture Document
## 1. System Overview
### 1.1 High-Level Architecture
```mermaid
graph TB
subgraph "Frontend Layer"
A[Web UI<br/>HTML/CSS/JS]
B[File Upload<br/>Drag & Drop]
C[Chat Interface<br/>Real-time]
D[Analytics Dashboard<br/>RAGAS Metrics]
end
subgraph "API Gateway"
E[FastAPI Server<br/>Python 3.11+]
end
subgraph "Core Processing Engine"
F[Ingestion Module]
G[Processing Module]
H[Retrieval Module]
I[Generation Module]
J[Evaluation Module]
end
subgraph "AI/ML Layer"
K[Ollama LLM<br/>Mistral-7B]
L[Embedding Model<br/>BGE-small-en]
M[FAISS Vector DB]
end
subgraph "Quality Assurance"
N[RAGAS Evaluator<br/>Real-time Metrics]
end
A --> E
E --> F
F --> G
G --> H
H --> I
I --> K
G --> L
L --> M
H --> M
I --> N
N --> E
```
### 1.2 System Characteristics
| Aspect | Specification |
|--------|---------------|
| **Architecture Style** | Modular Microservices-inspired |
| **Deployment** | Docker Containerized |
| **Processing Model** | Async/Event-driven |
| **Data Flow** | Pipeline-based with Checkpoints |
| **Scalability** | Horizontal (Stateless API) + Vertical (GPU) |
| **Caching** | In-Memory LRU Cache |
| **Evaluation** | Real-time RAGAS Metrics |
---
## 2. Component Architecture
### 2.1 Ingestion Module
```mermaid
flowchart TD
A[User Input] --> B{Input Type Detection}
B -->|PDF/DOCX| D[Document Parser]
B -->|ZIP| E[Archive Extractor]
subgraph D [Document Processing]
D1[PyPDF2<br/>PDF Text]
D2[python-docx<br/>Word Docs]
D3[EasyOCR<br/>Scanned PDFs]
end
subgraph E [Archive Handling]
E1[zipfile<br/>Extraction]
E2[Recursive Processing]
E3[Size Validation<br/>2GB Max]
end
D --> F[Text Cleaning]
E --> F
F --> G[Encoding Normalization]
G --> H[Structure Preservation]
H --> I[Output: Cleaned Text<br/>+ Metadata]
```
#### Ingestion Specifications:
| Component | Technology | Configuration | Limits |
|-----------|------------|---------------|---------|
| **PDF Parser** | PyPDF2 + EasyOCR | OCR: English+Multilingual | 1000 pages max |
| **Document Parser** | python-docx | Preserve formatting | 50MB per file |
| **Archive Handler** | zipfile | Recursion depth: 5 | 2GB total, 10k files |
### 2.2 Processing Module
#### 2.2.1 Adaptive Chunking Strategy
```mermaid
flowchart TD
A[Input Text] --> B[Token Count Analysis]
B --> C{Document Size}
C -->|<50K tokens| D[Fixed-Size Chunking]
C -->|50K-500K tokens| E[Semantic Chunking]
C -->|>500K tokens| F[Hierarchical Chunking]
subgraph D [Strategy 1: Fixed]
D1[Chunk Size: 512 tokens]
D2[Overlap: 50 tokens]
D3[Method: Simple sliding window]
end
subgraph E [Strategy 2: Semantic]
E1[Breakpoint: 95th percentile similarity]
E2[Method: LlamaIndex SemanticSplitter]
E3[Preserve: Section boundaries]
end
subgraph F [Strategy 3: Hierarchical]
F1[Parent: 2048 tokens]
F2[Child: 512 tokens]
F3[Retrieval: Child → Parent expansion]
end
D --> G[Chunk Metadata]
E --> G
F --> G
G --> H[Embedding Generation]
```
#### 2.2.2 Embedding Pipeline
```python
# Embedding Configuration
EMBEDDING_CONFIG = {
"model": "BAAI/bge-small-en-v1.5",
"dimensions": 384,
"batch_size": 32,
"normalize": True,
"device": "cuda" if torch.cuda.is_available() else "cpu",
"max_sequence_length": 512
}
```
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **Model** | BAAI/bge-small-en-v1.5 | SOTA quality, 62.17 MTEB score |
| **Dimensions** | 384 | Optimal speed/accuracy balance |
| **Batch Size** | 32 | Memory efficiency on GPU/CPU |
| **Normalization** | L2 | Required for cosine similarity |
| **Speed** | 1000 docs/sec (CPU) | 10x faster than alternatives |
---
### 2.3 Storage Module Architecture
```mermaid
graph TB
subgraph "Storage Layer"
A[FAISS Vector Store]
B[BM25 Keyword Index]
C[SQLite Metadata]
D[LRU Cache<br/>In-Memory]
end
subgraph A [Vector Storage Architecture]
A1[IndexHNSW<br/>Large datasets]
A2[IndexIVFFlat<br/>Medium datasets]
A3[IndexFlatL2<br/>Small datasets]
end
subgraph B [Keyword Index]
B1[rank_bm25 Library]
B2[TF-IDF Weights]
B3[In-memory Index]
end
subgraph C [Metadata Management]
C1[Document Metadata]
C2[Chunk Relationships]
C3[User Sessions]
C4[RAGAS Evaluations]
end
subgraph D [Cache Layer]
D1[Query Embeddings]
D2[Frequent Results]
D3[LRU Eviction]
end
A --> E[Hybrid Retrieval]
B --> E
C --> E
D --> E
```
#### Vector Store Configuration
| Index Type | Use Case | Parameters | Performance |
|------------|----------|------------|-------------|
| **IndexFlatL2** | < 100K vectors | Exact search | O(n), High accuracy |
| **IndexIVFFlat** | 100K-1M vectors | nprobe: 10-20 | O(log n), Balanced |
| **IndexHNSW** | > 1M vectors | M: 16, efConstruction: 40 | O(log n), Fastest |
#### Caching Strategy
```python
# LRU Cache Configuration
CACHE_CONFIG = {
"max_size": 1000, # Maximum cached items
"ttl": 3600, # Time to live (seconds)
"eviction": "LRU", # Least Recently Used
"cache_embeddings": True,
"cache_results": True
}
```
**Benefits:**
- **Reduced latency**: 80% reduction for repeat queries
- **Resource efficiency**: Avoid re-computing embeddings
- **No external dependencies**: Pure Python implementation
- **Memory efficient**: LRU eviction prevents unbounded growth
---
### 2.4 Retrieval Module
#### 2.4.1 Hybrid Retrieval Pipeline
```mermaid
flowchart TD
A[User Query] --> B[Query Processing]
B --> C[Vector Embedding]
B --> D[Keyword Extraction]
C --> E[FAISS Search<br/>Top-K: 10]
D --> F[BM25 Search<br/>Top-K: 10]
E --> G[Reciprocal Rank Fusion]
F --> G
G --> H{Reranking Enabled?}
H -->|Yes| I[Cross-Encoder Reranking]
H -->|No| J[Final Top-5 Selection]
I --> J
J --> K[Context Assembly]
K --> L[Citation Formatting]
L --> M[Output: Context + Sources]
```
#### 2.4.2 Retrieval Algorithms
**Hybrid Fusion Formula:**
```text
RRF_score(doc) = vector_weight * (1 / (60 + vector_rank)) + bm25_weight * (1 / (60 + bm25_rank))
```
**Default Weights:**
- Vector Similarity: 60%
- BM25 Keyword: 40%
**BM25 Parameters:**
```python
BM25_CONFIG = {
"k1": 1.5, # Term frequency saturation
"b": 0.75, # Length normalization
"epsilon": 0.25 # Smoothing factor
}
```
---
### 2.5 Generation Module
#### 2.5.1 LLM Integration Architecture
```mermaid
graph TB
subgraph "Ollama Integration"
A[Ollama Server]
B[Mistral-7B-Instruct]
C[LLaMA-2-13B-Chat]
end
subgraph "Prompt Engineering"
D[System Prompt Template]
E[Context Formatting]
F[Citation Injection]
end
subgraph "Generation Control"
G[Temperature Controller]
H[Token Manager]
I[Streaming Handler]
end
A --> J[API Client]
B --> A
C --> A
D --> K[Prompt Assembly]
E --> K
F --> K
G --> L[Generation Parameters]
H --> L
I --> L
K --> M[LLM Request]
L --> M
M --> J
J --> N[Response Processing]
```
#### 2.5.2 LLM Configuration
| Parameter | Default Value | Range | Description |
|-----------|---------------|-------|-------------|
| **Model** | Mistral-7B-Instruct | - | Primary inference model |
| **Temperature** | 0.1 | 0.0-1.0 | Response creativity |
| **Max Tokens** | 1000 | 100-4000 | Response length limit |
| **Top-P** | 0.9 | 0.1-1.0 | Nucleus sampling |
| **Context Window** | 32K | - | Mistral model capacity |
---
### 2.6 RAGAS Evaluation Module
#### 2.6.1 RAGAS Evaluation Pipeline
```mermaid
flowchart LR
A[Query] --> B[Generated Answer]
C[Retrieved Context] --> B
B --> D[RAGAS Evaluator]
C --> D
D --> E[Answer Relevancy]
D --> F[Faithfulness]
D --> G[Context Utilization]
D --> H[Context Relevancy]
E --> I[Metrics Aggregation]
F --> I
G --> I
H --> I
I --> J[Analytics Dashboard]
I --> K[SQLite Storage]
I --> L[Session Statistics]
```
#### 2.6.2 Evaluation Metrics
| Metric | Target | Measurement Method | Importance |
|--------|--------|-------------------|------------|
| **Answer Relevancy** | > 0.85 | LLM-based evaluation | Core user satisfaction |
| **Faithfulness** | > 0.90 | Grounded in context check | Prevents hallucinations |
| **Context Utilization** | > 0.80 | How well context is used | Generation effectiveness |
| **Context Relevancy** | > 0.85 | Retrieved chunks relevance | Retrieval quality |
**Implementation Details:**
```python
# RAGAS Configuration
RAGAS_CONFIG = {
"enable_ragas": True,
"enable_ground_truth": False,
"base_metrics": [
"answer_relevancy",
"faithfulness",
"context_utilization",
"context_relevancy"
],
"ground_truth_metrics": [
"context_precision",
"context_recall",
"answer_similarity",
"answer_correctness"
],
"evaluation_timeout": 60,
"batch_size": 10
}
```
**Evaluation Flow:**
1. **Automatic Trigger**: Every query-response pair is evaluated
2. **Async Processing**: Evaluation runs in background (non-blocking)
3. **Storage**: Results stored in SQLite for analytics
4. **Aggregation**: Session-level statistics computed on-demand
5. **Export**: Full evaluation data available for download
---
## 3. Data Flow & Workflows
### 3.1 End-to-End Processing Pipeline
```mermaid
sequenceDiagram
participant U as User
participant F as Frontend
participant A as API Gateway
participant I as Ingestion
participant P as Processing
participant S as Storage
participant R as Retrieval
participant G as Generation
participant E as RAGAS Evaluator
U->>F: Upload Documents
F->>A: POST /api/upload
A->>I: Process Input Sources
Note over I: Parallel Processing
I->>I: Document Parsing
I->>P: Extracted Text + Metadata
P->>P: Adaptive Chunking
P->>P: Embedding Generation
P->>S: Store Vectors + Indexes
S->>F: Processing Complete
U->>F: Send Query
F->>A: POST /api/chat
A->>R: Hybrid Retrieval
R->>S: Vector + BM25 Search
S->>R: Top-K Chunks
R->>G: Context + Query
G->>G: LLM Generation
G->>F: Response + Citations
G->>E: Auto-evaluation (async)
E->>E: Compute RAGAS Metrics
E->>S: Store Evaluation Results
E->>F: Return Metrics
```
### 3.2 Real-time Query Processing
```mermaid
flowchart TD
A[User Query] --> B[Query Understanding]
B --> C[Check Cache]
C --> D{Cache Hit?}
D -->|Yes| E[Return Cached Embedding]
D -->|No| F[Generate Embedding]
F --> G[Store in Cache]
E --> H[FAISS Vector Search]
G --> H
B --> I[Keyword Extraction]
I --> J[BM25 Keyword Search]
H --> K[Reciprocal Rank Fusion]
J --> K
K --> L[Top-20 Candidates]
L --> M{Reranking Enabled?}
M -->|Yes| N[Cross-Encoder Reranking]
M -->|No| O[Select Top-5]
N --> O
O --> P[Context Assembly]
P --> Q[LLM Prompt Construction]
Q --> R[Ollama Generation]
R --> S[Citation Formatting]
S --> T[Response Streaming]
T --> U[User Display]
R --> V[Async RAGAS Evaluation]
V --> W[Compute Metrics]
W --> X[Store Results]
X --> Y[Update Dashboard]
```
---
## 4. Infrastructure & Deployment
### 4.1 Container Architecture
```mermaid
graph TB
subgraph "Docker Compose Stack"
A[Frontend Container<br/>nginx:alpine]
B[Backend Container<br/>python:3.11]
C[Ollama Container<br/>ollama/ollama]
end
subgraph "External Services"
D[FAISS Indices<br/>Persistent Volume]
E[SQLite Database<br/>Persistent Volume]
F[Log Files<br/>Persistent Volume]
end
A --> B
B --> C
B --> D
B --> E
B --> F
```
### 4.2 Resource Requirements
#### 4.2.1 Minimum Deployment
| Resource | Specification | Purpose |
|----------|---------------|---------|
| **CPU** | 4 cores | Document processing, embeddings |
| **RAM** | 8GB | Model loading, FAISS indices, cache |
| **Storage** | 20GB | Models, indices, documents |
| **GPU** | Optional | 2-3x speedup for inference |
#### 4.2.2 Production Deployment
| Resource | Specification | Purpose |
|----------|---------------|---------|
| **CPU** | 8+ cores | Concurrent processing |
| **RAM** | 16GB+ | Larger datasets, caching |
| **GPU** | RTX 3090/4090 | 20-30 tokens/sec inference |
| **Storage** | 100GB+ SSD | Fast vector search |
---
## 5. API Architecture
### 5.1 REST API Endpoints
```mermaid
graph TB
subgraph "System Management"
A[GET /api/health]
B[GET /api/system-info]
C[GET /api/configuration]
D[POST /api/configuration]
end
subgraph "Document Management"
E[POST /api/upload]
F[POST /api/start-processing]
G[GET /api/processing-status]
end
subgraph "Query & Chat"
H[POST /api/chat]
I[GET /api/export-chat/:session_id]
end
subgraph "RAGAS Evaluation"
J[GET /api/ragas/history]
K[GET /api/ragas/statistics]
L[POST /api/ragas/clear]
M[GET /api/ragas/export]
N[GET /api/ragas/config]
end
subgraph "Analytics"
O[GET /api/analytics]
P[GET /api/analytics/refresh]
Q[GET /api/analytics/detailed]
end
```
### 5.2 Request/Response Flow
```python
# Typical Chat Request Flow with RAGAS
REQUEST_FLOW = {
"authentication": "None (local deployment)",
"rate_limiting": "100 requests/minute per IP",
"validation": "Query length, session ID format",
"processing": "Async with progress tracking",
"response": "JSON with citations + metrics + RAGAS scores",
"caching": "LRU cache for embeddings",
"evaluation": "Automatic RAGAS metrics (async)"
}
```
---
## 6. Monitoring & Quality Assurance
### 6.1 RAGAS Integration
```mermaid
graph LR
A[API Gateway] --> B[Query Processing]
C[Retrieval Module] --> B
D[Generation Module] --> B
B --> E[RAGAS Evaluator]
E --> F[Analytics Dashboard]
F --> G[Answer Relevancy]
F --> H[Faithfulness]
F --> I[Context Utilization]
F --> J[Context Relevancy]
F --> K[Session Statistics]
```
### 6.2 Key Performance Indicators
| Category | Metric | Target | Alert Threshold |
|----------|--------|--------|-----------------|
| **Performance** | Query Latency (p95) | < 5s | > 10s |
| **Quality** | Answer Relevancy | > 0.85 | < 0.70 |
| **Quality** | Faithfulness | > 0.90 | < 0.80 |
| **Quality** | Context Utilization | > 0.80 | < 0.65 |
| **Quality** | Overall Score | > 0.85 | < 0.70 |
| **Reliability** | Uptime | > 99.5% | < 95% |
### 6.3 Analytics Dashboard Features
**Real-Time Metrics:**
- RAGAS evaluation table with all query-response pairs
- Session-level aggregate statistics
- Performance metrics (latency, throughput)
- Component health status
**Historical Analysis:**
- Quality trend over time
- Performance degradation detection
- Cache hit rate monitoring
- Resource utilization tracking
**Export Capabilities:**
- JSON export of all evaluation data
- CSV export for external analysis
- Session-based filtering
- Time-range queries
---
## 7. Technology Stack Details
### Complete Technology Matrix
| Layer | Component | Technology | Version | Purpose |
|-------|-----------|------------|---------|----------|
| **Frontend** | UI Framework | HTML5/CSS3/JS | - | Responsive interface |
| **Frontend** | Styling | Tailwind CSS | 3.3+ | Utility-first CSS |
| **Frontend** | Icons | Font Awesome | 6.0+ | Icon library |
| **Backend** | API Framework | FastAPI | 0.104+ | Async REST API |
| **Backend** | Python Version | Python | 3.11+ | Runtime |
| **AI/ML** | LLM Engine | Ollama | 0.1.20+ | Local LLM inference |
| **AI/ML** | Primary Model | Mistral-7B-Instruct | v0.2 | Text generation |
| **AI/ML** | Embeddings | sentence-transformers | 2.2.2+ | Vector embeddings |
| **AI/ML** | Embedding Model | BAAI/bge-small-en | v1.5 | Semantic search |
| **Vector DB** | Storage | FAISS | 1.7.4+ | Vector similarity |
| **Search** | Keyword | rank-bm25 | 0.2.1 | BM25 implementation |
| **Evaluation** | Quality | Ragas | 0.1.9 | RAG evaluation |
| **Document** | PDF | PyPDF2 | 3.0+ | PDF text extraction |
| **Document** | Word | python-docx | 1.1+ | DOCX processing |
| **OCR** | Text Recognition | EasyOCR | 1.7+ | Scanned documents |
| **Database** | Metadata | SQLite | 3.35+ | Local storage |
| **Cache** | In-memory | Python functools | - | LRU caching |
| **Deployment** | Container | Docker | 24.0+ | Containerization |
| **Deployment** | Orchestration | Docker Compose | 2.20+ | Multi-container |
---
## 8. Key Architectural Decisions
### 8.1 Why Local Caching Instead of Redis?
**Decision:** Use in-memory LRU cache with Python's `functools.lru_cache`
**Rationale:**
- **Simplicity**: No external service to manage
- **Performance**: Faster access (no network overhead)
- **MVP Focus**: Adequate for initial deployment
- **Resource Efficient**: No additional memory footprint
- **Easy Migration**: Can upgrade to Redis later if needed
**Trade-offs:**
- Cache doesn't persist across restarts
- Can't share cache across multiple instances
- Limited by single-process memory
### 8.2 Why RAGAS for Evaluation?
**Decision:** Integrate RAGAS for real-time quality assessment
**Rationale:**
- **Automated Metrics**: No manual annotation required
- **Production-Ready**: Quantifiable quality scores
- **Real-Time**: Evaluate every query-response pair
- **Comprehensive**: Multiple dimensions of quality
- **Research-Backed**: Based on academic research
**Implementation Details:**
- OpenAI API key required for LLM-based metrics
- Async evaluation to avoid blocking responses
- SQLite storage for historical analysis
- Export capability for offline processing
### 8.3 Why No Web Scraping?
**Decision:** Removed web scraping from MVP
**Rationale:**
- **Complexity**: Anti-scraping mechanisms require maintenance
- **Reliability**: Website changes break scrapers
- **Legal**: Potential legal/ethical issues
- **Scope**: Focus on core RAG functionality first
**Alternative:**
- Users can save web pages as PDFs
- Future enhancement if market demands it
---
## 9. Performance Optimization Strategies
### 9.1 Embedding Cache Strategy
```python
# Cache Implementation
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_query_embedding(query: str) -> np.ndarray:
"""Cache query embeddings for repeat queries"""
return embedder.embed(query)
# Benefits:
# - 80% reduction in latency for repeat queries
# - No re-computation of identical queries
# - Automatic LRU eviction
```
### 9.2 Batch Processing
```python
# Batch Embedding Generation
BATCH_SIZE = 32
def embed_chunks_batch(chunks: List[str]) -> List[np.ndarray]:
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i:i+BATCH_SIZE]
batch_embeddings = embedder.embed_batch(batch)
embeddings.extend(batch_embeddings)
return embeddings
```
### 9.3 Async Processing
```python
# Async Document Processing
import asyncio
async def process_documents_async(documents: List[Path]):
tasks = [process_single_document(doc) for doc in documents]
results = await asyncio.gather(*tasks)
return results
```
---
## 10. Security Considerations
### 10.1 Data Privacy
- **On-Premise Processing**: All data stays local
- **No External APIs**: Except OpenAI for RAGAS (configurable)
- **Local LLM**: Ollama runs entirely on-premise
- **Encrypted Storage**: Optional SQLite encryption
### 10.2 Input Validation
```python
# File Upload Validation
MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB
ALLOWED_EXTENSIONS = {'.pdf', '.docx', '.txt', '.zip'}
def validate_upload(file: UploadFile):
# Check extension
if Path(file.filename).suffix not in ALLOWED_EXTENSIONS:
raise ValueError("Unsupported file type")
# Check size
if file.size > MAX_FILE_SIZE:
raise ValueError("File too large")
# Scan for malicious content (optional)
# scan_for_malware(file)
```
### 10.3 Rate Limiting
```python
# Simple rate limiting
from fastapi import Request
from collections import defaultdict
from datetime import datetime, timedelta
rate_limits = defaultdict(list)
def check_rate_limit(request: Request, limit: int = 100):
ip = request.client.host
now = datetime.now()
# Clean old requests
rate_limits[ip] = [
ts for ts in rate_limits[ip]
if now - ts < timedelta(minutes=1)
]
# Check limit
if len(rate_limits[ip]) >= limit:
raise HTTPException(429, "Rate limit exceeded")
rate_limits[ip].append(now)
```
---
## Conclusion
This architecture document provides a comprehensive technical blueprint for the QuerySphere system. The modular design, clear separation of concerns, and production-ready considerations make this system suitable for enterprise deployment while maintaining flexibility for future enhancements.
### Key Architectural Strengths
1. **Modularity**: Each component is independent and replaceable
2. **Scalability**: Horizontal scaling through stateless API design
3. **Performance**: Intelligent caching and batch processing
4. **Quality**: Real-time RAGAS evaluation for continuous monitoring
5. **Privacy**: Complete on-premise processing with local LLM
6. **Simplicity**: Minimal external dependencies (no Redis, no web scraping)
### Future Enhancements
**Short-term:**
- Redis cache for multi-instance deployments
- Advanced monitoring dashboard
- User authentication and authorization
- API rate limiting enhancements
**Long-term:**
- Distributed processing with Celery
- Web scraping module (optional)
- Fine-tuned domain-specific embeddings
- Multi-tenant support
- Advanced analytics and reporting
---
Document Version: 1.0
Last Updated: November 2025
Author: Satyaki Mitra
---
> This document is part of the QuerySphere technical documentation suite. |