File size: 22,692 Bytes
69c2ef1
0a4529c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69c2ef1
0a4529c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69c2ef1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
# QuerySphere - Technical Architecture Document

## 1. System Overview

### 1.1 High-Level Architecture

```mermaid
graph TB
    subgraph "Frontend Layer"
        A[Web UI<br/>HTML/CSS/JS]
        B[File Upload<br/>Drag & Drop]
        C[Chat Interface<br/>Real-time]
        D[Analytics Dashboard<br/>RAGAS Metrics]
    end
    
    subgraph "API Gateway"
        E[FastAPI Server<br/>Python 3.11+]
    end
    
    subgraph "Core Processing Engine"
        F[Ingestion Module]
        G[Processing Module]
        H[Retrieval Module]
        I[Generation Module]
        J[Evaluation Module]
    end
    
    subgraph "AI/ML Layer"
        K[Ollama LLM<br/>Mistral-7B]
        L[Embedding Model<br/>BGE-small-en]
        M[FAISS Vector DB]
    end
    
    subgraph "Quality Assurance"
        N[RAGAS Evaluator<br/>Real-time Metrics]
    end
    
    A --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> K
    G --> L
    L --> M
    H --> M
    I --> N
    N --> E
```

### 1.2 System Characteristics

| Aspect | Specification |
|--------|---------------|
| **Architecture Style** | Modular Microservices-inspired |
| **Deployment** | Docker Containerized |
| **Processing Model** | Async/Event-driven |
| **Data Flow** | Pipeline-based with Checkpoints |
| **Scalability** | Horizontal (Stateless API) + Vertical (GPU) |
| **Caching** | In-Memory LRU Cache |
| **Evaluation** | Real-time RAGAS Metrics |

---

## 2. Component Architecture

### 2.1 Ingestion Module

```mermaid
flowchart TD
    A[User Input] --> B{Input Type Detection}
    
    B -->|PDF/DOCX| D[Document Parser]
    B -->|ZIP| E[Archive Extractor]
    
    subgraph D [Document Processing]
        D1[PyPDF2<br/>PDF Text]
        D2[python-docx<br/>Word Docs]
        D3[EasyOCR<br/>Scanned PDFs]
    end
    
    subgraph E [Archive Handling]
        E1[zipfile<br/>Extraction]
        E2[Recursive Processing]
        E3[Size Validation<br/>2GB Max]
    end
    
    D --> F[Text Cleaning]
    E --> F
    
    F --> G[Encoding Normalization]
    G --> H[Structure Preservation]
    H --> I[Output: Cleaned Text<br/>+ Metadata]
```

#### Ingestion Specifications:

| Component | Technology | Configuration | Limits |
|-----------|------------|---------------|---------|
| **PDF Parser** | PyPDF2 + EasyOCR | OCR: English+Multilingual | 1000 pages max |
| **Document Parser** | python-docx | Preserve formatting | 50MB per file |
| **Archive Handler** | zipfile | Recursion depth: 5 | 2GB total, 10k files |

### 2.2 Processing Module

#### 2.2.1 Adaptive Chunking Strategy

```mermaid
flowchart TD
    A[Input Text] --> B[Token Count Analysis]
    B --> C{Document Size}
    
    C -->|<50K tokens| D[Fixed-Size Chunking]
    C -->|50K-500K tokens| E[Semantic Chunking]
    C -->|>500K tokens| F[Hierarchical Chunking]
    
    subgraph D [Strategy 1: Fixed]
        D1[Chunk Size: 512 tokens]
        D2[Overlap: 50 tokens]
        D3[Method: Simple sliding window]
    end
    
    subgraph E [Strategy 2: Semantic]
        E1[Breakpoint: 95th percentile similarity]
        E2[Method: LlamaIndex SemanticSplitter]
        E3[Preserve: Section boundaries]
    end
    
    subgraph F [Strategy 3: Hierarchical]
        F1[Parent: 2048 tokens]
        F2[Child: 512 tokens]
        F3[Retrieval: Child → Parent expansion]
    end
    
    D --> G[Chunk Metadata]
    E --> G
    F --> G
    
    G --> H[Embedding Generation]
```

#### 2.2.2 Embedding Pipeline

```python
# Embedding Configuration
EMBEDDING_CONFIG = {
    "model": "BAAI/bge-small-en-v1.5",
    "dimensions": 384,
    "batch_size": 32,
    "normalize": True,
    "device": "cuda" if torch.cuda.is_available() else "cpu",
    "max_sequence_length": 512
}
```

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **Model** | BAAI/bge-small-en-v1.5 | SOTA quality, 62.17 MTEB score |
| **Dimensions** | 384 | Optimal speed/accuracy balance |
| **Batch Size** | 32 | Memory efficiency on GPU/CPU |
| **Normalization** | L2 | Required for cosine similarity |
| **Speed** | 1000 docs/sec (CPU) | 10x faster than alternatives |

---

### 2.3 Storage Module Architecture

```mermaid
graph TB
    subgraph "Storage Layer"
        A[FAISS Vector Store]
        B[BM25 Keyword Index]
        C[SQLite Metadata]
        D[LRU Cache<br/>In-Memory]
    end
    
    subgraph A [Vector Storage Architecture]
        A1[IndexHNSW<br/>Large datasets]
        A2[IndexIVFFlat<br/>Medium datasets]
        A3[IndexFlatL2<br/>Small datasets]
    end
    
    subgraph B [Keyword Index]
        B1[rank_bm25 Library]
        B2[TF-IDF Weights]
        B3[In-memory Index]
    end
    
    subgraph C [Metadata Management]
        C1[Document Metadata]
        C2[Chunk Relationships]
        C3[User Sessions]
        C4[RAGAS Evaluations]
    end
    
    subgraph D [Cache Layer]
        D1[Query Embeddings]
        D2[Frequent Results]
        D3[LRU Eviction]
    end
    
    A --> E[Hybrid Retrieval]
    B --> E
    C --> E
    D --> E
```

#### Vector Store Configuration

| Index Type | Use Case | Parameters | Performance |
|------------|----------|------------|-------------|
| **IndexFlatL2** | < 100K vectors | Exact search | O(n), High accuracy |
| **IndexIVFFlat** | 100K-1M vectors | nprobe: 10-20 | O(log n), Balanced |
| **IndexHNSW** | > 1M vectors | M: 16, efConstruction: 40 | O(log n), Fastest |

#### Caching Strategy

```python
# LRU Cache Configuration
CACHE_CONFIG = {
    "max_size": 1000,        # Maximum cached items
    "ttl": 3600,             # Time to live (seconds)
    "eviction": "LRU",       # Least Recently Used
    "cache_embeddings": True,
    "cache_results": True
}
```

**Benefits:**
- **Reduced latency**: 80% reduction for repeat queries
- **Resource efficiency**: Avoid re-computing embeddings
- **No external dependencies**: Pure Python implementation
- **Memory efficient**: LRU eviction prevents unbounded growth

---

### 2.4 Retrieval Module

#### 2.4.1 Hybrid Retrieval Pipeline

```mermaid
flowchart TD
    A[User Query] --> B[Query Processing]
    
    B --> C[Vector Embedding]
    B --> D[Keyword Extraction]
    
    C --> E[FAISS Search<br/>Top-K: 10]
    D --> F[BM25 Search<br/>Top-K: 10]
    
    E --> G[Reciprocal Rank Fusion]
    F --> G
    
    G --> H{Reranking Enabled?}
    
    H -->|Yes| I[Cross-Encoder Reranking]
    H -->|No| J[Final Top-5 Selection]
    
    I --> J
    
    J --> K[Context Assembly]
    K --> L[Citation Formatting]
    L --> M[Output: Context + Sources]
```

#### 2.4.2 Retrieval Algorithms

**Hybrid Fusion Formula:**

```text
RRF_score(doc) = vector_weight * (1 / (60 + vector_rank)) + bm25_weight * (1 / (60 + bm25_rank))
```

**Default Weights:**
- Vector Similarity: 60%
- BM25 Keyword: 40%

**BM25 Parameters:**

```python
BM25_CONFIG = {
    "k1": 1.5,      # Term frequency saturation
    "b": 0.75,      # Length normalization
    "epsilon": 0.25  # Smoothing factor
}
```

---

### 2.5 Generation Module

#### 2.5.1 LLM Integration Architecture

```mermaid
graph TB
    subgraph "Ollama Integration"
        A[Ollama Server]
        B[Mistral-7B-Instruct]
        C[LLaMA-2-13B-Chat]
    end
    
    subgraph "Prompt Engineering"
        D[System Prompt Template]
        E[Context Formatting]
        F[Citation Injection]
    end
    
    subgraph "Generation Control"
        G[Temperature Controller]
        H[Token Manager]
        I[Streaming Handler]
    end
    
    A --> J[API Client]
    B --> A
    C --> A
    
    D --> K[Prompt Assembly]
    E --> K
    F --> K
    
    G --> L[Generation Parameters]
    H --> L
    I --> L
    
    K --> M[LLM Request]
    L --> M
    M --> J
    J --> N[Response Processing]
```

#### 2.5.2 LLM Configuration

| Parameter | Default Value | Range | Description |
|-----------|---------------|-------|-------------|
| **Model** | Mistral-7B-Instruct | - | Primary inference model |
| **Temperature** | 0.1 | 0.0-1.0 | Response creativity |
| **Max Tokens** | 1000 | 100-4000 | Response length limit |
| **Top-P** | 0.9 | 0.1-1.0 | Nucleus sampling |
| **Context Window** | 32K | - | Mistral model capacity |

---

### 2.6 RAGAS Evaluation Module

#### 2.6.1 RAGAS Evaluation Pipeline

```mermaid
flowchart LR
    A[Query] --> B[Generated Answer]
    C[Retrieved Context] --> B
    
    B --> D[RAGAS Evaluator]
    C --> D
    
    D --> E[Answer Relevancy]
    D --> F[Faithfulness]
    D --> G[Context Utilization]
    D --> H[Context Relevancy]
    
    E --> I[Metrics Aggregation]
    F --> I
    G --> I
    H --> I
    
    I --> J[Analytics Dashboard]
    I --> K[SQLite Storage]
    I --> L[Session Statistics]
```

#### 2.6.2 Evaluation Metrics

| Metric | Target | Measurement Method | Importance |
|--------|--------|-------------------|------------|
| **Answer Relevancy** | > 0.85 | LLM-based evaluation | Core user satisfaction |
| **Faithfulness** | > 0.90 | Grounded in context check | Prevents hallucinations |
| **Context Utilization** | > 0.80 | How well context is used | Generation effectiveness |
| **Context Relevancy** | > 0.85 | Retrieved chunks relevance | Retrieval quality |

**Implementation Details:**

```python
# RAGAS Configuration
RAGAS_CONFIG = {
    "enable_ragas": True,
    "enable_ground_truth": False,
    "base_metrics": [
        "answer_relevancy",
        "faithfulness",
        "context_utilization",
        "context_relevancy"
    ],
    "ground_truth_metrics": [
        "context_precision",
        "context_recall",
        "answer_similarity",
        "answer_correctness"
    ],
    "evaluation_timeout": 60,
    "batch_size": 10
}
```

**Evaluation Flow:**

1. **Automatic Trigger**: Every query-response pair is evaluated
2. **Async Processing**: Evaluation runs in background (non-blocking)
3. **Storage**: Results stored in SQLite for analytics
4. **Aggregation**: Session-level statistics computed on-demand
5. **Export**: Full evaluation data available for download

---

## 3. Data Flow & Workflows

### 3.1 End-to-End Processing Pipeline

```mermaid
sequenceDiagram
    participant U as User
    participant F as Frontend
    participant A as API Gateway
    participant I as Ingestion
    participant P as Processing
    participant S as Storage
    participant R as Retrieval
    participant G as Generation
    participant E as RAGAS Evaluator
    
    U->>F: Upload Documents
    F->>A: POST /api/upload
    A->>I: Process Input Sources
    
    Note over I: Parallel Processing
    I->>I: Document Parsing
    I->>P: Extracted Text + Metadata
    
    P->>P: Adaptive Chunking
    P->>P: Embedding Generation
    P->>S: Store Vectors + Indexes
    
    S->>F: Processing Complete
    
    U->>F: Send Query
    F->>A: POST /api/chat
    
    A->>R: Hybrid Retrieval
    R->>S: Vector + BM25 Search
    S->>R: Top-K Chunks
    
    R->>G: Context + Query
    G->>G: LLM Generation
    G->>F: Response + Citations
    
    G->>E: Auto-evaluation (async)
    E->>E: Compute RAGAS Metrics
    E->>S: Store Evaluation Results
    E->>F: Return Metrics
```

### 3.2 Real-time Query Processing

```mermaid
flowchart TD
    A[User Query] --> B[Query Understanding]
    B --> C[Check Cache]
    
    C --> D{Cache Hit?}
    D -->|Yes| E[Return Cached Embedding]
    D -->|No| F[Generate Embedding]
    
    F --> G[Store in Cache]
    E --> H[FAISS Vector Search]
    G --> H
    
    B --> I[Keyword Extraction]
    I --> J[BM25 Keyword Search]
    
    H --> K[Reciprocal Rank Fusion]
    J --> K
    
    K --> L[Top-20 Candidates]
    L --> M{Reranking Enabled?}
    
    M -->|Yes| N[Cross-Encoder Reranking]
    M -->|No| O[Select Top-5]
    
    N --> O
    O --> P[Context Assembly]
    P --> Q[LLM Prompt Construction]
    Q --> R[Ollama Generation]
    R --> S[Citation Formatting]
    S --> T[Response Streaming]
    T --> U[User Display]
    
    R --> V[Async RAGAS Evaluation]
    V --> W[Compute Metrics]
    W --> X[Store Results]
    X --> Y[Update Dashboard]
```

---

## 4. Infrastructure & Deployment

### 4.1 Container Architecture

```mermaid
graph TB
    subgraph "Docker Compose Stack"
        A[Frontend Container<br/>nginx:alpine]
        B[Backend Container<br/>python:3.11]
        C[Ollama Container<br/>ollama/ollama]
    end
    
    subgraph "External Services"
        D[FAISS Indices<br/>Persistent Volume]
        E[SQLite Database<br/>Persistent Volume]
        F[Log Files<br/>Persistent Volume]
    end
    
    A --> B
    B --> C
    B --> D
    B --> E
    B --> F
```

### 4.2 Resource Requirements

#### 4.2.1 Minimum Deployment

| Resource | Specification | Purpose |
|----------|---------------|---------|
| **CPU** | 4 cores | Document processing, embeddings |
| **RAM** | 8GB | Model loading, FAISS indices, cache |
| **Storage** | 20GB | Models, indices, documents |
| **GPU** | Optional | 2-3x speedup for inference |

#### 4.2.2 Production Deployment

| Resource | Specification | Purpose |
|----------|---------------|---------|
| **CPU** | 8+ cores | Concurrent processing |
| **RAM** | 16GB+ | Larger datasets, caching |
| **GPU** | RTX 3090/4090 | 20-30 tokens/sec inference |
| **Storage** | 100GB+ SSD | Fast vector search |

---

## 5. API Architecture

### 5.1 REST API Endpoints

```mermaid
graph TB
    subgraph "System Management"
        A[GET /api/health]
        B[GET /api/system-info]
        C[GET /api/configuration]
        D[POST /api/configuration]
    end
    
    subgraph "Document Management"
        E[POST /api/upload]
        F[POST /api/start-processing]
        G[GET /api/processing-status]
    end
    
    subgraph "Query & Chat"
        H[POST /api/chat]
        I[GET /api/export-chat/:session_id]
    end
    
    subgraph "RAGAS Evaluation"
        J[GET /api/ragas/history]
        K[GET /api/ragas/statistics]
        L[POST /api/ragas/clear]
        M[GET /api/ragas/export]
        N[GET /api/ragas/config]
    end
    
    subgraph "Analytics"
        O[GET /api/analytics]
        P[GET /api/analytics/refresh]
        Q[GET /api/analytics/detailed]
    end
```

### 5.2 Request/Response Flow

```python
# Typical Chat Request Flow with RAGAS
REQUEST_FLOW = {
    "authentication": "None (local deployment)",
    "rate_limiting": "100 requests/minute per IP",
    "validation": "Query length, session ID format",
    "processing": "Async with progress tracking",
    "response": "JSON with citations + metrics + RAGAS scores",
    "caching": "LRU cache for embeddings",
    "evaluation": "Automatic RAGAS metrics (async)"
}
```

---

## 6. Monitoring & Quality Assurance

### 6.1 RAGAS Integration

```mermaid
graph LR
    A[API Gateway] --> B[Query Processing]
    C[Retrieval Module] --> B
    D[Generation Module] --> B
    
    B --> E[RAGAS Evaluator]
    
    E --> F[Analytics Dashboard]
    
    F --> G[Answer Relevancy]
    F --> H[Faithfulness]
    F --> I[Context Utilization]
    F --> J[Context Relevancy]
    F --> K[Session Statistics]
```

### 6.2 Key Performance Indicators

| Category | Metric | Target | Alert Threshold |
|----------|--------|--------|-----------------|
| **Performance** | Query Latency (p95) | < 5s | > 10s |
| **Quality** | Answer Relevancy | > 0.85 | < 0.70 |
| **Quality** | Faithfulness | > 0.90 | < 0.80 |
| **Quality** | Context Utilization | > 0.80 | < 0.65 |
| **Quality** | Overall Score | > 0.85 | < 0.70 |
| **Reliability** | Uptime | > 99.5% | < 95% |

### 6.3 Analytics Dashboard Features

**Real-Time Metrics:**
- RAGAS evaluation table with all query-response pairs
- Session-level aggregate statistics
- Performance metrics (latency, throughput)
- Component health status

**Historical Analysis:**
- Quality trend over time
- Performance degradation detection
- Cache hit rate monitoring
- Resource utilization tracking

**Export Capabilities:**
- JSON export of all evaluation data
- CSV export for external analysis
- Session-based filtering
- Time-range queries

---

## 7. Technology Stack Details

### Complete Technology Matrix

| Layer | Component | Technology | Version | Purpose |
|-------|-----------|------------|---------|----------|
| **Frontend** | UI Framework | HTML5/CSS3/JS | - | Responsive interface |
| **Frontend** | Styling | Tailwind CSS | 3.3+ | Utility-first CSS |
| **Frontend** | Icons | Font Awesome | 6.0+ | Icon library |
| **Backend** | API Framework | FastAPI | 0.104+ | Async REST API |
| **Backend** | Python Version | Python | 3.11+ | Runtime |
| **AI/ML** | LLM Engine | Ollama | 0.1.20+ | Local LLM inference |
| **AI/ML** | Primary Model | Mistral-7B-Instruct | v0.2 | Text generation |
| **AI/ML** | Embeddings | sentence-transformers | 2.2.2+ | Vector embeddings |
| **AI/ML** | Embedding Model | BAAI/bge-small-en | v1.5 | Semantic search |
| **Vector DB** | Storage | FAISS | 1.7.4+ | Vector similarity |
| **Search** | Keyword | rank-bm25 | 0.2.1 | BM25 implementation |
| **Evaluation** | Quality | Ragas | 0.1.9 | RAG evaluation |
| **Document** | PDF | PyPDF2 | 3.0+ | PDF text extraction |
| **Document** | Word | python-docx | 1.1+ | DOCX processing |
| **OCR** | Text Recognition | EasyOCR | 1.7+ | Scanned documents |
| **Database** | Metadata | SQLite | 3.35+ | Local storage |
| **Cache** | In-memory | Python functools | - | LRU caching |
| **Deployment** | Container | Docker | 24.0+ | Containerization |
| **Deployment** | Orchestration | Docker Compose | 2.20+ | Multi-container |

---

## 8. Key Architectural Decisions

### 8.1 Why Local Caching Instead of Redis?

**Decision:** Use in-memory LRU cache with Python's `functools.lru_cache`

**Rationale:**
- **Simplicity**: No external service to manage
- **Performance**: Faster access (no network overhead)
- **MVP Focus**: Adequate for initial deployment
- **Resource Efficient**: No additional memory footprint
- **Easy Migration**: Can upgrade to Redis later if needed

**Trade-offs:**
- Cache doesn't persist across restarts
- Can't share cache across multiple instances
- Limited by single-process memory

### 8.2 Why RAGAS for Evaluation?

**Decision:** Integrate RAGAS for real-time quality assessment

**Rationale:**
- **Automated Metrics**: No manual annotation required
- **Production-Ready**: Quantifiable quality scores
- **Real-Time**: Evaluate every query-response pair
- **Comprehensive**: Multiple dimensions of quality
- **Research-Backed**: Based on academic research

**Implementation Details:**
- OpenAI API key required for LLM-based metrics
- Async evaluation to avoid blocking responses
- SQLite storage for historical analysis
- Export capability for offline processing

### 8.3 Why No Web Scraping?

**Decision:** Removed web scraping from MVP

**Rationale:**
- **Complexity**: Anti-scraping mechanisms require maintenance
- **Reliability**: Website changes break scrapers
- **Legal**: Potential legal/ethical issues
- **Scope**: Focus on core RAG functionality first

**Alternative:**
- Users can save web pages as PDFs
- Future enhancement if market demands it

---

## 9. Performance Optimization Strategies

### 9.1 Embedding Cache Strategy

```python
# Cache Implementation
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query: str) -> np.ndarray:
    """Cache query embeddings for repeat queries"""
    return embedder.embed(query)

# Benefits:
# - 80% reduction in latency for repeat queries
# - No re-computation of identical queries
# - Automatic LRU eviction
```

### 9.2 Batch Processing

```python
# Batch Embedding Generation
BATCH_SIZE = 32

def embed_chunks_batch(chunks: List[str]) -> List[np.ndarray]:
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i:i+BATCH_SIZE]
        batch_embeddings = embedder.embed_batch(batch)
        embeddings.extend(batch_embeddings)
    return embeddings
```

### 9.3 Async Processing

```python
# Async Document Processing
import asyncio

async def process_documents_async(documents: List[Path]):
    tasks = [process_single_document(doc) for doc in documents]
    results = await asyncio.gather(*tasks)
    return results
```

---

## 10. Security Considerations

### 10.1 Data Privacy

- **On-Premise Processing**: All data stays local
- **No External APIs**: Except OpenAI for RAGAS (configurable)
- **Local LLM**: Ollama runs entirely on-premise
- **Encrypted Storage**: Optional SQLite encryption

### 10.2 Input Validation

```python
# File Upload Validation
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100MB
ALLOWED_EXTENSIONS = {'.pdf', '.docx', '.txt', '.zip'}

def validate_upload(file: UploadFile):
    # Check extension
    if Path(file.filename).suffix not in ALLOWED_EXTENSIONS:
        raise ValueError("Unsupported file type")
    
    # Check size
    if file.size > MAX_FILE_SIZE:
        raise ValueError("File too large")
    
    # Scan for malicious content (optional)
    # scan_for_malware(file)
```

### 10.3 Rate Limiting

```python
# Simple rate limiting
from fastapi import Request
from collections import defaultdict
from datetime import datetime, timedelta

rate_limits = defaultdict(list)

def check_rate_limit(request: Request, limit: int = 100):
    ip = request.client.host
    now = datetime.now()
    
    # Clean old requests
    rate_limits[ip] = [
        ts for ts in rate_limits[ip] 
        if now - ts < timedelta(minutes=1)
    ]
    
    # Check limit
    if len(rate_limits[ip]) >= limit:
        raise HTTPException(429, "Rate limit exceeded")
    
    rate_limits[ip].append(now)
```

---

## Conclusion

This architecture document provides a comprehensive technical blueprint for the QuerySphere system. The modular design, clear separation of concerns, and production-ready considerations make this system suitable for enterprise deployment while maintaining flexibility for future enhancements.

### Key Architectural Strengths

1. **Modularity**: Each component is independent and replaceable
2. **Scalability**: Horizontal scaling through stateless API design
3. **Performance**: Intelligent caching and batch processing
4. **Quality**: Real-time RAGAS evaluation for continuous monitoring
5. **Privacy**: Complete on-premise processing with local LLM
6. **Simplicity**: Minimal external dependencies (no Redis, no web scraping)

### Future Enhancements

**Short-term:**
- Redis cache for multi-instance deployments
- Advanced monitoring dashboard
- User authentication and authorization
- API rate limiting enhancements

**Long-term:**
- Distributed processing with Celery
- Web scraping module (optional)
- Fine-tuned domain-specific embeddings
- Multi-tenant support
- Advanced analytics and reporting

---

Document Version: 1.0
Last Updated: November 2025
Author: Satyaki Mitra

---

> This document is part of the QuerySphere technical documentation suite.