# KnowledgeBridge - System Architecture & Flow

## 🎯 Overview

This document provides a comprehensive technical overview of the KnowledgeBridge system architecture, data flow, and AI processing pipeline.

## 📊 Main Data Flow

```
User Query → AI Enhancement → Multi-Source Search → URL Validation → Results Display
```

## 🔄 Detailed Process Flow

### Stage 1: Input Processing & Enhancement
**Components:**
- Enhanced Search Interface (React/TypeScript)
- Input validation and sanitization
- Rate limiting middleware

**Technical Details:**
- React captures user input with real-time validation
- Optional AI query enhancement using Nebius DeepSeek models
- Express.js endpoint with comprehensive security middleware
- Request body size limits and input sanitization

### Stage 2: AI-Powered Query Enhancement
**Components:**
- Nebius AI client with DeepSeek-R1-0528 model
- Smart query analysis and improvement
- Intent recognition and keyword extraction

**Technical Details:**
- Nebius API call: `deepseek-ai/DeepSeek-R1-0528`
- Analyzes user intent and suggests improvements
- Provides enhanced query, keywords, and alternative suggestions
- Fallback to original query if enhancement fails

### Stage 3: Embedding Generation
**Components:**
- Nebius embedding service
- BAAI/bge-en-icl model for vector generation
- Mock embedding fallback for reliability

**Technical Details:**
- Primary model: `BAAI/bge-en-icl`
- Generates high-dimensional vector representations
- Fallback to deterministic mock embeddings for demos
- Semantic meaning captured in numerical vectors

### Stage 4: Multi-Source Search
**Components:**
- Local storage search (in-memory with sample data)
- GitHub repository search with advanced filtering
- Wikipedia API integration
- ArXiv academic paper search

**Technical Details:**
- **Local Search**: Keyword matching with relevance scoring
- **GitHub API**: Enhanced with author filtering and fallback strategies
- **Wikipedia API**: 3-second timeout with content validation
- **ArXiv API**: Format validation and paper existence verification
- **Parallel Processing**: Concurrent search across all sources

### Stage 5: URL Validation & Content Verification
**Components:**
- Smart URL validation system
- ArXiv paper ID format checking
- Content-based error detection
- Concurrent processing with rate limits

**Technical Details:**
- **ArXiv Validation**: Checks paper ID format (2024.12345, cs.AI/1234567)
- **Content Verification**: Detects error pages that return 200 status
- **Rate Limiting**: Configurable concurrency to prevent API abuse
- **Trusted Domains**: Fast-path for reliable sources (GitHub, Wikipedia)

### Stage 6: Document Analysis (Optional)
**Components:**
- Nebius AI with configurable output formatting
- DeepSeek-R1 model for comprehensive analysis
- Content cleanup and markdown processing

**Technical Details:**
- Analysis types: summary, classification, key_points, quality_score
- Configurable markdown vs plain text output
- DeepSeek R1 thinking tag cleanup for clean results
- Custom prompts optimized for each analysis type

### Stage 7: Results Processing & Display
**Components:**
- Result ranking and relevance scoring
- Citation management system
- Interactive UI with error boundaries

**Technical Details:**
- Relevance-based sorting with multiple factors
- Rich metadata display with type-safe rendering
- Error boundaries prevent UI crashes
- Real-time result updates and filtering

## 🏗️ System Architecture

### Frontend Stack
```
┌─────────────────────────────────────────────────────────┐
│                    React 18 + TypeScript                │
├─────────────────────────────────────────────────────────┤
│  Enhanced Search Interface  │  Knowledge Graph  │  AI   │
│  - Unified search & AI      │  - D3.js visualization  │ Tools│
│  - Query enhancement        │  - Interactive nodes     │Panel │
│  - Configurable analysis    │  - Relationship mapping  │     │
├─────────────────────────────────────────────────────────┤
│           TanStack Query (Data Fetching & Caching)      │
├─────────────────────────────────────────────────────────┤
│              Radix UI + Tailwind CSS                    │
└─────────────────────────────────────────────────────────┘
```

### Backend Stack
```
┌─────────────────────────────────────────────────────────┐
│                Express.js + Security Middleware         │
├─────────────────────────────────────────────────────────┤
│  Helmet.js  │  Rate Limiting  │  Input Validation  │CORS│
├─────────────────────────────────────────────────────────┤
│                    API Routes Layer                     │
│  /api/search  │  /api/analyze-document  │  /api/health │
├─────────────────────────────────────────────────────────┤
│            Service Layer                                │
│  Nebius Client  │  Modal Client  │  Storage Service    │
└─────────────────────────────────────────────────────────┘
```

### AI & Processing Layer
```
┌─────────────────────────────────────────────────────────┐
│                      Nebius AI Platform                 │
├─────────────────────────────────────────────────────────┤
│  DeepSeek-R1-0528        │  BAAI/bge-en-icl            │
│  - Chat completions      │  - Embedding generation      │
│  - Document analysis     │  - Semantic similarity       │
│  - Query enhancement     │  - Vector search             │
├─────────────────────────────────────────────────────────┤
│                      Modal Platform                     │
│  - Distributed processing   │  - Scalable compute       │
│  - Batch operations         │  - Resource management    │
└─────────────────────────────────────────────────────────┘
```

## 🔒 Security Architecture

### Input Protection
```
Request → Rate Limiter → Helmet.js → Input Validator → API Route
  ↓           ↓            ↓             ↓              ↓
100/15min   CSP Headers   Body Size    Zod Schema   Business Logic
10/min*     XSS Protection  10MB Limit   Type Safety    Error Handling

* Sensitive endpoints
```

### Error Handling Chain
```
React Error Boundary → API Error Handler → Service Error Handler
        ↓                      ↓                     ↓
   UI Graceful Fallback   HTTP Status Codes   Logging & Recovery
```

## 🚀 Performance Characteristics

### Response Times
| Operation | Average Time | Details |
|-----------|-------------|---------|
| Local Search | <100ms | In-memory keyword matching |
| URL Validation | 1-3s per URL | Concurrent processing |
| Document Analysis | 3-5s | AI model processing time |
| Embedding Generation | 500ms-1s | Nebius API call |
| Query Enhancement | 1-2s | DeepSeek model inference |

### Scalability Features
- **Horizontal Scaling**: Modal platform for distributed processing
- **Rate Limiting**: Prevents API abuse and ensures fair usage
- **Caching**: TanStack Query for client-side data caching
- **Error Recovery**: Graceful degradation when services are unavailable
- **Load Distribution**: Concurrent processing of multiple requests

## 🔧 Data Flow Patterns

### Search Request Flow
```mermaid
graph TD
    A[User Query] --> B[Rate Limiter]
    B --> C[Input Validation]
    C --> D[AI Enhancement?]
    D -->|Yes| E[Nebius Query Enhancement]
    D -->|No| F[Direct Search]
    E --> F[Multi-Source Search]
    F --> G[Local Storage]
    F --> H[GitHub API]
    F --> I[Wikipedia API]
    F --> J[ArXiv API]
    G --> K[URL Validation]
    H --> K
    I --> K
    J --> K
    K --> L[Results Ranking]
    L --> M[Response Formatting]
    M --> N[Client Display]
```

### Document Analysis Flow
```mermaid
graph TD
    A[Document Content] --> B[Content Validation]
    B --> C[Analysis Type Selection]
    C --> D[Nebius DeepSeek API]
    D --> E[Response Processing]
    E --> F[Format Selection]
    F -->|Markdown| G[Rich Formatting]
    F -->|Plain Text| H[Clean Text Output]
    G --> I[Client Display]
    H --> I
```

## 🛠️ Technology Integration Points

### External API Integration
- **Nebius AI**: Primary AI service for all language model tasks
- **Modal**: Distributed processing and compute scaling
- **GitHub API**: Repository search with authentication
- **Wikipedia API**: Authoritative content with caching
- **ArXiv API**: Academic paper search with validation

### Internal Service Communication
- **REST APIs**: Standard HTTP/JSON for client-server communication
- **Event-Driven**: React state management for UI updates
- **Error Propagation**: Structured error handling across all layers
- **Type Safety**: TypeScript contracts for all service boundaries

## 📊 Quality Assurance

### Code Quality
- **TypeScript**: 100% type coverage across frontend and backend
- **Input Validation**: Zod schemas for all API endpoints
- **Error Boundaries**: React error boundaries prevent UI crashes
- **Security Middleware**: Comprehensive protection against common attacks

### Testing Strategy
- **Type Checking**: Continuous TypeScript compilation validation
- **API Testing**: Health checks and endpoint validation
- **Error Testing**: Graceful handling of service failures
- **Performance Testing**: Response time monitoring and optimization

This architecture provides a robust, scalable, and secure foundation for AI-powered knowledge discovery with comprehensive error handling and performance optimization.