Spaces:

msse-team-3
/

ai-engineering-project

Sleeping

App Files Files Community

ai-engineering-project / ARCHITECTURE.md

GitHub Action

Clean deployment without binary files

f884e6e 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

🏗️ Architecture Documentation

Overview

This RAG (Retrieval-Augmented Generation) application uses a hybrid architecture combining HuggingFace services with OpenRouter to provide reliable, cost-effective corporate policy assistance.

🔧 Service Architecture

Current Stack (October 2025)

┌─────────────────────────────────────────────────────────────────┐
│                    HYBRID RAG ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │   EMBEDDINGS    │  │  VECTOR STORE   │  │   LLM SERVICE   │ │
│  │                 │  │                 │  │                 │ │
│  │  HuggingFace    │  │  HuggingFace    │  │   OpenRouter    │ │
│  │  Inference API  │  │    Dataset      │  │   WizardLM      │ │
│  │                 │  │                 │  │                 │ │
│  │ multilingual-e5 │  │ Persistent      │  │ Free Tier       │ │
│  │ 1024 dimensions │  │ Parquet Format  │  │ Reliable        │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service Details

1. Embedding Service

Provider: HuggingFace Inference API
Model: intfloat/multilingual-e5-large
Dimensions: 1024
Features:
- Automatic batching for efficiency
- Fallback to local ONNX models for development
- Memory-optimized processing
- Triple-layer configuration override

2. Vector Store

Provider: HuggingFace Dataset
Storage Format: Parquet + JSON metadata
Features:
- Persistent storage across deployments
- Cosine similarity search
- Metadata preservation
- Complete interface compatibility

3. LLM Service

Provider: OpenRouter
Model: microsoft/wizardlm-2-8x22b
Features:
- Free tier access
- Reliable availability (no 404 errors)
- Automatic prompt formatting
- Built-in safety filtering

🔄 Data Flow

User Query
    ↓
┌───────────────────┐
│ Query Processing  │ ← Natural language understanding
└───────────────────┘
    ↓
┌───────────────────┐
│ Embedding         │ ← HuggingFace Inference API
│ Generation        │   (multilingual-e5-large)
└───────────────────┘
    ↓
┌───────────────────┐
│ Vector Search     │ ← HuggingFace Dataset
│                   │   Cosine similarity
└───────────────────┘
    ↓
┌───────────────────┐
│ Context Assembly  │ ← Retrieved documents + metadata
└───────────────────┘
    ↓
┌───────────────────┐
│ LLM Generation    │ ← OpenRouter WizardLM
│                   │   Prompt + context → response
└───────────────────┘
    ↓
┌───────────────────┐
│ Response          │ ← Formatted answer + citations
│ Formatting        │
└───────────────────┘
    ↓
Structured Response

📊 Document Processing Pipeline

Initialization Phase

Document Loading
- 22 synthetic policy files
- Markdown format with structured metadata
Chunking Strategy
- Semantic chunking preserving context
- Target chunk size: ~400 tokens
- Overlap: 50 tokens for continuity
- Total chunks: 170+
Embedding Generation
- Batch processing for efficiency
- HuggingFace API rate limiting compliance
- Memory optimization for large datasets
Vector Storage
- Parquet format for efficient storage
- JSON metadata for complex structures
- Upload to HuggingFace Dataset
- Local caching for development

🔧 Configuration Management

Environment Variables

Required for Production

HF_TOKEN=hf_xxx...          # HuggingFace API access
OPENROUTER_API_KEY=sk-or-v1-xxx...  # OpenRouter API access

Optional Configuration

USE_OPENAI_EMBEDDING=false  # Force HF embeddings (overridden when HF_TOKEN present)
ENABLE_HF_SERVICES=true     # Enable HF services (auto-detected)
ENABLE_HF_PROCESSING=true   # Enable document processing
REBUILD_EMBEDDINGS_ON_START=false  # Force rebuild

Configuration Override System

The application implements a triple-layer override system to ensure hybrid services are used:

Configuration Level (src/config.py)
- Forces USE_OPENAI_EMBEDDING=false when HF_TOKEN available
- Ensures HF embeddings are used
Application Factory Level (src/app_factory.py)
- Overrides service selection in RAG pipeline initialization
- Uses LLMService.from_environment() for OpenRouter
Routes Level (src/routes/main_routes.py)
- Ensures consistent service usage in API endpoints
- Hybrid pipeline: HF embeddings + OpenRouter LLM

🚀 Deployment Architecture

HuggingFace Spaces Deployment

┌─────────────────────────────────────────────────────────────────┐
│                    HUGGINGFACE SPACES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                  FLASK APPLICATION                         │ │
│  │                                                             │ │
│  │  ┌─────────────────┐  ┌─────────────────┐                 │ │
│  │  │  RAG PIPELINE   │  │   WEB INTERFACE │                 │ │
│  │  │                 │  │                 │                 │ │
│  │  │ Search Service  │  │ Chat Interface  │                 │ │
│  │  │ LLM Service     │  │ API Endpoints   │                 │ │
│  │  │ Context Manager │  │ Health Checks   │                 │ │
│  │  └─────────────────┘  └─────────────────┘                 │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│  External Services:                                             │
│  ├─ HuggingFace Inference API (embeddings)                     │
│  ├─ HuggingFace Dataset (vector storage)                       │
│  └─ OpenRouter API (LLM generation)                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Resource Requirements

CPU: Basic tier (sufficient for I/O-bound operations)
Memory: ~512MB (optimized for Spaces limits)
Storage: Small tier (document cache + temporary files)
Network: External API calls for all major services

🔄 Migration History

Evolution of Architecture

Phase 1: OpenAI-based (Expensive)
- OpenAI embeddings + GPT models
- High API costs
- Excellent reliability
Phase 2: Full HuggingFace (Problematic)
- HF embeddings + HF LLM models
- Cost-effective
- LLM reliability issues (404 errors)
Phase 3: Hybrid (Current - Optimal)
- HF embeddings + OpenRouter LLM
- Cost-effective
- Reliable LLM generation
- Best of both worlds

Why Hybrid Architecture?

HuggingFace Embeddings: Stable, reliable, cost-effective
HuggingFace Vector Store: Persistent, efficient, free
OpenRouter LLM: Reliable, no 404 errors, free tier available
Overall: Optimal balance of cost, reliability, and performance

🛠️ Development Guidelines

Local Development

Set both API tokens in environment
Application auto-detects hybrid configuration
Falls back to local ONNX embeddings if HF unavailable
Uses file-based vector storage for development

Production Deployment

Ensure both tokens are set in HuggingFace Spaces secrets
Application automatically uses hybrid services
Persistent vector storage via HuggingFace Dataset
Automatic document processing on startup

Monitoring and Health Checks

/health - Overall application health
/debug/rag - RAG pipeline diagnostics
Comprehensive logging for all service interactions
Error tracking and graceful degradation

📈 Performance Characteristics

Latency Breakdown (Typical Query)

Embedding Generation: ~200-500ms (HF API)
Vector Search: ~50-100ms (local computation)
LLM Generation: ~1-3s (OpenRouter API)
Total Response Time: ~2-4s

Throughput Considerations

HuggingFace API: Rate limited by free tier
OpenRouter API: Rate limited by free tier
Vector Search: Limited by local CPU/memory
Concurrent Users: ~5-10 concurrent (estimated)

Scalability

Horizontal: Multiple Spaces instances
Vertical: Upgrade to larger Spaces tier
Caching: Implement response caching for common queries
CDN: Static asset delivery optimization

🔒 Security Considerations

API Key Management

Environment variables for sensitive tokens
HuggingFace Spaces secrets for production
No hardcoded credentials in codebase

Data Privacy

No persistent user data storage
Ephemeral query processing
No logging of sensitive information
GDPR-compliant by design

Content Safety

Built-in guardrails for inappropriate content
Bias detection and mitigation
PII detection and filtering
Response validation

🔮 Future Enhancements

Potential Improvements

Caching Layer: Redis for common queries
Model Upgrades: Better LLM models as they become available
Multi-modal: Support for document images and PDFs
Advanced RAG: Re-ranking, query expansion, multi-hop reasoning
Analytics: User interaction tracking and optimization

Migration Considerations

Maintain backward compatibility
Gradual service migration strategies
A/B testing for service comparisons
Performance monitoring during transitions