ai-engineering-project / ARCHITECTURE.md
GitHub Action
Clean deployment without binary files
f884e6e

๐Ÿ—๏ธ Architecture Documentation

Overview

This RAG (Retrieval-Augmented Generation) application uses a hybrid architecture combining HuggingFace services with OpenRouter to provide reliable, cost-effective corporate policy assistance.

๐Ÿ”ง Service Architecture

Current Stack (October 2025)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    HYBRID RAG ARCHITECTURE                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚   EMBEDDINGS    โ”‚  โ”‚  VECTOR STORE   โ”‚  โ”‚   LLM SERVICE   โ”‚ โ”‚
โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  HuggingFace    โ”‚  โ”‚  HuggingFace    โ”‚  โ”‚   OpenRouter    โ”‚ โ”‚
โ”‚  โ”‚  Inference API  โ”‚  โ”‚    Dataset      โ”‚  โ”‚   WizardLM      โ”‚ โ”‚
โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚ multilingual-e5 โ”‚  โ”‚ Persistent      โ”‚  โ”‚ Free Tier       โ”‚ โ”‚
โ”‚  โ”‚ 1024 dimensions โ”‚  โ”‚ Parquet Format  โ”‚  โ”‚ Reliable        โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Service Details

1. Embedding Service

  • Provider: HuggingFace Inference API
  • Model: intfloat/multilingual-e5-large
  • Dimensions: 1024
  • Features:
    • Automatic batching for efficiency
    • Fallback to local ONNX models for development
    • Memory-optimized processing
    • Triple-layer configuration override

2. Vector Store

  • Provider: HuggingFace Dataset
  • Storage Format: Parquet + JSON metadata
  • Features:
    • Persistent storage across deployments
    • Cosine similarity search
    • Metadata preservation
    • Complete interface compatibility

3. LLM Service

  • Provider: OpenRouter
  • Model: microsoft/wizardlm-2-8x22b
  • Features:
    • Free tier access
    • Reliable availability (no 404 errors)
    • Automatic prompt formatting
    • Built-in safety filtering

๐Ÿ”„ Data Flow

User Query
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Query Processing  โ”‚ โ† Natural language understanding
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Embedding         โ”‚ โ† HuggingFace Inference API
โ”‚ Generation        โ”‚   (multilingual-e5-large)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Vector Search     โ”‚ โ† HuggingFace Dataset
โ”‚                   โ”‚   Cosine similarity
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Context Assembly  โ”‚ โ† Retrieved documents + metadata
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LLM Generation    โ”‚ โ† OpenRouter WizardLM
โ”‚                   โ”‚   Prompt + context โ†’ response
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Response          โ”‚ โ† Formatted answer + citations
โ”‚ Formatting        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Structured Response

๐Ÿ“Š Document Processing Pipeline

Initialization Phase

  1. Document Loading

    • 22 synthetic policy files
    • Markdown format with structured metadata
  2. Chunking Strategy

    • Semantic chunking preserving context
    • Target chunk size: ~400 tokens
    • Overlap: 50 tokens for continuity
    • Total chunks: 170+
  3. Embedding Generation

    • Batch processing for efficiency
    • HuggingFace API rate limiting compliance
    • Memory optimization for large datasets
  4. Vector Storage

    • Parquet format for efficient storage
    • JSON metadata for complex structures
    • Upload to HuggingFace Dataset
    • Local caching for development

๐Ÿ”ง Configuration Management

Environment Variables

Required for Production

HF_TOKEN=hf_xxx...          # HuggingFace API access
OPENROUTER_API_KEY=sk-or-v1-xxx...  # OpenRouter API access

Optional Configuration

USE_OPENAI_EMBEDDING=false  # Force HF embeddings (overridden when HF_TOKEN present)
ENABLE_HF_SERVICES=true     # Enable HF services (auto-detected)
ENABLE_HF_PROCESSING=true   # Enable document processing
REBUILD_EMBEDDINGS_ON_START=false  # Force rebuild

Configuration Override System

The application implements a triple-layer override system to ensure hybrid services are used:

  1. Configuration Level (src/config.py)

    • Forces USE_OPENAI_EMBEDDING=false when HF_TOKEN available
    • Ensures HF embeddings are used
  2. Application Factory Level (src/app_factory.py)

    • Overrides service selection in RAG pipeline initialization
    • Uses LLMService.from_environment() for OpenRouter
  3. Routes Level (src/routes/main_routes.py)

    • Ensures consistent service usage in API endpoints
    • Hybrid pipeline: HF embeddings + OpenRouter LLM

๐Ÿš€ Deployment Architecture

HuggingFace Spaces Deployment

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    HUGGINGFACE SPACES                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                  FLASK APPLICATION                         โ”‚ โ”‚
โ”‚  โ”‚                                                             โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  RAG PIPELINE   โ”‚  โ”‚   WEB INTERFACE โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ Search Service  โ”‚  โ”‚ Chat Interface  โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ LLM Service     โ”‚  โ”‚ API Endpoints   โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ Context Manager โ”‚  โ”‚ Health Checks   โ”‚                 โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                 โ”‚
โ”‚  External Services:                                             โ”‚
โ”‚  โ”œโ”€ HuggingFace Inference API (embeddings)                     โ”‚
โ”‚  โ”œโ”€ HuggingFace Dataset (vector storage)                       โ”‚
โ”‚  โ””โ”€ OpenRouter API (LLM generation)                            โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Resource Requirements

  • CPU: Basic tier (sufficient for I/O-bound operations)
  • Memory: ~512MB (optimized for Spaces limits)
  • Storage: Small tier (document cache + temporary files)
  • Network: External API calls for all major services

๐Ÿ”„ Migration History

Evolution of Architecture

  1. Phase 1: OpenAI-based (Expensive)

    • OpenAI embeddings + GPT models
    • High API costs
    • Excellent reliability
  2. Phase 2: Full HuggingFace (Problematic)

    • HF embeddings + HF LLM models
    • Cost-effective
    • LLM reliability issues (404 errors)
  3. Phase 3: Hybrid (Current - Optimal)

    • HF embeddings + OpenRouter LLM
    • Cost-effective
    • Reliable LLM generation
    • Best of both worlds

Why Hybrid Architecture?

  • HuggingFace Embeddings: Stable, reliable, cost-effective
  • HuggingFace Vector Store: Persistent, efficient, free
  • OpenRouter LLM: Reliable, no 404 errors, free tier available
  • Overall: Optimal balance of cost, reliability, and performance

๐Ÿ› ๏ธ Development Guidelines

Local Development

  1. Set both API tokens in environment
  2. Application auto-detects hybrid configuration
  3. Falls back to local ONNX embeddings if HF unavailable
  4. Uses file-based vector storage for development

Production Deployment

  1. Ensure both tokens are set in HuggingFace Spaces secrets
  2. Application automatically uses hybrid services
  3. Persistent vector storage via HuggingFace Dataset
  4. Automatic document processing on startup

Monitoring and Health Checks

  • /health - Overall application health
  • /debug/rag - RAG pipeline diagnostics
  • Comprehensive logging for all service interactions
  • Error tracking and graceful degradation

๐Ÿ“ˆ Performance Characteristics

Latency Breakdown (Typical Query)

  • Embedding Generation: ~200-500ms (HF API)
  • Vector Search: ~50-100ms (local computation)
  • LLM Generation: ~1-3s (OpenRouter API)
  • Total Response Time: ~2-4s

Throughput Considerations

  • HuggingFace API: Rate limited by free tier
  • OpenRouter API: Rate limited by free tier
  • Vector Search: Limited by local CPU/memory
  • Concurrent Users: ~5-10 concurrent (estimated)

Scalability

  • Horizontal: Multiple Spaces instances
  • Vertical: Upgrade to larger Spaces tier
  • Caching: Implement response caching for common queries
  • CDN: Static asset delivery optimization

๐Ÿ”’ Security Considerations

API Key Management

  • Environment variables for sensitive tokens
  • HuggingFace Spaces secrets for production
  • No hardcoded credentials in codebase

Data Privacy

  • No persistent user data storage
  • Ephemeral query processing
  • No logging of sensitive information
  • GDPR-compliant by design

Content Safety

  • Built-in guardrails for inappropriate content
  • Bias detection and mitigation
  • PII detection and filtering
  • Response validation

๐Ÿ”ฎ Future Enhancements

Potential Improvements

  1. Caching Layer: Redis for common queries
  2. Model Upgrades: Better LLM models as they become available
  3. Multi-modal: Support for document images and PDFs
  4. Advanced RAG: Re-ranking, query expansion, multi-hop reasoning
  5. Analytics: User interaction tracking and optimization

Migration Considerations

  • Maintain backward compatibility
  • Gradual service migration strategies
  • A/B testing for service comparisons
  • Performance monitoring during transitions