Spaces:
Sleeping
Sleeping
๐๏ธ Architecture Documentation
Overview
This RAG (Retrieval-Augmented Generation) application uses a hybrid architecture combining HuggingFace services with OpenRouter to provide reliable, cost-effective corporate policy assistance.
๐ง Service Architecture
Current Stack (October 2025)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HYBRID RAG ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ EMBEDDINGS โ โ VECTOR STORE โ โ LLM SERVICE โ โ
โ โ โ โ โ โ โ โ
โ โ HuggingFace โ โ HuggingFace โ โ OpenRouter โ โ
โ โ Inference API โ โ Dataset โ โ WizardLM โ โ
โ โ โ โ โ โ โ โ
โ โ multilingual-e5 โ โ Persistent โ โ Free Tier โ โ
โ โ 1024 dimensions โ โ Parquet Format โ โ Reliable โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Service Details
1. Embedding Service
- Provider: HuggingFace Inference API
- Model:
intfloat/multilingual-e5-large - Dimensions: 1024
- Features:
- Automatic batching for efficiency
- Fallback to local ONNX models for development
- Memory-optimized processing
- Triple-layer configuration override
2. Vector Store
- Provider: HuggingFace Dataset
- Storage Format: Parquet + JSON metadata
- Features:
- Persistent storage across deployments
- Cosine similarity search
- Metadata preservation
- Complete interface compatibility
3. LLM Service
- Provider: OpenRouter
- Model:
microsoft/wizardlm-2-8x22b - Features:
- Free tier access
- Reliable availability (no 404 errors)
- Automatic prompt formatting
- Built-in safety filtering
๐ Data Flow
User Query
โ
โโโโโโโโโโโโโโโโโโโโโ
โ Query Processing โ โ Natural language understanding
โโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโ
โ Embedding โ โ HuggingFace Inference API
โ Generation โ (multilingual-e5-large)
โโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโ
โ Vector Search โ โ HuggingFace Dataset
โ โ Cosine similarity
โโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโ
โ Context Assembly โ โ Retrieved documents + metadata
โโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโ
โ LLM Generation โ โ OpenRouter WizardLM
โ โ Prompt + context โ response
โโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโ
โ Response โ โ Formatted answer + citations
โ Formatting โ
โโโโโโโโโโโโโโโโโโโโโ
โ
Structured Response
๐ Document Processing Pipeline
Initialization Phase
Document Loading
- 22 synthetic policy files
- Markdown format with structured metadata
Chunking Strategy
- Semantic chunking preserving context
- Target chunk size: ~400 tokens
- Overlap: 50 tokens for continuity
- Total chunks: 170+
Embedding Generation
- Batch processing for efficiency
- HuggingFace API rate limiting compliance
- Memory optimization for large datasets
Vector Storage
- Parquet format for efficient storage
- JSON metadata for complex structures
- Upload to HuggingFace Dataset
- Local caching for development
๐ง Configuration Management
Environment Variables
Required for Production
HF_TOKEN=hf_xxx... # HuggingFace API access
OPENROUTER_API_KEY=sk-or-v1-xxx... # OpenRouter API access
Optional Configuration
USE_OPENAI_EMBEDDING=false # Force HF embeddings (overridden when HF_TOKEN present)
ENABLE_HF_SERVICES=true # Enable HF services (auto-detected)
ENABLE_HF_PROCESSING=true # Enable document processing
REBUILD_EMBEDDINGS_ON_START=false # Force rebuild
Configuration Override System
The application implements a triple-layer override system to ensure hybrid services are used:
Configuration Level (
src/config.py)- Forces
USE_OPENAI_EMBEDDING=falsewhenHF_TOKENavailable - Ensures HF embeddings are used
- Forces
Application Factory Level (
src/app_factory.py)- Overrides service selection in RAG pipeline initialization
- Uses
LLMService.from_environment()for OpenRouter
Routes Level (
src/routes/main_routes.py)- Ensures consistent service usage in API endpoints
- Hybrid pipeline: HF embeddings + OpenRouter LLM
๐ Deployment Architecture
HuggingFace Spaces Deployment
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HUGGINGFACE SPACES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FLASK APPLICATION โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ RAG PIPELINE โ โ WEB INTERFACE โ โ โ
โ โ โ โ โ โ โ โ
โ โ โ Search Service โ โ Chat Interface โ โ โ
โ โ โ LLM Service โ โ API Endpoints โ โ โ
โ โ โ Context Manager โ โ Health Checks โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ External Services: โ
โ โโ HuggingFace Inference API (embeddings) โ
โ โโ HuggingFace Dataset (vector storage) โ
โ โโ OpenRouter API (LLM generation) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Resource Requirements
- CPU: Basic tier (sufficient for I/O-bound operations)
- Memory: ~512MB (optimized for Spaces limits)
- Storage: Small tier (document cache + temporary files)
- Network: External API calls for all major services
๐ Migration History
Evolution of Architecture
Phase 1: OpenAI-based (Expensive)
- OpenAI embeddings + GPT models
- High API costs
- Excellent reliability
Phase 2: Full HuggingFace (Problematic)
- HF embeddings + HF LLM models
- Cost-effective
- LLM reliability issues (404 errors)
Phase 3: Hybrid (Current - Optimal)
- HF embeddings + OpenRouter LLM
- Cost-effective
- Reliable LLM generation
- Best of both worlds
Why Hybrid Architecture?
- HuggingFace Embeddings: Stable, reliable, cost-effective
- HuggingFace Vector Store: Persistent, efficient, free
- OpenRouter LLM: Reliable, no 404 errors, free tier available
- Overall: Optimal balance of cost, reliability, and performance
๐ ๏ธ Development Guidelines
Local Development
- Set both API tokens in environment
- Application auto-detects hybrid configuration
- Falls back to local ONNX embeddings if HF unavailable
- Uses file-based vector storage for development
Production Deployment
- Ensure both tokens are set in HuggingFace Spaces secrets
- Application automatically uses hybrid services
- Persistent vector storage via HuggingFace Dataset
- Automatic document processing on startup
Monitoring and Health Checks
/health- Overall application health/debug/rag- RAG pipeline diagnostics- Comprehensive logging for all service interactions
- Error tracking and graceful degradation
๐ Performance Characteristics
Latency Breakdown (Typical Query)
- Embedding Generation: ~200-500ms (HF API)
- Vector Search: ~50-100ms (local computation)
- LLM Generation: ~1-3s (OpenRouter API)
- Total Response Time: ~2-4s
Throughput Considerations
- HuggingFace API: Rate limited by free tier
- OpenRouter API: Rate limited by free tier
- Vector Search: Limited by local CPU/memory
- Concurrent Users: ~5-10 concurrent (estimated)
Scalability
- Horizontal: Multiple Spaces instances
- Vertical: Upgrade to larger Spaces tier
- Caching: Implement response caching for common queries
- CDN: Static asset delivery optimization
๐ Security Considerations
API Key Management
- Environment variables for sensitive tokens
- HuggingFace Spaces secrets for production
- No hardcoded credentials in codebase
Data Privacy
- No persistent user data storage
- Ephemeral query processing
- No logging of sensitive information
- GDPR-compliant by design
Content Safety
- Built-in guardrails for inappropriate content
- Bias detection and mitigation
- PII detection and filtering
- Response validation
๐ฎ Future Enhancements
Potential Improvements
- Caching Layer: Redis for common queries
- Model Upgrades: Better LLM models as they become available
- Multi-modal: Support for document images and PDFs
- Advanced RAG: Re-ranking, query expansion, multi-hop reasoning
- Analytics: User interaction tracking and optimization
Migration Considerations
- Maintain backward compatibility
- Gradual service migration strategies
- A/B testing for service comparisons
- Performance monitoring during transitions