Spaces:
Sleeping
Sleeping
| # ๐๏ธ Architecture Documentation | |
| ## Overview | |
| This RAG (Retrieval-Augmented Generation) application uses a hybrid architecture combining HuggingFace services with OpenRouter to provide reliable, cost-effective corporate policy assistance. | |
| ## ๐ง Service Architecture | |
| ### Current Stack (October 2025) | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ HYBRID RAG ARCHITECTURE โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ โ | |
| โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ | |
| โ โ EMBEDDINGS โ โ VECTOR STORE โ โ LLM SERVICE โ โ | |
| โ โ โ โ โ โ โ โ | |
| โ โ HuggingFace โ โ HuggingFace โ โ OpenRouter โ โ | |
| โ โ Inference API โ โ Dataset โ โ WizardLM โ โ | |
| โ โ โ โ โ โ โ โ | |
| โ โ multilingual-e5 โ โ Persistent โ โ Free Tier โ โ | |
| โ โ 1024 dimensions โ โ Parquet Format โ โ Reliable โ โ | |
| โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ | |
| โ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| ### Service Details | |
| #### 1. Embedding Service | |
| - **Provider**: HuggingFace Inference API | |
| - **Model**: `intfloat/multilingual-e5-large` | |
| - **Dimensions**: 1024 | |
| - **Features**: | |
| - Automatic batching for efficiency | |
| - Fallback to local ONNX models for development | |
| - Memory-optimized processing | |
| - Triple-layer configuration override | |
| #### 2. Vector Store | |
| - **Provider**: HuggingFace Dataset | |
| - **Storage Format**: Parquet + JSON metadata | |
| - **Features**: | |
| - Persistent storage across deployments | |
| - Cosine similarity search | |
| - Metadata preservation | |
| - Complete interface compatibility | |
| #### 3. LLM Service | |
| - **Provider**: OpenRouter | |
| - **Model**: `microsoft/wizardlm-2-8x22b` | |
| - **Features**: | |
| - Free tier access | |
| - Reliable availability (no 404 errors) | |
| - Automatic prompt formatting | |
| - Built-in safety filtering | |
| ## ๐ Data Flow | |
| ``` | |
| User Query | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ Query Processing โ โ Natural language understanding | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ Embedding โ โ HuggingFace Inference API | |
| โ Generation โ (multilingual-e5-large) | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ Vector Search โ โ HuggingFace Dataset | |
| โ โ Cosine similarity | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ Context Assembly โ โ Retrieved documents + metadata | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ LLM Generation โ โ OpenRouter WizardLM | |
| โ โ Prompt + context โ response | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ Response โ โ Formatted answer + citations | |
| โ Formatting โ | |
| โโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| Structured Response | |
| ``` | |
| ## ๐ Document Processing Pipeline | |
| ### Initialization Phase | |
| 1. **Document Loading** | |
| - 22 synthetic policy files | |
| - Markdown format with structured metadata | |
| 2. **Chunking Strategy** | |
| - Semantic chunking preserving context | |
| - Target chunk size: ~400 tokens | |
| - Overlap: 50 tokens for continuity | |
| - Total chunks: 170+ | |
| 3. **Embedding Generation** | |
| - Batch processing for efficiency | |
| - HuggingFace API rate limiting compliance | |
| - Memory optimization for large datasets | |
| 4. **Vector Storage** | |
| - Parquet format for efficient storage | |
| - JSON metadata for complex structures | |
| - Upload to HuggingFace Dataset | |
| - Local caching for development | |
| ## ๐ง Configuration Management | |
| ### Environment Variables | |
| #### Required for Production | |
| ```bash | |
| HF_TOKEN=hf_xxx... # HuggingFace API access | |
| OPENROUTER_API_KEY=sk-or-v1-xxx... # OpenRouter API access | |
| ``` | |
| #### Optional Configuration | |
| ```bash | |
| USE_OPENAI_EMBEDDING=false # Force HF embeddings (overridden when HF_TOKEN present) | |
| ENABLE_HF_SERVICES=true # Enable HF services (auto-detected) | |
| ENABLE_HF_PROCESSING=true # Enable document processing | |
| REBUILD_EMBEDDINGS_ON_START=false # Force rebuild | |
| ``` | |
| ### Configuration Override System | |
| The application implements a triple-layer override system to ensure hybrid services are used: | |
| 1. **Configuration Level** (`src/config.py`) | |
| - Forces `USE_OPENAI_EMBEDDING=false` when `HF_TOKEN` available | |
| - Ensures HF embeddings are used | |
| 2. **Application Factory Level** (`src/app_factory.py`) | |
| - Overrides service selection in RAG pipeline initialization | |
| - Uses `LLMService.from_environment()` for OpenRouter | |
| 3. **Routes Level** (`src/routes/main_routes.py`) | |
| - Ensures consistent service usage in API endpoints | |
| - Hybrid pipeline: HF embeddings + OpenRouter LLM | |
| ## ๐ Deployment Architecture | |
| ### HuggingFace Spaces Deployment | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ HUGGINGFACE SPACES โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ FLASK APPLICATION โ โ | |
| โ โ โ โ | |
| โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ | |
| โ โ โ RAG PIPELINE โ โ WEB INTERFACE โ โ โ | |
| โ โ โ โ โ โ โ โ | |
| โ โ โ Search Service โ โ Chat Interface โ โ โ | |
| โ โ โ LLM Service โ โ API Endpoints โ โ โ | |
| โ โ โ Context Manager โ โ Health Checks โ โ โ | |
| โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ | |
| โ External Services: โ | |
| โ โโ HuggingFace Inference API (embeddings) โ | |
| โ โโ HuggingFace Dataset (vector storage) โ | |
| โ โโ OpenRouter API (LLM generation) โ | |
| โ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| ### Resource Requirements | |
| - **CPU**: Basic tier (sufficient for I/O-bound operations) | |
| - **Memory**: ~512MB (optimized for Spaces limits) | |
| - **Storage**: Small tier (document cache + temporary files) | |
| - **Network**: External API calls for all major services | |
| ## ๐ Migration History | |
| ### Evolution of Architecture | |
| 1. **Phase 1**: OpenAI-based (Expensive) | |
| - OpenAI embeddings + GPT models | |
| - High API costs | |
| - Excellent reliability | |
| 2. **Phase 2**: Full HuggingFace (Problematic) | |
| - HF embeddings + HF LLM models | |
| - Cost-effective | |
| - LLM reliability issues (404 errors) | |
| 3. **Phase 3**: Hybrid (Current - Optimal) | |
| - HF embeddings + OpenRouter LLM | |
| - Cost-effective | |
| - Reliable LLM generation | |
| - Best of both worlds | |
| ### Why Hybrid Architecture? | |
| - **HuggingFace Embeddings**: Stable, reliable, cost-effective | |
| - **HuggingFace Vector Store**: Persistent, efficient, free | |
| - **OpenRouter LLM**: Reliable, no 404 errors, free tier available | |
| - **Overall**: Optimal balance of cost, reliability, and performance | |
| ## ๐ ๏ธ Development Guidelines | |
| ### Local Development | |
| 1. Set both API tokens in environment | |
| 2. Application auto-detects hybrid configuration | |
| 3. Falls back to local ONNX embeddings if HF unavailable | |
| 4. Uses file-based vector storage for development | |
| ### Production Deployment | |
| 1. Ensure both tokens are set in HuggingFace Spaces secrets | |
| 2. Application automatically uses hybrid services | |
| 3. Persistent vector storage via HuggingFace Dataset | |
| 4. Automatic document processing on startup | |
| ### Monitoring and Health Checks | |
| - `/health` - Overall application health | |
| - `/debug/rag` - RAG pipeline diagnostics | |
| - Comprehensive logging for all service interactions | |
| - Error tracking and graceful degradation | |
| ## ๐ Performance Characteristics | |
| ### Latency Breakdown (Typical Query) | |
| - **Embedding Generation**: ~200-500ms (HF API) | |
| - **Vector Search**: ~50-100ms (local computation) | |
| - **LLM Generation**: ~1-3s (OpenRouter API) | |
| - **Total Response Time**: ~2-4s | |
| ### Throughput Considerations | |
| - **HuggingFace API**: Rate limited by free tier | |
| - **OpenRouter API**: Rate limited by free tier | |
| - **Vector Search**: Limited by local CPU/memory | |
| - **Concurrent Users**: ~5-10 concurrent (estimated) | |
| ### Scalability | |
| - **Horizontal**: Multiple Spaces instances | |
| - **Vertical**: Upgrade to larger Spaces tier | |
| - **Caching**: Implement response caching for common queries | |
| - **CDN**: Static asset delivery optimization | |
| ## ๐ Security Considerations | |
| ### API Key Management | |
| - Environment variables for sensitive tokens | |
| - HuggingFace Spaces secrets for production | |
| - No hardcoded credentials in codebase | |
| ### Data Privacy | |
| - No persistent user data storage | |
| - Ephemeral query processing | |
| - No logging of sensitive information | |
| - GDPR-compliant by design | |
| ### Content Safety | |
| - Built-in guardrails for inappropriate content | |
| - Bias detection and mitigation | |
| - PII detection and filtering | |
| - Response validation | |
| ## ๐ฎ Future Enhancements | |
| ### Potential Improvements | |
| 1. **Caching Layer**: Redis for common queries | |
| 2. **Model Upgrades**: Better LLM models as they become available | |
| 3. **Multi-modal**: Support for document images and PDFs | |
| 4. **Advanced RAG**: Re-ranking, query expansion, multi-hop reasoning | |
| 5. **Analytics**: User interaction tracking and optimization | |
| ### Migration Considerations | |
| - Maintain backward compatibility | |
| - Gradual service migration strategies | |
| - A/B testing for service comparisons | |
| - Performance monitoring during transitions | |