Spaces:

msse-team-3
/

ai-engineering-project

Sleeping

App Files Files Community

ai-engineering-project / ARCHITECTURE.md

GitHub Action

Clean deployment without binary files

f884e6e 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

	# 🏗️ Architecture Documentation

	## Overview

	This RAG (Retrieval-Augmented Generation) application uses a hybrid architecture combining HuggingFace services with OpenRouter to provide reliable, cost-effective corporate policy assistance.

	## 🔧 Service Architecture

	### Current Stack (October 2025)

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ HYBRID RAG ARCHITECTURE │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
	│ │ EMBEDDINGS │ │ VECTOR STORE │ │ LLM SERVICE │ │
	│ │ │ │ │ │ │ │
	│ │ HuggingFace │ │ HuggingFace │ │ OpenRouter │ │
	│ │ Inference API │ │ Dataset │ │ WizardLM │ │
	│ │ │ │ │ │ │ │
	│ │ multilingual-e5 │ │ Persistent │ │ Free Tier │ │
	│ │ 1024 dimensions │ │ Parquet Format │ │ Reliable │ │
	│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────┘
	```

	### Service Details

	#### 1. Embedding Service
	- Provider: HuggingFace Inference API
	- Model: `intfloat/multilingual-e5-large`
	- Dimensions: 1024
	- Features:
	- Automatic batching for efficiency
	- Fallback to local ONNX models for development
	- Memory-optimized processing
	- Triple-layer configuration override

	#### 2. Vector Store
	- Provider: HuggingFace Dataset
	- Storage Format: Parquet + JSON metadata
	- Features:
	- Persistent storage across deployments
	- Cosine similarity search
	- Metadata preservation
	- Complete interface compatibility

	#### 3. LLM Service
	- Provider: OpenRouter
	- Model: `microsoft/wizardlm-2-8x22b`
	- Features:
	- Free tier access
	- Reliable availability (no 404 errors)
	- Automatic prompt formatting
	- Built-in safety filtering

	## 🔄 Data Flow

	```
	User Query
	↓
	┌───────────────────┐
	│ Query Processing │ ← Natural language understanding
	└───────────────────┘
	↓
	┌───────────────────┐
	│ Embedding │ ← HuggingFace Inference API
	│ Generation │ (multilingual-e5-large)
	└───────────────────┘
	↓
	┌───────────────────┐
	│ Vector Search │ ← HuggingFace Dataset
	│ │ Cosine similarity
	└───────────────────┘
	↓
	┌───────────────────┐
	│ Context Assembly │ ← Retrieved documents + metadata
	└───────────────────┘
	↓
	┌───────────────────┐
	│ LLM Generation │ ← OpenRouter WizardLM
	│ │ Prompt + context → response
	└───────────────────┘
	↓
	┌───────────────────┐
	│ Response │ ← Formatted answer + citations
	│ Formatting │
	└───────────────────┘
	↓
	Structured Response
	```

	## 📊 Document Processing Pipeline

	### Initialization Phase

	1. Document Loading
	- 22 synthetic policy files
	- Markdown format with structured metadata

	2. Chunking Strategy
	- Semantic chunking preserving context
	- Target chunk size: ~400 tokens
	- Overlap: 50 tokens for continuity
	- Total chunks: 170+

	3. Embedding Generation
	- Batch processing for efficiency
	- HuggingFace API rate limiting compliance
	- Memory optimization for large datasets

	4. Vector Storage
	- Parquet format for efficient storage
	- JSON metadata for complex structures
	- Upload to HuggingFace Dataset
	- Local caching for development

	## 🔧 Configuration Management

	### Environment Variables

	#### Required for Production
	```bash
	HF_TOKEN=hf_xxx... # HuggingFace API access
	OPENROUTER_API_KEY=sk-or-v1-xxx... # OpenRouter API access
	```

	#### Optional Configuration
	```bash
	USE_OPENAI_EMBEDDING=false # Force HF embeddings (overridden when HF_TOKEN present)
	ENABLE_HF_SERVICES=true # Enable HF services (auto-detected)
	ENABLE_HF_PROCESSING=true # Enable document processing
	REBUILD_EMBEDDINGS_ON_START=false # Force rebuild
	```

	### Configuration Override System

	The application implements a triple-layer override system to ensure hybrid services are used:

	1. Configuration Level (`src/config.py`)
	- Forces `USE_OPENAI_EMBEDDING=false` when `HF_TOKEN` available
	- Ensures HF embeddings are used

	2. Application Factory Level (`src/app_factory.py`)
	- Overrides service selection in RAG pipeline initialization
	- Uses `LLMService.from_environment()` for OpenRouter

	3. Routes Level (`src/routes/main_routes.py`)
	- Ensures consistent service usage in API endpoints
	- Hybrid pipeline: HF embeddings + OpenRouter LLM

	## 🚀 Deployment Architecture

	### HuggingFace Spaces Deployment

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ HUGGINGFACE SPACES │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌─────────────────────────────────────────────────────────────┐ │
	│ │ FLASK APPLICATION │ │
	│ │ │ │
	│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
	│ │ │ RAG PIPELINE │ │ WEB INTERFACE │ │ │
	│ │ │ │ │ │ │ │
	│ │ │ Search Service │ │ Chat Interface │ │ │
	│ │ │ LLM Service │ │ API Endpoints │ │ │
	│ │ │ Context Manager │ │ Health Checks │ │ │
	│ │ └─────────────────┘ └─────────────────┘ │ │
	│ └─────────────────────────────────────────────────────────────┘ │
	│ │
	│ External Services: │
	│ ├─ HuggingFace Inference API (embeddings) │
	│ ├─ HuggingFace Dataset (vector storage) │
	│ └─ OpenRouter API (LLM generation) │
	│ │
	└─────────────────────────────────────────────────────────────────┘
	```

	### Resource Requirements

	- CPU: Basic tier (sufficient for I/O-bound operations)
	- Memory: ~512MB (optimized for Spaces limits)
	- Storage: Small tier (document cache + temporary files)
	- Network: External API calls for all major services

	## 🔄 Migration History

	### Evolution of Architecture

	1. Phase 1: OpenAI-based (Expensive)
	- OpenAI embeddings + GPT models
	- High API costs
	- Excellent reliability

	2. Phase 2: Full HuggingFace (Problematic)
	- HF embeddings + HF LLM models
	- Cost-effective
	- LLM reliability issues (404 errors)

	3. Phase 3: Hybrid (Current - Optimal)
	- HF embeddings + OpenRouter LLM
	- Cost-effective
	- Reliable LLM generation
	- Best of both worlds

	### Why Hybrid Architecture?

	- HuggingFace Embeddings: Stable, reliable, cost-effective
	- HuggingFace Vector Store: Persistent, efficient, free
	- OpenRouter LLM: Reliable, no 404 errors, free tier available
	- Overall: Optimal balance of cost, reliability, and performance

	## 🛠️ Development Guidelines

	### Local Development

	1. Set both API tokens in environment
	2. Application auto-detects hybrid configuration
	3. Falls back to local ONNX embeddings if HF unavailable
	4. Uses file-based vector storage for development

	### Production Deployment

	1. Ensure both tokens are set in HuggingFace Spaces secrets
	2. Application automatically uses hybrid services
	3. Persistent vector storage via HuggingFace Dataset
	4. Automatic document processing on startup

	### Monitoring and Health Checks

	- `/health` - Overall application health
	- `/debug/rag` - RAG pipeline diagnostics
	- Comprehensive logging for all service interactions
	- Error tracking and graceful degradation

	## 📈 Performance Characteristics

	### Latency Breakdown (Typical Query)

	- Embedding Generation: ~200-500ms (HF API)
	- Vector Search: ~50-100ms (local computation)
	- LLM Generation: ~1-3s (OpenRouter API)
	- Total Response Time: ~2-4s

	### Throughput Considerations

	- HuggingFace API: Rate limited by free tier
	- OpenRouter API: Rate limited by free tier
	- Vector Search: Limited by local CPU/memory
	- Concurrent Users: ~5-10 concurrent (estimated)

	### Scalability

	- Horizontal: Multiple Spaces instances
	- Vertical: Upgrade to larger Spaces tier
	- Caching: Implement response caching for common queries
	- CDN: Static asset delivery optimization

	## 🔒 Security Considerations

	### API Key Management

	- Environment variables for sensitive tokens
	- HuggingFace Spaces secrets for production
	- No hardcoded credentials in codebase

	### Data Privacy

	- No persistent user data storage
	- Ephemeral query processing
	- No logging of sensitive information
	- GDPR-compliant by design

	### Content Safety

	- Built-in guardrails for inappropriate content
	- Bias detection and mitigation
	- PII detection and filtering
	- Response validation

	## 🔮 Future Enhancements

	### Potential Improvements

	1. Caching Layer: Redis for common queries
	2. Model Upgrades: Better LLM models as they become available
	3. Multi-modal: Support for document images and PDFs
	4. Advanced RAG: Re-ranking, query expansion, multi-hop reasoning
	5. Analytics: User interaction tracking and optimization

	### Migration Considerations

	- Maintain backward compatibility
	- Gradual service migration strategies
	- A/B testing for service comparisons
	- Performance monitoring during transitions