| # Deployment Notes |
|
|
| ## Hugging Face Spaces Deployment |
|
|
| ### NVIDIA T4 Medium Configuration |
| This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces. |
|
|
| #### Hardware Specifications |
| - **GPU**: NVIDIA T4 (persistent, always available) |
| - **vCPU**: 8 cores |
| - **RAM**: 30GB |
| - **vRAM**: 24GB |
| - **Storage**: ~20GB |
| - **Network**: Shared infrastructure |
|
|
| #### Resource Capacity |
| - **GPU Memory**: 24GB vRAM (sufficient for local model loading) |
| - **System Memory**: 30GB RAM (excellent for caching and processing) |
| - **CPU**: 8 vCPU (good for parallel operations) |
|
|
| ### Environment Variables |
| Required environment variables for deployment: |
|
|
| ```bash |
| HF_TOKEN=your_huggingface_token_here |
| HF_HOME=/tmp/huggingface |
| MAX_WORKERS=4 |
| CACHE_TTL=3600 |
| DB_PATH=sessions.db |
| FAISS_INDEX_PATH=embeddings.faiss |
| SESSION_TIMEOUT=3600 |
| MAX_SESSION_SIZE_MB=10 |
| MOBILE_MAX_TOKENS=800 |
| MOBILE_TIMEOUT=15000 |
| GRADIO_PORT=7860 |
| GRADIO_HOST=0.0.0.0 |
| LOG_LEVEL=INFO |
| ``` |
|
|
| ### Space Configuration |
| Create a `README.md` in the HF Space with: |
|
|
| ```yaml |
| --- |
| title: AI Research Assistant MVP |
| emoji: 🧠 |
| colorFrom: blue |
| colorTo: purple |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| license: apache-2.0 |
| --- |
| ``` |
| |
| ### Deployment Steps |
| |
| 1. **Clone/Setup Repository** |
| ```bash |
| git clone your-repo |
| cd Research_Assistant |
| ``` |
| |
| 2. **Install Dependencies** |
| ```bash |
| bash install.sh |
| # or |
| pip install -r requirements.txt |
| ``` |
| |
| 3. **Test Installation** |
| ```bash |
| python test_setup.py |
| # or |
| bash quick_test.sh |
| ``` |
| |
| 4. **Run Locally** |
| ```bash |
| python app.py |
| ``` |
| |
| 5. **Deploy to HF Spaces** |
| - Push to GitHub |
| - Connect to HF Spaces |
| - Select NVIDIA T4 Medium GPU hardware |
| - Deploy |
| |
| ### Resource Management |
| |
| #### Memory Limits |
| - **Base Python**: ~100MB |
| - **Gradio**: ~50MB |
| - **Models (loaded on GPU)**: ~14-16GB vRAM |
| - Primary model (Qwen/Qwen2.5-7B): ~14GB |
| - Embedding model: ~500MB |
| - Classification models: ~500MB each |
| - **System RAM**: ~2-4GB for caching and processing |
| - **Cache**: ~500MB-1GB max |
| |
| **GPU Memory Budget**: ~24GB vRAM (models fit comfortably) |
| **System RAM Budget**: 30GB (plenty of headroom) |
| |
| #### Strategies |
| - **Local GPU Model Loading**: Models loaded on GPU for faster inference |
| - **Lazy Loading**: Models loaded on-demand to speed up startup |
| - **GPU Memory Management**: Automatic device placement with FP16 precision |
| - **Caching**: Aggressive caching with 30GB RAM available |
| - **Stream responses**: To reduce memory during generation |
| |
| ### Performance Optimization |
| |
| #### For NVIDIA T4 GPU |
| 1. **Local Model Loading**: Models run locally on GPU (faster than API) |
| - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM) |
| - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB) |
| 2. **GPU Acceleration**: All inference runs on GPU |
| 3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests |
| 4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail |
| 5. **Request Queuing**: Built-in async request handling |
| 6. **Response Streaming**: Implemented for efficient memory usage |
| |
| #### Mobile Optimizations |
| - Reduce max tokens to 800 |
| - Shorten timeout to 15s |
| - Implement progressive loading |
| - Use touch-optimized UI |
| |
| ### Monitoring |
| |
| #### Health Checks |
| - Application health endpoint: `/health` |
| - Database connectivity check |
| - Cache hit rate monitoring |
| - Response time tracking |
| |
| #### Logging |
| - Use structured logging (structlog) |
| - Log levels: DEBUG (dev), INFO (prod) |
| - Monitor error rates |
| - Track performance metrics |
| |
| ### Troubleshooting |
| |
| #### Common Issues |
| |
| **Issue**: Out of memory errors |
| - **Solution**: Reduce max_workers, implement request queuing |
| |
| **Issue**: Slow responses |
| - **Solution**: Enable aggressive caching, use streaming |
| |
| **Issue**: Model loading failures |
| - **Solution**: Use HF Inference API instead of local models |
| |
| **Issue**: Session data loss |
| - **Solution**: Implement proper persistence with SQLite backup |
| |
| ### Scaling Considerations |
| |
| #### For Production |
| 1. **Horizontal Scaling**: Deploy multiple instances |
| 2. **Caching Layer**: Add Redis for shared session data |
| 3. **Load Balancing**: Use HF Spaces built-in load balancer |
| 4. **CDN**: Static assets via CDN |
| 5. **Database**: Consider PostgreSQL for production |
| |
| #### Migration Path |
| - **Phase 1**: MVP on ZeroGPU (current) |
| - **Phase 2**: Upgrade to GPU for local models |
| - **Phase 3**: Scale to multiple workers |
| - **Phase 4**: Enterprise deployment with managed infrastructure |
| |
| |