Spaces:

JatinAutonomousLabs
/

Research_AI_Assistant

Sleeping

App Files Files Community

Research_AI_Assistant / DEPLOYMENT_NOTES.md

JatsTheAIGen

api migration v2

7632802 6 months ago

preview code

raw

history blame contribute delete

4.42 kB

	# Deployment Notes

	## Hugging Face Spaces Deployment

	### NVIDIA T4 Medium Configuration
	This MVP is optimized for NVIDIA T4 Medium GPU deployment on Hugging Face Spaces.

	#### Hardware Specifications
	- GPU: NVIDIA T4 (persistent, always available)
	- vCPU: 8 cores
	- RAM: 30GB
	- vRAM: 24GB
	- Storage: ~20GB
	- Network: Shared infrastructure

	#### Resource Capacity
	- GPU Memory: 24GB vRAM (sufficient for local model loading)
	- System Memory: 30GB RAM (excellent for caching and processing)
	- CPU: 8 vCPU (good for parallel operations)

	### Environment Variables
	Required environment variables for deployment:

	```bash
	HF_TOKEN=your_huggingface_token_here
	HF_HOME=/tmp/huggingface
	MAX_WORKERS=4
	CACHE_TTL=3600
	DB_PATH=sessions.db
	FAISS_INDEX_PATH=embeddings.faiss
	SESSION_TIMEOUT=3600
	MAX_SESSION_SIZE_MB=10
	MOBILE_MAX_TOKENS=800
	MOBILE_TIMEOUT=15000
	GRADIO_PORT=7860
	GRADIO_HOST=0.0.0.0
	LOG_LEVEL=INFO
	```

	### Space Configuration
	Create a `README.md` in the HF Space with:

	```yaml
	---
	title: AI Research Assistant MVP
	emoji: 🧠
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	license: apache-2.0
	---
	```

	### Deployment Steps

	1. Clone/Setup Repository
	```bash
	git clone your-repo
	cd Research_Assistant
	```

	2. Install Dependencies
	```bash
	bash install.sh
	# or
	pip install -r requirements.txt
	```

	3. Test Installation
	```bash
	python test_setup.py
	# or
	bash quick_test.sh
	```

	4. Run Locally
	```bash
	python app.py
	```

	5. Deploy to HF Spaces
	- Push to GitHub
	- Connect to HF Spaces
	- Select NVIDIA T4 Medium GPU hardware
	- Deploy

	### Resource Management

	#### Memory Limits
	- Base Python: ~100MB
	- Gradio: ~50MB
	- Models (loaded on GPU): ~14-16GB vRAM
	- Primary model (Qwen/Qwen2.5-7B): ~14GB
	- Embedding model: ~500MB
	- Classification models: ~500MB each
	- System RAM: ~2-4GB for caching and processing
	- Cache: ~500MB-1GB max

	GPU Memory Budget: ~24GB vRAM (models fit comfortably)
	System RAM Budget: 30GB (plenty of headroom)

	#### Strategies
	- Local GPU Model Loading: Models loaded on GPU for faster inference
	- Lazy Loading: Models loaded on-demand to speed up startup
	- GPU Memory Management: Automatic device placement with FP16 precision
	- Caching: Aggressive caching with 30GB RAM available
	- Stream responses: To reduce memory during generation

	### Performance Optimization

	#### For NVIDIA T4 GPU
	1. Local Model Loading: Models run locally on GPU (faster than API)
	- Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
	- Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
	2. GPU Acceleration: All inference runs on GPU
	3. Parallel Processing: 4 workers (MAX_WORKERS=4) for concurrent requests
	4. Fallback to API: Automatically falls back to HF Inference API if local models fail
	5. Request Queuing: Built-in async request handling
	6. Response Streaming: Implemented for efficient memory usage

	#### Mobile Optimizations
	- Reduce max tokens to 800
	- Shorten timeout to 15s
	- Implement progressive loading
	- Use touch-optimized UI

	### Monitoring

	#### Health Checks
	- Application health endpoint: `/health`
	- Database connectivity check
	- Cache hit rate monitoring
	- Response time tracking

	#### Logging
	- Use structured logging (structlog)
	- Log levels: DEBUG (dev), INFO (prod)
	- Monitor error rates
	- Track performance metrics

	### Troubleshooting

	#### Common Issues

	Issue: Out of memory errors
	- Solution: Reduce max_workers, implement request queuing

	Issue: Slow responses
	- Solution: Enable aggressive caching, use streaming

	Issue: Model loading failures
	- Solution: Use HF Inference API instead of local models

	Issue: Session data loss
	- Solution: Implement proper persistence with SQLite backup

	### Scaling Considerations

	#### For Production
	1. Horizontal Scaling: Deploy multiple instances
	2. Caching Layer: Add Redis for shared session data
	3. Load Balancing: Use HF Spaces built-in load balancer
	4. CDN: Static assets via CDN
	5. Database: Consider PostgreSQL for production

	#### Migration Path
	- Phase 1: MVP on ZeroGPU (current)
	- Phase 2: Upgrade to GPU for local models
	- Phase 3: Scale to multiple workers
	- Phase 4: Enterprise deployment with managed infrastructure