# Troubleshooting Guide This guide helps diagnose and resolve common issues with MediGuard AI. ## Table of Contents 1. [Startup Issues](#startup-issues) 2. [Service Connectivity](#service-connectivity) 3. [Performance Issues](#performance-issues) 4. [API Errors](#api-errors) 5. [Database Issues](#database-issues) 6. [Memory and CPU Issues](#memory-and-cpu-issues) 7. [Logging and Monitoring](#logging-and-monitoring) 8. [Common Error Messages](#common-error-messages) ## Startup Issues ### Application Won't Start **Symptoms:** - Application exits immediately - Port already in use errors - Module import errors **Solutions:** 1. **Check port availability:** ```bash # Check if port 8000 is in use netstat -tulpn | grep 8000 # Or on Windows netstat -ano | findstr 8000 ``` 2. **Verify Python environment:** ```bash # Activate virtual environment source venv/bin/activate # On Windows venv\Scripts\activate # Check dependencies pip list ``` 3. **Check environment variables:** ```bash # Verify required variables are set env | grep -E "(GROQ|REDIS|OPENSEARCH)" ``` 4. **Common startup errors and fixes:** | Error | Cause | Solution | |-------|-------|----------| | `ModuleNotFoundError` | Missing dependencies | `pip install -r requirements.txt` | | `Permission denied` | Port requires privileges | Use port > 1024 or run with sudo | | `Address already in use` | Another process using port | Kill process or use different port | ### Docker Container Issues **Symptoms:** - Container fails to start - Health check failures - Volume mount errors **Solutions:** 1. **Check container logs:** ```bash docker logs mediguard-api docker-compose logs api ``` 2. **Verify Docker resources:** ```bash # Check Docker resource usage docker stats # Check disk space docker system df ``` 3. **Rebuild container:** ```bash docker-compose down docker-compose build --no-cache docker-compose up -d ``` ## Service Connectivity ### OpenSearch Connection Issues **Symptoms:** - Search requests failing - Connection timeout errors - Authentication failures **Diagnosis:** ```bash # Check OpenSearch health curl -X GET "localhost:9200/_cluster/health?pretty" # Test from application curl http://localhost:8000/health/service/opensearch ``` **Solutions:** 1. **Verify OpenSearch is running:** ```bash docker-compose ps opensearch docker-compose restart opensearch ``` 2. **Check network connectivity:** ```bash # Test connection telnet localhost 9200 # Check firewall sudo ufw status ``` 3. **Fix authentication:** ```yaml # In docker-compose.yml environment: - DISABLE_SECURITY_PLUGIN=true # For development ``` ### Redis Connection Issues **Symptoms:** - Cache misses - Session data loss - Rate limiting not working **Diagnosis:** ```bash # Test Redis connection redis-cli ping # Check from application curl http://localhost:8000/health/service/redis ``` **Solutions:** 1. **Restart Redis:** ```bash docker-compose restart redis ``` 2. **Clear corrupted data:** ```bash redis-cli FLUSHALL ``` 3. **Check memory limits:** ```bash # In redis-cli INFO memory ``` ### Ollama/LLM Connection Issues **Symptoms:** - LLM requests timing out - Model not found errors - Slow responses **Diagnosis:** ```bash # Check Ollama status curl http://localhost:11434/api/tags # Test model curl http://localhost:11434/api/generate -d '{ "model": "llama3.3", "prompt": "Test" }' ``` **Solutions:** 1. **Pull required models:** ```bash docker-compose exec ollama ollama pull llama3.3 ``` 2. **Check GPU availability:** ```bash nvidia-smi ``` 3. **Adjust timeouts:** ```python # In settings OLLAMA_TIMEOUT = 120 # Increase timeout ``` ## Performance Issues ### Slow API Responses **Symptoms:** - Requests taking > 5 seconds - Timeouts in client applications - High CPU usage **Diagnosis:** 1. **Check response times:** ```bash # Use curl with timing curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health # Monitor with metrics curl http://localhost:8000/metrics | grep http_request_duration ``` 2. **Profile the application:** ```bash # Use py-spy pip install py-spy py-spy top --pid ``` **Solutions:** 1. **Enable caching:** ```python # Add caching to expensive operations from src.services.cache.advanced_cache import cached @cached(ttl=300) async def expensive_operation(): ... ``` 2. **Optimize database queries:** ```python # Use optimized queries from src.services.opensearch.client import make_opensearch_client client = make_opensearch_client() results = client.search_bm25_optimized(query, min_score=0.5) ``` 3. **Scale horizontally:** ```bash # Run multiple instances docker-compose up -d --scale api=3 ``` ### Memory Leaks **Symptoms:** - Memory usage increasing over time - Out of memory errors - Container restarts **Diagnosis:** 1. **Monitor memory usage:** ```bash # Check container memory docker stats # Check process memory ps aux | grep python ``` 2. **Find memory leaks:** ```bash # Use memory-profiler pip install memory-profiler python -m memory_profiler script.py ``` **Solutions:** 1. **Fix circular references:** ```python # Use weak references import weakref class Parent: def __init__(self): self.children = weakref.WeakSet() ``` 2. **Clear caches:** ```python # Periodically clear caches from src.services.cache.advanced_cache import CacheInvalidator await CacheInvalidator.invalidate_by_pattern("*") ``` 3. **Increase memory limits:** ```yaml # In docker-compose.yml deploy: resources: limits: memory: 4G ``` ## API Errors ### 422 Validation Errors **Symptoms:** - `{"detail": [...]}` with validation errors - Requests rejected with status 422 **Common causes:** 1. **Missing required fields:** ```json // Wrong {"biomarkers": {}} // Right {"biomarkers": {"Glucose": 100}} ``` 2. **Invalid data types:** ```json // Wrong {"biomarkers": {"Glucose": "high"}} // Right {"biomarkers": {"Glucose": 150}} ``` 3. **Out of range values:** ```json // Check API docs for valid ranges curl http://localhost:8000/docs ``` ### 500 Internal Server Errors **Symptoms:** - Generic error messages - Stack traces in logs **Diagnosis:** 1. **Check application logs:** ```bash docker-compose logs -f api | grep ERROR ``` 2. **Enable debug mode:** ```bash export DEBUG=true uvicorn src.main:app --reload ``` **Common causes:** | Error | Solution | |-------|----------| | Database connection lost | Restart database services | | External service down | Check service health endpoints | | Memory error | Increase memory or optimize code | | Configuration error | Verify environment variables | ### 503 Service Unavailable **Symptoms:** - Service temporarily unavailable - Health check failures **Solutions:** 1. **Check service dependencies:** ```bash curl http://localhost:8000/health/detailed ``` 2. **Restart affected services:** ```bash docker-compose restart ``` 3. **Check rate limits:** ```bash # Check rate limit headers curl -I http://localhost:8000/analyze/structured ``` ## Database Issues ### OpenSearch Index Problems **Symptoms:** - Search returning no results - Index not found errors - Mapping errors **Diagnosis:** 1. **Check index status:** ```bash curl -X GET "localhost:9200/_cat/indices?v" ``` 2. **Verify mapping:** ```bash curl -X GET "localhost:9200/medical_chunks/_mapping?pretty" ``` **Solutions:** 1. **Recreate index:** ```bash # Delete and recreate curl -X DELETE "localhost:9200/medical_chunks" # Restart application to recreate ``` 2. **Fix mapping:** ```python # Update index config from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING client.ensure_index(MEDICAL_CHUNKS_MAPPING) ``` ### Data Corruption **Symptoms:** - Inconsistent search results - Missing documents - Strange query behavior **Solutions:** 1. **Verify data integrity:** ```bash # Count documents curl -X GET "localhost:9200/medical_chunks/_count" ``` 2. **Reindex data:** ```python # Use indexing service from src.services.indexing.service import IndexingService service = IndexingService() await service.reindex_all() ``` ## Logging and Monitoring ### Enable Debug Logging 1. **Set log level:** ```bash export LOG_LEVEL=DEBUG export LOG_TO_FILE=true ``` 2. **View logs:** ```bash # Real-time logs tail -f data/logs/mediguard.log # Filter by level grep "ERROR" data/logs/mediguard.log ``` ### Monitor Metrics 1. **Check Prometheus metrics:** ```bash curl http://localhost:8000/metrics | grep http_ ``` 2. **View Grafana dashboard:** - Navigate to http://localhost:3000 - Import `monitoring/grafana-dashboard.json` ### Performance Profiling 1. **Enable profiling:** ```python # Add to main.py from pyinstrument import Profiler @app.middleware("http") async def profile_requests(request: Request, call_next): profiler = Profiler() profiler.start() response = await call_next(request) profiler.stop() print(profiler.output_text(unicode=True, color=True)) return response ``` ## Common Error Messages ### "Service unavailable" in logs **Meaning:** A required service (OpenSearch, Redis, etc.) is not responding. **Fix:** 1. Check service status: `docker-compose ps` 2. Restart service: `docker-compose restart ` 3. Check logs: `docker-compose logs ` ### "Rate limit exceeded" **Meaning:** Too many requests from a client. **Fix:** 1. Wait and retry 2. Check `Retry-After` header 3. Implement client-side rate limiting ### "Invalid token" or "Authentication failed" **Meaning:** Invalid API key or token. **Fix:** 1. Verify API key is correct 2. Check token hasn't expired 3. Ensure proper header format: `Authorization: Bearer ` ### "Query too large" or "Request entity too large" **Meaning:** Request exceeds size limits. **Fix:** 1. Reduce request size 2. Use pagination 3. Increase limits in configuration ### "Connection pool exhausted" **Meaning:** Too many concurrent database connections. **Fix:** 1. Increase pool size 2. Add connection timeout 3. Implement request queuing ## Emergency Procedures ### Full System Recovery ```bash # 1. Stop all services docker-compose down # 2. Clear corrupted data (WARNING: This deletes data!) docker volume rm agentic-ragbot_opensearch_data docker volume rm agentic-ragbot_redis_data # 3. Restart with fresh data docker-compose up -d # 4. Wait for services to be ready sleep 30 # 5. Verify health curl http://localhost:8000/health/detailed ``` ### Backup and Restore ```bash # Backup OpenSearch curl -X POST "localhost:9200/_snapshot/backup/snapshot_1" # Backup Redis docker-compose exec redis redis-cli BGSAVE # Restore from backup # See DEPLOYMENT.md for detailed instructions ``` ### Performance Emergency ```bash # 1. Scale up services docker-compose up -d --scale api=5 # 2. Clear all caches curl -X DELETE http://localhost:8000/admin/cache/clear # 3. Enable emergency mode export EMERGENCY_MODE=true # This disables non-essential features ``` ## Getting Help 1. **Check logs first:** Always check application logs for error details 2. **Search issues:** Look for similar issues in GitHub 3. **Collect information:** - Error messages - Logs - System specs - Steps to reproduce 4. **Create issue:** Include all relevant information in GitHub issue ### Contact Information - **Documentation:** Check `/docs` directory - **Issues:** GitHub Issues - **Emergency:** Check DEPLOYMENT.md for emergency contacts