Agentic-RagBot / docs /TROUBLESHOOTING.md
MediGuard AI
feat: Initial release of MediGuard AI v2.0
c4f5f25

Troubleshooting Guide

This guide helps diagnose and resolve common issues with MediGuard AI.

Table of Contents

  1. Startup Issues
  2. Service Connectivity
  3. Performance Issues
  4. API Errors
  5. Database Issues
  6. Memory and CPU Issues
  7. Logging and Monitoring
  8. Common Error Messages

Startup Issues

Application Won't Start

Symptoms:

  • Application exits immediately
  • Port already in use errors
  • Module import errors

Solutions:

  1. Check port availability:

    # Check if port 8000 is in use
    netstat -tulpn | grep 8000
    # Or on Windows
    netstat -ano | findstr 8000
    
  2. Verify Python environment:

    # Activate virtual environment
    source venv/bin/activate
    # On Windows
    venv\Scripts\activate
    
    # Check dependencies
    pip list
    
  3. Check environment variables:

    # Verify required variables are set
    env | grep -E "(GROQ|REDIS|OPENSEARCH)"
    
  4. Common startup errors and fixes:

    Error Cause Solution
    ModuleNotFoundError Missing dependencies pip install -r requirements.txt
    Permission denied Port requires privileges Use port > 1024 or run with sudo
    Address already in use Another process using port Kill process or use different port

Docker Container Issues

Symptoms:

  • Container fails to start
  • Health check failures
  • Volume mount errors

Solutions:

  1. Check container logs:

    docker logs mediguard-api
    docker-compose logs api
    
  2. Verify Docker resources:

    # Check Docker resource usage
    docker stats
    
    # Check disk space
    docker system df
    
  3. Rebuild container:

    docker-compose down
    docker-compose build --no-cache
    docker-compose up -d
    

Service Connectivity

OpenSearch Connection Issues

Symptoms:

  • Search requests failing
  • Connection timeout errors
  • Authentication failures

Diagnosis:

# Check OpenSearch health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Test from application
curl http://localhost:8000/health/service/opensearch

Solutions:

  1. Verify OpenSearch is running:

    docker-compose ps opensearch
    docker-compose restart opensearch
    
  2. Check network connectivity:

    # Test connection
    telnet localhost 9200
    
    # Check firewall
    sudo ufw status
    
  3. Fix authentication:

    # In docker-compose.yml
    environment:
      - DISABLE_SECURITY_PLUGIN=true  # For development
    

Redis Connection Issues

Symptoms:

  • Cache misses
  • Session data loss
  • Rate limiting not working

Diagnosis:

# Test Redis connection
redis-cli ping

# Check from application
curl http://localhost:8000/health/service/redis

Solutions:

  1. Restart Redis:

    docker-compose restart redis
    
  2. Clear corrupted data:

    redis-cli FLUSHALL
    
  3. Check memory limits:

    # In redis-cli
    INFO memory
    

Ollama/LLM Connection Issues

Symptoms:

  • LLM requests timing out
  • Model not found errors
  • Slow responses

Diagnosis:

# Check Ollama status
curl http://localhost:11434/api/tags

# Test model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Test"
}'

Solutions:

  1. Pull required models:

    docker-compose exec ollama ollama pull llama3.3
    
  2. Check GPU availability:

    nvidia-smi
    
  3. Adjust timeouts:

    # In settings
    OLLAMA_TIMEOUT = 120  # Increase timeout
    

Performance Issues

Slow API Responses

Symptoms:

  • Requests taking > 5 seconds
  • Timeouts in client applications
  • High CPU usage

Diagnosis:

  1. Check response times:

    # Use curl with timing
    curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health
    
    # Monitor with metrics
    curl http://localhost:8000/metrics | grep http_request_duration
    
  2. Profile the application:

    # Use py-spy
    pip install py-spy
    py-spy top --pid <pid>
    

Solutions:

  1. Enable caching:

    # Add caching to expensive operations
    from src.services.cache.advanced_cache import cached
    
    @cached(ttl=300)
    async def expensive_operation():
        ...
    
  2. Optimize database queries:

    # Use optimized queries
    from src.services.opensearch.client import make_opensearch_client
    client = make_opensearch_client()
    results = client.search_bm25_optimized(query, min_score=0.5)
    
  3. Scale horizontally:

    # Run multiple instances
    docker-compose up -d --scale api=3
    

Memory Leaks

Symptoms:

  • Memory usage increasing over time
  • Out of memory errors
  • Container restarts

Diagnosis:

  1. Monitor memory usage:

    # Check container memory
    docker stats
    
    # Check process memory
    ps aux | grep python
    
  2. Find memory leaks:

    # Use memory-profiler
    pip install memory-profiler
    python -m memory_profiler script.py
    

Solutions:

  1. Fix circular references:

    # Use weak references
    import weakref
    
    class Parent:
        def __init__(self):
            self.children = weakref.WeakSet()
    
  2. Clear caches:

    # Periodically clear caches
    from src.services.cache.advanced_cache import CacheInvalidator
    await CacheInvalidator.invalidate_by_pattern("*")
    
  3. Increase memory limits:

    # In docker-compose.yml
    deploy:
      resources:
        limits:
          memory: 4G
    

API Errors

422 Validation Errors

Symptoms:

  • {"detail": [...]} with validation errors
  • Requests rejected with status 422

Common causes:

  1. Missing required fields:

    // Wrong
    {"biomarkers": {}}
    
    // Right
    {"biomarkers": {"Glucose": 100}}
    
  2. Invalid data types:

    // Wrong
    {"biomarkers": {"Glucose": "high"}}
    
    // Right
    {"biomarkers": {"Glucose": 150}}
    
  3. Out of range values:

    // Check API docs for valid ranges
    curl http://localhost:8000/docs
    

500 Internal Server Errors

Symptoms:

  • Generic error messages
  • Stack traces in logs

Diagnosis:

  1. Check application logs:

    docker-compose logs -f api | grep ERROR
    
  2. Enable debug mode:

    export DEBUG=true
    uvicorn src.main:app --reload
    

Common causes:

Error Solution
Database connection lost Restart database services
External service down Check service health endpoints
Memory error Increase memory or optimize code
Configuration error Verify environment variables

503 Service Unavailable

Symptoms:

  • Service temporarily unavailable
  • Health check failures

Solutions:

  1. Check service dependencies:

    curl http://localhost:8000/health/detailed
    
  2. Restart affected services:

    docker-compose restart
    
  3. Check rate limits:

    # Check rate limit headers
    curl -I http://localhost:8000/analyze/structured
    

Database Issues

OpenSearch Index Problems

Symptoms:

  • Search returning no results
  • Index not found errors
  • Mapping errors

Diagnosis:

  1. Check index status:

    curl -X GET "localhost:9200/_cat/indices?v"
    
  2. Verify mapping:

    curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"
    

Solutions:

  1. Recreate index:

    # Delete and recreate
    curl -X DELETE "localhost:9200/medical_chunks"
    # Restart application to recreate
    
  2. Fix mapping:

    # Update index config
    from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
    client.ensure_index(MEDICAL_CHUNKS_MAPPING)
    

Data Corruption

Symptoms:

  • Inconsistent search results
  • Missing documents
  • Strange query behavior

Solutions:

  1. Verify data integrity:

    # Count documents
    curl -X GET "localhost:9200/medical_chunks/_count"
    
  2. Reindex data:

    # Use indexing service
    from src.services.indexing.service import IndexingService
    service = IndexingService()
    await service.reindex_all()
    

Logging and Monitoring

Enable Debug Logging

  1. Set log level:

    export LOG_LEVEL=DEBUG
    export LOG_TO_FILE=true
    
  2. View logs:

    # Real-time logs
    tail -f data/logs/mediguard.log
    
    # Filter by level
    grep "ERROR" data/logs/mediguard.log
    

Monitor Metrics

  1. Check Prometheus metrics:

    curl http://localhost:8000/metrics | grep http_
    
  2. View Grafana dashboard:

Performance Profiling

  1. Enable profiling:
    # Add to main.py
    from pyinstrument import Profiler
    
    @app.middleware("http")
    async def profile_requests(request: Request, call_next):
        profiler = Profiler()
        profiler.start()
        response = await call_next(request)
        profiler.stop()
        print(profiler.output_text(unicode=True, color=True))
        return response
    

Common Error Messages

"Service unavailable" in logs

Meaning: A required service (OpenSearch, Redis, etc.) is not responding.

Fix:

  1. Check service status: docker-compose ps
  2. Restart service: docker-compose restart <service>
  3. Check logs: docker-compose logs <service>

"Rate limit exceeded"

Meaning: Too many requests from a client.

Fix:

  1. Wait and retry
  2. Check Retry-After header
  3. Implement client-side rate limiting

"Invalid token" or "Authentication failed"

Meaning: Invalid API key or token.

Fix:

  1. Verify API key is correct
  2. Check token hasn't expired
  3. Ensure proper header format: Authorization: Bearer <token>

"Query too large" or "Request entity too large"

Meaning: Request exceeds size limits.

Fix:

  1. Reduce request size
  2. Use pagination
  3. Increase limits in configuration

"Connection pool exhausted"

Meaning: Too many concurrent database connections.

Fix:

  1. Increase pool size
  2. Add connection timeout
  3. Implement request queuing

Emergency Procedures

Full System Recovery

# 1. Stop all services
docker-compose down

# 2. Clear corrupted data (WARNING: This deletes data!)
docker volume rm agentic-ragbot_opensearch_data
docker volume rm agentic-ragbot_redis_data

# 3. Restart with fresh data
docker-compose up -d

# 4. Wait for services to be ready
sleep 30

# 5. Verify health
curl http://localhost:8000/health/detailed

Backup and Restore

# Backup OpenSearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"

# Backup Redis
docker-compose exec redis redis-cli BGSAVE

# Restore from backup
# See DEPLOYMENT.md for detailed instructions

Performance Emergency

# 1. Scale up services
docker-compose up -d --scale api=5

# 2. Clear all caches
curl -X DELETE http://localhost:8000/admin/cache/clear

# 3. Enable emergency mode
export EMERGENCY_MODE=true
# This disables non-essential features

Getting Help

  1. Check logs first: Always check application logs for error details
  2. Search issues: Look for similar issues in GitHub
  3. Collect information:
    • Error messages
    • Logs
    • System specs
    • Steps to reproduce
  4. Create issue: Include all relevant information in GitHub issue

Contact Information

  • Documentation: Check /docs directory
  • Issues: GitHub Issues
  • Emergency: Check DEPLOYMENT.md for emergency contacts