Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

File size: 12,085 Bytes

c4f5f25

# Troubleshooting Guide

This guide helps diagnose and resolve common issues with MediGuard AI.

## Table of Contents
1. [Startup Issues](#startup-issues)
2. [Service Connectivity](#service-connectivity)
3. [Performance Issues](#performance-issues)
4. [API Errors](#api-errors)
5. [Database Issues](#database-issues)
6. [Memory and CPU Issues](#memory-and-cpu-issues)
7. [Logging and Monitoring](#logging-and-monitoring)
8. [Common Error Messages](#common-error-messages)

## Startup Issues

### Application Won't Start

**Symptoms:**
- Application exits immediately
- Port already in use errors
- Module import errors

**Solutions:**

1. **Check port availability:**
   ```bash
   # Check if port 8000 is in use
   netstat -tulpn | grep 8000
   # Or on Windows
   netstat -ano | findstr 8000
   ```

2. **Verify Python environment:**
   ```bash
   # Activate virtual environment
   source venv/bin/activate
   # On Windows
   venv\Scripts\activate
   
   # Check dependencies
   pip list
   ```

3. **Check environment variables:**
   ```bash
   # Verify required variables are set
   env | grep -E "(GROQ|REDIS|OPENSEARCH)"
   ```

4. **Common startup errors and fixes:**

   | Error | Cause | Solution |
   |-------|-------|----------|
   | `ModuleNotFoundError` | Missing dependencies | `pip install -r requirements.txt` |
   | `Permission denied` | Port requires privileges | Use port > 1024 or run with sudo |
   | `Address already in use` | Another process using port | Kill process or use different port |

### Docker Container Issues

**Symptoms:**
- Container fails to start
- Health check failures
- Volume mount errors

**Solutions:**

1. **Check container logs:**
   ```bash
   docker logs mediguard-api
   docker-compose logs api
   ```

2. **Verify Docker resources:**
   ```bash
   # Check Docker resource usage
   docker stats
   
   # Check disk space
   docker system df
   ```

3. **Rebuild container:**
   ```bash
   docker-compose down
   docker-compose build --no-cache
   docker-compose up -d
   ```

## Service Connectivity

### OpenSearch Connection Issues

**Symptoms:**
- Search requests failing
- Connection timeout errors
- Authentication failures

**Diagnosis:**
```bash
# Check OpenSearch health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Test from application
curl http://localhost:8000/health/service/opensearch
```

**Solutions:**

1. **Verify OpenSearch is running:**
   ```bash
   docker-compose ps opensearch
   docker-compose restart opensearch
   ```

2. **Check network connectivity:**
   ```bash
   # Test connection
   telnet localhost 9200
   
   # Check firewall
   sudo ufw status
   ```

3. **Fix authentication:**
   ```yaml
   # In docker-compose.yml
   environment:
     - DISABLE_SECURITY_PLUGIN=true  # For development
   ```

### Redis Connection Issues

**Symptoms:**
- Cache misses
- Session data loss
- Rate limiting not working

**Diagnosis:**
```bash
# Test Redis connection
redis-cli ping

# Check from application
curl http://localhost:8000/health/service/redis
```

**Solutions:**

1. **Restart Redis:**
   ```bash
   docker-compose restart redis
   ```

2. **Clear corrupted data:**
   ```bash
   redis-cli FLUSHALL
   ```

3. **Check memory limits:**
   ```bash
   # In redis-cli
   INFO memory
   ```

### Ollama/LLM Connection Issues

**Symptoms:**
- LLM requests timing out
- Model not found errors
- Slow responses

**Diagnosis:**
```bash
# Check Ollama status
curl http://localhost:11434/api/tags

# Test model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Test"
}'
```

**Solutions:**

1. **Pull required models:**
   ```bash
   docker-compose exec ollama ollama pull llama3.3
   ```

2. **Check GPU availability:**
   ```bash
   nvidia-smi
   ```

3. **Adjust timeouts:**
   ```python
   # In settings
   OLLAMA_TIMEOUT = 120  # Increase timeout
   ```

## Performance Issues

### Slow API Responses

**Symptoms:**
- Requests taking > 5 seconds
- Timeouts in client applications
- High CPU usage

**Diagnosis:**

1. **Check response times:**
   ```bash
   # Use curl with timing
   curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health
   
   # Monitor with metrics
   curl http://localhost:8000/metrics | grep http_request_duration
   ```

2. **Profile the application:**
   ```bash
   # Use py-spy
   pip install py-spy
   py-spy top --pid <pid>
   ```

**Solutions:**

1. **Enable caching:**
   ```python
   # Add caching to expensive operations
   from src.services.cache.advanced_cache import cached
   
   @cached(ttl=300)
   async def expensive_operation():
       ...
   ```

2. **Optimize database queries:**
   ```python
   # Use optimized queries
   from src.services.opensearch.client import make_opensearch_client
   client = make_opensearch_client()
   results = client.search_bm25_optimized(query, min_score=0.5)
   ```

3. **Scale horizontally:**
   ```bash
   # Run multiple instances
   docker-compose up -d --scale api=3
   ```

### Memory Leaks

**Symptoms:**
- Memory usage increasing over time
- Out of memory errors
- Container restarts

**Diagnosis:**

1. **Monitor memory usage:**
   ```bash
   # Check container memory
   docker stats
   
   # Check process memory
   ps aux | grep python
   ```

2. **Find memory leaks:**
   ```bash
   # Use memory-profiler
   pip install memory-profiler
   python -m memory_profiler script.py
   ```

**Solutions:**

1. **Fix circular references:**
   ```python
   # Use weak references
   import weakref
   
   class Parent:
       def __init__(self):
           self.children = weakref.WeakSet()
   ```

2. **Clear caches:**
   ```python
   # Periodically clear caches
   from src.services.cache.advanced_cache import CacheInvalidator
   await CacheInvalidator.invalidate_by_pattern("*")
   ```

3. **Increase memory limits:**
   ```yaml
   # In docker-compose.yml
   deploy:
     resources:
       limits:
         memory: 4G
   ```

## API Errors

### 422 Validation Errors

**Symptoms:**
- `{"detail": [...]}` with validation errors
- Requests rejected with status 422

**Common causes:**

1. **Missing required fields:**
   ```json
   // Wrong
   {"biomarkers": {}}
   
   // Right
   {"biomarkers": {"Glucose": 100}}
   ```

2. **Invalid data types:**
   ```json
   // Wrong
   {"biomarkers": {"Glucose": "high"}}
   
   // Right
   {"biomarkers": {"Glucose": 150}}
   ```

3. **Out of range values:**
   ```json
   // Check API docs for valid ranges
   curl http://localhost:8000/docs
   ```

### 500 Internal Server Errors

**Symptoms:**
- Generic error messages
- Stack traces in logs

**Diagnosis:**

1. **Check application logs:**
   ```bash
   docker-compose logs -f api | grep ERROR
   ```

2. **Enable debug mode:**
   ```bash
   export DEBUG=true
   uvicorn src.main:app --reload
   ```

**Common causes:**

| Error | Solution |
|-------|----------|
| Database connection lost | Restart database services |
| External service down | Check service health endpoints |
| Memory error | Increase memory or optimize code |
| Configuration error | Verify environment variables |

### 503 Service Unavailable

**Symptoms:**
- Service temporarily unavailable
- Health check failures

**Solutions:**

1. **Check service dependencies:**
   ```bash
   curl http://localhost:8000/health/detailed
   ```

2. **Restart affected services:**
   ```bash
   docker-compose restart
   ```

3. **Check rate limits:**
   ```bash
   # Check rate limit headers
   curl -I http://localhost:8000/analyze/structured
   ```

## Database Issues

### OpenSearch Index Problems

**Symptoms:**
- Search returning no results
- Index not found errors
- Mapping errors

**Diagnosis:**

1. **Check index status:**
   ```bash
   curl -X GET "localhost:9200/_cat/indices?v"
   ```

2. **Verify mapping:**
   ```bash
   curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"
   ```

**Solutions:**

1. **Recreate index:**
   ```bash
   # Delete and recreate
   curl -X DELETE "localhost:9200/medical_chunks"
   # Restart application to recreate
   ```

2. **Fix mapping:**
   ```python
   # Update index config
   from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
   client.ensure_index(MEDICAL_CHUNKS_MAPPING)
   ```

### Data Corruption

**Symptoms:**
- Inconsistent search results
- Missing documents
- Strange query behavior

**Solutions:**

1. **Verify data integrity:**
   ```bash
   # Count documents
   curl -X GET "localhost:9200/medical_chunks/_count"
   ```

2. **Reindex data:**
   ```python
   # Use indexing service
   from src.services.indexing.service import IndexingService
   service = IndexingService()
   await service.reindex_all()
   ```

## Logging and Monitoring

### Enable Debug Logging

1. **Set log level:**
   ```bash
   export LOG_LEVEL=DEBUG
   export LOG_TO_FILE=true
   ```

2. **View logs:**
   ```bash
   # Real-time logs
   tail -f data/logs/mediguard.log
   
   # Filter by level
   grep "ERROR" data/logs/mediguard.log
   ```

### Monitor Metrics

1. **Check Prometheus metrics:**
   ```bash
   curl http://localhost:8000/metrics | grep http_
   ```

2. **View Grafana dashboard:**
   - Navigate to http://localhost:3000
   - Import `monitoring/grafana-dashboard.json`

### Performance Profiling

1. **Enable profiling:**
   ```python
   # Add to main.py
   from pyinstrument import Profiler
   
   @app.middleware("http")
   async def profile_requests(request: Request, call_next):
       profiler = Profiler()
       profiler.start()
       response = await call_next(request)
       profiler.stop()
       print(profiler.output_text(unicode=True, color=True))
       return response
   ```

## Common Error Messages

### "Service unavailable" in logs

**Meaning:** A required service (OpenSearch, Redis, etc.) is not responding.

**Fix:**
1. Check service status: `docker-compose ps`
2. Restart service: `docker-compose restart <service>`
3. Check logs: `docker-compose logs <service>`

### "Rate limit exceeded"

**Meaning:** Too many requests from a client.

**Fix:**
1. Wait and retry
2. Check `Retry-After` header
3. Implement client-side rate limiting

### "Invalid token" or "Authentication failed"

**Meaning:** Invalid API key or token.

**Fix:**
1. Verify API key is correct
2. Check token hasn't expired
3. Ensure proper header format: `Authorization: Bearer <token>`

### "Query too large" or "Request entity too large"

**Meaning:** Request exceeds size limits.

**Fix:**
1. Reduce request size
2. Use pagination
3. Increase limits in configuration

### "Connection pool exhausted"

**Meaning:** Too many concurrent database connections.

**Fix:**
1. Increase pool size
2. Add connection timeout
3. Implement request queuing

## Emergency Procedures

### Full System Recovery

```bash
# 1. Stop all services
docker-compose down

# 2. Clear corrupted data (WARNING: This deletes data!)
docker volume rm agentic-ragbot_opensearch_data
docker volume rm agentic-ragbot_redis_data

# 3. Restart with fresh data
docker-compose up -d

# 4. Wait for services to be ready
sleep 30

# 5. Verify health
curl http://localhost:8000/health/detailed
```

### Backup and Restore

```bash
# Backup OpenSearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"

# Backup Redis
docker-compose exec redis redis-cli BGSAVE

# Restore from backup
# See DEPLOYMENT.md for detailed instructions
```

### Performance Emergency

```bash
# 1. Scale up services
docker-compose up -d --scale api=5

# 2. Clear all caches
curl -X DELETE http://localhost:8000/admin/cache/clear

# 3. Enable emergency mode
export EMERGENCY_MODE=true
# This disables non-essential features
```

## Getting Help

1. **Check logs first:** Always check application logs for error details
2. **Search issues:** Look for similar issues in GitHub
3. **Collect information:**
   - Error messages
   - Logs
   - System specs
   - Steps to reproduce
4. **Create issue:** Include all relevant information in GitHub issue

### Contact Information

- **Documentation:** Check `/docs` directory
- **Issues:** GitHub Issues
- **Emergency:** Check DEPLOYMENT.md for emergency contacts