Agentic-RagBot / docs /TROUBLESHOOTING.md
MediGuard AI
feat: Initial release of MediGuard AI v2.0
c4f5f25
# Troubleshooting Guide
This guide helps diagnose and resolve common issues with MediGuard AI.
## Table of Contents
1. [Startup Issues](#startup-issues)
2. [Service Connectivity](#service-connectivity)
3. [Performance Issues](#performance-issues)
4. [API Errors](#api-errors)
5. [Database Issues](#database-issues)
6. [Memory and CPU Issues](#memory-and-cpu-issues)
7. [Logging and Monitoring](#logging-and-monitoring)
8. [Common Error Messages](#common-error-messages)
## Startup Issues
### Application Won't Start
**Symptoms:**
- Application exits immediately
- Port already in use errors
- Module import errors
**Solutions:**
1. **Check port availability:**
```bash
# Check if port 8000 is in use
netstat -tulpn | grep 8000
# Or on Windows
netstat -ano | findstr 8000
```
2. **Verify Python environment:**
```bash
# Activate virtual environment
source venv/bin/activate
# On Windows
venv\Scripts\activate
# Check dependencies
pip list
```
3. **Check environment variables:**
```bash
# Verify required variables are set
env | grep -E "(GROQ|REDIS|OPENSEARCH)"
```
4. **Common startup errors and fixes:**
| Error | Cause | Solution |
|-------|-------|----------|
| `ModuleNotFoundError` | Missing dependencies | `pip install -r requirements.txt` |
| `Permission denied` | Port requires privileges | Use port > 1024 or run with sudo |
| `Address already in use` | Another process using port | Kill process or use different port |
### Docker Container Issues
**Symptoms:**
- Container fails to start
- Health check failures
- Volume mount errors
**Solutions:**
1. **Check container logs:**
```bash
docker logs mediguard-api
docker-compose logs api
```
2. **Verify Docker resources:**
```bash
# Check Docker resource usage
docker stats
# Check disk space
docker system df
```
3. **Rebuild container:**
```bash
docker-compose down
docker-compose build --no-cache
docker-compose up -d
```
## Service Connectivity
### OpenSearch Connection Issues
**Symptoms:**
- Search requests failing
- Connection timeout errors
- Authentication failures
**Diagnosis:**
```bash
# Check OpenSearch health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Test from application
curl http://localhost:8000/health/service/opensearch
```
**Solutions:**
1. **Verify OpenSearch is running:**
```bash
docker-compose ps opensearch
docker-compose restart opensearch
```
2. **Check network connectivity:**
```bash
# Test connection
telnet localhost 9200
# Check firewall
sudo ufw status
```
3. **Fix authentication:**
```yaml
# In docker-compose.yml
environment:
- DISABLE_SECURITY_PLUGIN=true # For development
```
### Redis Connection Issues
**Symptoms:**
- Cache misses
- Session data loss
- Rate limiting not working
**Diagnosis:**
```bash
# Test Redis connection
redis-cli ping
# Check from application
curl http://localhost:8000/health/service/redis
```
**Solutions:**
1. **Restart Redis:**
```bash
docker-compose restart redis
```
2. **Clear corrupted data:**
```bash
redis-cli FLUSHALL
```
3. **Check memory limits:**
```bash
# In redis-cli
INFO memory
```
### Ollama/LLM Connection Issues
**Symptoms:**
- LLM requests timing out
- Model not found errors
- Slow responses
**Diagnosis:**
```bash
# Check Ollama status
curl http://localhost:11434/api/tags
# Test model
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3",
"prompt": "Test"
}'
```
**Solutions:**
1. **Pull required models:**
```bash
docker-compose exec ollama ollama pull llama3.3
```
2. **Check GPU availability:**
```bash
nvidia-smi
```
3. **Adjust timeouts:**
```python
# In settings
OLLAMA_TIMEOUT = 120 # Increase timeout
```
## Performance Issues
### Slow API Responses
**Symptoms:**
- Requests taking > 5 seconds
- Timeouts in client applications
- High CPU usage
**Diagnosis:**
1. **Check response times:**
```bash
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health
# Monitor with metrics
curl http://localhost:8000/metrics | grep http_request_duration
```
2. **Profile the application:**
```bash
# Use py-spy
pip install py-spy
py-spy top --pid <pid>
```
**Solutions:**
1. **Enable caching:**
```python
# Add caching to expensive operations
from src.services.cache.advanced_cache import cached
@cached(ttl=300)
async def expensive_operation():
...
```
2. **Optimize database queries:**
```python
# Use optimized queries
from src.services.opensearch.client import make_opensearch_client
client = make_opensearch_client()
results = client.search_bm25_optimized(query, min_score=0.5)
```
3. **Scale horizontally:**
```bash
# Run multiple instances
docker-compose up -d --scale api=3
```
### Memory Leaks
**Symptoms:**
- Memory usage increasing over time
- Out of memory errors
- Container restarts
**Diagnosis:**
1. **Monitor memory usage:**
```bash
# Check container memory
docker stats
# Check process memory
ps aux | grep python
```
2. **Find memory leaks:**
```bash
# Use memory-profiler
pip install memory-profiler
python -m memory_profiler script.py
```
**Solutions:**
1. **Fix circular references:**
```python
# Use weak references
import weakref
class Parent:
def __init__(self):
self.children = weakref.WeakSet()
```
2. **Clear caches:**
```python
# Periodically clear caches
from src.services.cache.advanced_cache import CacheInvalidator
await CacheInvalidator.invalidate_by_pattern("*")
```
3. **Increase memory limits:**
```yaml
# In docker-compose.yml
deploy:
resources:
limits:
memory: 4G
```
## API Errors
### 422 Validation Errors
**Symptoms:**
- `{"detail": [...]}` with validation errors
- Requests rejected with status 422
**Common causes:**
1. **Missing required fields:**
```json
// Wrong
{"biomarkers": {}}
// Right
{"biomarkers": {"Glucose": 100}}
```
2. **Invalid data types:**
```json
// Wrong
{"biomarkers": {"Glucose": "high"}}
// Right
{"biomarkers": {"Glucose": 150}}
```
3. **Out of range values:**
```json
// Check API docs for valid ranges
curl http://localhost:8000/docs
```
### 500 Internal Server Errors
**Symptoms:**
- Generic error messages
- Stack traces in logs
**Diagnosis:**
1. **Check application logs:**
```bash
docker-compose logs -f api | grep ERROR
```
2. **Enable debug mode:**
```bash
export DEBUG=true
uvicorn src.main:app --reload
```
**Common causes:**
| Error | Solution |
|-------|----------|
| Database connection lost | Restart database services |
| External service down | Check service health endpoints |
| Memory error | Increase memory or optimize code |
| Configuration error | Verify environment variables |
### 503 Service Unavailable
**Symptoms:**
- Service temporarily unavailable
- Health check failures
**Solutions:**
1. **Check service dependencies:**
```bash
curl http://localhost:8000/health/detailed
```
2. **Restart affected services:**
```bash
docker-compose restart
```
3. **Check rate limits:**
```bash
# Check rate limit headers
curl -I http://localhost:8000/analyze/structured
```
## Database Issues
### OpenSearch Index Problems
**Symptoms:**
- Search returning no results
- Index not found errors
- Mapping errors
**Diagnosis:**
1. **Check index status:**
```bash
curl -X GET "localhost:9200/_cat/indices?v"
```
2. **Verify mapping:**
```bash
curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"
```
**Solutions:**
1. **Recreate index:**
```bash
# Delete and recreate
curl -X DELETE "localhost:9200/medical_chunks"
# Restart application to recreate
```
2. **Fix mapping:**
```python
# Update index config
from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
client.ensure_index(MEDICAL_CHUNKS_MAPPING)
```
### Data Corruption
**Symptoms:**
- Inconsistent search results
- Missing documents
- Strange query behavior
**Solutions:**
1. **Verify data integrity:**
```bash
# Count documents
curl -X GET "localhost:9200/medical_chunks/_count"
```
2. **Reindex data:**
```python
# Use indexing service
from src.services.indexing.service import IndexingService
service = IndexingService()
await service.reindex_all()
```
## Logging and Monitoring
### Enable Debug Logging
1. **Set log level:**
```bash
export LOG_LEVEL=DEBUG
export LOG_TO_FILE=true
```
2. **View logs:**
```bash
# Real-time logs
tail -f data/logs/mediguard.log
# Filter by level
grep "ERROR" data/logs/mediguard.log
```
### Monitor Metrics
1. **Check Prometheus metrics:**
```bash
curl http://localhost:8000/metrics | grep http_
```
2. **View Grafana dashboard:**
- Navigate to http://localhost:3000
- Import `monitoring/grafana-dashboard.json`
### Performance Profiling
1. **Enable profiling:**
```python
# Add to main.py
from pyinstrument import Profiler
@app.middleware("http")
async def profile_requests(request: Request, call_next):
profiler = Profiler()
profiler.start()
response = await call_next(request)
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
return response
```
## Common Error Messages
### "Service unavailable" in logs
**Meaning:** A required service (OpenSearch, Redis, etc.) is not responding.
**Fix:**
1. Check service status: `docker-compose ps`
2. Restart service: `docker-compose restart <service>`
3. Check logs: `docker-compose logs <service>`
### "Rate limit exceeded"
**Meaning:** Too many requests from a client.
**Fix:**
1. Wait and retry
2. Check `Retry-After` header
3. Implement client-side rate limiting
### "Invalid token" or "Authentication failed"
**Meaning:** Invalid API key or token.
**Fix:**
1. Verify API key is correct
2. Check token hasn't expired
3. Ensure proper header format: `Authorization: Bearer <token>`
### "Query too large" or "Request entity too large"
**Meaning:** Request exceeds size limits.
**Fix:**
1. Reduce request size
2. Use pagination
3. Increase limits in configuration
### "Connection pool exhausted"
**Meaning:** Too many concurrent database connections.
**Fix:**
1. Increase pool size
2. Add connection timeout
3. Implement request queuing
## Emergency Procedures
### Full System Recovery
```bash
# 1. Stop all services
docker-compose down
# 2. Clear corrupted data (WARNING: This deletes data!)
docker volume rm agentic-ragbot_opensearch_data
docker volume rm agentic-ragbot_redis_data
# 3. Restart with fresh data
docker-compose up -d
# 4. Wait for services to be ready
sleep 30
# 5. Verify health
curl http://localhost:8000/health/detailed
```
### Backup and Restore
```bash
# Backup OpenSearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"
# Backup Redis
docker-compose exec redis redis-cli BGSAVE
# Restore from backup
# See DEPLOYMENT.md for detailed instructions
```
### Performance Emergency
```bash
# 1. Scale up services
docker-compose up -d --scale api=5
# 2. Clear all caches
curl -X DELETE http://localhost:8000/admin/cache/clear
# 3. Enable emergency mode
export EMERGENCY_MODE=true
# This disables non-essential features
```
## Getting Help
1. **Check logs first:** Always check application logs for error details
2. **Search issues:** Look for similar issues in GitHub
3. **Collect information:**
- Error messages
- Logs
- System specs
- Steps to reproduce
4. **Create issue:** Include all relevant information in GitHub issue
### Contact Information
- **Documentation:** Check `/docs` directory
- **Issues:** GitHub Issues
- **Emergency:** Check DEPLOYMENT.md for emergency contacts