Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

App Files Files Community

Agentic-RagBot / docs /TROUBLESHOOTING.md

MediGuard AI

feat: Initial release of MediGuard AI v2.0

c4f5f25 about 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

Troubleshooting Guide

This guide helps diagnose and resolve common issues with MediGuard AI.

Startup Issues
Service Connectivity
Performance Issues
API Errors
Database Issues
Memory and CPU Issues
Logging and Monitoring
Common Error Messages

Startup Issues

Application Won't Start

Symptoms:

Application exits immediately
Port already in use errors
Module import errors

Solutions:

Check port availability:

# Check if port 8000 is in use
netstat -tulpn | grep 8000
# Or on Windows
netstat -ano | findstr 8000

Verify Python environment:

# Activate virtual environment
source venv/bin/activate
# On Windows
venv\Scripts\activate

# Check dependencies
pip list

Check environment variables:

# Verify required variables are set
env | grep -E "(GROQ|REDIS|OPENSEARCH)"

Common startup errors and fixes:

Error	Cause	Solution
`ModuleNotFoundError`	Missing dependencies	`pip install -r requirements.txt`
`Permission denied`	Port requires privileges	Use port > 1024 or run with sudo
`Address already in use`	Another process using port	Kill process or use different port

Docker Container Issues

Symptoms:

Container fails to start
Health check failures
Volume mount errors

Solutions:

Check container logs:

docker logs mediguard-api
docker-compose logs api

Verify Docker resources:

# Check Docker resource usage
docker stats

# Check disk space
docker system df

Rebuild container:

docker-compose down
docker-compose build --no-cache
docker-compose up -d

Service Connectivity

OpenSearch Connection Issues

Symptoms:

Search requests failing
Connection timeout errors
Authentication failures

Diagnosis:

# Check OpenSearch health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Test from application
curl http://localhost:8000/health/service/opensearch

Solutions:

Verify OpenSearch is running:

docker-compose ps opensearch
docker-compose restart opensearch

Check network connectivity:

# Test connection
telnet localhost 9200

# Check firewall
sudo ufw status

Fix authentication:

# In docker-compose.yml
environment:
  - DISABLE_SECURITY_PLUGIN=true  # For development

Redis Connection Issues

Symptoms:

Cache misses
Session data loss
Rate limiting not working

Diagnosis:

# Test Redis connection
redis-cli ping

# Check from application
curl http://localhost:8000/health/service/redis

Solutions:

Restart Redis:
```
docker-compose restart redis
```
Clear corrupted data:
```
redis-cli FLUSHALL
```
Check memory limits:
```
# In redis-cli
INFO memory
```

Ollama/LLM Connection Issues

Symptoms:

LLM requests timing out
Model not found errors
Slow responses

Diagnosis:

# Check Ollama status
curl http://localhost:11434/api/tags

# Test model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Test"
}'

Solutions:

Pull required models:

docker-compose exec ollama ollama pull llama3.3

Check GPU availability:
```
nvidia-smi
```

Adjust timeouts:

# In settings
OLLAMA_TIMEOUT = 120  # Increase timeout

Performance Issues

Slow API Responses

Symptoms:

Requests taking > 5 seconds
Timeouts in client applications
High CPU usage

Diagnosis:

Check response times:

# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health

# Monitor with metrics
curl http://localhost:8000/metrics | grep http_request_duration

Profile the application:

# Use py-spy
pip install py-spy
py-spy top --pid <pid>

Solutions:

Enable caching:

# Add caching to expensive operations
from src.services.cache.advanced_cache import cached

@cached(ttl=300)
async def expensive_operation():
    ...

Optimize database queries:

# Use optimized queries
from src.services.opensearch.client import make_opensearch_client
client = make_opensearch_client()
results = client.search_bm25_optimized(query, min_score=0.5)

Scale horizontally:

# Run multiple instances
docker-compose up -d --scale api=3

Memory Leaks

Symptoms:

Memory usage increasing over time
Out of memory errors
Container restarts

Diagnosis:

Monitor memory usage:

# Check container memory
docker stats

# Check process memory
ps aux | grep python

Find memory leaks:

# Use memory-profiler
pip install memory-profiler
python -m memory_profiler script.py

Solutions:

Fix circular references:

# Use weak references
import weakref

class Parent:
    def __init__(self):
        self.children = weakref.WeakSet()

Clear caches:

# Periodically clear caches
from src.services.cache.advanced_cache import CacheInvalidator
await CacheInvalidator.invalidate_by_pattern("*")

Increase memory limits:

# In docker-compose.yml
deploy:
  resources:
    limits:
      memory: 4G

API Errors

422 Validation Errors

Symptoms:

{"detail": [...]} with validation errors
Requests rejected with status 422

Common causes:

Missing required fields:

// Wrong
{"biomarkers": {}}

// Right
{"biomarkers": {"Glucose": 100}}

Invalid data types:

// Wrong
{"biomarkers": {"Glucose": "high"}}

// Right
{"biomarkers": {"Glucose": 150}}

Out of range values:

// Check API docs for valid ranges
curl http://localhost:8000/docs

500 Internal Server Errors

Symptoms:

Generic error messages
Stack traces in logs

Diagnosis:

Check application logs:

docker-compose logs -f api | grep ERROR

Enable debug mode:

export DEBUG=true
uvicorn src.main:app --reload

Common causes:

Error	Solution
Database connection lost	Restart database services
External service down	Check service health endpoints
Memory error	Increase memory or optimize code
Configuration error	Verify environment variables

503 Service Unavailable

Symptoms:

Service temporarily unavailable
Health check failures

Solutions:

Check service dependencies:

curl http://localhost:8000/health/detailed

Restart affected services:
```
docker-compose restart
```

Check rate limits:

# Check rate limit headers
curl -I http://localhost:8000/analyze/structured

Database Issues

OpenSearch Index Problems

Symptoms:

Search returning no results
Index not found errors
Mapping errors

Diagnosis:

Check index status:

curl -X GET "localhost:9200/_cat/indices?v"

Verify mapping:

curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"

Solutions:

Recreate index:

# Delete and recreate
curl -X DELETE "localhost:9200/medical_chunks"
# Restart application to recreate

Fix mapping:

# Update index config
from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
client.ensure_index(MEDICAL_CHUNKS_MAPPING)

Data Corruption

Symptoms:

Inconsistent search results
Missing documents
Strange query behavior

Solutions:

Verify data integrity:

# Count documents
curl -X GET "localhost:9200/medical_chunks/_count"

Reindex data:

# Use indexing service
from src.services.indexing.service import IndexingService
service = IndexingService()
await service.reindex_all()

Logging and Monitoring

Enable Debug Logging

Set log level:

export LOG_LEVEL=DEBUG
export LOG_TO_FILE=true

View logs:

# Real-time logs
tail -f data/logs/mediguard.log

# Filter by level
grep "ERROR" data/logs/mediguard.log

Monitor Metrics

Check Prometheus metrics:

curl http://localhost:8000/metrics | grep http_

View Grafana dashboard:
- Navigate to http://localhost:3000
- Import monitoring/grafana-dashboard.json

Performance Profiling

Enable profiling:

# Add to main.py
from pyinstrument import Profiler

@app.middleware("http")
async def profile_requests(request: Request, call_next):
    profiler = Profiler()
    profiler.start()
    response = await call_next(request)
    profiler.stop()
    print(profiler.output_text(unicode=True, color=True))
    return response

Common Error Messages

"Service unavailable" in logs

Meaning: A required service (OpenSearch, Redis, etc.) is not responding.

Fix:

Check service status: docker-compose ps
Restart service: docker-compose restart <service>
Check logs: docker-compose logs <service>

"Rate limit exceeded"

Meaning: Too many requests from a client.

Fix:

Wait and retry
Check Retry-After header
Implement client-side rate limiting

"Invalid token" or "Authentication failed"

Meaning: Invalid API key or token.

Fix:

Verify API key is correct
Check token hasn't expired
Ensure proper header format: Authorization: Bearer <token>

"Query too large" or "Request entity too large"

Meaning: Request exceeds size limits.

Fix:

Reduce request size
Use pagination
Increase limits in configuration

"Connection pool exhausted"

Meaning: Too many concurrent database connections.

Fix:

Increase pool size
Add connection timeout
Implement request queuing

Emergency Procedures

Full System Recovery

# 1. Stop all services
docker-compose down

# 2. Clear corrupted data (WARNING: This deletes data!)
docker volume rm agentic-ragbot_opensearch_data
docker volume rm agentic-ragbot_redis_data

# 3. Restart with fresh data
docker-compose up -d

# 4. Wait for services to be ready
sleep 30

# 5. Verify health
curl http://localhost:8000/health/detailed

Backup and Restore

# Backup OpenSearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"

# Backup Redis
docker-compose exec redis redis-cli BGSAVE

# Restore from backup
# See DEPLOYMENT.md for detailed instructions

Performance Emergency

# 1. Scale up services
docker-compose up -d --scale api=5

# 2. Clear all caches
curl -X DELETE http://localhost:8000/admin/cache/clear

# 3. Enable emergency mode
export EMERGENCY_MODE=true
# This disables non-essential features

Getting Help

Check logs first: Always check application logs for error details
Search issues: Look for similar issues in GitHub
Collect information:
- Error messages
- Logs
- System specs
- Steps to reproduce
Create issue: Include all relevant information in GitHub issue

Contact Information

Documentation: Check /docs directory
Issues: GitHub Issues
Emergency: Check DEPLOYMENT.md for emergency contacts

Troubleshooting Guide

Table of Contents

Startup Issues

Application Won't Start

Docker Container Issues

Service Connectivity

OpenSearch Connection Issues

Redis Connection Issues

Ollama/LLM Connection Issues

Performance Issues

Slow API Responses

Memory Leaks

API Errors

422 Validation Errors

500 Internal Server Errors

503 Service Unavailable

Database Issues

OpenSearch Index Problems

Data Corruption

Logging and Monitoring

Enable Debug Logging

Monitor Metrics

Performance Profiling

Common Error Messages

"Service unavailable" in logs

"Rate limit exceeded"

"Invalid token" or "Authentication failed"

"Query too large" or "Request entity too large"

"Connection pool exhausted"

Emergency Procedures

Full System Recovery

Backup and Restore

Performance Emergency

Getting Help

Contact Information