Spaces:
Sleeping
Sleeping
File size: 12,085 Bytes
c4f5f25 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 | # Troubleshooting Guide
This guide helps diagnose and resolve common issues with MediGuard AI.
## Table of Contents
1. [Startup Issues](#startup-issues)
2. [Service Connectivity](#service-connectivity)
3. [Performance Issues](#performance-issues)
4. [API Errors](#api-errors)
5. [Database Issues](#database-issues)
6. [Memory and CPU Issues](#memory-and-cpu-issues)
7. [Logging and Monitoring](#logging-and-monitoring)
8. [Common Error Messages](#common-error-messages)
## Startup Issues
### Application Won't Start
**Symptoms:**
- Application exits immediately
- Port already in use errors
- Module import errors
**Solutions:**
1. **Check port availability:**
```bash
# Check if port 8000 is in use
netstat -tulpn | grep 8000
# Or on Windows
netstat -ano | findstr 8000
```
2. **Verify Python environment:**
```bash
# Activate virtual environment
source venv/bin/activate
# On Windows
venv\Scripts\activate
# Check dependencies
pip list
```
3. **Check environment variables:**
```bash
# Verify required variables are set
env | grep -E "(GROQ|REDIS|OPENSEARCH)"
```
4. **Common startup errors and fixes:**
| Error | Cause | Solution |
|-------|-------|----------|
| `ModuleNotFoundError` | Missing dependencies | `pip install -r requirements.txt` |
| `Permission denied` | Port requires privileges | Use port > 1024 or run with sudo |
| `Address already in use` | Another process using port | Kill process or use different port |
### Docker Container Issues
**Symptoms:**
- Container fails to start
- Health check failures
- Volume mount errors
**Solutions:**
1. **Check container logs:**
```bash
docker logs mediguard-api
docker-compose logs api
```
2. **Verify Docker resources:**
```bash
# Check Docker resource usage
docker stats
# Check disk space
docker system df
```
3. **Rebuild container:**
```bash
docker-compose down
docker-compose build --no-cache
docker-compose up -d
```
## Service Connectivity
### OpenSearch Connection Issues
**Symptoms:**
- Search requests failing
- Connection timeout errors
- Authentication failures
**Diagnosis:**
```bash
# Check OpenSearch health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Test from application
curl http://localhost:8000/health/service/opensearch
```
**Solutions:**
1. **Verify OpenSearch is running:**
```bash
docker-compose ps opensearch
docker-compose restart opensearch
```
2. **Check network connectivity:**
```bash
# Test connection
telnet localhost 9200
# Check firewall
sudo ufw status
```
3. **Fix authentication:**
```yaml
# In docker-compose.yml
environment:
- DISABLE_SECURITY_PLUGIN=true # For development
```
### Redis Connection Issues
**Symptoms:**
- Cache misses
- Session data loss
- Rate limiting not working
**Diagnosis:**
```bash
# Test Redis connection
redis-cli ping
# Check from application
curl http://localhost:8000/health/service/redis
```
**Solutions:**
1. **Restart Redis:**
```bash
docker-compose restart redis
```
2. **Clear corrupted data:**
```bash
redis-cli FLUSHALL
```
3. **Check memory limits:**
```bash
# In redis-cli
INFO memory
```
### Ollama/LLM Connection Issues
**Symptoms:**
- LLM requests timing out
- Model not found errors
- Slow responses
**Diagnosis:**
```bash
# Check Ollama status
curl http://localhost:11434/api/tags
# Test model
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3",
"prompt": "Test"
}'
```
**Solutions:**
1. **Pull required models:**
```bash
docker-compose exec ollama ollama pull llama3.3
```
2. **Check GPU availability:**
```bash
nvidia-smi
```
3. **Adjust timeouts:**
```python
# In settings
OLLAMA_TIMEOUT = 120 # Increase timeout
```
## Performance Issues
### Slow API Responses
**Symptoms:**
- Requests taking > 5 seconds
- Timeouts in client applications
- High CPU usage
**Diagnosis:**
1. **Check response times:**
```bash
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health
# Monitor with metrics
curl http://localhost:8000/metrics | grep http_request_duration
```
2. **Profile the application:**
```bash
# Use py-spy
pip install py-spy
py-spy top --pid <pid>
```
**Solutions:**
1. **Enable caching:**
```python
# Add caching to expensive operations
from src.services.cache.advanced_cache import cached
@cached(ttl=300)
async def expensive_operation():
...
```
2. **Optimize database queries:**
```python
# Use optimized queries
from src.services.opensearch.client import make_opensearch_client
client = make_opensearch_client()
results = client.search_bm25_optimized(query, min_score=0.5)
```
3. **Scale horizontally:**
```bash
# Run multiple instances
docker-compose up -d --scale api=3
```
### Memory Leaks
**Symptoms:**
- Memory usage increasing over time
- Out of memory errors
- Container restarts
**Diagnosis:**
1. **Monitor memory usage:**
```bash
# Check container memory
docker stats
# Check process memory
ps aux | grep python
```
2. **Find memory leaks:**
```bash
# Use memory-profiler
pip install memory-profiler
python -m memory_profiler script.py
```
**Solutions:**
1. **Fix circular references:**
```python
# Use weak references
import weakref
class Parent:
def __init__(self):
self.children = weakref.WeakSet()
```
2. **Clear caches:**
```python
# Periodically clear caches
from src.services.cache.advanced_cache import CacheInvalidator
await CacheInvalidator.invalidate_by_pattern("*")
```
3. **Increase memory limits:**
```yaml
# In docker-compose.yml
deploy:
resources:
limits:
memory: 4G
```
## API Errors
### 422 Validation Errors
**Symptoms:**
- `{"detail": [...]}` with validation errors
- Requests rejected with status 422
**Common causes:**
1. **Missing required fields:**
```json
// Wrong
{"biomarkers": {}}
// Right
{"biomarkers": {"Glucose": 100}}
```
2. **Invalid data types:**
```json
// Wrong
{"biomarkers": {"Glucose": "high"}}
// Right
{"biomarkers": {"Glucose": 150}}
```
3. **Out of range values:**
```json
// Check API docs for valid ranges
curl http://localhost:8000/docs
```
### 500 Internal Server Errors
**Symptoms:**
- Generic error messages
- Stack traces in logs
**Diagnosis:**
1. **Check application logs:**
```bash
docker-compose logs -f api | grep ERROR
```
2. **Enable debug mode:**
```bash
export DEBUG=true
uvicorn src.main:app --reload
```
**Common causes:**
| Error | Solution |
|-------|----------|
| Database connection lost | Restart database services |
| External service down | Check service health endpoints |
| Memory error | Increase memory or optimize code |
| Configuration error | Verify environment variables |
### 503 Service Unavailable
**Symptoms:**
- Service temporarily unavailable
- Health check failures
**Solutions:**
1. **Check service dependencies:**
```bash
curl http://localhost:8000/health/detailed
```
2. **Restart affected services:**
```bash
docker-compose restart
```
3. **Check rate limits:**
```bash
# Check rate limit headers
curl -I http://localhost:8000/analyze/structured
```
## Database Issues
### OpenSearch Index Problems
**Symptoms:**
- Search returning no results
- Index not found errors
- Mapping errors
**Diagnosis:**
1. **Check index status:**
```bash
curl -X GET "localhost:9200/_cat/indices?v"
```
2. **Verify mapping:**
```bash
curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"
```
**Solutions:**
1. **Recreate index:**
```bash
# Delete and recreate
curl -X DELETE "localhost:9200/medical_chunks"
# Restart application to recreate
```
2. **Fix mapping:**
```python
# Update index config
from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
client.ensure_index(MEDICAL_CHUNKS_MAPPING)
```
### Data Corruption
**Symptoms:**
- Inconsistent search results
- Missing documents
- Strange query behavior
**Solutions:**
1. **Verify data integrity:**
```bash
# Count documents
curl -X GET "localhost:9200/medical_chunks/_count"
```
2. **Reindex data:**
```python
# Use indexing service
from src.services.indexing.service import IndexingService
service = IndexingService()
await service.reindex_all()
```
## Logging and Monitoring
### Enable Debug Logging
1. **Set log level:**
```bash
export LOG_LEVEL=DEBUG
export LOG_TO_FILE=true
```
2. **View logs:**
```bash
# Real-time logs
tail -f data/logs/mediguard.log
# Filter by level
grep "ERROR" data/logs/mediguard.log
```
### Monitor Metrics
1. **Check Prometheus metrics:**
```bash
curl http://localhost:8000/metrics | grep http_
```
2. **View Grafana dashboard:**
- Navigate to http://localhost:3000
- Import `monitoring/grafana-dashboard.json`
### Performance Profiling
1. **Enable profiling:**
```python
# Add to main.py
from pyinstrument import Profiler
@app.middleware("http")
async def profile_requests(request: Request, call_next):
profiler = Profiler()
profiler.start()
response = await call_next(request)
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
return response
```
## Common Error Messages
### "Service unavailable" in logs
**Meaning:** A required service (OpenSearch, Redis, etc.) is not responding.
**Fix:**
1. Check service status: `docker-compose ps`
2. Restart service: `docker-compose restart <service>`
3. Check logs: `docker-compose logs <service>`
### "Rate limit exceeded"
**Meaning:** Too many requests from a client.
**Fix:**
1. Wait and retry
2. Check `Retry-After` header
3. Implement client-side rate limiting
### "Invalid token" or "Authentication failed"
**Meaning:** Invalid API key or token.
**Fix:**
1. Verify API key is correct
2. Check token hasn't expired
3. Ensure proper header format: `Authorization: Bearer <token>`
### "Query too large" or "Request entity too large"
**Meaning:** Request exceeds size limits.
**Fix:**
1. Reduce request size
2. Use pagination
3. Increase limits in configuration
### "Connection pool exhausted"
**Meaning:** Too many concurrent database connections.
**Fix:**
1. Increase pool size
2. Add connection timeout
3. Implement request queuing
## Emergency Procedures
### Full System Recovery
```bash
# 1. Stop all services
docker-compose down
# 2. Clear corrupted data (WARNING: This deletes data!)
docker volume rm agentic-ragbot_opensearch_data
docker volume rm agentic-ragbot_redis_data
# 3. Restart with fresh data
docker-compose up -d
# 4. Wait for services to be ready
sleep 30
# 5. Verify health
curl http://localhost:8000/health/detailed
```
### Backup and Restore
```bash
# Backup OpenSearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"
# Backup Redis
docker-compose exec redis redis-cli BGSAVE
# Restore from backup
# See DEPLOYMENT.md for detailed instructions
```
### Performance Emergency
```bash
# 1. Scale up services
docker-compose up -d --scale api=5
# 2. Clear all caches
curl -X DELETE http://localhost:8000/admin/cache/clear
# 3. Enable emergency mode
export EMERGENCY_MODE=true
# This disables non-essential features
```
## Getting Help
1. **Check logs first:** Always check application logs for error details
2. **Search issues:** Look for similar issues in GitHub
3. **Collect information:**
- Error messages
- Logs
- System specs
- Steps to reproduce
4. **Create issue:** Include all relevant information in GitHub issue
### Contact Information
- **Documentation:** Check `/docs` directory
- **Issues:** GitHub Issues
- **Emergency:** Check DEPLOYMENT.md for emergency contacts
|