Spaces:

T0X1N
/

Agentic-RagBot

Sleeping

App Files Files Community

Agentic-RagBot / docs /TROUBLESHOOTING.md

MediGuard AI

feat: Initial release of MediGuard AI v2.0

c4f5f25 about 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

	# Troubleshooting Guide

	This guide helps diagnose and resolve common issues with MediGuard AI.

	## Table of Contents
	1. [Startup Issues](#startup-issues)
	2. [Service Connectivity](#service-connectivity)
	3. [Performance Issues](#performance-issues)
	4. [API Errors](#api-errors)
	5. [Database Issues](#database-issues)
	6. [Memory and CPU Issues](#memory-and-cpu-issues)
	7. [Logging and Monitoring](#logging-and-monitoring)
	8. [Common Error Messages](#common-error-messages)

	## Startup Issues

	### Application Won't Start

	Symptoms:
	- Application exits immediately
	- Port already in use errors
	- Module import errors

	Solutions:

	1. Check port availability:
	```bash
	# Check if port 8000 is in use
	netstat -tulpn \| grep 8000
	# Or on Windows
	netstat -ano \| findstr 8000
	```

	2. Verify Python environment:
	```bash
	# Activate virtual environment
	source venv/bin/activate
	# On Windows
	venv\Scripts\activate

	# Check dependencies
	pip list
	```

	3. Check environment variables:
	```bash
	# Verify required variables are set
	env \| grep -E "(GROQ\|REDIS\|OPENSEARCH)"
	```

	4. Common startup errors and fixes:

	\| Error \| Cause \| Solution \|
	\|-------\|-------\|----------\|
	\| `ModuleNotFoundError` \| Missing dependencies \| `pip install -r requirements.txt` \|
	\| `Permission denied` \| Port requires privileges \| Use port > 1024 or run with sudo \|
	\| `Address already in use` \| Another process using port \| Kill process or use different port \|

	### Docker Container Issues

	Symptoms:
	- Container fails to start
	- Health check failures
	- Volume mount errors

	Solutions:

	1. Check container logs:
	```bash
	docker logs mediguard-api
	docker-compose logs api
	```

	2. Verify Docker resources:
	```bash
	# Check Docker resource usage
	docker stats

	# Check disk space
	docker system df
	```

	3. Rebuild container:
	```bash
	docker-compose down
	docker-compose build --no-cache
	docker-compose up -d
	```

	## Service Connectivity

	### OpenSearch Connection Issues

	Symptoms:
	- Search requests failing
	- Connection timeout errors
	- Authentication failures

	Diagnosis:
	```bash
	# Check OpenSearch health
	curl -X GET "localhost:9200/_cluster/health?pretty"

	# Test from application
	curl http://localhost:8000/health/service/opensearch
	```

	Solutions:

	1. Verify OpenSearch is running:
	```bash
	docker-compose ps opensearch
	docker-compose restart opensearch
	```

	2. Check network connectivity:
	```bash
	# Test connection
	telnet localhost 9200

	# Check firewall
	sudo ufw status
	```

	3. Fix authentication:
	```yaml
	# In docker-compose.yml
	environment:
	- DISABLE_SECURITY_PLUGIN=true # For development
	```

	### Redis Connection Issues

	Symptoms:
	- Cache misses
	- Session data loss
	- Rate limiting not working

	Diagnosis:
	```bash
	# Test Redis connection
	redis-cli ping

	# Check from application
	curl http://localhost:8000/health/service/redis
	```

	Solutions:

	1. Restart Redis:
	```bash
	docker-compose restart redis
	```

	2. Clear corrupted data:
	```bash
	redis-cli FLUSHALL
	```

	3. Check memory limits:
	```bash
	# In redis-cli
	INFO memory
	```

	### Ollama/LLM Connection Issues

	Symptoms:
	- LLM requests timing out
	- Model not found errors
	- Slow responses

	Diagnosis:
	```bash
	# Check Ollama status
	curl http://localhost:11434/api/tags

	# Test model
	curl http://localhost:11434/api/generate -d '{
	"model": "llama3.3",
	"prompt": "Test"
	}'
	```

	Solutions:

	1. Pull required models:
	```bash
	docker-compose exec ollama ollama pull llama3.3
	```

	2. Check GPU availability:
	```bash
	nvidia-smi
	```

	3. Adjust timeouts:
	```python
	# In settings
	OLLAMA_TIMEOUT = 120 # Increase timeout
	```

	## Performance Issues

	### Slow API Responses

	Symptoms:
	- Requests taking > 5 seconds
	- Timeouts in client applications
	- High CPU usage

	Diagnosis:

	1. Check response times:
	```bash
	# Use curl with timing
	curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/health

	# Monitor with metrics
	curl http://localhost:8000/metrics \| grep http_request_duration
	```

	2. Profile the application:
	```bash
	# Use py-spy
	pip install py-spy
	py-spy top --pid <pid>
	```

	Solutions:

	1. Enable caching:
	```python
	# Add caching to expensive operations
	from src.services.cache.advanced_cache import cached

	@cached(ttl=300)
	async def expensive_operation():
	...
	```

	2. Optimize database queries:
	```python
	# Use optimized queries
	from src.services.opensearch.client import make_opensearch_client
	client = make_opensearch_client()
	results = client.search_bm25_optimized(query, min_score=0.5)
	```

	3. Scale horizontally:
	```bash
	# Run multiple instances
	docker-compose up -d --scale api=3
	```

	### Memory Leaks

	Symptoms:
	- Memory usage increasing over time
	- Out of memory errors
	- Container restarts

	Diagnosis:

	1. Monitor memory usage:
	```bash
	# Check container memory
	docker stats

	# Check process memory
	ps aux \| grep python
	```

	2. Find memory leaks:
	```bash
	# Use memory-profiler
	pip install memory-profiler
	python -m memory_profiler script.py
	```

	Solutions:

	1. Fix circular references:
	```python
	# Use weak references
	import weakref

	class Parent:
	def __init__(self):
	self.children = weakref.WeakSet()
	```

	2. Clear caches:
	```python
	# Periodically clear caches
	from src.services.cache.advanced_cache import CacheInvalidator
	await CacheInvalidator.invalidate_by_pattern("*")
	```

	3. Increase memory limits:
	```yaml
	# In docker-compose.yml
	deploy:
	resources:
	limits:
	memory: 4G
	```

	## API Errors

	### 422 Validation Errors

	Symptoms:
	- `{"detail": [...]}` with validation errors
	- Requests rejected with status 422

	Common causes:

	1. Missing required fields:
	```json
	// Wrong
	{"biomarkers": {}}

	// Right
	{"biomarkers": {"Glucose": 100}}
	```

	2. Invalid data types:
	```json
	// Wrong
	{"biomarkers": {"Glucose": "high"}}

	// Right
	{"biomarkers": {"Glucose": 150}}
	```

	3. Out of range values:
	```json
	// Check API docs for valid ranges
	curl http://localhost:8000/docs
	```

	### 500 Internal Server Errors

	Symptoms:
	- Generic error messages
	- Stack traces in logs

	Diagnosis:

	1. Check application logs:
	```bash
	docker-compose logs -f api \| grep ERROR
	```

	2. Enable debug mode:
	```bash
	export DEBUG=true
	uvicorn src.main:app --reload
	```

	Common causes:

	\| Error \| Solution \|
	\|-------\|----------\|
	\| Database connection lost \| Restart database services \|
	\| External service down \| Check service health endpoints \|
	\| Memory error \| Increase memory or optimize code \|
	\| Configuration error \| Verify environment variables \|

	### 503 Service Unavailable

	Symptoms:
	- Service temporarily unavailable
	- Health check failures

	Solutions:

	1. Check service dependencies:
	```bash
	curl http://localhost:8000/health/detailed
	```

	2. Restart affected services:
	```bash
	docker-compose restart
	```

	3. Check rate limits:
	```bash
	# Check rate limit headers
	curl -I http://localhost:8000/analyze/structured
	```

	## Database Issues

	### OpenSearch Index Problems

	Symptoms:
	- Search returning no results
	- Index not found errors
	- Mapping errors

	Diagnosis:

	1. Check index status:
	```bash
	curl -X GET "localhost:9200/_cat/indices?v"
	```

	2. Verify mapping:
	```bash
	curl -X GET "localhost:9200/medical_chunks/_mapping?pretty"
	```

	Solutions:

	1. Recreate index:
	```bash
	# Delete and recreate
	curl -X DELETE "localhost:9200/medical_chunks"
	# Restart application to recreate
	```

	2. Fix mapping:
	```python
	# Update index config
	from src.services.opensearch.index_config import MEDICAL_CHUNKS_MAPPING
	client.ensure_index(MEDICAL_CHUNKS_MAPPING)
	```

	### Data Corruption

	Symptoms:
	- Inconsistent search results
	- Missing documents
	- Strange query behavior

	Solutions:

	1. Verify data integrity:
	```bash
	# Count documents
	curl -X GET "localhost:9200/medical_chunks/_count"
	```

	2. Reindex data:
	```python
	# Use indexing service
	from src.services.indexing.service import IndexingService
	service = IndexingService()
	await service.reindex_all()
	```

	## Logging and Monitoring

	### Enable Debug Logging

	1. Set log level:
	```bash
	export LOG_LEVEL=DEBUG
	export LOG_TO_FILE=true
	```

	2. View logs:
	```bash
	# Real-time logs
	tail -f data/logs/mediguard.log

	# Filter by level
	grep "ERROR" data/logs/mediguard.log
	```

	### Monitor Metrics

	1. Check Prometheus metrics:
	```bash
	curl http://localhost:8000/metrics \| grep http_
	```

	2. View Grafana dashboard:
	- Navigate to http://localhost:3000
	- Import `monitoring/grafana-dashboard.json`

	### Performance Profiling

	1. Enable profiling:
	```python
	# Add to main.py
	from pyinstrument import Profiler

	@app.middleware("http")
	async def profile_requests(request: Request, call_next):
	profiler = Profiler()
	profiler.start()
	response = await call_next(request)
	profiler.stop()
	print(profiler.output_text(unicode=True, color=True))
	return response
	```

	## Common Error Messages

	### "Service unavailable" in logs

	Meaning: A required service (OpenSearch, Redis, etc.) is not responding.

	Fix:
	1. Check service status: `docker-compose ps`
	2. Restart service: `docker-compose restart <service>`
	3. Check logs: `docker-compose logs <service>`

	### "Rate limit exceeded"

	Meaning: Too many requests from a client.

	Fix:
	1. Wait and retry
	2. Check `Retry-After` header
	3. Implement client-side rate limiting

	### "Invalid token" or "Authentication failed"

	Meaning: Invalid API key or token.

	Fix:
	1. Verify API key is correct
	2. Check token hasn't expired
	3. Ensure proper header format: `Authorization: Bearer <token>`

	### "Query too large" or "Request entity too large"

	Meaning: Request exceeds size limits.

	Fix:
	1. Reduce request size
	2. Use pagination
	3. Increase limits in configuration

	### "Connection pool exhausted"

	Meaning: Too many concurrent database connections.

	Fix:
	1. Increase pool size
	2. Add connection timeout
	3. Implement request queuing

	## Emergency Procedures

	### Full System Recovery

	```bash
	# 1. Stop all services
	docker-compose down

	# 2. Clear corrupted data (WARNING: This deletes data!)
	docker volume rm agentic-ragbot_opensearch_data
	docker volume rm agentic-ragbot_redis_data

	# 3. Restart with fresh data
	docker-compose up -d

	# 4. Wait for services to be ready
	sleep 30

	# 5. Verify health
	curl http://localhost:8000/health/detailed
	```

	### Backup and Restore

	```bash
	# Backup OpenSearch
	curl -X POST "localhost:9200/_snapshot/backup/snapshot_1"

	# Backup Redis
	docker-compose exec redis redis-cli BGSAVE

	# Restore from backup
	# See DEPLOYMENT.md for detailed instructions
	```

	### Performance Emergency

	```bash
	# 1. Scale up services
	docker-compose up -d --scale api=5

	# 2. Clear all caches
	curl -X DELETE http://localhost:8000/admin/cache/clear

	# 3. Enable emergency mode
	export EMERGENCY_MODE=true
	# This disables non-essential features
	```

	## Getting Help

	1. Check logs first: Always check application logs for error details
	2. Search issues: Look for similar issues in GitHub
	3. Collect information:
	- Error messages
	- Logs
	- System specs
	- Steps to reproduce
	4. Create issue: Include all relevant information in GitHub issue

	### Contact Information

	- Documentation: Check `/docs` directory
	- Issues: GitHub Issues
	- Emergency: Check DEPLOYMENT.md for emergency contacts