File size: 8,235 Bytes
7a87926 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
# API Enhancements - Logging, Profiling & Error Handling
This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling.
## Overview
All API endpoints have been enhanced with:
- **Comprehensive logging** with structured data
- **Request/response tracking** with unique request IDs
- **Error handling** with detailed error information
- **Profiling integration** for performance monitoring
- **Timing information** for all operations
- **Structured error responses** with error types and details
## Components
### 1. Request Logging Middleware
A custom middleware (`RequestLoggingMiddleware`) logs all HTTP requests and responses:
- Generates unique request IDs for tracking
- Logs request start (method, path, client IP, query params)
- Logs response completion (status code, duration)
- Adds request ID to response headers
- Handles exceptions and logs errors
### 2. Enhanced Error Handling
#### Exception Handlers
1. **ValidationError Handler**: Catches Pydantic validation errors
- Returns 422 status code
- Includes detailed validation error messages
- Logs validation failures
2. **General Exception Handler**: Catches all unhandled exceptions
- Returns 500 status code
- Logs full exception traceback
- Returns structured error response with request ID
#### Error Types Handled
- `FileNotFoundError` β 404 with descriptive message
- `PermissionError` β 403 with descriptive message
- `ValueError` β 400 with validation details
- `HTTPException` β Respects FastAPI HTTP exceptions
- `Exception` β 500 with structured error response
### 3. Enhanced CLI Command Execution
The `run_cli_command` function now includes:
- **Comprehensive logging**: Logs command start, completion, and failures
- **Execution timing**: Tracks duration of all commands
- **Error classification**: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions)
- **Traceback capture**: Captures full stack traces for debugging
- **Output capture**: Captures stdout/stderr with length tracking
### 4. Background Task Enhancement
All background tasks (validation, training, etc.) now include:
- **Pre-execution validation**: Validates input paths and parameters
- **Structured logging**: Logs job start, progress, and completion
- **Error context**: Captures error type, message, and traceback
- **Job metadata**: Tracks duration, timestamps, and request parameters
- **Profiling integration**: Automatic profiling context for long-running tasks
### 5. Request ID Tracking
Every request gets a unique request ID:
- Generated automatically if not provided in `X-Request-ID` header
- Included in all log entries
- Added to response headers
- Used for correlating logs across distributed systems
## Logging Structure
### Log Levels
- **INFO**: Normal operations, request/response logging, job status
- **WARNING**: Validation errors, HTTP errors, non-fatal issues
- **ERROR**: Exceptions, failures, critical errors
- **DEBUG**: Detailed debugging information
### Structured Logging
All logs use structured data with `extra` parameter:
```python
logger.info(
"Message",
extra={
"request_id": "req_123",
"job_id": "job_456",
"duration_ms": 1234.5,
"status_code": 200,
# ... more context
}
)
```
## Example Enhanced Endpoint
### Before
```python
@app.post("/api/v1/validate/sequence")
async def validate_sequence(request: ValidateSequenceRequest):
job_id = str(uuid.uuid4())
jobs[job_id] = {"status": "queued"}
executor.submit(run_validation)
return {"job_id": job_id}
```
### After
```python
@app.post("/api/v1/validate/sequence", response_model=JobResponse)
async def validate_sequence(
request: ValidateSequenceRequest,
background_tasks: BackgroundTasks,
fastapi_request: Request
):
request_id = fastapi_request.headers.get('X-Request-ID', 'unknown')
job_id = str(uuid.uuid4())
logger.info(
f"Received sequence validation request",
extra={"request_id": request_id, "job_id": job_id, ...}
)
# Validate input
seq_path = Path(request.sequence_dir)
if not seq_path.exists():
logger.warning(...)
raise HTTPException(status_code=400, detail=...)
jobs[job_id] = {
"status": "queued",
"request_id": request_id,
"created_at": time.time(),
"request_params": {...}
}
try:
executor.submit(run_validation)
logger.info("Job queued successfully", ...)
return JobResponse(job_id=job_id, status="queued", ...)
except Exception as e:
logger.error("Failed to queue job", ...)
raise HTTPException(status_code=500, detail=...)
```
## Background Task Function Enhancement
### Before
```python
def run_validation():
try:
result = run_cli_command(...)
jobs[job_id]["status"] = "completed" if result["success"] else "failed"
except Exception as e:
jobs[job_id]["status"] = "failed"
```
### After
```python
def run_validation():
logger.info(f"Starting validation job: {job_id}", ...)
try:
# Pre-validation
if not seq_path.exists():
raise FileNotFoundError(...)
# Execute with profiling
with profile_context(...):
result = run_cli_command(...)
# Update job with metadata
jobs[job_id]["duration"] = result.get("duration")
jobs[job_id]["completed_at"] = time.time()
if result["success"]:
logger.info("Job completed successfully", ...)
jobs[job_id]["status"] = "completed"
else:
logger.error("Job failed", ...)
jobs[job_id]["status"] = "failed"
except FileNotFoundError as e:
logger.error("File not found", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"}
except Exception as e:
logger.error("Unexpected error", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {
"error": str(e),
"error_type": type(e).__name__,
"traceback": traceback.format_exc()
}
```
## Error Response Format
All errors return structured JSON:
```json
{
"error": "ErrorType",
"message": "Human-readable message",
"request_id": "req_123",
"details": {...} // Optional additional details
}
```
### Error Types
- `ValidationError`: Pydantic validation failures (422)
- `FileNotFoundError`: Missing files/directories (404)
- `PermissionError`: Access denied (403)
- `InternalServerError`: Unexpected errors (500)
## Profiling Integration
### Automatic Profiling
Endpoints automatically profile when profiler is enabled:
- API endpoint execution
- Background task execution
- CLI command execution
### Manual Profiling
Use `profile_context` for custom profiling:
```python
with profile_context(stage="validation", job_id=job_id):
result = run_validation()
```
## Benefits
1. **Debugging**: Full tracebacks and context in logs
2. **Monitoring**: Request IDs enable log correlation
3. **Performance**: Timing information for all operations
4. **Reliability**: Comprehensive error handling prevents crashes
5. **Observability**: Structured logs enable better analysis
6. **User Experience**: Clear, actionable error messages
## Usage
### Viewing Logs
Logs are output to stdout/stderr and can be:
- Viewed in RunPod logs
- Collected by log aggregation services
- Filtered by request_id for debugging
### Request ID
Include `X-Request-ID` header for custom request tracking:
```bash
curl -H "X-Request-ID: my-custom-id" https://api.example.com/health
```
### Error Handling
All errors are logged with full context, so you can:
1. Find the request_id from the error response
2. Search logs for that request_id
3. See the full execution trace and error details
## Future Enhancements
- [ ] Add rate limiting with logging
- [ ] Add request/response size limits
- [ ] Add metrics export (Prometheus)
- [ ] Add distributed tracing support
- [ ] Add structured error codes
- [ ] Add retry logic with exponential backoff
|