# API Enhancements - Logging, Profiling & Error Handling This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling. ## Overview All API endpoints have been enhanced with: - **Comprehensive logging** with structured data - **Request/response tracking** with unique request IDs - **Error handling** with detailed error information - **Profiling integration** for performance monitoring - **Timing information** for all operations - **Structured error responses** with error types and details ## Components ### 1. Request Logging Middleware A custom middleware (`RequestLoggingMiddleware`) logs all HTTP requests and responses: - Generates unique request IDs for tracking - Logs request start (method, path, client IP, query params) - Logs response completion (status code, duration) - Adds request ID to response headers - Handles exceptions and logs errors ### 2. Enhanced Error Handling #### Exception Handlers 1. **ValidationError Handler**: Catches Pydantic validation errors - Returns 422 status code - Includes detailed validation error messages - Logs validation failures 2. **General Exception Handler**: Catches all unhandled exceptions - Returns 500 status code - Logs full exception traceback - Returns structured error response with request ID #### Error Types Handled - `FileNotFoundError` → 404 with descriptive message - `PermissionError` → 403 with descriptive message - `ValueError` → 400 with validation details - `HTTPException` → Respects FastAPI HTTP exceptions - `Exception` → 500 with structured error response ### 3. Enhanced CLI Command Execution The `run_cli_command` function now includes: - **Comprehensive logging**: Logs command start, completion, and failures - **Execution timing**: Tracks duration of all commands - **Error classification**: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions) - **Traceback capture**: Captures full stack traces for debugging - **Output capture**: Captures stdout/stderr with length tracking ### 4. Background Task Enhancement All background tasks (validation, training, etc.) now include: - **Pre-execution validation**: Validates input paths and parameters - **Structured logging**: Logs job start, progress, and completion - **Error context**: Captures error type, message, and traceback - **Job metadata**: Tracks duration, timestamps, and request parameters - **Profiling integration**: Automatic profiling context for long-running tasks ### 5. Request ID Tracking Every request gets a unique request ID: - Generated automatically if not provided in `X-Request-ID` header - Included in all log entries - Added to response headers - Used for correlating logs across distributed systems ## Logging Structure ### Log Levels - **INFO**: Normal operations, request/response logging, job status - **WARNING**: Validation errors, HTTP errors, non-fatal issues - **ERROR**: Exceptions, failures, critical errors - **DEBUG**: Detailed debugging information ### Structured Logging All logs use structured data with `extra` parameter: ```python logger.info( "Message", extra={ "request_id": "req_123", "job_id": "job_456", "duration_ms": 1234.5, "status_code": 200, # ... more context } ) ``` ## Example Enhanced Endpoint ### Before ```python @app.post("/api/v1/validate/sequence") async def validate_sequence(request: ValidateSequenceRequest): job_id = str(uuid.uuid4()) jobs[job_id] = {"status": "queued"} executor.submit(run_validation) return {"job_id": job_id} ``` ### After ```python @app.post("/api/v1/validate/sequence", response_model=JobResponse) async def validate_sequence( request: ValidateSequenceRequest, background_tasks: BackgroundTasks, fastapi_request: Request ): request_id = fastapi_request.headers.get('X-Request-ID', 'unknown') job_id = str(uuid.uuid4()) logger.info( f"Received sequence validation request", extra={"request_id": request_id, "job_id": job_id, ...} ) # Validate input seq_path = Path(request.sequence_dir) if not seq_path.exists(): logger.warning(...) raise HTTPException(status_code=400, detail=...) jobs[job_id] = { "status": "queued", "request_id": request_id, "created_at": time.time(), "request_params": {...} } try: executor.submit(run_validation) logger.info("Job queued successfully", ...) return JobResponse(job_id=job_id, status="queued", ...) except Exception as e: logger.error("Failed to queue job", ...) raise HTTPException(status_code=500, detail=...) ``` ## Background Task Function Enhancement ### Before ```python def run_validation(): try: result = run_cli_command(...) jobs[job_id]["status"] = "completed" if result["success"] else "failed" except Exception as e: jobs[job_id]["status"] = "failed" ``` ### After ```python def run_validation(): logger.info(f"Starting validation job: {job_id}", ...) try: # Pre-validation if not seq_path.exists(): raise FileNotFoundError(...) # Execute with profiling with profile_context(...): result = run_cli_command(...) # Update job with metadata jobs[job_id]["duration"] = result.get("duration") jobs[job_id]["completed_at"] = time.time() if result["success"]: logger.info("Job completed successfully", ...) jobs[job_id]["status"] = "completed" else: logger.error("Job failed", ...) jobs[job_id]["status"] = "failed" except FileNotFoundError as e: logger.error("File not found", exc_info=True) jobs[job_id]["status"] = "failed" jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"} except Exception as e: logger.error("Unexpected error", exc_info=True) jobs[job_id]["status"] = "failed" jobs[job_id]["result"] = { "error": str(e), "error_type": type(e).__name__, "traceback": traceback.format_exc() } ``` ## Error Response Format All errors return structured JSON: ```json { "error": "ErrorType", "message": "Human-readable message", "request_id": "req_123", "details": {...} // Optional additional details } ``` ### Error Types - `ValidationError`: Pydantic validation failures (422) - `FileNotFoundError`: Missing files/directories (404) - `PermissionError`: Access denied (403) - `InternalServerError`: Unexpected errors (500) ## Profiling Integration ### Automatic Profiling Endpoints automatically profile when profiler is enabled: - API endpoint execution - Background task execution - CLI command execution ### Manual Profiling Use `profile_context` for custom profiling: ```python with profile_context(stage="validation", job_id=job_id): result = run_validation() ``` ## Benefits 1. **Debugging**: Full tracebacks and context in logs 2. **Monitoring**: Request IDs enable log correlation 3. **Performance**: Timing information for all operations 4. **Reliability**: Comprehensive error handling prevents crashes 5. **Observability**: Structured logs enable better analysis 6. **User Experience**: Clear, actionable error messages ## Usage ### Viewing Logs Logs are output to stdout/stderr and can be: - Viewed in RunPod logs - Collected by log aggregation services - Filtered by request_id for debugging ### Request ID Include `X-Request-ID` header for custom request tracking: ```bash curl -H "X-Request-ID: my-custom-id" https://api.example.com/health ``` ### Error Handling All errors are logged with full context, so you can: 1. Find the request_id from the error response 2. Search logs for that request_id 3. See the full execution trace and error details ## Future Enhancements - [ ] Add rate limiting with logging - [ ] Add request/response size limits - [ ] Add metrics export (Prometheus) - [ ] Add distributed tracing support - [ ] Add structured error codes - [ ] Add retry logic with exponential backoff