API Enhancements - Logging, Profiling & Error Handling
This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling.
Overview
All API endpoints have been enhanced with:
- Comprehensive logging with structured data
- Request/response tracking with unique request IDs
- Error handling with detailed error information
- Profiling integration for performance monitoring
- Timing information for all operations
- Structured error responses with error types and details
Components
1. Request Logging Middleware
A custom middleware (RequestLoggingMiddleware) logs all HTTP requests and responses:
- Generates unique request IDs for tracking
- Logs request start (method, path, client IP, query params)
- Logs response completion (status code, duration)
- Adds request ID to response headers
- Handles exceptions and logs errors
2. Enhanced Error Handling
Exception Handlers
ValidationError Handler: Catches Pydantic validation errors
- Returns 422 status code
- Includes detailed validation error messages
- Logs validation failures
General Exception Handler: Catches all unhandled exceptions
- Returns 500 status code
- Logs full exception traceback
- Returns structured error response with request ID
Error Types Handled
FileNotFoundErrorβ 404 with descriptive messagePermissionErrorβ 403 with descriptive messageValueErrorβ 400 with validation detailsHTTPExceptionβ Respects FastAPI HTTP exceptionsExceptionβ 500 with structured error response
3. Enhanced CLI Command Execution
The run_cli_command function now includes:
- Comprehensive logging: Logs command start, completion, and failures
- Execution timing: Tracks duration of all commands
- Error classification: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions)
- Traceback capture: Captures full stack traces for debugging
- Output capture: Captures stdout/stderr with length tracking
4. Background Task Enhancement
All background tasks (validation, training, etc.) now include:
- Pre-execution validation: Validates input paths and parameters
- Structured logging: Logs job start, progress, and completion
- Error context: Captures error type, message, and traceback
- Job metadata: Tracks duration, timestamps, and request parameters
- Profiling integration: Automatic profiling context for long-running tasks
5. Request ID Tracking
Every request gets a unique request ID:
- Generated automatically if not provided in
X-Request-IDheader - Included in all log entries
- Added to response headers
- Used for correlating logs across distributed systems
Logging Structure
Log Levels
- INFO: Normal operations, request/response logging, job status
- WARNING: Validation errors, HTTP errors, non-fatal issues
- ERROR: Exceptions, failures, critical errors
- DEBUG: Detailed debugging information
Structured Logging
All logs use structured data with extra parameter:
logger.info(
"Message",
extra={
"request_id": "req_123",
"job_id": "job_456",
"duration_ms": 1234.5,
"status_code": 200,
# ... more context
}
)
Example Enhanced Endpoint
Before
@app.post("/api/v1/validate/sequence")
async def validate_sequence(request: ValidateSequenceRequest):
job_id = str(uuid.uuid4())
jobs[job_id] = {"status": "queued"}
executor.submit(run_validation)
return {"job_id": job_id}
After
@app.post("/api/v1/validate/sequence", response_model=JobResponse)
async def validate_sequence(
request: ValidateSequenceRequest,
background_tasks: BackgroundTasks,
fastapi_request: Request
):
request_id = fastapi_request.headers.get('X-Request-ID', 'unknown')
job_id = str(uuid.uuid4())
logger.info(
f"Received sequence validation request",
extra={"request_id": request_id, "job_id": job_id, ...}
)
# Validate input
seq_path = Path(request.sequence_dir)
if not seq_path.exists():
logger.warning(...)
raise HTTPException(status_code=400, detail=...)
jobs[job_id] = {
"status": "queued",
"request_id": request_id,
"created_at": time.time(),
"request_params": {...}
}
try:
executor.submit(run_validation)
logger.info("Job queued successfully", ...)
return JobResponse(job_id=job_id, status="queued", ...)
except Exception as e:
logger.error("Failed to queue job", ...)
raise HTTPException(status_code=500, detail=...)
Background Task Function Enhancement
Before
def run_validation():
try:
result = run_cli_command(...)
jobs[job_id]["status"] = "completed" if result["success"] else "failed"
except Exception as e:
jobs[job_id]["status"] = "failed"
After
def run_validation():
logger.info(f"Starting validation job: {job_id}", ...)
try:
# Pre-validation
if not seq_path.exists():
raise FileNotFoundError(...)
# Execute with profiling
with profile_context(...):
result = run_cli_command(...)
# Update job with metadata
jobs[job_id]["duration"] = result.get("duration")
jobs[job_id]["completed_at"] = time.time()
if result["success"]:
logger.info("Job completed successfully", ...)
jobs[job_id]["status"] = "completed"
else:
logger.error("Job failed", ...)
jobs[job_id]["status"] = "failed"
except FileNotFoundError as e:
logger.error("File not found", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"}
except Exception as e:
logger.error("Unexpected error", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {
"error": str(e),
"error_type": type(e).__name__,
"traceback": traceback.format_exc()
}
Error Response Format
All errors return structured JSON:
{
"error": "ErrorType",
"message": "Human-readable message",
"request_id": "req_123",
"details": {...} // Optional additional details
}
Error Types
ValidationError: Pydantic validation failures (422)FileNotFoundError: Missing files/directories (404)PermissionError: Access denied (403)InternalServerError: Unexpected errors (500)
Profiling Integration
Automatic Profiling
Endpoints automatically profile when profiler is enabled:
- API endpoint execution
- Background task execution
- CLI command execution
Manual Profiling
Use profile_context for custom profiling:
with profile_context(stage="validation", job_id=job_id):
result = run_validation()
Benefits
- Debugging: Full tracebacks and context in logs
- Monitoring: Request IDs enable log correlation
- Performance: Timing information for all operations
- Reliability: Comprehensive error handling prevents crashes
- Observability: Structured logs enable better analysis
- User Experience: Clear, actionable error messages
Usage
Viewing Logs
Logs are output to stdout/stderr and can be:
- Viewed in RunPod logs
- Collected by log aggregation services
- Filtered by request_id for debugging
Request ID
Include X-Request-ID header for custom request tracking:
curl -H "X-Request-ID: my-custom-id" https://api.example.com/health
Error Handling
All errors are logged with full context, so you can:
- Find the request_id from the error response
- Search logs for that request_id
- See the full execution trace and error details
Future Enhancements
- Add rate limiting with logging
- Add request/response size limits
- Add metrics export (Prometheus)
- Add distributed tracing support
- Add structured error codes
- Add retry logic with exponential backoff