3d_model / docs /API_ENHANCEMENTS.md
Azan
Clean deployment build (Squashed)
7a87926
# API Enhancements - Logging, Profiling & Error Handling
This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling.
## Overview
All API endpoints have been enhanced with:
- **Comprehensive logging** with structured data
- **Request/response tracking** with unique request IDs
- **Error handling** with detailed error information
- **Profiling integration** for performance monitoring
- **Timing information** for all operations
- **Structured error responses** with error types and details
## Components
### 1. Request Logging Middleware
A custom middleware (`RequestLoggingMiddleware`) logs all HTTP requests and responses:
- Generates unique request IDs for tracking
- Logs request start (method, path, client IP, query params)
- Logs response completion (status code, duration)
- Adds request ID to response headers
- Handles exceptions and logs errors
### 2. Enhanced Error Handling
#### Exception Handlers
1. **ValidationError Handler**: Catches Pydantic validation errors
- Returns 422 status code
- Includes detailed validation error messages
- Logs validation failures
2. **General Exception Handler**: Catches all unhandled exceptions
- Returns 500 status code
- Logs full exception traceback
- Returns structured error response with request ID
#### Error Types Handled
- `FileNotFoundError` β†’ 404 with descriptive message
- `PermissionError` β†’ 403 with descriptive message
- `ValueError` β†’ 400 with validation details
- `HTTPException` β†’ Respects FastAPI HTTP exceptions
- `Exception` β†’ 500 with structured error response
### 3. Enhanced CLI Command Execution
The `run_cli_command` function now includes:
- **Comprehensive logging**: Logs command start, completion, and failures
- **Execution timing**: Tracks duration of all commands
- **Error classification**: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions)
- **Traceback capture**: Captures full stack traces for debugging
- **Output capture**: Captures stdout/stderr with length tracking
### 4. Background Task Enhancement
All background tasks (validation, training, etc.) now include:
- **Pre-execution validation**: Validates input paths and parameters
- **Structured logging**: Logs job start, progress, and completion
- **Error context**: Captures error type, message, and traceback
- **Job metadata**: Tracks duration, timestamps, and request parameters
- **Profiling integration**: Automatic profiling context for long-running tasks
### 5. Request ID Tracking
Every request gets a unique request ID:
- Generated automatically if not provided in `X-Request-ID` header
- Included in all log entries
- Added to response headers
- Used for correlating logs across distributed systems
## Logging Structure
### Log Levels
- **INFO**: Normal operations, request/response logging, job status
- **WARNING**: Validation errors, HTTP errors, non-fatal issues
- **ERROR**: Exceptions, failures, critical errors
- **DEBUG**: Detailed debugging information
### Structured Logging
All logs use structured data with `extra` parameter:
```python
logger.info(
"Message",
extra={
"request_id": "req_123",
"job_id": "job_456",
"duration_ms": 1234.5,
"status_code": 200,
# ... more context
}
)
```
## Example Enhanced Endpoint
### Before
```python
@app.post("/api/v1/validate/sequence")
async def validate_sequence(request: ValidateSequenceRequest):
job_id = str(uuid.uuid4())
jobs[job_id] = {"status": "queued"}
executor.submit(run_validation)
return {"job_id": job_id}
```
### After
```python
@app.post("/api/v1/validate/sequence", response_model=JobResponse)
async def validate_sequence(
request: ValidateSequenceRequest,
background_tasks: BackgroundTasks,
fastapi_request: Request
):
request_id = fastapi_request.headers.get('X-Request-ID', 'unknown')
job_id = str(uuid.uuid4())
logger.info(
f"Received sequence validation request",
extra={"request_id": request_id, "job_id": job_id, ...}
)
# Validate input
seq_path = Path(request.sequence_dir)
if not seq_path.exists():
logger.warning(...)
raise HTTPException(status_code=400, detail=...)
jobs[job_id] = {
"status": "queued",
"request_id": request_id,
"created_at": time.time(),
"request_params": {...}
}
try:
executor.submit(run_validation)
logger.info("Job queued successfully", ...)
return JobResponse(job_id=job_id, status="queued", ...)
except Exception as e:
logger.error("Failed to queue job", ...)
raise HTTPException(status_code=500, detail=...)
```
## Background Task Function Enhancement
### Before
```python
def run_validation():
try:
result = run_cli_command(...)
jobs[job_id]["status"] = "completed" if result["success"] else "failed"
except Exception as e:
jobs[job_id]["status"] = "failed"
```
### After
```python
def run_validation():
logger.info(f"Starting validation job: {job_id}", ...)
try:
# Pre-validation
if not seq_path.exists():
raise FileNotFoundError(...)
# Execute with profiling
with profile_context(...):
result = run_cli_command(...)
# Update job with metadata
jobs[job_id]["duration"] = result.get("duration")
jobs[job_id]["completed_at"] = time.time()
if result["success"]:
logger.info("Job completed successfully", ...)
jobs[job_id]["status"] = "completed"
else:
logger.error("Job failed", ...)
jobs[job_id]["status"] = "failed"
except FileNotFoundError as e:
logger.error("File not found", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"}
except Exception as e:
logger.error("Unexpected error", exc_info=True)
jobs[job_id]["status"] = "failed"
jobs[job_id]["result"] = {
"error": str(e),
"error_type": type(e).__name__,
"traceback": traceback.format_exc()
}
```
## Error Response Format
All errors return structured JSON:
```json
{
"error": "ErrorType",
"message": "Human-readable message",
"request_id": "req_123",
"details": {...} // Optional additional details
}
```
### Error Types
- `ValidationError`: Pydantic validation failures (422)
- `FileNotFoundError`: Missing files/directories (404)
- `PermissionError`: Access denied (403)
- `InternalServerError`: Unexpected errors (500)
## Profiling Integration
### Automatic Profiling
Endpoints automatically profile when profiler is enabled:
- API endpoint execution
- Background task execution
- CLI command execution
### Manual Profiling
Use `profile_context` for custom profiling:
```python
with profile_context(stage="validation", job_id=job_id):
result = run_validation()
```
## Benefits
1. **Debugging**: Full tracebacks and context in logs
2. **Monitoring**: Request IDs enable log correlation
3. **Performance**: Timing information for all operations
4. **Reliability**: Comprehensive error handling prevents crashes
5. **Observability**: Structured logs enable better analysis
6. **User Experience**: Clear, actionable error messages
## Usage
### Viewing Logs
Logs are output to stdout/stderr and can be:
- Viewed in RunPod logs
- Collected by log aggregation services
- Filtered by request_id for debugging
### Request ID
Include `X-Request-ID` header for custom request tracking:
```bash
curl -H "X-Request-ID: my-custom-id" https://api.example.com/health
```
### Error Handling
All errors are logged with full context, so you can:
1. Find the request_id from the error response
2. Search logs for that request_id
3. See the full execution trace and error details
## Future Enhancements
- [ ] Add rate limiting with logging
- [ ] Add request/response size limits
- [ ] Add metrics export (Prometheus)
- [ ] Add distributed tracing support
- [ ] Add structured error codes
- [ ] Add retry logic with exponential backoff