Spaces:

azan888
/

3d_model

Running

File size: 8,235 Bytes

7a87926

# API Enhancements - Logging, Profiling & Error Handling

This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling.

## Overview

All API endpoints have been enhanced with:

- **Comprehensive logging** with structured data
- **Request/response tracking** with unique request IDs
- **Error handling** with detailed error information
- **Profiling integration** for performance monitoring
- **Timing information** for all operations
- **Structured error responses** with error types and details

## Components

### 1. Request Logging Middleware

A custom middleware (`RequestLoggingMiddleware`) logs all HTTP requests and responses:

- Generates unique request IDs for tracking
- Logs request start (method, path, client IP, query params)
- Logs response completion (status code, duration)
- Adds request ID to response headers
- Handles exceptions and logs errors

### 2. Enhanced Error Handling

#### Exception Handlers

1. **ValidationError Handler**: Catches Pydantic validation errors

   - Returns 422 status code
   - Includes detailed validation error messages
   - Logs validation failures

2. **General Exception Handler**: Catches all unhandled exceptions
   - Returns 500 status code
   - Logs full exception traceback
   - Returns structured error response with request ID

#### Error Types Handled

- `FileNotFoundError` → 404 with descriptive message
- `PermissionError` → 403 with descriptive message
- `ValueError` → 400 with validation details
- `HTTPException` → Respects FastAPI HTTP exceptions
- `Exception` → 500 with structured error response

### 3. Enhanced CLI Command Execution

The `run_cli_command` function now includes:

- **Comprehensive logging**: Logs command start, completion, and failures
- **Execution timing**: Tracks duration of all commands
- **Error classification**: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions)
- **Traceback capture**: Captures full stack traces for debugging
- **Output capture**: Captures stdout/stderr with length tracking

### 4. Background Task Enhancement

All background tasks (validation, training, etc.) now include:

- **Pre-execution validation**: Validates input paths and parameters
- **Structured logging**: Logs job start, progress, and completion
- **Error context**: Captures error type, message, and traceback
- **Job metadata**: Tracks duration, timestamps, and request parameters
- **Profiling integration**: Automatic profiling context for long-running tasks

### 5. Request ID Tracking

Every request gets a unique request ID:

- Generated automatically if not provided in `X-Request-ID` header
- Included in all log entries
- Added to response headers
- Used for correlating logs across distributed systems

## Logging Structure

### Log Levels

- **INFO**: Normal operations, request/response logging, job status
- **WARNING**: Validation errors, HTTP errors, non-fatal issues
- **ERROR**: Exceptions, failures, critical errors
- **DEBUG**: Detailed debugging information

### Structured Logging

All logs use structured data with `extra` parameter:

```python
logger.info(
    "Message",
    extra={
        "request_id": "req_123",
        "job_id": "job_456",
        "duration_ms": 1234.5,
        "status_code": 200,
        # ... more context
    }
)
```

## Example Enhanced Endpoint

### Before

```python
@app.post("/api/v1/validate/sequence")
async def validate_sequence(request: ValidateSequenceRequest):
    job_id = str(uuid.uuid4())
    jobs[job_id] = {"status": "queued"}
    executor.submit(run_validation)
    return {"job_id": job_id}
```

### After

```python
@app.post("/api/v1/validate/sequence", response_model=JobResponse)
async def validate_sequence(
    request: ValidateSequenceRequest,
    background_tasks: BackgroundTasks,
    fastapi_request: Request
):
    request_id = fastapi_request.headers.get('X-Request-ID', 'unknown')
    job_id = str(uuid.uuid4())

    logger.info(
        f"Received sequence validation request",
        extra={"request_id": request_id, "job_id": job_id, ...}
    )

    # Validate input
    seq_path = Path(request.sequence_dir)
    if not seq_path.exists():
        logger.warning(...)
        raise HTTPException(status_code=400, detail=...)

    jobs[job_id] = {
        "status": "queued",
        "request_id": request_id,
        "created_at": time.time(),
        "request_params": {...}
    }

    try:
        executor.submit(run_validation)
        logger.info("Job queued successfully", ...)
        return JobResponse(job_id=job_id, status="queued", ...)
    except Exception as e:
        logger.error("Failed to queue job", ...)
        raise HTTPException(status_code=500, detail=...)
```

## Background Task Function Enhancement

### Before

```python
def run_validation():
    try:
        result = run_cli_command(...)
        jobs[job_id]["status"] = "completed" if result["success"] else "failed"
    except Exception as e:
        jobs[job_id]["status"] = "failed"
```

### After

```python
def run_validation():
    logger.info(f"Starting validation job: {job_id}", ...)

    try:
        # Pre-validation
        if not seq_path.exists():
            raise FileNotFoundError(...)

        # Execute with profiling
        with profile_context(...):
            result = run_cli_command(...)

        # Update job with metadata
        jobs[job_id]["duration"] = result.get("duration")
        jobs[job_id]["completed_at"] = time.time()

        if result["success"]:
            logger.info("Job completed successfully", ...)
            jobs[job_id]["status"] = "completed"
        else:
            logger.error("Job failed", ...)
            jobs[job_id]["status"] = "failed"

    except FileNotFoundError as e:
        logger.error("File not found", exc_info=True)
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"}
    except Exception as e:
        logger.error("Unexpected error", exc_info=True)
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["result"] = {
            "error": str(e),
            "error_type": type(e).__name__,
            "traceback": traceback.format_exc()
        }
```

## Error Response Format

All errors return structured JSON:

```json
{
    "error": "ErrorType",
    "message": "Human-readable message",
    "request_id": "req_123",
    "details": {...}  // Optional additional details
}
```

### Error Types

- `ValidationError`: Pydantic validation failures (422)
- `FileNotFoundError`: Missing files/directories (404)
- `PermissionError`: Access denied (403)
- `InternalServerError`: Unexpected errors (500)

## Profiling Integration

### Automatic Profiling

Endpoints automatically profile when profiler is enabled:

- API endpoint execution
- Background task execution
- CLI command execution

### Manual Profiling

Use `profile_context` for custom profiling:

```python
with profile_context(stage="validation", job_id=job_id):
    result = run_validation()
```

## Benefits

1. **Debugging**: Full tracebacks and context in logs
2. **Monitoring**: Request IDs enable log correlation
3. **Performance**: Timing information for all operations
4. **Reliability**: Comprehensive error handling prevents crashes
5. **Observability**: Structured logs enable better analysis
6. **User Experience**: Clear, actionable error messages

## Usage

### Viewing Logs

Logs are output to stdout/stderr and can be:

- Viewed in RunPod logs
- Collected by log aggregation services
- Filtered by request_id for debugging

### Request ID

Include `X-Request-ID` header for custom request tracking:

```bash
curl -H "X-Request-ID: my-custom-id" https://api.example.com/health
```

### Error Handling

All errors are logged with full context, so you can:

1. Find the request_id from the error response
2. Search logs for that request_id
3. See the full execution trace and error details

## Future Enhancements

- [ ] Add rate limiting with logging
- [ ] Add request/response size limits
- [ ] Add metrics export (Prometheus)
- [ ] Add distributed tracing support
- [ ] Add structured error codes
- [ ] Add retry logic with exponential backoff