File size: 8,235 Bytes
7a87926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
# API Enhancements - Logging, Profiling & Error Handling

This document describes the comprehensive enhancements made to the YLFF API endpoints for robust logging, profiling, and error handling.

## Overview

All API endpoints have been enhanced with:

- **Comprehensive logging** with structured data
- **Request/response tracking** with unique request IDs
- **Error handling** with detailed error information
- **Profiling integration** for performance monitoring
- **Timing information** for all operations
- **Structured error responses** with error types and details

## Components

### 1. Request Logging Middleware

A custom middleware (`RequestLoggingMiddleware`) logs all HTTP requests and responses:

- Generates unique request IDs for tracking
- Logs request start (method, path, client IP, query params)
- Logs response completion (status code, duration)
- Adds request ID to response headers
- Handles exceptions and logs errors

### 2. Enhanced Error Handling

#### Exception Handlers

1. **ValidationError Handler**: Catches Pydantic validation errors

   - Returns 422 status code
   - Includes detailed validation error messages
   - Logs validation failures

2. **General Exception Handler**: Catches all unhandled exceptions
   - Returns 500 status code
   - Logs full exception traceback
   - Returns structured error response with request ID

#### Error Types Handled

- `FileNotFoundError` β†’ 404 with descriptive message
- `PermissionError` β†’ 403 with descriptive message
- `ValueError` β†’ 400 with validation details
- `HTTPException` β†’ Respects FastAPI HTTP exceptions
- `Exception` β†’ 500 with structured error response

### 3. Enhanced CLI Command Execution

The `run_cli_command` function now includes:

- **Comprehensive logging**: Logs command start, completion, and failures
- **Execution timing**: Tracks duration of all commands
- **Error classification**: Identifies error types (Exit codes, KeyboardInterrupt, Exceptions)
- **Traceback capture**: Captures full stack traces for debugging
- **Output capture**: Captures stdout/stderr with length tracking

### 4. Background Task Enhancement

All background tasks (validation, training, etc.) now include:

- **Pre-execution validation**: Validates input paths and parameters
- **Structured logging**: Logs job start, progress, and completion
- **Error context**: Captures error type, message, and traceback
- **Job metadata**: Tracks duration, timestamps, and request parameters
- **Profiling integration**: Automatic profiling context for long-running tasks

### 5. Request ID Tracking

Every request gets a unique request ID:

- Generated automatically if not provided in `X-Request-ID` header
- Included in all log entries
- Added to response headers
- Used for correlating logs across distributed systems

## Logging Structure

### Log Levels

- **INFO**: Normal operations, request/response logging, job status
- **WARNING**: Validation errors, HTTP errors, non-fatal issues
- **ERROR**: Exceptions, failures, critical errors
- **DEBUG**: Detailed debugging information

### Structured Logging

All logs use structured data with `extra` parameter:

```python
logger.info(
    "Message",
    extra={
        "request_id": "req_123",
        "job_id": "job_456",
        "duration_ms": 1234.5,
        "status_code": 200,
        # ... more context
    }
)
```

## Example Enhanced Endpoint

### Before

```python
@app.post("/api/v1/validate/sequence")
async def validate_sequence(request: ValidateSequenceRequest):
    job_id = str(uuid.uuid4())
    jobs[job_id] = {"status": "queued"}
    executor.submit(run_validation)
    return {"job_id": job_id}
```

### After

```python
@app.post("/api/v1/validate/sequence", response_model=JobResponse)
async def validate_sequence(
    request: ValidateSequenceRequest,
    background_tasks: BackgroundTasks,
    fastapi_request: Request
):
    request_id = fastapi_request.headers.get('X-Request-ID', 'unknown')
    job_id = str(uuid.uuid4())

    logger.info(
        f"Received sequence validation request",
        extra={"request_id": request_id, "job_id": job_id, ...}
    )

    # Validate input
    seq_path = Path(request.sequence_dir)
    if not seq_path.exists():
        logger.warning(...)
        raise HTTPException(status_code=400, detail=...)

    jobs[job_id] = {
        "status": "queued",
        "request_id": request_id,
        "created_at": time.time(),
        "request_params": {...}
    }

    try:
        executor.submit(run_validation)
        logger.info("Job queued successfully", ...)
        return JobResponse(job_id=job_id, status="queued", ...)
    except Exception as e:
        logger.error("Failed to queue job", ...)
        raise HTTPException(status_code=500, detail=...)
```

## Background Task Function Enhancement

### Before

```python
def run_validation():
    try:
        result = run_cli_command(...)
        jobs[job_id]["status"] = "completed" if result["success"] else "failed"
    except Exception as e:
        jobs[job_id]["status"] = "failed"
```

### After

```python
def run_validation():
    logger.info(f"Starting validation job: {job_id}", ...)

    try:
        # Pre-validation
        if not seq_path.exists():
            raise FileNotFoundError(...)

        # Execute with profiling
        with profile_context(...):
            result = run_cli_command(...)

        # Update job with metadata
        jobs[job_id]["duration"] = result.get("duration")
        jobs[job_id]["completed_at"] = time.time()

        if result["success"]:
            logger.info("Job completed successfully", ...)
            jobs[job_id]["status"] = "completed"
        else:
            logger.error("Job failed", ...)
            jobs[job_id]["status"] = "failed"

    except FileNotFoundError as e:
        logger.error("File not found", exc_info=True)
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["result"] = {"error": str(e), "error_type": "FileNotFoundError"}
    except Exception as e:
        logger.error("Unexpected error", exc_info=True)
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["result"] = {
            "error": str(e),
            "error_type": type(e).__name__,
            "traceback": traceback.format_exc()
        }
```

## Error Response Format

All errors return structured JSON:

```json
{
    "error": "ErrorType",
    "message": "Human-readable message",
    "request_id": "req_123",
    "details": {...}  // Optional additional details
}
```

### Error Types

- `ValidationError`: Pydantic validation failures (422)
- `FileNotFoundError`: Missing files/directories (404)
- `PermissionError`: Access denied (403)
- `InternalServerError`: Unexpected errors (500)

## Profiling Integration

### Automatic Profiling

Endpoints automatically profile when profiler is enabled:

- API endpoint execution
- Background task execution
- CLI command execution

### Manual Profiling

Use `profile_context` for custom profiling:

```python
with profile_context(stage="validation", job_id=job_id):
    result = run_validation()
```

## Benefits

1. **Debugging**: Full tracebacks and context in logs
2. **Monitoring**: Request IDs enable log correlation
3. **Performance**: Timing information for all operations
4. **Reliability**: Comprehensive error handling prevents crashes
5. **Observability**: Structured logs enable better analysis
6. **User Experience**: Clear, actionable error messages

## Usage

### Viewing Logs

Logs are output to stdout/stderr and can be:

- Viewed in RunPod logs
- Collected by log aggregation services
- Filtered by request_id for debugging

### Request ID

Include `X-Request-ID` header for custom request tracking:

```bash
curl -H "X-Request-ID: my-custom-id" https://api.example.com/health
```

### Error Handling

All errors are logged with full context, so you can:

1. Find the request_id from the error response
2. Search logs for that request_id
3. See the full execution trace and error details

## Future Enhancements

- [ ] Add rate limiting with logging
- [ ] Add request/response size limits
- [ ] Add metrics export (Prometheus)
- [ ] Add distributed tracing support
- [ ] Add structured error codes
- [ ] Add retry logic with exponential backoff