Spaces:
Paused
Paused
File size: 8,793 Bytes
5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5838a8e c9419cb 5bfdc95 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | # Technical Documentation: `rotating-api-key-client`
This document provides a detailed technical explanation of the `rotating-api-key-client` library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.
## 1. `client.py` - The `RotatingClient`
The `RotatingClient` is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.
### Core Responsibilities
- Managing an `httpx.AsyncClient` for non-blocking HTTP requests.
- Interfacing with the `UsageManager` to acquire and release API keys.
- Handling provider-specific request modifications.
- Executing API calls via `litellm` with a robust retry and rotation strategy.
- Providing a safe wrapper for streaming responses.
### Request Lifecycle (`acompletion`)
When `acompletion` is called, it follows these steps:
1. **Provider and Key Validation**: It extracts the provider from the `model` name and ensures keys are configured for it.
2. **Key Acquisition Loop**: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried.
a. **Acquire Best Key**: It calls `self.usage_manager.acquire_key()`. This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
b. **Prepare Request**: It prepares the `litellm` keyword arguments. This includes:
- **Request Sanitization**: Calling `sanitize_request_payload()` to remove parameters that might be unsupported by the target model, preventing errors.
- **Provider-Specific Logic**: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (`api_base` and model name remapping).
3. **Retry Loop**: Once a key is acquired, it enters an inner retry loop (`for attempt in range(self.max_retries)`):
a. **API Call**: It calls `litellm.acompletion` with the acquired key.
b. **Success (Non-Streaming)**:
- It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns for the key-model pair.
- It calls `self.usage_manager.release_key()` to release the lock on the key for this model.
- It returns the response, and the process ends.
c. **Success (Streaming)**:
- It returns a `_safe_streaming_wrapper` async generator. This wrapper is critical:
- It yields SSE-formatted chunks to the consumer.
- After the stream is fully consumed, its `finally` block ensures that `record_success()` and `release_key()` are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream.
d. **Failure**: If an exception occurs:
- The failure is logged in detail by `log_failure()`.
- The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
- **Server Error**: If the error type is `server_error`, it waits with exponential backoff and retries the request with the *same key*.
- **Rotation Error (Rate Limit, Auth, etc.)**: For any other error, it's considered a rotation trigger. `self.usage_manager.record_failure()` is called to apply an escalating cooldown, and `self.usage_manager.release_key()` releases the lock. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.
## 2. `usage_manager.py` - Stateful Concurrency & Usage Management
This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.
### Key Concepts
- **Asynchronous Design & Lazy Loading**: The entire class is asynchronous, using `aiofiles` for non-blocking file I/O and a `_lazy_init` pattern. The usage data from the JSON file is loaded only when the first request is made.
- **Concurrency Primitives**:
- **`filelock`**: A file-level lock (`.json.lock`) prevents race conditions if multiple *processes* are running and sharing the same usage file.
- **`asyncio.Lock` & `asyncio.Condition`**: Each key has its own `asyncio.Lock` and `asyncio.Condition` object. This enables the fine-grained, model-aware locking strategy.
### Tiered Key Acquisition (`acquire_key`)
This method implements the core logic for selecting a key. It is a "smart" blocking call.
1. **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
2. **Tiering**: It categorizes the remaining, valid keys into two tiers:
- **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
- **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested.
3. **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
4. **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key, waiting until it is notified that the key has been released.
### Failure Handling & Cooldowns (`record_failure`)
- **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
- **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
- **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
### Data Structure
The `key_usage.json` file has a more complex structure to store this detailed state:
```json
{
"api_key_hash": {
"daily": {
"date": "YYYY-MM-DD",
"models": {
"gemini/gemini-1.5-pro": {
"success_count": 10,
"prompt_tokens": 5000,
"completion_tokens": 10000,
"approx_cost": 0.075
}
}
},
"global": { /* ... similar to daily, but accumulates over time ... */ },
"model_cooldowns": {
"gemini/gemini-1.5-flash": 1719987600.0
},
"failures": {
"gemini/gemini-1.5-flash": {
"consecutive_failures": 2
}
},
"key_cooldown_until": null,
"last_daily_reset": "YYYY-MM-DD"
}
}
```
## 3. `error_handler.py`
This module provides a centralized function, `classify_error`, which is a significant improvement over the previous boolean checks.
- It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
- This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`, `'server_error'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
- This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
## 4. `request_sanitizer.py` (New Module)
- This module's purpose is to prevent `InvalidRequestError` exceptions from `litellm` that occur when a payload contains parameters not supported by the target model (e.g., sending a `thinking` parameter to a model that doesn't support it).
- The `sanitize_request_payload` function is called just before `litellm.acompletion` to strip out any such unsupported parameters, making the system more robust.
## 5. `providers/` - Provider Plugins
The provider plugin system remains for fetching model lists. The interface now correctly specifies that the `get_models` method receives an `httpx.AsyncClient` instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.
## 6. `proxy_app/` - The Proxy Application
The `proxy_app` directory contains the FastAPI application that serves the rotating client.
### `main.py` - The FastAPI App
This file contains the FastAPI application that exposes the `RotatingClient` through an OpenAI-compatible API.
#### Command-Line Arguments
- `--enable-request-logging`: This flag enables logging of all incoming requests and outgoing responses to the `logs/` directory. This is useful for debugging and monitoring the proxy's activity. By default, this is disabled.
|