Spaces:

elmerzole
/

llm-api-proxy

Paused

Mirrowel commited on Jul 10, 2025

Commit

62ed41b

1 Parent(s): 83934f3

feat(core): implement deadline-driven request execution and resilient error handling

Introduce a `global_timeout` parameter to `RotatingClient` and refactor the request lifecycle to operate within a strict time budget.

- All key acquisition, rotation, and retry mechanisms now adhere to this global deadline, preventing indefinite hangs.
- Redesign error propagation to handle transient provider failures (e.g., rate limits, 5xx errors) internally. These errors now trigger key rotation or deadline-aware retries without immediately raising exceptions to the caller.
- Provide a more stable client experience by shielding consumers from intermittent backend issues, returning a failure only when the global timeout is exceeded or all keys are exhausted.
- Update documentation to reflect the new `global_timeout` and improved error handling.

BREAKING CHANGE: The client's error propagation for transient failures has changed. `acompletion` and `aembedding` methods in `RotatingClient` no longer raise exceptions for intermittent issues like rate limits or server errors; instead, they handle them internally. Non-streaming requests will now return `None` upon final failure (after exhausting all keys or exceeding the global timeout), and streaming requests will yield a final `[DONE]` message with an error payload. Additionally, the `UsageManager.__init__` method no longer accepts the `wait_timeout` parameter, and `UsageManager.acquire_key` now requires a `deadline` argument.

Files changed (5) hide show

DOCUMENTATION.md +35 -27
README.md +3 -2
src/rotator_library/README.md +13 -2
src/rotator_library/client.py +64 -21
src/rotator_library/usage_manager.py +24 -12

DOCUMENTATION.md CHANGED Viewed

@@ -21,39 +21,46 @@ This library is the heart of the project, containing all the logic for key rotat
 The `RotatingClient` is the central class that orchestrates all operations. It is designed as a long-lived, async-native object.
 #### Core Responsibilities
 *   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
 *   Interfacing with the `UsageManager` to acquire and release API keys.
 *   Dynamically loading and using provider-specific plugins from the `providers/` directory.
-*   Executing API calls via `litellm` with a robust retry and rotation strategy.
 *   Providing a safe, stateful wrapper for handling streaming responses.
-#### Request Lifecycle (`acompletion` & `aembedding`)
-When `acompletion` or `aembedding` is called, it follows a sophisticated, multi-layered process:
-1.  **Provider & Key Validation**: It extracts the provider from the `model` name (e.g., `"gemini/gemini-1.5-pro"` -> `"gemini"`) and ensures keys are configured for it.
-2.  **Key Acquisition Loop**: The client enters a `while` loop that attempts to find a valid key and complete the request. It iterates until one key succeeds or all have been tried.
-    a.  **Acquire Best Key**: It calls `self.usage_manager.acquire_key()`. This is a crucial, potentially blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
-    b.  **Prepare Request**: It prepares the `litellm` keyword arguments. This includes applying provider-specific logic (e.g., remapping safety settings for Gemini, handling `api_base` for Chutes.ai) and sanitizing the payload to remove unsupported parameters.
-3.  **Retry Loop**: Once a key is acquired, it enters an inner `for` loop (`for attempt in range(self.max_retries)`):
-    a.  **API Call**: It calls `litellm.acompletion` or `litellm.aembedding`.
-    b.  **Success (Non-Streaming)**:
-        -   It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns.
-        -   It calls `self.usage_manager.release_key()` to release the lock.
-        -   It returns the response, and the process ends.
-    c.  **Success (Streaming)**:
-        -   It returns the `_safe_streaming_wrapper` async generator. This wrapper is critical:
-            -   It yields SSE-formatted chunks to the consumer.
-            -   It can reassemble fragmented JSON chunks and detect errors mid-stream.
-            -   Its `finally` block ensures that `record_success()` and `release_key()` are called *only after the stream is fully consumed or closed*. This guarantees the key lock is held for the entire duration of the stream.
-    d.  **Failure**: If an exception occurs:
-        -   The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
-        -   **Server Error**: If the error is temporary (e.g., 5xx), it waits with exponential backoff and retries the request with the *same key*.
-        -   **Rotation Error (Rate Limit, Auth, etc.)**: For any other error, it's a trigger to rotate. `self.usage_manager.record_failure()` is called to apply a cooldown, and the lock is released. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
@@ -66,14 +73,15 @@ This class is the stateful core of the library, managing concurrency, usage, and
 #### Tiered Key Acquisition (`acquire_key`)
-This method implements the intelligent logic for selecting the best key for a job.
-1.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
-2.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
     -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
     -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested. This allows a single key to be used for concurrent calls to, for example, `gemini-1.5-pro` and `gemini-1.5-flash`.
-3.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the key with the lowest usage count.
-4.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key, waiting efficiently until it is notified that a key has been released.
 #### Failure Handling & Cooldowns (`record_failure`)

 The `RotatingClient` is the central class that orchestrates all operations. It is designed as a long-lived, async-native object.
+#### Initialization
+The client is initialized with your provider API keys, retry settings, and a new `global_timeout`.
+```python
+client = RotatingClient(
+    api_keys=api_keys,
+    max_retries=2,
+    global_timeout=30  # in seconds
+)
+```
+-   `global_timeout`: A crucial new parameter that sets a hard time limit for the entire request lifecycle, from the moment `acompletion` is called until a response is returned or the timeout is exceeded.
 #### Core Responsibilities
 *   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
 *   Interfacing with the `UsageManager` to acquire and release API keys.
 *   Dynamically loading and using provider-specific plugins from the `providers/` directory.
+*   Executing API calls via `litellm` with a robust, **deadline-driven** retry and rotation strategy.
 *   Providing a safe, stateful wrapper for handling streaming responses.
+#### Request Lifecycle: A Deadline-Driven Approach
+The request lifecycle has been redesigned around a single, authoritative time budget to ensure predictable performance and prevent requests from hanging indefinitely.
+1.  **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
+2.  **Deadline-Aware Key Rotation Loop**: The main `while` loop now has a critical secondary condition: `while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:`. The loop will exit immediately if the `deadline` is reached, regardless of how many keys are left to try.
+3.  **Deadline-Aware Key Acquisition**: The `self.usage_manager.acquire_key()` method now accepts the `deadline`. The `UsageManager` will not wait indefinitely for a key; if it cannot acquire one before the `deadline` is met, it will raise a `NoAvailableKeysError`, causing the request to fail fast with a "busy" error.
+4.  **Deadline-Aware Retries**: When a transient error occurs, the client calculates the necessary `wait_time` for an exponential backoff. It then checks if this wait time fits within the remaining budget (`deadline - time.time()`).
+    -   **If it fits**: It waits (`asyncio.sleep`) and retries with the same key.
+    -   **If it exceeds the budget**: It skips the wait entirely, logs a warning, and immediately rotates to the next key to avoid wasting time.
+5.  **Refined Error Propagation**:
+    -   **Fatal Errors**: Invalid requests or authentication errors are raised immediately to the client.
+    -   **Intermittent Errors**: Rate limits, server errors, and other temporary issues are now handled internally. The error is logged, the key is rotated, but the exception is **not** propagated to the end client. This prevents the client from seeing disruptive, intermittent failures.
+    -   **Final Failure**: A non-streaming request will only return `None` (indicating failure) if either a) the global `deadline` is exceeded, or b) all keys for the provider have been tried and have failed. A streaming request will yield a final `[DONE]` with an error message in the same scenarios.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
 #### Tiered Key Acquisition (`acquire_key`)
+This method implements the intelligent logic for selecting the best key for a job, now with deadline awareness.
+1.  **Deadline Enforcement**: The entire acquisition process runs in a `while time.time() < deadline:` loop. If a key cannot be found before the deadline, the method raises `NoAvailableKeysError`.
+2.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
+3.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
     -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
     -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested. This allows a single key to be used for concurrent calls to, for example, `gemini-1.5-pro` and `gemini-1.5-flash`.
+4.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the key with the lowest usage count.
+5.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key. Crucially, this wait is itself timed out by the remaining request budget, preventing indefinite waits.
 #### Failure Handling & Cooldowns (`record_failure`)

README.md CHANGED Viewed

@@ -22,10 +22,11 @@ This project provides a robust, self-hosted solution for managing and rotating A
 ## Features
 -   **Advanced Concurrency Control**: A single API key can handle multiple concurrent requests to different models, maximizing throughput.
 -   **Smart Key Rotation**: Intelligently selects the least-used, available API key to distribute request loads evenly.
--   **Escalating Per-Model Cooldowns**: If a key fails for a specific model (e.g., due to rate limits), it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
--   **Automatic Retries**: Automatically retries requests on transient server errors (e.g., 5xx status codes) with exponential backoff.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 -   **Request Logging**: Optional logging of full request and response payloads for easy debugging.
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.

 ## Features
+-   **Predictable Performance**: A new **global timeout** ensures that requests complete within a set time, preventing your application from hanging on slow or failing provider responses.
+-   **Resilient Error Handling**: The proxy now shields your application from transient backend errors. It handles rate limits and temporary provider issues internally by rotating keys, so your client only sees a failure if all options are exhausted or the timeout is hit.
 -   **Advanced Concurrency Control**: A single API key can handle multiple concurrent requests to different models, maximizing throughput.
 -   **Smart Key Rotation**: Intelligently selects the least-used, available API key to distribute request loads evenly.
+-   **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 -   **Request Logging**: Optional logging of full request and response payloads for easy debugging.
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.

src/rotator_library/README.md CHANGED Viewed

@@ -7,9 +7,10 @@ A robust, asynchronous, and thread-safe client that intelligently rotates and re
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
 -   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
 -   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy to distribute load evenly.
 -   **Intelligent Error Handling**:
     -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
-    -   **Automatic Retries**: Retries requests on transient server errors (e.g., 5xx) with exponential backoff.
     -   **Key-Level Lockouts**: If a key fails across multiple models, it's temporarily taken out of rotation entirely.
 -   **Robust Streaming Support**: The client includes a wrapper for streaming responses that can reassemble fragmented JSON chunks and intelligently detect and handle errors that occur mid-stream.
 -   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost, persisted to a JSON file.
@@ -56,13 +57,15 @@ if not api_keys:
 client = RotatingClient(
     api_keys=api_keys,
     max_retries=2,
-    usage_file_path="key_usage.json"
 )
 ```
 -   `api_keys`: A dictionary where keys are provider names (e.g., `"openai"`, `"gemini"`) and values are lists of API keys for that provider.
 -   `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
 -   `usage_file_path`: The path to the JSON file where key usage data will be stored.
 ### Concurrency and Resource Management
@@ -135,6 +138,14 @@ The client uses a sophisticated error handling mechanism:
 -   **Rotation Errors (Rate Limit, Auth, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
 -   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 ## Extending with Provider Plugins
 The library uses a dynamic plugin system. To add support for a new provider's model list, you only need to:

 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
 -   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
 -   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy to distribute load evenly.
+-   **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key rotations, exceeds a specified time limit, preventing indefinite hangs.
 -   **Intelligent Error Handling**:
     -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
+    -   **Deadline-Aware Retries**: Retries requests on transient server errors with exponential backoff, but only if the wait time fits within the global request budget.
     -   **Key-Level Lockouts**: If a key fails across multiple models, it's temporarily taken out of rotation entirely.
 -   **Robust Streaming Support**: The client includes a wrapper for streaming responses that can reassemble fragmented JSON chunks and intelligently detect and handle errors that occur mid-stream.
 -   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost, persisted to a JSON file.
 client = RotatingClient(
     api_keys=api_keys,
     max_retries=2,
+    usage_file_path="key_usage.json",
+    global_timeout=30  # Default is 30 seconds
 )
 ```
 -   `api_keys`: A dictionary where keys are provider names (e.g., `"openai"`, `"gemini"`) and values are lists of API keys for that provider.
 -   `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
 -   `usage_file_path`: The path to the JSON file where key usage data will be stored.
+-   `global_timeout`: A hard time limit (in seconds) for the entire request lifecycle. If the total time exceeds this, the request will fail.
 ### Concurrency and Resource Management
 -   **Rotation Errors (Rate Limit, Auth, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
 -   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
+### Global Timeout and Deadline-Driven Logic
+To ensure predictable performance, the client now operates on a strict time budget defined by the `global_timeout` parameter.
+-   **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
+-   **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
+-   **Silent Internal Errors**: Intermittent failures like rate limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key. A non-streaming request will only return `None` (or a streaming request will end) if the global timeout is exceeded or all keys have been exhausted. This creates a more stable experience for the end-user, as they are shielded from transient backend issues.
 ## Extending with Provider Plugins
 The library uses a dynamic plugin system. To add support for a new provider's model list, you only need to:

src/rotator_library/client.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import asyncio
 import json
 import os
 import random
 import httpx
@@ -33,7 +34,7 @@ class RotatingClient:
     A client that intelligently rotates and retries API keys using LiteLLM,
     with support for both streaming and non-streaming responses.
     """
-    def __init__(self, api_keys: Dict[str, List[str]], max_retries: int = 2, usage_file_path: str = "key_usage.json", configure_logging: bool = True):
         os.environ["LITELLM_LOG"] = "ERROR"
         litellm.set_verbose = False
         litellm.drop_params = True
@@ -52,6 +53,7 @@ class RotatingClient:
             raise ValueError("API keys dictionary cannot be empty.")
         self.api_keys = api_keys
         self.max_retries = max_retries
         self.usage_manager = UsageManager(file_path=usage_file_path)
         self._model_list_cache = {}
         self._provider_plugins = PROVIDER_PLUGINS
@@ -227,25 +229,40 @@ class RotatingClient:
         if provider not in self.api_keys:
             raise ValueError(f"No API keys configured for provider: {provider}")
         keys_for_provider = self.api_keys[provider]
         tried_keys = set()
         last_exception = None
         kwargs = self._convert_model_params(**kwargs)
-        while len(tried_keys) < len(keys_for_provider):
             current_key = None
             key_acquired = False
             try:
                 if await self.cooldown_manager.is_cooling_down(provider):
-                    remaining_time = await self.cooldown_manager.get_cooldown_remaining(provider)
-                    lib_logger.warning(f"Provider {provider} is in cooldown. Waiting for {remaining_time:.2f} seconds.")
-                    await asyncio.sleep(remaining_time)
                 keys_to_try = [k for k in keys_for_provider if k not in tried_keys]
                 if not keys_to_try:
                     break
-                current_key = await self.usage_manager.acquire_key(available_keys=keys_to_try, model=model)
                 key_acquired = True
                 tried_keys.add(current_key)
@@ -308,7 +325,15 @@ class RotatingClient:
                             lib_logger.warning(f"Key ...{current_key[-4:]} failed after {self.max_retries} retries with {classified_error.error_type} (Status: {classified_error.status_code}). Error: {error_message}. Rotating key.")
                             break # Move to the next key
                         wait_time = classified_error.retry_after or (1 * (2 ** attempt)) + random.uniform(0, 1)
                         error_message = str(e).split('\n')[0]
                         lib_logger.warning(f"Key ...{current_key[-4:]} failed with {classified_error.error_type} (Status: {classified_error.status_code}). Error: {error_message}. Retrying in {wait_time:.2f} seconds.")
                         await asyncio.sleep(wait_time)
@@ -341,27 +366,35 @@ class RotatingClient:
                     await self.usage_manager.release_key(current_key, model)
         if last_exception:
-            raise last_exception
-        raise Exception("Failed to complete the request: No available API keys for the provider or all keys failed.")
     async def _streaming_acompletion_with_retry(self, request: Optional[Any], **kwargs) -> AsyncGenerator[str, None]:
         """A dedicated generator for retrying streaming completions with full request preparation and per-key retries."""
         model = kwargs.get("model")
         provider = model.split('/')[0]
         keys_for_provider = self.api_keys[provider]
         tried_keys = set()
         last_exception = None
         kwargs = self._convert_model_params(**kwargs)
         try:
-            while len(tried_keys) < len(keys_for_provider):
                 current_key = None
                 key_acquired = False
                 try:
                     if await self.cooldown_manager.is_cooling_down(provider):
-                        remaining_time = await self.cooldown_manager.get_cooldown_remaining(provider)
-                        lib_logger.warning(f"Provider {provider} is in a global cooldown. All requests to this provider will be paused for {remaining_time:.2f} seconds.")
-                        await asyncio.sleep(remaining_time)
                     keys_to_try = [k for k in keys_for_provider if k not in tried_keys]
                     if not keys_to_try:
@@ -369,7 +402,11 @@ class RotatingClient:
                         break
                     lib_logger.info(f"Acquiring key for model {model}. Tried keys: {len(tried_keys)}/{len(keys_for_provider)}")
-                    current_key = await self.usage_manager.acquire_key(available_keys=keys_to_try, model=model)
                     key_acquired = True
                     tried_keys.add(current_key)
@@ -455,6 +492,11 @@ class RotatingClient:
                                 break
                             wait_time = classified_error.retry_after or (1 * (2 ** attempt)) + random.uniform(0, 1)
                             lib_logger.warning(f"Key ...{current_key[-4:]} failed with {classified_error.error_type}. Retrying in {wait_time:.2f} seconds.")
                             await asyncio.sleep(wait_time)
                             continue
@@ -493,22 +535,23 @@ class RotatingClient:
                     if key_acquired and current_key:
                         await self.usage_manager.release_key(current_key, model)
             if last_exception:
-                error_data = {"error": {"message": f"Failed to complete the streaming request. Last error: {str(last_exception)}", "type": "proxy_error"}}
-                yield f"data: {json.dumps(error_data)}\n\n"
-            else:
-                error_data = {"error": {"message": "Failed to complete the streaming request: No available API keys after rotation.", "type": "proxy_error"}}
-                yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"
         except NoAvailableKeysError as e:
-            lib_logger.error(f"A streaming request failed because no keys were available: {e}")
             error_data = {"error": {"message": str(e), "type": "proxy_busy"}}
             yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"
         except Exception as e:
-            lib_logger.error(f"An unhandled exception occurred in streaming retry logic: {e}")
             error_data = {"error": {"message": f"An unexpected error occurred: {str(e)}", "type": "proxy_internal_error"}}
             yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"

 import asyncio
 import json
+import time
 import os
 import random
 import httpx
     A client that intelligently rotates and retries API keys using LiteLLM,
     with support for both streaming and non-streaming responses.
     """
+    def __init__(self, api_keys: Dict[str, List[str]], max_retries: int = 2, usage_file_path: str = "key_usage.json", configure_logging: bool = True, global_timeout: int = 30):
         os.environ["LITELLM_LOG"] = "ERROR"
         litellm.set_verbose = False
         litellm.drop_params = True
             raise ValueError("API keys dictionary cannot be empty.")
         self.api_keys = api_keys
         self.max_retries = max_retries
+        self.global_timeout = global_timeout
         self.usage_manager = UsageManager(file_path=usage_file_path)
         self._model_list_cache = {}
         self._provider_plugins = PROVIDER_PLUGINS
         if provider not in self.api_keys:
             raise ValueError(f"No API keys configured for provider: {provider}")
+        # Establish a global deadline for the entire request lifecycle.
+        deadline = time.time() + self.global_timeout
         keys_for_provider = self.api_keys[provider]
         tried_keys = set()
         last_exception = None
         kwargs = self._convert_model_params(**kwargs)
+        # The main rotation loop. It continues as long as there are untried keys and the global deadline has not been exceeded.
+        while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:
             current_key = None
             key_acquired = False
             try:
+                # Check for a provider-wide cooldown first.
                 if await self.cooldown_manager.is_cooling_down(provider):
+                    remaining_cooldown = await self.cooldown_manager.get_cooldown_remaining(provider)
+                    remaining_budget = deadline - time.time()
+                    # If the cooldown is longer than the remaining time budget, fail fast.
+                    if remaining_cooldown > remaining_budget:
+                        lib_logger.warning(f"Provider {provider} cooldown ({remaining_cooldown:.2f}s) exceeds remaining request budget ({remaining_budget:.2f}s). Failing early.")
+                        break
+                    lib_logger.warning(f"Provider {provider} is in cooldown. Waiting for {remaining_cooldown:.2f} seconds.")
+                    await asyncio.sleep(remaining_cooldown)
                 keys_to_try = [k for k in keys_for_provider if k not in tried_keys]
                 if not keys_to_try:
                     break
+                current_key = await self.usage_manager.acquire_key(
+                    available_keys=keys_to_try,
+                    model=model,
+                    deadline=deadline
+                )
                 key_acquired = True
                 tried_keys.add(current_key)
                             lib_logger.warning(f"Key ...{current_key[-4:]} failed after {self.max_retries} retries with {classified_error.error_type} (Status: {classified_error.status_code}). Error: {error_message}. Rotating key.")
                             break # Move to the next key
+                        # For temporary errors, wait before retrying with the same key.
                         wait_time = classified_error.retry_after or (1 * (2 ** attempt)) + random.uniform(0, 1)
+                        remaining_budget = deadline - time.time()
+                        # If the required wait time exceeds the budget, don't wait; rotate to the next key immediately.
+                        if wait_time > remaining_budget:
+                            lib_logger.warning(f"Required retry wait time ({wait_time:.2f}s) exceeds remaining budget ({remaining_budget:.2f}s). Rotating key early.")
+                            break
                         error_message = str(e).split('\n')[0]
                         lib_logger.warning(f"Key ...{current_key[-4:]} failed with {classified_error.error_type} (Status: {classified_error.status_code}). Error: {error_message}. Retrying in {wait_time:.2f} seconds.")
                         await asyncio.sleep(wait_time)
                     await self.usage_manager.release_key(current_key, model)
         if last_exception:
+            # Log the final error but do not raise it, as per the new requirement.
+            # The client should not see intermittent failures.
+            lib_logger.error(f"Request failed after trying all keys or exceeding global timeout. Last error: {last_exception}")
+        # Return None to indicate failure without propagating a disruptive exception.
+        return None
     async def _streaming_acompletion_with_retry(self, request: Optional[Any], **kwargs) -> AsyncGenerator[str, None]:
         """A dedicated generator for retrying streaming completions with full request preparation and per-key retries."""
         model = kwargs.get("model")
         provider = model.split('/')[0]
         keys_for_provider = self.api_keys[provider]
+        deadline = time.time() + self.global_timeout
         tried_keys = set()
         last_exception = None
         kwargs = self._convert_model_params(**kwargs)
         try:
+            while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:
                 current_key = None
                 key_acquired = False
                 try:
                     if await self.cooldown_manager.is_cooling_down(provider):
+                        remaining_cooldown = await self.cooldown_manager.get_cooldown_remaining(provider)
+                        remaining_budget = deadline - time.time()
+                        if remaining_cooldown > remaining_budget:
+                            lib_logger.warning(f"Provider {provider} cooldown ({remaining_cooldown:.2f}s) exceeds remaining request budget ({remaining_budget:.2f}s). Failing early.")
+                            break
+                        lib_logger.warning(f"Provider {provider} is in a global cooldown. All requests to this provider will be paused for {remaining_cooldown:.2f} seconds.")
+                        await asyncio.sleep(remaining_cooldown)
                     keys_to_try = [k for k in keys_for_provider if k not in tried_keys]
                     if not keys_to_try:
                         break
                     lib_logger.info(f"Acquiring key for model {model}. Tried keys: {len(tried_keys)}/{len(keys_for_provider)}")
+                    current_key = await self.usage_manager.acquire_key(
+                        available_keys=keys_to_try,
+                        model=model,
+                        deadline=deadline
+                    )
                     key_acquired = True
                     tried_keys.add(current_key)
                                 break
                             wait_time = classified_error.retry_after or (1 * (2 ** attempt)) + random.uniform(0, 1)
+                            remaining_budget = deadline - time.time()
+                            if wait_time > remaining_budget:
+                                lib_logger.warning(f"Required retry wait time ({wait_time:.2f}s) exceeds remaining budget ({remaining_budget:.2f}s). Rotating key early.")
+                                break
                             lib_logger.warning(f"Key ...{current_key[-4:]} failed with {classified_error.error_type}. Retrying in {wait_time:.2f} seconds.")
                             await asyncio.sleep(wait_time)
                             continue
                     if key_acquired and current_key:
                         await self.usage_manager.release_key(current_key, model)
+            final_error_message = "Failed to complete the streaming request: No available API keys after rotation or global timeout exceeded."
             if last_exception:
+                final_error_message = f"Failed to complete the streaming request. Last error: {str(last_exception)}"
+                lib_logger.error(f"Streaming request failed after trying all keys. Last error: {last_exception}")
+            error_data = {"error": {"message": final_error_message, "type": "proxy_error"}}
+            yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"
         except NoAvailableKeysError as e:
+            lib_logger.error(f"A streaming request failed because no keys were available within the time budget: {e}")
             error_data = {"error": {"message": str(e), "type": "proxy_busy"}}
             yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"
         except Exception as e:
+            # This will now only catch fatal errors that should be raised, like invalid requests.
+            lib_logger.error(f"An unhandled exception occurred in streaming retry logic: {e}", exc_info=True)
             error_data = {"error": {"message": f"An unexpected error occurred: {str(e)}", "type": "proxy_internal_error"}}
             yield f"data: {json.dumps(error_data)}\n\n"
             yield "data: [DONE]\n\n"

src/rotator_library/usage_manager.py CHANGED Viewed

@@ -20,10 +20,9 @@ class UsageManager:
     Manages usage statistics and cooldowns for API keys with asyncio-safe locking,
     asynchronous file I/O, and a lazy-loading mechanism for usage data.
     """
-    def __init__(self, file_path: str = "key_usage.json", wait_timeout: int = 13, daily_reset_time_utc: Optional[str] = "03:00"):
         self.file_path = file_path
         self.key_states: Dict[str, Dict[str, Any]] = {}
-        self.wait_timeout = wait_timeout
         self._data_lock = asyncio.Lock()
         self._usage_data: Optional[Dict] = None
@@ -129,18 +128,21 @@ class UsageManager:
                     "models_in_use": set()
                 }
-    async def acquire_key(self, available_keys: List[str], model: str) -> str:
         """
-        Acquires the best available key using a tiered, model-aware locking strategy.
         """
         await self._lazy_init()
         self._initialize_key_states(available_keys)
-        start_time = time.time()
-        while time.time() - start_time < self.wait_timeout:
             tier1_keys, tier2_keys = [], []
             async with self._data_lock:
-                now = time.time()
                 for key in available_keys:
                     key_data = self._usage_data.get(key, {})
@@ -148,17 +150,21 @@ class UsageManager:
                        (key_data.get("model_cooldowns", {}).get(model) or 0) > now:
                         continue
                     usage_count = key_data.get("daily", {}).get("models", {}).get(model, {}).get("success_count", 0)
                     key_state = self.key_states[key]
                     if not key_state["models_in_use"]:
                         tier1_keys.append((key, usage_count))
                     elif model not in key_state["models_in_use"]:
                         tier2_keys.append((key, usage_count))
             tier1_keys.sort(key=lambda x: x[1])
             tier2_keys.sort(key=lambda x: x[1])
             for key, _ in tier1_keys:
                 state = self.key_states[key]
                 async with state["lock"]:
@@ -167,6 +173,7 @@ class UsageManager:
                         lib_logger.info(f"Acquired Tier 1 key ...{key[-4:]} for model {model}")
                         return key
             for key, _ in tier2_keys:
                 state = self.key_states[key]
                 async with state["lock"]:
@@ -175,6 +182,7 @@ class UsageManager:
                         lib_logger.info(f"Acquired Tier 2 key ...{key[-4:]} for model {model}")
                         return key
             lib_logger.info("All eligible keys are currently locked for this model. Waiting...")
             all_potential_keys = tier1_keys + tier2_keys
@@ -183,20 +191,24 @@ class UsageManager:
                 await asyncio.sleep(1)
                 continue
             best_wait_key = min(all_potential_keys, key=lambda x: x[1])[0]
             wait_condition = self.key_states[best_wait_key]["condition"]
             try:
                 async with wait_condition:
-                    remaining_timeout = self.wait_timeout - (time.time() - start_time)
-                    if remaining_timeout <= 0:
-                        break
-                    await asyncio.wait_for(wait_condition.wait(), timeout=min(1, remaining_timeout))
                 lib_logger.info("Notified that a key was released. Re-evaluating...")
             except asyncio.TimeoutError:
                 lib_logger.info("Wait timed out. Re-evaluating for any available key.")
-        raise NoAvailableKeysError(f"Could not acquire a key for model {model} within the {self.wait_timeout}s timeout.")
     async def release_key(self, key: str, model: str):

     Manages usage statistics and cooldowns for API keys with asyncio-safe locking,
     asynchronous file I/O, and a lazy-loading mechanism for usage data.
     """
+    def __init__(self, file_path: str = "key_usage.json", daily_reset_time_utc: Optional[str] = "03:00"):
         self.file_path = file_path
         self.key_states: Dict[str, Dict[str, Any]] = {}
         self._data_lock = asyncio.Lock()
         self._usage_data: Optional[Dict] = None
                     "models_in_use": set()
                 }
+    async def acquire_key(self, available_keys: List[str], model: str, deadline: float) -> str:
         """
+        Acquires the best available key using a tiered, model-aware locking strategy,
+        respecting a global deadline.
         """
         await self._lazy_init()
         self._initialize_key_states(available_keys)
+        # This loop continues as long as the global deadline has not been met.
+        while time.time() < deadline:
             tier1_keys, tier2_keys = [], []
+            now = time.time()
+            # First, filter the list of available keys to exclude any on cooldown.
             async with self._data_lock:
                 for key in available_keys:
                     key_data = self._usage_data.get(key, {})
                        (key_data.get("model_cooldowns", {}).get(model) or 0) > now:
                         continue
+                    # Prioritize keys based on their current usage to ensure load balancing.
                     usage_count = key_data.get("daily", {}).get("models", {}).get(model, {}).get("success_count", 0)
                     key_state = self.key_states[key]
+                    # Tier 1: Completely idle keys (preferred).
                     if not key_state["models_in_use"]:
                         tier1_keys.append((key, usage_count))
+                    # Tier 2: Keys busy with other models, but free for this one.
                     elif model not in key_state["models_in_use"]:
                         tier2_keys.append((key, usage_count))
             tier1_keys.sort(key=lambda x: x[1])
             tier2_keys.sort(key=lambda x: x[1])
+            # Attempt to acquire a key from Tier 1 first.
             for key, _ in tier1_keys:
                 state = self.key_states[key]
                 async with state["lock"]:
                         lib_logger.info(f"Acquired Tier 1 key ...{key[-4:]} for model {model}")
                         return key
+            # If no Tier 1 keys are available, try Tier 2.
             for key, _ in tier2_keys:
                 state = self.key_states[key]
                 async with state["lock"]:
                         lib_logger.info(f"Acquired Tier 2 key ...{key[-4:]} for model {model}")
                         return key
+            # If all eligible keys are locked, wait for a key to be released.
             lib_logger.info("All eligible keys are currently locked for this model. Waiting...")
             all_potential_keys = tier1_keys + tier2_keys
                 await asyncio.sleep(1)
                 continue
+            # Wait on the condition of the key with the lowest current usage.
             best_wait_key = min(all_potential_keys, key=lambda x: x[1])[0]
             wait_condition = self.key_states[best_wait_key]["condition"]
             try:
                 async with wait_condition:
+                    remaining_budget = deadline - time.time()
+                    if remaining_budget <= 0:
+                        break # Exit if the budget has already been exceeded.
+                    # Wait for a notification, but no longer than the remaining budget or 1 second.
+                    await asyncio.wait_for(wait_condition.wait(), timeout=min(1, remaining_budget))
                 lib_logger.info("Notified that a key was released. Re-evaluating...")
             except asyncio.TimeoutError:
+                # This is not an error, just a timeout for the wait. The main loop will re-evaluate.
                 lib_logger.info("Wait timed out. Re-evaluating for any available key.")
+        # If the loop exits, it means the deadline was exceeded.
+        raise NoAvailableKeysError(f"Could not acquire a key for model {model} within the global time budget.")
     async def release_key(self, key: str, model: str):