Spaces:

elmerzole
/

llm-api-proxy

Paused

Mirrowel commited on Sep 20, 2025

Commit

40be1c9

1 Parent(s): 30f6fec

docs: update project descriptions and component names

Refine the project's title and descriptions across documentation to better emphasize its role as a universal LLM API proxy with a resilience library, improving clarity for developers building agentic systems. This includes renaming the library to "Resilience & API Key Management Library" and updating feature lists to highlight high availability and efficient key management.

Files changed (3) hide show

DOCUMENTATION.md +10 -10
README.md +17 -16
src/rotator_library/README.md +9 -9

DOCUMENTATION.md CHANGED Viewed

@@ -1,21 +1,21 @@
-# Technical Documentation: API Key Proxy & Rotator Library
-This document provides a detailed technical explanation of the API Key Proxy and the `rotating-api-key-client` library, covering their architecture, components, and internal workings.
 ## 1. Architecture Overview
 The project is a monorepo containing two primary components:
-1.  **`rotator_library`**: A standalone, reusable Python library for intelligent API key rotation and management.
-2.  **`proxy_app`**: A FastAPI application that consumes the `rotator_library` and exposes its functionality through an OpenAI-compatible web API.
-This architecture separates the core rotation logic from the web-serving layer, making the library portable and the proxy a clean implementation of its features.
 ---
-## 2. `rotator_library` - The Core Engine
-This library is the heart of the project, containing all the logic for key rotation, usage tracking, and provider management.
 ### 2.1. `client.py` - The `RotatingClient`
@@ -40,7 +40,7 @@ client = RotatingClient(
 *   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
 *   Interfacing with the `UsageManager` to acquire and release API keys.
 *   Dynamically loading and using provider-specific plugins from the `providers/` directory.
-*   Executing API calls via `litellm` with a robust, **deadline-driven** retry and rotation strategy.
 *   Providing a safe, stateful wrapper for handling streaming responses.
 #### Request Lifecycle: A Deadline-Driven Approach
@@ -49,7 +49,7 @@ The request lifecycle has been redesigned around a single, authoritative time bu
 1.  **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
-2.  **Deadline-Aware Key Rotation Loop**: The main `while` loop now has a critical secondary condition: `while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:`. The loop will exit immediately if the `deadline` is reached, regardless of how many keys are left to try.
 3.  **Deadline-Aware Key Acquisition**: The `self.usage_manager.acquire_key()` method now accepts the `deadline`. The `UsageManager` will not wait indefinitely for a key; if it cannot acquire one before the `deadline` is met, it will raise a `NoAvailableKeysError`, causing the request to fail fast with a "busy" error.
@@ -59,7 +59,7 @@ The request lifecycle has been redesigned around a single, authoritative time bu
 5.  **Refined Error Propagation**:
     -   **Fatal Errors**: Invalid requests or authentication errors are raised immediately to the client.
-    -   **Intermittent Errors**: Rate limits, server errors, and other temporary issues are now handled internally. The error is logged, the key is rotated, but the exception is **not** propagated to the end client. This prevents the client from seeing disruptive, intermittent failures.
     -   **Final Failure**: A non-streaming request will only return `None` (indicating failure) if either a) the global `deadline` is exceeded, or b) all keys for the provider have been tried and have failed. A streaming request will yield a final `[DONE]` with an error message in the same scenarios.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management

+# Technical Documentation: Universal LLM API Proxy & Resilience Library
+This document provides a detailed technical explanation of the project's two main components: the Universal LLM API Proxy and the Resilience Library that powers it.
 ## 1. Architecture Overview
 The project is a monorepo containing two primary components:
+1.  **The Proxy Application (`proxy_app`)**: This is the user-facing component. It's a FastAPI application that uses `litellm` to create a universal, OpenAI-compatible API. Its primary role is to abstract away the complexity of dealing with multiple LLM providers, offering a single point of entry for applications like agentic coders.
+2.  **The Resilience Library (`rotator_library`)**: This is the core engine that provides high availability. It is consumed by the proxy app to manage a pool of API keys, handle errors gracefully, and ensure requests are completed successfully even when individual keys or provider endpoints face issues.
+This architecture cleanly separates the API interface from the resilience logic, making the library a portable and powerful tool for any application needing robust API key management.
 ---
+## 2. `rotator_library` - The Resilience Engine
+This library is the heart of the project, containing all the logic for managing a pool of API keys, tracking their usage, and handling provider interactions to ensure application resilience.
 ### 2.1. `client.py` - The `RotatingClient`
 *   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
 *   Interfacing with the `UsageManager` to acquire and release API keys.
 *   Dynamically loading and using provider-specific plugins from the `providers/` directory.
+*   Executing API calls via `litellm` with a robust, **deadline-driven** retry and key selection strategy.
 *   Providing a safe, stateful wrapper for handling streaming responses.
 #### Request Lifecycle: A Deadline-Driven Approach
 1.  **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
+2.  **Deadline-Aware Key Selection Loop**: The main `while` loop now has a critical secondary condition: `while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:`. The loop will exit immediately if the `deadline` is reached, regardless of how many keys are left to try.
 3.  **Deadline-Aware Key Acquisition**: The `self.usage_manager.acquire_key()` method now accepts the `deadline`. The `UsageManager` will not wait indefinitely for a key; if it cannot acquire one before the `deadline` is met, it will raise a `NoAvailableKeysError`, causing the request to fail fast with a "busy" error.
 5.  **Refined Error Propagation**:
     -   **Fatal Errors**: Invalid requests or authentication errors are raised immediately to the client.
+    -   **Intermittent Errors**: Temporary issues like server errors and provider-side capacity limits are now handled internally. The error is logged, the key is rotated, but the exception is **not** propagated to the end client. This prevents the client from seeing disruptive, intermittent failures.
     -   **Final Failure**: A non-streaming request will only return `None` (indicating failure) if either a) the global `deadline` is exceeded, or b) all keys for the provider have been tried and have failed. A streaming request will yield a final `[DONE]` with an error message in the same scenarios.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management

README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# API Key Proxy with Rotating Key Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
 ## Easy Setup for Beginners (Windows)
@@ -15,17 +15,18 @@ Your proxy is now running! You can now use it in your applications.
 ## Detailed Setup and Features
-This project provides a robust, self-hosted solution for managing and rotating API keys for various Large Language Model (LLM) providers. It consists of two main components:
-1.  A reusable Python library (`rotating-api-key-client`) for intelligently rotating API keys with advanced concurrency and error handling.
-2.  A FastAPI proxy application that uses this library to provide a single, unified, and OpenAI-compatible endpoint for all your LLM requests.
 ## Features
--   **Predictable Performance**: A new **global timeout** ensures that requests complete within a set time, preventing your application from hanging on slow or failing provider responses.
--   **Resilient Error Handling**: The proxy now shields your application from transient backend errors. It handles rate limits and temporary provider issues internally by rotating keys, so your client only sees a failure if all options are exhausted or the timeout is hit.
--   **Advanced Concurrency Control**: A single API key can handle multiple concurrent requests to different models, maximizing throughput.
--   **Smart Key Rotation**: Intelligently selects the least-used, available API key to distribute request loads evenly.
 -   **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 -   **Request Logging**: Optional logging of full request and response payloads for easy debugging.
@@ -220,15 +221,15 @@ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
 ### How It Works
-The core of this project is the `RotatingClient` library. When a request is made, the client:
-1.  **Acquires the Best Key**: It requests the best available key from the `UsageManager`. The manager uses a tiered locking strategy to find a key that is not on cooldown and preferably not in use. If a key is busy with another request for the *same model*, it waits. Otherwise, it allows concurrent use for *different models*.
-2.  **Makes the Request**: It uses the acquired key to make the API call via `litellm`.
-3.  **Handles Errors**:
     -   It uses a `classify_error` function to determine the failure type.
-    -   For **server errors**, it retries the request with the same key using exponential backoff.
-    -   For **rate-limit or auth errors**, it records the failure, applies an escalating cooldown for that specific key-model pair, and the client immediately tries the next available key.
-4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key's lock is then released, notifying any waiting requests that it is available.
 ### Command-Line Arguments and Scripts
@@ -260,5 +261,5 @@ For convenience on Windows, you can use the provided `.bat` scripts in the root
 ## Library and Technical Docs
--   **Using the Library**: For documentation on how to use the `rotating-api-key-client` library directly in your own Python projects, please refer to its [README.md](src/rotator_library/README.md).
 -   **Technical Details**: For a more in-depth technical explanation of the library's architecture, components, and internal workings, please refer to the [Technical Documentation](DOCUMENTATION.md).

+# Universal LLM API Proxy & Resilience Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
 ## Easy Setup for Beginners (Windows)
 ## Detailed Setup and Features
+This project provides a powerful solution for developers building complex applications, such as agentic systems, that interact with multiple Large Language Model (LLM) providers. It consists of two distinct but complementary components:
+1.  **A Universal API Proxy**: A self-hosted FastAPI application that provides a single, OpenAI-compatible endpoint for all your LLM requests. Powered by `litellm`, it allows you to seamlessly switch between different providers and models without altering your application's code.
+2.  **A Resilience & Key Management Library**: The core engine that powers the proxy. This reusable Python library intelligently manages a pool of API keys to ensure your application is highly available and resilient to transient provider errors or performance issues.
 ## Features
+-   **Universal API Endpoint**: Simplifies development by providing a single, OpenAI-compatible interface for diverse LLM providers.
+-   **High Availability**: The underlying library ensures your application remains operational by gracefully handling transient provider errors and API key-specific issues.
+-   **Resilient Performance**: A global timeout on all requests prevents your application from hanging on unresponsive provider APIs.
+-   **Efficient Concurrency**: Maximizes throughput by allowing a single API key to handle multiple concurrent requests to different models.
+-   **Intelligent Key Management**: Optimizes request distribution across your pool of keys by selecting the best available one for each call.
 -   **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 -   **Request Logging**: Optional logging of full request and response payloads for easy debugging.
 ### How It Works
+When a request is made to the proxy, the application uses its core resilience library to ensure the request is handled reliably:
+1.  **Selects an Optimal Key**: The `UsageManager` selects the best available key from your pool. It uses a tiered locking strategy to find a healthy, available key, prioritizing those with the least recent usage. This allows for concurrent requests to different models using the same key, maximizing efficiency.
+2.  **Makes the Request**: The proxy uses the acquired key to make the API call to the target provider via `litellm`.
+3.  **Manages Errors Gracefully**:
     -   It uses a `classify_error` function to determine the failure type.
+    -   For **transient server errors**, it retries the request with the same key using exponential backoff.
+    -   For **key-specific issues (e.g., authentication or provider-side limits)**, it temporarily places that key on a cooldown for the specific model and seamlessly retries the request with the next available key from the pool.
+4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key is then released back into the available pool, ready for the next request.
 ### Command-Line Arguments and Scripts
 ## Library and Technical Docs
+-   **Using the Library**: For documentation on how to use the `api-key-manager` library directly in your own Python projects, please refer to its [README.md](src/rotator_library/README.md).
 -   **Technical Details**: For a more in-depth technical explanation of the library's architecture, components, and internal workings, please refer to the [Technical Documentation](DOCUMENTATION.md).

src/rotator_library/README.md CHANGED Viewed

@@ -1,13 +1,13 @@
-# Rotating API Key Client
-A robust, asynchronous, and thread-safe client that intelligently rotates and retries API keys for use with `litellm`. This library is designed to make your interactions with LLM providers more resilient, concurrent, and efficient.
 ## Key Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
 -   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
--   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy to distribute load evenly.
--   **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key rotations, exceeds a specified time limit, preventing indefinite hangs.
 -   **Intelligent Error Handling**:
     -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
     -   **Deadline-Aware Retries**: Retries requests on transient server errors with exponential backoff, but only if the wait time fits within the global request budget.
@@ -28,7 +28,7 @@ pip install -e .
 ## `RotatingClient` Class
-This is the main class for interacting with the library. It is designed to be a long-lived object that manages its own HTTP client and key usage state.
 ### Initialization
@@ -90,7 +90,7 @@ asyncio.run(main())
 #### `async def acompletion(self, **kwargs) -> Any:`
-This is the primary method for making API calls. It's a wrapper around `litellm.acompletion` that adds the core logic for key acquisition, rotation, and retries.
 -   **Parameters**: Accepts the same keyword arguments as `litellm.acompletion`. The `model` parameter is required and must be a string in the format `provider/model_name`.
 -   **Returns**:
@@ -115,7 +115,7 @@ asyncio.run(stream_example())
 #### `async def aembedding(self, **kwargs) -> Any:`
-A wrapper around `litellm.aembedding` that provides the same key rotation and retry logic for embedding requests.
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
@@ -135,7 +135,7 @@ The client uses a sophisticated error handling mechanism:
 -   **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, etc.).
 -   **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
--   **Rotation Errors (Rate Limit, Auth, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
 -   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 ### Global Timeout and Deadline-Driven Logic
@@ -144,7 +144,7 @@ To ensure predictable performance, the client now operates on a strict time budg
 -   **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
 -   **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
--   **Silent Internal Errors**: Intermittent failures like rate limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key. A non-streaming request will only return `None` (or a streaming request will end) if the global timeout is exceeded or all keys have been exhausted. This creates a more stable experience for the end-user, as they are shielded from transient backend issues.
 ## Extending with Provider Plugins

+# Resilience & API Key Management Library
+A robust, asynchronous, and thread-safe Python library for managing a pool of API keys. It is designed to be integrated into applications (such as the Universal LLM API Proxy included in this project) to provide a powerful layer of resilience and high availability when interacting with multiple LLM providers.
 ## Key Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
 -   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
+-   **Smart Key Management**: Selects the optimal key for each request using a tiered, model-aware locking strategy to distribute load evenly and maximize availability.
+-   **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key selections, exceeds a specified time limit, preventing indefinite hangs.
 -   **Intelligent Error Handling**:
     -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
     -   **Deadline-Aware Retries**: Retries requests on transient server errors with exponential backoff, but only if the wait time fits within the global request budget.
 ## `RotatingClient` Class
+This is the main class for interacting with the library. It is designed to be a long-lived object that manages the state of your API key pool.
 ### Initialization
 #### `async def acompletion(self, **kwargs) -> Any:`
+This is the primary method for making API calls. It's a wrapper around `litellm.acompletion` that adds the core logic for key acquisition, selection, and retries.
 -   **Parameters**: Accepts the same keyword arguments as `litellm.acompletion`. The `model` parameter is required and must be a string in the format `provider/model_name`.
 -   **Returns**:
 #### `async def aembedding(self, **kwargs) -> Any:`
+A wrapper around `litellm.aembedding` that provides the same key management and retry logic for embedding requests.
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
 -   **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, etc.).
 -   **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
+-   **Key-Specific Errors (Authentication, Quota, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
 -   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 ### Global Timeout and Deadline-Driven Logic
 -   **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
 -   **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
+-   **Silent Internal Errors**: Intermittent failures like provider capacity limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key. A non-streaming request will only return `None` (or a streaming request will end) if the global timeout is exceeded or all keys have been exhausted. This creates a more stable experience for the end-user, as they are shielded from transient backend issues.
 ## Extending with Provider Plugins