Spaces:

elmerzole
/

llm-api-proxy

Paused

App Files Files Community

Mirrowel commited on Jul 4, 2025

Commit

c9419cb

1 Parent(s): b0569d9

feat: Update documentation and example configurations for improved clarity and usability

Browse files

Files changed (4) hide show

.env.example +4 -2
DOCUMENTATION.md +87 -58
README.md +133 -105
src/rotator_library/README.md +49 -44

.env.example CHANGED Viewed

@@ -1,4 +1,4 @@
-# LiteLLM will automatically pick up these keys.
 # Add more keys by creating GEMINI_API_KEY_2, GEMINI_API_KEY_3, etc.
 GEMINI_API_KEY_1="YOUR_GEMINI_API_KEY_1"
 GEMINI_API_KEY_2="YOUR_GEMINI_API_KEY_2"
@@ -6,6 +6,8 @@ OPENROUTER_API_KEY_1="YOUR_OPENROUTER_API_KEY_1"
 OPENROUTER_API_KEY_2="YOUR_OPENROUTER_API_KEY_2"
 CHUTES_API_KEY_1="YOUR_CHUTES_API_KEY_1"
 CHUTES_API_KEY_2="YOUR_CHUTES_API_KEY_2"
-# A secret key for your proxy server to authenticate requests
 PROXY_API_KEY="YOUR_PROXY_API_KEY"

+# Library will automatically pick up these keys.
 # Add more keys by creating GEMINI_API_KEY_2, GEMINI_API_KEY_3, etc.
 GEMINI_API_KEY_1="YOUR_GEMINI_API_KEY_1"
 GEMINI_API_KEY_2="YOUR_GEMINI_API_KEY_2"
 OPENROUTER_API_KEY_2="YOUR_OPENROUTER_API_KEY_2"
 CHUTES_API_KEY_1="YOUR_CHUTES_API_KEY_1"
 CHUTES_API_KEY_2="YOUR_CHUTES_API_KEY_2"
+NVIDIA_NIM_API_KEY_1="YOUR_NVIDIA_NIM_API_KEY_1"
+NVIDIA_NIM_API_KEY_2="YOUR_NVIDIA_NIM_API_KEY_2"
+# A secret key for your proxy server to authenticate requests(Can be anything. Used for compatibility)
 PROXY_API_KEY="YOUR_PROXY_API_KEY"

DOCUMENTATION.md CHANGED Viewed

@@ -1,90 +1,119 @@
 # Technical Documentation: `rotating-api-key-client`
-This document provides a detailed technical explanation of the `rotating-api-key-client` library, its components, and its internal workings.
 ## 1. `client.py` - The `RotatingClient`
-The `RotatingClient` is the central component of the library, orchestrating API calls, key rotation, and error handling.
 ### Request Lifecycle (`acompletion`)
 When `acompletion` is called, it follows these steps:
-1.  **Model and Provider Validation**: It first checks that a `model` is specified and extracts the provider name from it (e.g., `"gemini"` from `"gemini/gemini-2.5-flash-preview-05-20"`). It ensures that API keys for this provider are available.
-2.  **Key Selection Loop**: The client enters a loop to find a valid key and complete the request.
-    a.  **Get Next Smart Key**: It calls `self.usage_manager.get_next_smart_key()` to get the least-used key for the given model that is not currently on cooldown.
-    b.  **No Key Available**: If all keys for the provider are on cooldown, it waits for 5 seconds before restarting the loop.
-3.  **Attempt Loop**: Once a key is selected, it enters a retry loop (`for attempt in range(self.max_retries)`):
-    a.  **API Call**: It calls `litellm.acompletion` with the selected key and the user-provided arguments.
-    b.  **Success**:
-        -   If the call is successful and **non-streaming**, it calls `self.usage_manager.record_success()`, returns the response, and the process ends.
-        -   If the call is successful and **streaming**, it returns a `_streaming_wrapper` async generator. This wrapper formats the response chunks as Server-Sent Events (SSE) and calls `self.usage_manager.record_success()` only when the stream is fully consumed.
-    c.  **Failure**: If an exception occurs:
-        -   The failure is logged using `log_failure()`.
-        -   **Server Error**: If `is_server_error()` returns `True` and there are retries left, it waits for a moment and continues to the next attempt with the *same key*.
-        -   **Unrecoverable Error**: If `is_unrecoverable_error()` returns `True`, the exception is immediately raised, terminating the process.
-        -   **Other Errors (Rate Limit, Auth, etc.)**: For any other error, it's considered a "rotation" error. `self.usage_manager.record_rotation_error()` is called to put the key on cooldown, and the inner `attempt` loop is broken. The outer `while` loop then continues, fetching a new key.
-## 2. `usage_manager.py` - The `UsageManager`
-This class is responsible for all logic related to tracking and selecting API keys.
-### Key Data Structure
-Usage data is stored in a JSON file (e.g., `key_usage.json`). Here's a conceptual view of its structure:
 ```json
 {
-  "api_key_1_hash": {
-    "last_used": "timestamp",
-    "cooldown_until": "timestamp",
-    "global_usage": 150,
-    "daily_usage": {
-      "YYYY-MM-DD": 100
     },
-    "model_usage": {
-      "gemini/gemini-2.5-flash-preview-05-20": 50
-    }
   }
 }
 ```
--   **Key Hashing**: Keys are stored by their SHA256 hash to avoid exposing sensitive keys in logs or files.
--   `cooldown_until`: If a key fails, this timestamp is set. The key will not be selected until the current time is past this timestamp.
--   `model_usage`: Tracks the usage count for each specific model, which is the primary metric for the "smart" key selection.
-### Core Methods
--   `get_next_smart_key()`: This is the key selection logic. It filters out any keys that are on cooldown and then finds the key with the lowest usage count for the requested `model`.
--   `record_success()`: Increments the usage counters (`global_usage`, `daily_usage`, `model_usage`) for the given key.
--   `record_rotation_error()`: Sets the `cooldown_until` timestamp for the given key, effectively taking it out of rotation for a short period.
 ## 3. `error_handler.py`
-This module contains functions to classify exceptions returned by `litellm`.
--   `is_server_error(e)`: Checks if the exception is a transient server-side error (typically a `5xx` status code) that is worth retrying with the same key.
--   `is_unrecoverable_error(e)`: Checks for critical errors (e.g., invalid request parameters) that should immediately stop the process. Any error that is not a server error or an unrecoverable error is treated as a "rotation" error by the client.
-## 4. `failure_logger.py`
--   `log_failure()`: This function logs detailed information about a failed API request to a file in the `logs/` directory. This is crucial for debugging issues with specific keys or providers. The log includes the hashed API key, the model, the error message, and the request data.
 ## 5. `providers/` - Provider Plugins
-The provider plugin system allows for easy extension to support model list fetching from new LLM providers.
--   **`provider_interface.py`**: Defines the abstract base class `ProviderPlugin` with a single abstract method, `get_models`. Any new provider plugin must inherit from this class and implement this method.
--   **Implementations**: Each provider (e.g., `openai_provider.py`, `gemini_provider.py`) has its own file containing a class that implements the `ProviderPlugin` interface. The `get_models` method contains the specific logic to call the provider's API and return a list of their available models.
--   **`__init__.py`**: This file contains a dynamic plugin system that automatically discovers and registers any provider implementation placed in the `providers/` directory.
-### Special Provider: `chutes.ai`
-The `chutes` provider is handled as a special case within the `RotatingClient`. Since `litellm` does not have native support for `chutes.ai`, the client performs the following modifications at runtime:
-1.  **Sets `api_base`**: It sets the `api_base` to `https://llm.chutes.ai/v1`.
-2.  **Remaps the Model**: It changes the model name from `chutes/some-model` to `openai/some-model` before passing the request to `litellm`.
-This allows the system to use `chutes.ai` as if it were a custom OpenAI endpoint, while still leveraging the library's key rotation and management features.

 # Technical Documentation: `rotating-api-key-client`
+This document provides a detailed technical explanation of the `rotating-api-key-client` library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.
 ## 1. `client.py` - The `RotatingClient`
+The `RotatingClient` is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.
+### Core Responsibilities
+-   Managing an `httpx.AsyncClient` for non-blocking HTTP requests.
+-   Interfacing with the `UsageManager` to acquire and release API keys.
+-   Handling provider-specific request modifications.
+-   Executing API calls via `litellm` with a robust retry and rotation strategy.
+-   Providing a safe wrapper for streaming responses.
 ### Request Lifecycle (`acompletion`)
 When `acompletion` is called, it follows these steps:
+1.  **Provider and Key Validation**: It extracts the provider from the `model` name and ensures keys are configured for it.
+2.  **Key Acquisition Loop**: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried.
+    a.  **Acquire Best Key**: It calls `self.usage_manager.acquire_key()`. This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
+    b.  **Prepare Request**: It prepares the `litellm` keyword arguments. This includes:
+        -   **Request Sanitization**: Calling `sanitize_request_payload()` to remove parameters that might be unsupported by the target model, preventing errors.
+        -   **Provider-Specific Logic**: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (`api_base` and model name remapping).
+3.  **Retry Loop**: Once a key is acquired, it enters an inner retry loop (`for attempt in range(self.max_retries)`):
+    a.  **API Call**: It calls `litellm.acompletion` with the acquired key.
+    b.  **Success (Non-Streaming)**:
+        -   It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns for the key-model pair.
+        -   It calls `self.usage_manager.release_key()` to release the lock on the key for this model.
+        -   It returns the response, and the process ends.
+    c.  **Success (Streaming)**:
+        -   It returns a `_safe_streaming_wrapper` async generator. This wrapper is critical:
+            -   It yields SSE-formatted chunks to the consumer.
+            -   After the stream is fully consumed, its `finally` block ensures that `record_success()` and `release_key()` are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream.
+    d.  **Failure**: If an exception occurs:
+        -   The failure is logged in detail by `log_failure()`.
+        -   The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
+        -   **Server Error**: If the error type is `server_error`, it waits with exponential backoff and retries the request with the *same key*.
+        -   **Rotation Error (Rate Limit, Auth, etc.)**: For any other error, it's considered a rotation trigger. `self.usage_manager.record_failure()` is called to apply an escalating cooldown, and `self.usage_manager.release_key()` releases the lock. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.
+## 2. `usage_manager.py` - Stateful Concurrency & Usage Management
+This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.
+### Key Concepts
+-   **Asynchronous Design & Lazy Loading**: The entire class is asynchronous, using `aiofiles` for non-blocking file I/O and a `_lazy_init` pattern. The usage data from the JSON file is loaded only when the first request is made.
+-   **Concurrency Primitives**:
+    -   **`filelock`**: A file-level lock (`.json.lock`) prevents race conditions if multiple *processes* are running and sharing the same usage file.
+    -   **`asyncio.Lock` & `asyncio.Condition`**: Each key has its own `asyncio.Lock` and `asyncio.Condition` object. This enables the fine-grained, model-aware locking strategy.
+### Tiered Key Acquisition (`acquire_key`)
+This method implements the core logic for selecting a key. It is a "smart" blocking call.
+1.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
+2.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
+    -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
+    -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested.
+3.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
+4.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key, waiting until it is notified that the key has been released.
+### Failure Handling & Cooldowns (`record_failure`)
+-   **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
+-   **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
+-   **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
+### Data Structure
+The `key_usage.json` file has a more complex structure to store this detailed state:
 ```json
 {
+  "api_key_hash": {
+    "daily": {
+      "date": "YYYY-MM-DD",
+      "models": {
+        "gemini/gemini-1.5-pro": {
+          "success_count": 10,
+          "prompt_tokens": 5000,
+          "completion_tokens": 10000,
+          "approx_cost": 0.075
+        }
+      }
+    },
+    "global": { /* ... similar to daily, but accumulates over time ... */ },
+    "model_cooldowns": {
+      "gemini/gemini-1.5-flash": 1719987600.0
+    },
+    "failures": {
+      "gemini/gemini-1.5-flash": {
+        "consecutive_failures": 2
+      }
     },
+    "key_cooldown_until": null,
+    "last_daily_reset": "YYYY-MM-DD"
   }
 }
 ```
 ## 3. `error_handler.py`
+This module provides a centralized function, `classify_error`, which is a significant improvement over the previous boolean checks.
+-   It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
+-   This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`, `'server_error'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
+-   This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
+## 4. `request_sanitizer.py` (New Module)
+-   This module's purpose is to prevent `InvalidRequestError` exceptions from `litellm` that occur when a payload contains parameters not supported by the target model (e.g., sending a `thinking` parameter to a model that doesn't support it).
+-   The `sanitize_request_payload` function is called just before `litellm.acompletion` to strip out any such unsupported parameters, making the system more robust.
 ## 5. `providers/` - Provider Plugins
+The provider plugin system remains for fetching model lists. The interface now correctly specifies that the `get_models` method receives an `httpx.AsyncClient` instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.

README.md CHANGED Viewed

@@ -7,148 +7,179 @@ This project provides a robust solution for managing and rotating API keys for v
 ## Features
--   **Smart Key Rotation**: Intelligently selects the least-used API key to distribute request loads evenly.
--   **Automatic Retries**: Automatically retries requests on transient server errors (e.g., 5xx status codes).
--   **Per-Model Cooldowns**: If a key fails for a specific model (e.g., due to rate limits), it is only put on cooldown for that model, allowing it to be used with other models.
--   **Usage Tracking**: Monitors daily and global usage for each API key.
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.
--   **OpenAI-Compatible Proxy**: Offers a familiar API interface for seamless interaction with different models.
-## How It Works
-The core of this project is the `RotatingClient` library, which manages a pool of API keys. When a request is made, the client:
-1.  **Selects the Best Key**: It identifies the key with the lowest usage count that is not currently in a cooldown period.
-2.  **Makes the Request**: It uses the selected key to make the API call via `litellm`.
-3.  **Handles Errors**:
-    -   If a **retriable error** (like a 500 server error) occurs, it waits and retries the request.
-    -   If a **non-retriable error** (like a rate limit or invalid key error) occurs, it places the key on a temporary cooldown and selects a new key for the next attempt.
-4.  **Tracks Usage**: On a successful request, it records the usage for the key.
-The FastAPI proxy application exposes this functionality through an API endpoint that mimics the OpenAI API, making it easy to integrate with existing tools and applications.
-## Project Structure
 ```
-.
-├── logs/                     # Logs for failed requests
-├── src/
-│   ├── proxy_app/            # The FastAPI proxy application
-│   │   └── main.py
-│   └── rotator_library/      # The rotating-api-key-client library
-│       ├── __init__.py
-│       ├── client.py
-│       ├── error_handler.py
-│       ├── failure_logger.py
-│       ├── usage_manager.py
-│       ├── providers/
-│       └── ...
-├── .env.example
-├── README.md
-└── requirements.txt
 ```
-## Setup and Installation
-1.  **Clone the repository:**
-    ```bash
-    git clone <repository-url>
-    cd <repository-name>
-    ```
-2.  **Create a virtual environment:**
-    ```bash
-    python -m venv venv
-    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
-    ```
-3.  **Install dependencies:**
-    The `requirements.txt` file includes all necessary packages and installs the `rotator_library` in editable mode (`-e`), allowing for simultaneous development of the library and the proxy.
-    ```bash
-    pip install -r requirements.txt
-    ```
-4.  **Configure environment variables:**
-    Create a `.env` file by copying the example file:
-    ```bash
-    cp .env.example .env
-    ```
-    Edit the `.env` file to add your API keys. The proxy automatically detects keys for different providers based on the naming convention `PROVIDER_API_KEY_N`.
-    ```env
-    # A secret key to protect your proxy from unauthorized access
-    PROXY_API_KEY="your-secret-proxy-key"
-    # Add API keys for each provider. They will be rotated automatically.
-    GEMINI_API_KEY_1="your-gemini-api-key-1"
-    GEMINI_API_KEY_2="your-gemini-api-key-2"
-    OPENAI_API_KEY_1="your-openai-api-key-1"
-    OPENROUTER_API_KEY_1="your-openrouter-api-key-1"
-    # chutes.ai is used as a custom OpenAI endpoint
-    CHUTES_API_KEY_1="your-chutes-api-key-1"
-    ```
-## Running the Proxy
-To start the proxy application, run the following command:
 ```bash
 uvicorn src.proxy_app.main:app --reload
 ```
-The proxy will be available at `http://127.0.0.1:8000`.
-## Using the Proxy
-You can make requests to the proxy as if it were the OpenAI API. Remember to include your `PROXY_API_KEY` in the `Authorization` header.
-The `model` parameter must be specified in the format `provider/model_name` (e.g., `gemini/gemini-2.5-flash-preview-05-20`, `openai/gpt-4`, `openrouter/google/gemini-flash-1.5`, `chutes/deepseek-ai/DeepSeek-R1-0528`).
-### Example with `curl` (Non-Streaming):
 ```bash
 curl -X POST http://127.0.0.1:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
--H "Authorization: Bearer your-secret-proxy-key" \
 -d '{
     "model": "gemini/gemini-2.5-flash-preview-05-20",
     "messages": [{"role": "user", "content": "What is the capital of France?"}]
 }'
 ```
-### Example with `curl` (Streaming):
-```bash
-curl -X POST http://127.0.0.1:8000/v1/chat/completions \
--H "Content-Type: application/json" \
--H "Authorization: Bearer your-secret-proxy-key" \
--d '{
-    "model": "gemini/gemini-2.5-flash-preview-05-20",
-    "messages": [{"role": "user", "content": "Write a short story about a robot."}],
-    "stream": true
-}'
-```
-### Example with Python `requests`:
 ```python
-import requests
-import json
-proxy_url = "http://127.0.0.1:8000/v1/chat/completions"
-proxy_key = "your-secret-proxy-key"
-headers = {
-    "Content-Type": "application/json",
-    "Authorization": f"Bearer {proxy_key}"
-}
-data = {
-    "model": "gemini/gemini-2.5-flash-preview-05-20",
-    "messages": [{"role": "user", "content": "What is the capital of France?"}]
-}
-response = requests.post(proxy_url, headers=headers, data=json.dumps(data))
-print(response.json())
 ```
 ## Troubleshooting
@@ -156,10 +187,7 @@ print(response.json())
 -   **`500 Internal Server Error`**: Check the console logs of the `uvicorn` server for detailed error messages. This could indicate an issue with one of your provider API keys or a problem with the provider's service.
 -   **All keys on cooldown**: If you see a message that all keys are on cooldown, it means all your keys for a specific provider have recently failed. Check the `logs/` directory for details on why the failures occurred.
-## Using the Library in Other Projects
-The `rotating-api-key-client` is a standalone library that can be integrated into any Python project. For detailed documentation on how to use it, please refer to its `README.md` file located at `src/rotator_library/README.md`.
-## Detailed Documentation
-For a more in-depth technical explanation of the `rotating-api-key-client` library's architecture, components, and internal workings, please refer to the [Technical Documentation](DOCUMENTATION.md).

 ## Features
+-   **Advanced Concurrency Control**: A single API key can handle multiple concurrent requests to different models, maximizing throughput.
+-   **Smart Key Rotation**: Intelligently selects the least-used, available API key to distribute request loads evenly.
+-   **Escalating Per-Model Cooldowns**: If a key fails for a specific model (e.g., due to rate limits), it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
+-   **Automatic Retries**: Automatically retries requests on transient server errors (e.g., 5xx status codes) with exponential backoff.
+-   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
+-   **Request Logging**: Optional logging of full request and response payloads for easy debugging.
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.
+-   **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
+## Quick Start Guide
+This guide will get you up and running in just a few minutes.
+### 1. Setup
+First, clone the repository and install the required dependencies.
+**For Linux/macOS:**
+```bash
+# Clone the repository
+git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
+cd LLM-API-Key-Proxy
+# Create and activate a virtual environment
+python3 -m venv venv
+source venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+```
+**For Windows:**
+```powershell
+# Clone the repository
+git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
+cd LLM-API-Key-Proxy
+# Create and activate a virtual environment
+python -m venv venv
+.\venv\Scripts\Activate.ps1
+# Install dependencies
+pip install -r requirements.txt
+```
+### 2. Configure API Keys
+Next, create your `.env` file by copying the provided example. This file is where you will store all your secret keys.
+**For Linux/macOS:**
+```bash
+cp .env.example .env
 ```
+**For Windows:**
+```powershell
+copy .env.example .env
 ```
+Now, open the new `.env` file and replace the placeholder values with your actual API keys.
+**Refer to the `.env.example` file for the correct format and a full list of supported providers.**
+The two main types of keys are:
+1.  **`PROXY_API_KEY`**: This is a secret key *you create*. It is used to authorize requests to *your* proxy, preventing unauthorized use.
+2.  **Provider Keys**: These are the API keys you get from LLM providers (like Gemini, OpenAI, etc.). The proxy automatically finds them based on their name (e.g., `GEMINI_API_KEY_1`).
+**Example `.env` configuration:**
+```env
+# A secret key for your proxy server to authenticate requests.
+# This can be any secret string you choose.
+PROXY_API_KEY="YOUR_PROXY_API_KEY"
+# --- Provider API Keys ---
+# Add your keys from various providers below.
+# You can add multiple keys for one provider by numbering them (e.g., _1, _2).
+GEMINI_API_KEY_1="YOUR_GEMINI_API_KEY_1"
+GEMINI_API_KEY_2="YOUR_GEMINI_API_KEY_2"
+OPENROUTER_API_KEY_1="YOUR_OPENROUTER_API_KEY_1"
+NVIDIA_NIM_API_KEY_1="YOUR_NVIDIA_NIM_API_KEY_1"
+CHUTES_API_KEY_1="YOUR_CHUTES_API_KEY_1"
+```
+### 3. Run the Proxy
+Start the FastAPI server with `uvicorn`. The `--reload` flag will automatically restart the server when you make code changes.
 ```bash
 uvicorn src.proxy_app.main:app --reload
 ```
+The proxy is now running and available at `http://127.0.0.1:8000`.
+### 4. Make a Request
+You can now send requests to the proxy. The endpoint is `http://127.0.0.1:8000/v1/chat/completions`.
+Remember to:
+1.  Set the `Authorization` header to `Bearer your-super-secret-proxy-key`.
+2.  Specify the `model` in the format `provider/model_name`.
+Here is an example using `curl`:
 ```bash
 curl -X POST http://127.0.0.1:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
+-H "Authorization: Bearer your-super-secret-proxy-key" \
 -d '{
     "model": "gemini/gemini-2.5-flash-preview-05-20",
     "messages": [{"role": "user", "content": "What is the capital of France?"}]
 }'
 ```
+---
+## Advanced Usage
+### Using with the OpenAI Python Library
+The proxy is OpenAI-compatible, so you can use it directly with the `openai` Python client. This is the recommended way to integrate the proxy into your applications.
 ```python
+import openai
+# Point the client to your local proxy
+client = openai.OpenAI(
+    base_url="http://127.0.0.1:8000/v1",
+    api_key="your-super-secret-proxy-key" # Use your proxy key here
+)
+# Make a request
+response = client.chat.completions.create(
+    model="gemini/gemini-2.5-flash-preview-05-20", # Specify provider and model
+    messages=[
+        {"role": "user", "content": "Write a short poem about space."}
+    ]
+)
+print(response.choices[0].message.content)
+```
+### Available API Endpoints
+-   `POST /v1/chat/completions`: The main endpoint for making chat requests.
+-   `GET /v1/models`: Returns a list of all available models from your configured providers.
+-   `GET /v1/providers`: Returns a list of all configured providers.
+-   `POST /v1/token-count`: Calculates the token count for a given message payload.
+### Enabling Request Logging
+For debugging purposes, you can log the full request and response for every API call. To enable this, open `src/proxy_app/main.py` and change the following line:
+```python
+# Set to True to enable request/response logging
+ENABLE_REQUEST_LOGGING = True
 ```
+Logs will be saved in the `logs/` directory.
+## How It Works
+The core of this project is the `RotatingClient` library, which manages a pool of API keys with a sophisticated concurrency model. When a request is made, the client:
+1.  **Acquires the Best Key**: It requests the best available key from the `UsageManager`. The manager uses a tiered locking strategy to find a key that is not on cooldown and preferably not in use. If a key is busy with another request for the *same model*, it waits. Otherwise, it allows concurrent use for *different models*.
+2.  **Makes the Request**: It uses the acquired key to make the API call via `litellm`.
+3.  **Handles Errors**:
+    -   It uses a `classify_error` function to determine the failure type.
+    -   For **server errors**, it retries the request with the same key using exponential backoff.
+    -   For **rate-limit or auth errors**, it records the failure, applies an escalating cooldown for that specific key-model pair, and the client immediately tries the next available key.
+4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key's lock is then released, notifying any waiting requests that it is available.
 ## Troubleshooting
 -   **`500 Internal Server Error`**: Check the console logs of the `uvicorn` server for detailed error messages. This could indicate an issue with one of your provider API keys or a problem with the provider's service.
 -   **All keys on cooldown**: If you see a message that all keys are on cooldown, it means all your keys for a specific provider have recently failed. Check the `logs/` directory for details on why the failures occurred.
+## Library and Technical Docs
+-   **Using the Library**: For documentation on how to use the `rotating-api-key-client` library directly in your own Python projects, please refer to its [README.md](src/rotator_library/README.md).
+-   **Technical Details**: For a more in-depth technical explanation of the library's architecture, components, and internal workings, please refer to the [Technical Documentation](DOCUMENTATION.md).

src/rotator_library/README.md CHANGED Viewed

@@ -1,13 +1,16 @@
 # Rotating API Key Client
-A simple, thread-safe client that intelligently rotates and retries API keys for use with `litellm`. This library is designed to make your interactions with LLM providers more resilient and efficient.
 ## Features
--   **Smart Key Rotation**: Automatically uses the least-used key to distribute load.
--   **Automatic Retries**: Retries requests on transient server errors.
--   **Per-Model Cooldowns**: If a key fails for a specific model (e.g., due to rate limits), it is only put on cooldown for that model, allowing it to be used with other models.
--   **Usage Tracking**: Tracks daily and global usage for each key.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
 -   **Extensible**: Easily add support for new providers through a plugin-based architecture.
@@ -22,7 +25,7 @@ pip install -e .
 ## `RotatingClient` Class
-This is the main class for interacting with the library.
 ### Initialization
@@ -40,16 +43,33 @@ client = RotatingClient(
 -   `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
 -   `usage_file_path`: The path to the JSON file where key usage data will be stored.
 ### Methods
 #### `async def acompletion(self, **kwargs) -> Any:`
-This is the primary method for making API calls. It's a wrapper around `litellm.acompletion` that adds key rotation and retry logic.
--   **Parameters**: Accepts the same keyword arguments as `litellm.acompletion` (e.g., `messages`, `stream`). The `model` parameter is required and must be a string in the format `provider/model_name` (e.g., `"gemini/gemini-2.5-flash-preview-05-20"`, `"openrouter/google/gemini-flash-1.5"`, `"chutes/deepseek-ai/DeepSeek-R1-0528"`).
 -   **Returns**:
     -   For non-streaming requests, it returns the `litellm` response object.
-    -   For streaming requests, it returns an async generator that yields OpenAI-compatible Server-Sent Events (SSE).
 **Example:**
@@ -59,13 +79,12 @@ from rotating_api_key_client import RotatingClient
 async def main():
     api_keys = {"gemini": ["key1", "key2"]}
-    client = RotatingClient(api_keys=api_keys)
-    response = await client.acompletion(
-        model="gemini/gemini-2.5-flash-preview-05-20",
-        messages=[{"role": "user", "content": "Hello!"}]
-    )
-    print(response)
 asyncio.run(main())
 ```
@@ -73,61 +92,47 @@ asyncio.run(main())
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
 Calculates the token count for a given text or list of messages using `litellm.token_counter`.
-The `model` parameter is required and must be a string in the format `provider/model_name` (e.g., `"gemini/gemini-2.5-flash-preview-05-20"`).
-**Example:**
-```python
-count = client.token_count(
-    model="gemini/gemini-2.5-flash-preview-05-20",
-    messages=[{"role": "user", "content": "Count these tokens."}]
-)
-print(f"Token count: {count}")
-```
 #### `async def get_available_models(self, provider: str) -> List[str]:`
-Fetches a list of available models for a specific provider. Results are cached.
-#### `async def get_all_available_models(self) -> Dict[str, List[str]]:`
-Fetches a dictionary of all available models, grouped by provider.
 ## Error Handling and Cooldowns
-The client is designed to handle errors gracefully:
--   **Server Errors (`5xx`)**: The client will retry the request with the *same key* up to `max_retries` times.
--   **Rate Limit / Auth Errors**: These are considered "rotation" errors. The client will immediately place the failing key on a temporary cooldown for that specific model and retry the request with a different key. This ensures that a single model failure does not sideline a key for all other models.
--   **Unrecoverable Errors**: For critical errors, the client will fail fast and raise the exception.
-Cooldowns are managed by the `UsageManager` on a per-model basis, preventing failing keys from being used repeatedly for models they have recently failed with. Upon a successful call, any existing cooldown for that key-model pair is cleared.
 ## Extending with Provider Plugins
-The library uses a dynamic plugin system. To add support for a new provider, you only need to do two things:
-1.  **Create a new provider file** in `src/rotator_library/providers/` (e.g., `my_provider.py`). The name of the file (without `_provider.py`) will be used as the provider name (e.g., `my_provider`).
 2.  **Implement the `ProviderInterface`**: Inside your new file, create a class that inherits from `ProviderInterface` and implements the `get_models` method.
 ```python
 # src/rotator_library/providers/my_provider.py
 from .provider_interface import ProviderInterface
 from typing import List
 class MyProvider(ProviderInterface):
-    async def get_models(self, api_key: str) -> List[str]:
         # Logic to fetch and return a list of model names
         # The model names should be prefixed with the provider name.
         # e.g., ["my-provider/model-1", "my-provider/model-2"]
         pass
 ```
-The system will automatically discover and register your new provider when the library is imported.
-### Special Case: `chutes.ai`
-The `chutes` provider is handled as a special case. Since `litellm` does not support it directly, the `RotatingClient` modifies the request by setting the `api_base` to `https://llm.chutes.ai/v1` and remapping the model from `chutes/model-name` to `openai/model-name`. This allows `chutes.ai` to be used as a custom OpenAI-compatible endpoint.
 ## Detailed Documentation
-For a more in-depth technical explanation of the `rotating-api-key-client` library's architecture, components, and internal workings, please refer to the [Technical Documentation](../../DOCUMENTATION.md).

 # Rotating API Key Client
+A robust, asynchronous, and thread-safe client that intelligently rotates and retries API keys for use with `litellm`. This library is designed to make your interactions with LLM providers more resilient, concurrent, and efficient.
 ## Features
+-   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
+-   **Advanced Concurrency Control**: A single key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety.
+-   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy.
+-   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model.
+-   **Automatic Retries**: Retries requests on transient server errors with exponential backoff.
+-   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost.
+-   **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
 -   **Extensible**: Easily add support for new providers through a plugin-based architecture.
 ## `RotatingClient` Class
+This is the main class for interacting with the library. It is designed to be a long-lived object that manages its own HTTP client and key usage state.
 ### Initialization
 -   `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
 -   `usage_file_path`: The path to the JSON file where key usage data will be stored.
+### Concurrency and Resource Management
+The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. This can be done manually or by using an `async with` block.
+**Manual Management:**
+```python
+client = RotatingClient(api_keys=api_keys)
+# ... use the client ...
+await client.close()
+```
+**Recommended (`async with`):**
+```python
+async with RotatingClient(api_keys=api_keys) as client:
+    # ... use the client ...
+```
 ### Methods
 #### `async def acompletion(self, **kwargs) -> Any:`
+This is the primary method for making API calls. It's a wrapper around `litellm.acompletion` that adds the core logic for key acquisition, rotation, and retries.
+-   **Parameters**: Accepts the same keyword arguments as `litellm.acompletion`. The `model` parameter is required and must be a string in the format `provider/model_name`.
 -   **Returns**:
     -   For non-streaming requests, it returns the `litellm` response object.
+    -   For streaming requests, it returns an async generator that yields OpenAI-compatible Server-Sent Events (SSE). The wrapper ensures that key locks are released and usage is recorded only after the stream is fully consumed.
 **Example:**
 async def main():
     api_keys = {"gemini": ["key1", "key2"]}
+    async with RotatingClient(api_keys=api_keys) as client:
+        response = await client.acompletion(
+            model="gemini/gemini-2.5-flash-preview-05-20",
+            messages=[{"role": "user", "content": "Hello!"}]
+        )
+        print(response)
 asyncio.run(main())
 ```
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
 Calculates the token count for a given text or list of messages using `litellm.token_counter`.
 #### `async def get_available_models(self, provider: str) -> List[str]:`
+Fetches a list of available models for a specific provider. Results are cached in memory.
+#### `async def get_all_available_models(self, grouped: bool = True) -> Union[Dict[str, List[str]], List[str]]:`
+Fetches a dictionary of all available models, grouped by provider, or as a single flat list if `grouped=False`.
 ## Error Handling and Cooldowns
+The client uses a sophisticated error handling mechanism:
+-   **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, etc.).
+-   **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
+-   **Rotation Errors (Rate Limit, Auth, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
+-   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 ## Extending with Provider Plugins
+The library uses a dynamic plugin system. To add support for a new provider's model list, you only need to:
+1.  **Create a new provider file** in `src/rotator_library/providers/` (e.g., `my_provider.py`).
 2.  **Implement the `ProviderInterface`**: Inside your new file, create a class that inherits from `ProviderInterface` and implements the `get_models` method.
 ```python
 # src/rotator_library/providers/my_provider.py
 from .provider_interface import ProviderInterface
 from typing import List
+import httpx
 class MyProvider(ProviderInterface):
+    async def get_models(self, api_key: str, http_client: httpx.AsyncClient) -> List[str]:
         # Logic to fetch and return a list of model names
         # The model names should be prefixed with the provider name.
         # e.g., ["my-provider/model-1", "my-provider/model-2"]
         pass
 ```
+The system will automatically discover and register your new provider.
 ## Detailed Documentation
+For a more in-depth technical explanation of the library's architecture, including the `UsageManager`'s concurrency model and the error classification system, please refer to the [Technical Documentation](../../DOCUMENTATION.md).