Spaces:

elmerzole
/

llm-api-proxy

Paused

App Files Files Community

Mirrowel commited on Jul 7, 2025

Commit

7ba1fcd

1 Parent(s): d195a5f

docs: Big documentation update part

Browse files

Files changed (3) hide show

DOCUMENTATION.md +93 -58
README.md +72 -38
src/rotator_library/README.md +51 -35

DOCUMENTATION.md CHANGED Viewed

@@ -1,73 +1,85 @@
-# Technical Documentation: `rotating-api-key-client`
-This document provides a detailed technical explanation of the `rotating-api-key-client` library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.
-## 1. `client.py` - The `RotatingClient`
-The `RotatingClient` is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.
-### Core Responsibilities
--   Managing an `httpx.AsyncClient` for non-blocking HTTP requests.
--   Interfacing with the `UsageManager` to acquire and release API keys.
--   Handling provider-specific request modifications.
--   Executing API calls via `litellm` with a robust retry and rotation strategy.
--   Providing a safe wrapper for streaming responses.
-### Request Lifecycle (`acompletion`)
-When `acompletion` is called, it follows these steps:
-1.  **Provider and Key Validation**: It extracts the provider from the `model` name and ensures keys are configured for it.
-2.  **Key Acquisition Loop**: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried.
-    a.  **Acquire Best Key**: It calls `self.usage_manager.acquire_key()`. This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
-    b.  **Prepare Request**: It prepares the `litellm` keyword arguments. This includes:
-        -   **Request Sanitization**: Calling `sanitize_request_payload()` to remove parameters that might be unsupported by the target model, preventing errors.
-        -   **Provider-Specific Logic**: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (`api_base` and model name remapping).
-3.  **Retry Loop**: Once a key is acquired, it enters an inner retry loop (`for attempt in range(self.max_retries)`):
-    a.  **API Call**: It calls `litellm.acompletion` with the acquired key.
     b.  **Success (Non-Streaming)**:
-        -   It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns for the key-model pair.
-        -   It calls `self.usage_manager.release_key()` to release the lock on the key for this model.
         -   It returns the response, and the process ends.
     c.  **Success (Streaming)**:
-        -   It returns a `_safe_streaming_wrapper` async generator. This wrapper is critical:
             -   It yields SSE-formatted chunks to the consumer.
-            -   After the stream is fully consumed, its `finally` block ensures that `record_success()` and `release_key()` are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream.
     d.  **Failure**: If an exception occurs:
-        -   The failure is logged in detail by `log_failure()`.
         -   The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
-        -   **Server Error**: If the error type is `server_error`, it waits with exponential backoff and retries the request with the *same key*.
-        -   **Rotation Error (Rate Limit, Auth, etc.)**: For any other error, it's considered a rotation trigger. `self.usage_manager.record_failure()` is called to apply an escalating cooldown, and `self.usage_manager.release_key()` releases the lock. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.
-## 2. `usage_manager.py` - Stateful Concurrency & Usage Management
-This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.
-### Key Concepts
--   **Asynchronous Design & Lazy Loading**: The entire class is asynchronous, using `aiofiles` for non-blocking file I/O and a `_lazy_init` pattern. The usage data from the JSON file is loaded only when the first request is made.
--   **Concurrency Primitives**:
-    -   **`filelock`**: A file-level lock (`.json.lock`) prevents race conditions if multiple *processes* are running and sharing the same usage file.
-    -   **`asyncio.Lock` & `asyncio.Condition`**: Each key has its own `asyncio.Lock` and `asyncio.Condition` object. This enables the fine-grained, model-aware locking strategy.
-### Tiered Key Acquisition (`acquire_key`)
-This method implements the core logic for selecting a key. It is a "smart" blocking call.
 1.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
 2.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
     -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
-    -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested.
-3.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
-4.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key, waiting until it is notified that the key has been released.
-### Failure Handling & Cooldowns (`record_failure`)
--   **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
--   **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
--   **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
 ### Data Structure
@@ -103,29 +115,52 @@ The `key_usage.json` file has a more complex structure to store this detailed st
 ## 3. `error_handler.py`
-This module provides a centralized function, `classify_error`, which is a significant improvement over the previous boolean checks.
--   It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
--   This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`, `'server_error'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
--   This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
-## 4. `request_sanitizer.py` (New Module)
--   This module's purpose is to prevent `InvalidRequestError` exceptions from `litellm` that occur when a payload contains parameters not supported by the target model (e.g., sending a `thinking` parameter to a model that doesn't support it).
--   The `sanitize_request_payload` function is called just before `litellm.acompletion` to strip out any such unsupported parameters, making the system more robust.
-## 5. `providers/` - Provider Plugins
-The provider plugin system remains for fetching model lists. The interface now correctly specifies that the `get_models` method receives an `httpx.AsyncClient` instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.
-## 6. `proxy_app/` - The Proxy Application
-The `proxy_app` directory contains the FastAPI application that serves the rotating client.
-### `main.py` - The FastAPI App
-This file contains the FastAPI application that exposes the `RotatingClient` through an OpenAI-compatible API.
-#### Command-Line Arguments
--   `--enable-request-logging`: This flag enables logging of all incoming requests and outgoing responses to the `logs/` directory. This is useful for debugging and monitoring the proxy's activity. By default, this is disabled.

+# Technical Documentation: API Key Proxy & Rotator Library
+This document provides a detailed technical explanation of the API Key Proxy and the `rotating-api-key-client` library, covering their architecture, components, and internal workings.
+## 1. Architecture Overview
+The project is a monorepo containing two primary components:
+1.  **`rotator_library`**: A standalone, reusable Python library for intelligent API key rotation and management.
+2.  **`proxy_app`**: A FastAPI application that consumes the `rotator_library` and exposes its functionality through an OpenAI-compatible web API.
+This architecture separates the core rotation logic from the web-serving layer, making the library portable and the proxy a clean implementation of its features.
+---
+## 2. `rotator_library` - The Core Engine
+This library is the heart of the project, containing all the logic for key rotation, usage tracking, and provider management.
+### 2.1. `client.py` - The `RotatingClient`
+The `RotatingClient` is the central class that orchestrates all operations. It is designed as a long-lived, async-native object.
+#### Core Responsibilities
+*   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
+*   Interfacing with the `UsageManager` to acquire and release API keys.
+*   Dynamically loading and using provider-specific plugins from the `providers/` directory.
+*   Executing API calls via `litellm` with a robust retry and rotation strategy.
+*   Providing a safe, stateful wrapper for handling streaming responses.
+#### Request Lifecycle (`acompletion` & `aembedding`)
+When `acompletion` or `aembedding` is called, it follows a sophisticated, multi-layered process:
+1.  **Provider & Key Validation**: It extracts the provider from the `model` name (e.g., `"gemini/gemini-1.5-pro"` -> `"gemini"`) and ensures keys are configured for it.
+2.  **Key Acquisition Loop**: The client enters a `while` loop that attempts to find a valid key and complete the request. It iterates until one key succeeds or all have been tried.
+    a.  **Acquire Best Key**: It calls `self.usage_manager.acquire_key()`. This is a crucial, potentially blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (see `UsageManager` section).
+    b.  **Prepare Request**: It prepares the `litellm` keyword arguments. This includes applying provider-specific logic (e.g., remapping safety settings for Gemini, handling `api_base` for Chutes.ai) and sanitizing the payload to remove unsupported parameters.
+3.  **Retry Loop**: Once a key is acquired, it enters an inner `for` loop (`for attempt in range(self.max_retries)`):
+    a.  **API Call**: It calls `litellm.acompletion` or `litellm.aembedding`.
     b.  **Success (Non-Streaming)**:
+        -   It calls `self.usage_manager.record_success()` to update usage stats and clear any cooldowns.
+        -   It calls `self.usage_manager.release_key()` to release the lock.
         -   It returns the response, and the process ends.
     c.  **Success (Streaming)**:
+        -   It returns the `_safe_streaming_wrapper` async generator. This wrapper is critical:
             -   It yields SSE-formatted chunks to the consumer.
+            -   It can reassemble fragmented JSON chunks and detect errors mid-stream.
+            -   Its `finally` block ensures that `record_success()` and `release_key()` are called *only after the stream is fully consumed or closed*. This guarantees the key lock is held for the entire duration of the stream.
     d.  **Failure**: If an exception occurs:
         -   The exception is passed to `classify_error()` to get a structured `ClassifiedError` object.
+        -   **Server Error**: If the error is temporary (e.g., 5xx), it waits with exponential backoff and retries the request with the *same key*.
+        -   **Rotation Error (Rate Limit, Auth, etc.)**: For any other error, it's a trigger to rotate. `self.usage_manager.record_failure()` is called to apply a cooldown, and the lock is released. The inner `attempt` loop is broken, and the outer `while` loop continues, acquiring a new key.
+### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
+This class is the stateful core of the library, managing concurrency, usage, and cooldowns.
+#### Key Concepts
+*   **Async-Native & Lazy-Loaded**: The class is fully asynchronous, using `aiofiles` for non-blocking file I/O. The usage data from the JSON file is loaded only when the first request is made (`_lazy_init`).
+*   **Fine-Grained Locking**: Each API key is associated with its own `asyncio.Lock` and `asyncio.Condition` object. This allows for a highly granular and efficient locking strategy.
+#### Tiered Key Acquisition (`acquire_key`)
+This method implements the intelligent logic for selecting the best key for a job.
 1.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
 2.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
     -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
+    -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested. This allows a single key to be used for concurrent calls to, for example, `gemini-1.5-pro` and `gemini-1.5-flash`.
+3.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the key with the lowest usage count.
+4.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key, waiting efficiently until it is notified that a key has been released.
+#### Failure Handling & Cooldowns (`record_failure`)
+*   **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for that specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
+*   **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
+*   **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
 ### Data Structure
 ## 3. `error_handler.py`
+This module provides a centralized function, `classify_error`, which is a significant improvement over simple boolean checks.
+*   It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
+*   This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
+*   This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
+### 2.4. `providers/` - Provider Plugins
+The provider plugin system allows for easy extension. The `__init__.py` file in this directory dynamically scans for all modules ending in `_provider.py`, imports the provider class from each, and registers it in the `PROVIDER_PLUGINS` dictionary. This makes adding new providers as simple as dropping a new file into the directory.
+---
+## 3. `proxy_app` - The FastAPI Proxy
+The `proxy_app` directory contains the FastAPI application that serves the `rotator_library`.
+### 3.1. `main.py` - The FastAPI App
+This file defines the web server and its endpoints.
+#### Lifespan Management
+The application uses FastAPI's `lifespan` context manager to manage the `RotatingClient` instance. The client is initialized when the application starts and gracefully closed (releasing its `httpx` resources) when the application shuts down. This ensures that a single, stateful client instance is shared across all requests.
+#### Endpoints
+*   `POST /v1/chat/completions`: The main endpoint for chat requests.
+*   `POST /v1/embeddings`: The endpoint for creating embeddings.
+*   `GET /v1/models`: Returns a list of all available models from configured providers.
+*   `GET /v1/providers`: Returns a list of all configured providers.
+*   `POST /v1/token-count`: Calculates the token count for a given message payload.
+#### Authentication
+All endpoints are protected by the `verify_api_key` dependency, which checks for a valid `Authorization: Bearer <PROXY_API_KEY>` header.
+#### Streaming Response Handling
+For streaming requests, the `chat_completions` endpoint returns a `StreamingResponse` whose content is generated by the `streaming_response_wrapper` function. This wrapper serves two purposes:
+1.  It passes the chunks from the `RotatingClient`'s stream directly to the user.
+2.  It aggregates the full response in the background so that it can be logged completely once the stream is finished.
+### 3.2. `request_logger.py`
+This module provides the `log_request_response` function, which writes the request and response data to a timestamped JSON file in the `logs/` directory. It handles creating separate directories for `completions` and `embeddings`.
+### 3.3. `build.py`
+This is a utility script for creating a standalone executable of the proxy application using PyInstaller. It includes logic to dynamically find all provider plugins and explicitly include them as hidden imports, ensuring they are bundled into the final executable.

README.md CHANGED Viewed

@@ -15,10 +15,10 @@ Your proxy is now running! You can now use it in your applications.
 ## Detailed Setup and Features
-This project provides a robust solution for managing and rotating API keys for various Large Language Model (LLM) providers. It consists of two main components:
-1.  A reusable Python library (`rotating-api-key-client`) for intelligently rotating API keys.
-2.  A FastAPI proxy application that uses this library to provide an OpenAI-compatible endpoint.
 ## Features
@@ -31,15 +31,30 @@ This project provides a robust solution for managing and rotating API keys for v
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.
 -   **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
-## Quick Start Guide
-This guide will get you up and running in just a few minutes.
-### 1. Setup
-First, clone the repository and install the required dependencies.
-**For Linux/macOS:**
 ```bash
 # Clone the repository
 git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
@@ -53,7 +68,7 @@ source venv/bin/activate
 pip install -r requirements.txt
 ```
-**For Windows:**
 ```powershell
 # Clone the repository
 git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
@@ -67,34 +82,32 @@ python -m venv venv
 pip install -r requirements.txt
 ```
-### 2. Configure API Keys
-Next, create your `.env` file by copying the provided example. This file is where you will store all your secret keys.
-**For Linux/macOS:**
 ```bash
 cp .env.example .env
 ```
-**For Windows:**
 ```powershell
 copy .env.example .env
 ```
-Now, open the new `.env` file and replace the placeholder values with your actual API keys.
 **Refer to the `.env.example` file for the correct format and a full list of supported providers.**
-The two main types of keys are:
-1.  **`PROXY_API_KEY`**: This is a secret key *you create*. It is used to authorize requests to *your* proxy, preventing unauthorized use.
 2.  **Provider Keys**: These are the API keys you get from LLM providers (like Gemini, OpenAI, etc.). The proxy automatically finds them based on their name (e.g., `GEMINI_API_KEY_1`).
 **Example `.env` configuration:**
 ```env
 # A secret key for your proxy server to authenticate requests.
 # This can be any secret string you choose.
-PROXY_API_KEY="YOUR_PROXY_API_KEY"
 # --- Provider API Keys ---
 # Add your keys from various providers below.
@@ -153,9 +166,9 @@ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
 ## Advanced Usage
-### Using with the OpenAI Python Library
-The proxy is OpenAI-compatible, so you can use it directly with the `openai` Python client. This is the recommended way to integrate the proxy into your applications.
 ```python
 import openai
@@ -163,12 +176,12 @@ import openai
 # Point the client to your local proxy
 client = openai.OpenAI(
     base_url="http://127.0.0.1:8000/v1",
-    api_key="your-super-secret-proxy-key" # Use your proxy key here
 )
 # Make a request
 response = client.chat.completions.create(
-    model="gemini/gemini-2.5-flash-preview", # Specify provider and model
     messages=[
         {"role": "user", "content": "Write a short poem about space."}
     ]
@@ -177,6 +190,21 @@ response = client.chat.completions.create(
 print(response.choices[0].message.content)
 ```
 ### Available API Endpoints
 -   `POST /v1/chat/completions`: The main endpoint for making chat requests.
@@ -185,6 +213,22 @@ print(response.choices[0].message.content)
 -   `GET /v1/providers`: Returns a list of all configured providers.
 -   `POST /v1/token-count`: Calculates the token count for a given message payload.
 ### Enabling Request Logging
 For debugging purposes, you can log the full request and response for every API call. To enable this, start the proxy with the `--enable-request-logging` flag:
@@ -199,25 +243,15 @@ uvicorn src.proxy_app.main:app --reload -- --enable-request-logging
 ./proxy_app.exe --enable-request-logging
 ```
-Logs will be saved in the `logs/` directory.
-## How It Works
-The core of this project is the `RotatingClient` library, which manages a pool of API keys with a sophisticated concurrency model. When a request is made, the client:
-1.  **Acquires the Best Key**: It requests the best available key from the `UsageManager`. The manager uses a tiered locking strategy to find a key that is not on cooldown and preferably not in use. If a key is busy with another request for the *same model*, it waits. Otherwise, it allows concurrent use for *different models*.
-2.  **Makes the Request**: It uses the acquired key to make the API call via `litellm`.
-3.  **Handles Errors**:
-    -   It uses a `classify_error` function to determine the failure type.
-    -   For **server errors**, it retries the request with the same key using exponential backoff.
-    -   For **rate-limit or auth errors**, it records the failure, applies an escalating cooldown for that specific key-model pair, and the client immediately tries the next available key.
-4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key's lock is then released, notifying any waiting requests that it is available.
-## Troubleshooting
--   **`401 Unauthorized`**: Ensure your `PROXY_API_KEY` is set correctly in the `.env` file and included in the `Authorization` header of your request.
--   **`500 Internal Server Error`**: Check the console logs of the `uvicorn` server for detailed error messages. This could indicate an issue with one of your provider API keys or a problem with the provider's service.
--   **All keys on cooldown**: If you see a message that all keys are on cooldown, it means all your keys for a specific provider have recently failed. Check the `logs/` directory for details on why the failures occurred.
 ## Library and Technical Docs

 ## Detailed Setup and Features
+This project provides a robust, self-hosted solution for managing and rotating API keys for various Large Language Model (LLM) providers. It consists of two main components:
+1.  A reusable Python library (`rotating-api-key-client`) for intelligently rotating API keys with advanced concurrency and error handling.
+2.  A FastAPI proxy application that uses this library to provide a single, unified, and OpenAI-compatible endpoint for all your LLM requests.
 ## Features
 -   **Provider Agnostic**: Compatible with any provider supported by `litellm`.
 -   **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
+---
+## 1. Quick Start (Windows Executable)
+This is the fastest way to get started for most users on Windows.
+1.  **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
+2.  Unzip the downloaded file.
+3.  **Run `setup_env.bat`**. A window will open to help you add your API keys. Follow the on-screen instructions.
+4.  **Run `proxy_app.exe`**. This will start the proxy server in a new terminal window.
+Your proxy is now running and ready to use at `http://127.0.0.1:8000`.
+---
+## 2. Detailed Setup (From Source)
+This guide is for users who want to run the proxy from the source code on any operating system.
+### Step 1: Clone and Install
+First, clone the repository and install the required dependencies into a virtual environment.
+**Linux/macOS:**
 ```bash
 # Clone the repository
 git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
 pip install -r requirements.txt
 ```
+**Windows:**
 ```powershell
 # Clone the repository
 git clone https://github.com/Mirrowel/LLM-API-Key-Proxy.git
 pip install -r requirements.txt
 ```
+### Step 2: Configure API Keys
+Create a `.env` file to store your secret keys. You can do this by copying the example file.
+**Linux/macOS:**
 ```bash
 cp .env.example .env
 ```
+**Windows:**
 ```powershell
 copy .env.example .env
 ```
+Now, open the new `.env` file and add your keys.
 **Refer to the `.env.example` file for the correct format and a full list of supported providers.**
+1.  **`PROXY_API_KEY`**: This is a secret key **you create**. It is used to authorize requests to *your* proxy, preventing unauthorized use.
 2.  **Provider Keys**: These are the API keys you get from LLM providers (like Gemini, OpenAI, etc.). The proxy automatically finds them based on their name (e.g., `GEMINI_API_KEY_1`).
 **Example `.env` configuration:**
 ```env
 # A secret key for your proxy server to authenticate requests.
 # This can be any secret string you choose.
+PROXY_API_KEY="a-very-secret-and-unique-key"
 # --- Provider API Keys ---
 # Add your keys from various providers below.
 ## Advanced Usage
+### Using with the OpenAI Python Library (Recommended)
+The proxy is OpenAI-compatible, so you can use it directly with the `openai` Python client.
 ```python
 import openai
 # Point the client to your local proxy
 client = openai.OpenAI(
     base_url="http://127.0.0.1:8000/v1",
+    api_key="a-very-secret-and-unique-key" # Use your PROXY_API_KEY here
 )
 # Make a request
 response = client.chat.completions.create(
+    model="gemini/gemini-2.5-flash", # Specify provider and model
     messages=[
         {"role": "user", "content": "Write a short poem about space."}
     ]
 print(response.choices[0].message.content)
 ```
+### Using with `curl`
+```bash
+You can also send requests directly using tools like `curl`.
+```bash
+curl -X POST http://127.0.0.1:8000/v1/chat/completions \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer a-very-secret-and-unique-key" \
+-d '{
+    "model": "gemini/gemini-2.5-flash",
+    "messages": [{"role": "user", "content": "What is the capital of France?"}]
+}'
+```
 ### Available API Endpoints
 -   `POST /v1/chat/completions`: The main endpoint for making chat requests.
 -   `GET /v1/providers`: Returns a list of all configured providers.
 -   `POST /v1/token-count`: Calculates the token count for a given message payload.
+---
+## 4. Advanced Topics
+### How It Works
+The core of this project is the `RotatingClient` library. When a request is made, the client:
+1.  **Acquires the Best Key**: It requests the best available key from the `UsageManager`. The manager uses a tiered locking strategy to find a key that is not on cooldown and preferably not in use. If a key is busy with another request for the *same model*, it waits. Otherwise, it allows concurrent use for *different models*.
+2.  **Makes the Request**: It uses the acquired key to make the API call via `litellm`.
+3.  **Handles Errors**:
+    -   It uses a `classify_error` function to determine the failure type.
+    -   For **server errors**, it retries the request with the same key using exponential backoff.
+    -   For **rate-limit or auth errors**, it records the failure, applies an escalating cooldown for that specific key-model pair, and the client immediately tries the next available key.
+4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key's lock is then released, notifying any waiting requests that it is available.
 ### Enabling Request Logging
 For debugging purposes, you can log the full request and response for every API call. To enable this, start the proxy with the `--enable-request-logging` flag:
 ./proxy_app.exe --enable-request-logging
 ```
+Logs will be saved as JSON files in the `logs/` directory.
+### Troubleshooting
+-   **`401 Unauthorized`**: Ensure your `PROXY_API_KEY` is set correctly in the `.env` file and included in the `Authorization: Bearer <key>` header of your request.
+-   **`500 Internal Server Error`**: Check the console logs of the `uvicorn` server for detailed error messages. This could indicate an issue with one of your provider API keys (e.g., it's invalid or has been revoked) or a problem with the provider's service.
+-   **All keys on cooldown**: If you see a message that all keys are on cooldown, it means all your keys for a specific provider have recently failed. Check the `logs/` directory (if enabled) or the `key_usage.json` file for details on why the failures occurred.
+---
 ## Library and Technical Docs

src/rotator_library/README.md CHANGED Viewed

@@ -2,24 +2,26 @@
 A robust, asynchronous, and thread-safe client that intelligently rotates and retries API keys for use with `litellm`. This library is designed to make your interactions with LLM providers more resilient, concurrent, and efficient.
-## Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
--   **Advanced Concurrency Control**: A single key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety.
--   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy.
--   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model.
--   **Automatic Retries**: Retries requests on transient server errors with exponential backoff.
--   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost.
--   **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
--   **Extensible**: Easily add support for new providers through a plugin-based architecture.
 ## Installation
-To install the library, you can install it directly from a local path, which is recommended for development.
 ```bash
-# The -e flag installs it in "editable" mode
 pip install -e .
 ```
@@ -31,11 +33,18 @@ This is the main class for interacting with the library. It is designed to be a
 ```python
 from rotating_api_key_client import RotatingClient
 client = RotatingClient(
-    api_keys: Dict[str, List[str]],
-    max_retries: int = 2,
-    usage_file_path: str = "key_usage.json"
 )
 ```
@@ -45,19 +54,21 @@ client = RotatingClient(
 ### Concurrency and Resource Management
-The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. This can be done manually or by using an `async with` block.
-**Manual Management:**
 ```python
-client = RotatingClient(api_keys=api_keys)
-# ... use the client ...
-await client.close()
-```
-**Recommended (`async with`):**
-```python
-async with RotatingClient(api_keys=api_keys) as client:
-    # ... use the client ...
 ```
 ### Methods
@@ -71,24 +82,26 @@ This is the primary method for making API calls. It's a wrapper around `litellm.
     -   For non-streaming requests, it returns the `litellm` response object.
     -   For streaming requests, it returns an async generator that yields OpenAI-compatible Server-Sent Events (SSE). The wrapper ensures that key locks are released and usage is recorded only after the stream is fully consumed.
-**Example:**
 ```python
-import asyncio
-from rotating_api_key_client import RotatingClient
-async def main():
-    api_keys = {"gemini": ["key1", "key2"]}
     async with RotatingClient(api_keys=api_keys) as client:
-        response = await client.acompletion(
-            model="gemini/gemini-2.5-flash-preview-05-20",
-            messages=[{"role": "user", "content": "Hello!"}]
         )
-        print(response)
-asyncio.run(main())
 ```
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
 Calculates the token count for a given text or list of messages using `litellm.token_counter`.
@@ -124,10 +137,13 @@ from typing import List
 import httpx
 class MyProvider(ProviderInterface):
-    async def get_models(self, api_key: str, http_client: httpx.AsyncClient) -> List[str]:
         # Logic to fetch and return a list of model names
         # The model names should be prefixed with the provider name.
         # e.g., ["my-provider/model-1", "my-provider/model-2"]
         pass
 ```

 A robust, asynchronous, and thread-safe client that intelligently rotates and retries API keys for use with `litellm`. This library is designed to make your interactions with LLM providers more resilient, concurrent, and efficient.
+## Key Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
+-   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
+-   **Smart Key Rotation**: Acquires the least-used, available key using a tiered, model-aware locking strategy to distribute load evenly.
+-   **Intelligent Error Handling**:
+    -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
+    -   **Automatic Retries**: Retries requests on transient server errors (e.g., 5xx) with exponential backoff.
+    -   **Key-Level Lockouts**: If a key fails across multiple models, it's temporarily taken out of rotation entirely.
+-   **Robust Streaming Support**: The client includes a wrapper for streaming responses that can reassemble fragmented JSON chunks and intelligently detect and handle errors that occur mid-stream.
+-   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost, persisted to a JSON file.
+-   **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily to keep the system running smoothly.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
+-   **Extensible**: Easily add support for new providers through a simple plugin-based architecture.
 ## Installation
+To install the library, you can install it directly from a local path. Using the `-e` flag installs it in "editable" mode, which is recommended for development.
 ```bash
 pip install -e .
 ```
 ```python
 from rotating_api_key_client import RotatingClient
+from typing import Dict, List
+# Define your API keys, grouped by provider
+api_keys: Dict[str, List[str]] = {
+    "gemini": ["your_gemini_key_1", "your_gemini_key_2"],
+    "openai": ["your_openai_key_1"],
+}
 client = RotatingClient(
+    api_keys=api_keys,
+    max_retries=2,
+    usage_file_path="key_usage.json"
 )
 ```
 ### Concurrency and Resource Management
+The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. The recommended way is to use an `async with` block, which handles setup and teardown automatically.
 ```python
+import asyncio
+async def main():
+    async with RotatingClient(api_keys=api_keys) as client:
+        # ... use the client ...
+        response = await client.acompletion(
+            model="gemini/gemini-1.5-flash",
+            messages=[{"role": "user", "content": "Hello!"}]
+        )
+        print(response)
+asyncio.run(main())
 ```
 ### Methods
     -   For non-streaming requests, it returns the `litellm` response object.
     -   For streaming requests, it returns an async generator that yields OpenAI-compatible Server-Sent Events (SSE). The wrapper ensures that key locks are released and usage is recorded only after the stream is fully consumed.
+**Streaming Example:**
 ```python
+async def stream_example():
     async with RotatingClient(api_keys=api_keys) as client:
+        response_stream = await client.acompletion(
+            model="gemini/gemini-1.5-flash",
+            messages=[{"role": "user", "content": "Tell me a long story."}],
+            stream=True
         )
+        async for chunk in response_stream:
+            print(chunk)
+asyncio.run(stream_example())
 ```
+#### `async def aembedding(self, **kwargs) -> Any:`
+A wrapper around `litellm.aembedding` that provides the same key rotation and retry logic for embedding requests.
 #### `def token_count(self, model: str, text: str = None, messages: List[Dict[str, str]] = None) -> int:`
 Calculates the token count for a given text or list of messages using `litellm.token_counter`.
 import httpx
 class MyProvider(ProviderInterface):
+    async def get_models(self, api_key: str, client: httpx.AsyncClient) -> List[str]:
         # Logic to fetch and return a list of model names
         # The model names should be prefixed with the provider name.
         # e.g., ["my-provider/model-1", "my-provider/model-2"]
+        # Example:
+        # response = await client.get("https://api.myprovider.com/models", headers={"Auth": api_key})
+        # return [f"my-provider/{model['id']}" for model in response.json()]
         pass
 ```