Spaces:

elmerzole
/

llm-api-proxy

Paused

Mirrowel commited on Nov 19, 2025

Commit

37e5eea

1 Parent(s): 939a72b

docs(readme): 📚 documentation expansion to cover new PR features(and in general)

Refine and expand the project documentation to help users onboard faster and operate the proxy in a variety of environments.

Files changed (5) hide show

.env.example +5 -1
DOCUMENTATION.md +306 -127
Deployment guide.md +11 -0
README.md +197 -37
src/rotator_library/README.md +88 -28

.env.example CHANGED Viewed

@@ -94,6 +94,10 @@ GEMINI_CLI_OAUTH_1=""
 # Path to your Qwen credential file (e.g., ~/.qwen/oauth_creds.json).
 QWEN_CODE_OAUTH_1=""
 # ------------------------------------------------------------------------------
 # | [ADVANCED] Provider-Specific Settings                                      |
@@ -153,7 +157,7 @@ WHITELIST_MODELS_OPENAI=""
 MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=1
 MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
 MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=1
 # ------------------------------------------------------------------------------
 # | [ADVANCED] Proxy Configuration                                             |

 # Path to your Qwen credential file (e.g., ~/.qwen/oauth_creds.json).
 QWEN_CODE_OAUTH_1=""
+# --- iFlow ---
+# Path to your iFlow credential file (e.g., ~/.iflow/oauth_creds.json).
+IFLOW_OAUTH_1=""
 # ------------------------------------------------------------------------------
 # | [ADVANCED] Provider-Specific Settings                                      |
 MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=1
 MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
 MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=1
+MAX_CONCURRENT_REQUESTS_PER_KEY_IFLOW=1
 # ------------------------------------------------------------------------------
 # | [ADVANCED] Proxy Configuration                                             |

DOCUMENTATION.md CHANGED Viewed

@@ -1,12 +1,15 @@
 # Technical Documentation: Universal LLM API Proxy & Resilience Library
-This document provides a detailed technical explanation of the project's two main components: the Universal LLM API Proxy and the Resilience Library that powers it.
 ## 1. Architecture Overview
 The project is a monorepo containing two primary components:
-1.  **The Proxy Application (`proxy_app`)**: This is the user-facing component. It's a FastAPI application that uses `litellm` to create a universal, OpenAI-compatible API. Its primary role is to abstract away the complexity of dealing with multiple LLM providers, offering a single point of entry for applications like agentic coders.
 2.  **The Resilience Library (`rotator_library`)**: This is the core engine that provides high availability. It is consumed by the proxy app to manage a pool of API keys, handle errors gracefully, and ensure requests are completed successfully even when individual keys or provider endpoints face issues.
 This architecture cleanly separates the API interface from the resilience logic, making the library a portable and powerful tool for any application needing robust API key management.
@@ -28,180 +31,356 @@ The client is initialized with your provider API keys, retry settings, and a new
 ```python
 client = RotatingClient(
     api_keys=api_keys,
     max_retries=2,
-    global_timeout=30  # in seconds
 )
 ```
--   `global_timeout`: A crucial new parameter that sets a hard time limit for the entire request lifecycle, from the moment `acompletion` is called until a response is returned or the timeout is exceeded.
 #### Core Responsibilities
-*   Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
-*   Interfacing with the `UsageManager` to acquire and release API keys.
-*   Dynamically loading and using provider-specific plugins from the `providers/` directory.
-*   Executing API calls via `litellm` with a robust, **deadline-driven** retry and key selection strategy.
-*   Providing a safe, stateful wrapper for handling streaming responses.
-*   Filtering available models using configurable whitelists and blacklists.
 #### Model Filtering Logic
-The `RotatingClient` provides fine-grained control over which models are exposed via the `/v1/models` endpoint. This is handled by the `get_available_models` method, which is called by `get_all_available_models`.
-The logic is as follows:
-1.  The client is initialized with `ignore_models` (blacklist) and `whitelist_models` dictionaries.
-2.  When `get_available_models` is called for a provider, it first fetches all models from the provider's API.
-3.  It then iterates through this list of actual models and applies the following rules:
-    -   **Whitelist Check**: It first checks if the model matches any pattern in the provider's whitelist. If it does, the model is **immediately included** in the final list, and the blacklist is ignored for this model.
-    -   **Blacklist Check**: If the model is *not* on the whitelist, it is then checked against the blacklist. If it matches a pattern, it is excluded.
-    -   **Default**: If a model is on neither list, it is included.
-4.  This ensures that the whitelist always acts as a definitive override to the blacklist.
 #### Request Lifecycle: A Deadline-Driven Approach
-The request lifecycle has been redesigned around a single, authoritative time budget to ensure predictable performance and prevent requests from hanging indefinitely.
 1.  **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
-2.  **Deadline-Aware Key Selection Loop**: The main `while` loop now has a critical secondary condition: `while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:`. The loop will exit immediately if the `deadline` is reached, regardless of how many keys are left to try.
-3.  **Deadline-Aware Key Acquisition**: The `self.usage_manager.acquire_key()` method now accepts the `deadline`. The `UsageManager` will not wait indefinitely for a key; if it cannot acquire one before the `deadline` is met, it will raise a `NoAvailableKeysError`, causing the request to fail fast with a "busy" error.
-4.  **Deadline-Aware Retries**: When a transient error occurs, the client calculates the necessary `wait_time` for an exponential backoff. It then checks if this wait time fits within the remaining budget (`deadline - time.time()`).
-    -   **If it fits**: It waits (`asyncio.sleep`) and retries with the same key.
-    -   **If it exceeds the budget**: It skips the wait entirely, logs a warning, and immediately rotates to the next key to avoid wasting time.
-5.  **Refined Error Propagation**:
-    -   **Fatal Errors**: Invalid requests or authentication errors are raised immediately to the client.
-    -   **Intermittent Errors**: Temporary issues like server errors and provider-side capacity limits are now handled internally. The error is logged, the key is rotated, but the exception is **not** propagated to the end client. This prevents the client from seeing disruptive, intermittent failures.
-    -   **Final Failure**: A non-streaming request will only return `None` (indicating failure) if either a) the global `deadline` is exceeded, or b) all keys for the provider have been tried and have failed. A streaming request will yield a final `[DONE]` with an error message in the same scenarios.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
-This class is the stateful core of the library, managing concurrency, usage, and cooldowns.
 #### Key Concepts
-*   **Async-Native & Lazy-Loaded**: The class is fully asynchronous, using `aiofiles` for non-blocking file I/O. The usage data from the JSON file is loaded only when the first request is made (`_lazy_init`).
-*   **Fine-Grained Locking**: Each API key is associated with its own `asyncio.Lock` and `asyncio.Condition` object. This allows for a highly granular and efficient locking strategy.
-#### Tiered Key Acquisition (`acquire_key`)
-This method implements the intelligent logic for selecting the best key for a job, now with deadline awareness.
-1.  **Deadline Enforcement**: The entire acquisition process runs in a `while time.time() < deadline:` loop. If a key cannot be found before the deadline, the method raises `NoAvailableKeysError`.
-2.  **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
-3.  **Tiering**: It categorizes the remaining, valid keys into two tiers:
-    -   **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
-    -   **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested. This allows a single key to be used for concurrent calls to, for example, `gemini-1.5-pro` and `gemini-1.5-flash`.
-4.  **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the key with the lowest usage count.
-5.  **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key. Crucially, this wait is itself timed out by the remaining request budget, preventing indefinite waits.
-#### Failure Handling & Cooldowns (`record_failure`)
-*   **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for that specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
-*   **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
-*   **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
-### Data Structure
-The `key_usage.json` file has a more complex structure to store this detailed state:
-```json
-{
-  "api_key_hash": {
-    "daily": {
-      "date": "YYYY-MM-DD",
-      "models": {
-        "gemini/gemini-1.5-pro": {
-          "success_count": 10,
-          "prompt_tokens": 5000,
-          "completion_tokens": 10000,
-          "approx_cost": 0.075
-        }
-      }
-    },
-    "global": { /* ... similar to daily, but accumulates over time ... */ },
-    "model_cooldowns": {
-      "gemini/gemini-1.5-flash": 1719987600.0
-    },
-    "failures": {
-      "gemini/gemini-1.5-flash": {
-        "consecutive_failures": 2
-      }
-    },
-    "key_cooldown_until": null,
-    "last_daily_reset": "YYYY-MM-DD"
-  }
-}
 ```
-## 3. `error_handler.py`
-This module provides a centralized function, `classify_error`, which is a significant improvement over simple boolean checks.
-*   It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
-*   This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
-*   This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
-### 2.4. `providers/` - Provider Plugins
-The provider plugin system allows for easy extension. The `__init__.py` file in this directory dynamically scans for all modules ending in `_provider.py`, imports the provider class from each, and registers it in the `PROVIDER_PLUGINS` dictionary. This makes adding new providers as simple as dropping a new file into the directory.
 ---
-## 3. `proxy_app` - The FastAPI Proxy
-The `proxy_app` directory contains the FastAPI application that serves the `rotator_library`.
-### 3.1. `main.py` - The FastAPI App
-This file defines the web server and its endpoints.
-#### Lifespan Management
-The application uses FastAPI's `lifespan` context manager to manage the `RotatingClient` instance. The client is initialized when the application starts and gracefully closed (releasing its `httpx` resources) when the application shuts down. This ensures that a single, stateful client instance is shared across all requests.
-#### Endpoints
-*   `POST /v1/chat/completions`: The main endpoint for chat requests.
-*   `POST /v1/embeddings`: The endpoint for creating embeddings.
-*   `GET /v1/models`: Returns a list of all available models from configured providers.
-*   `GET /v1/providers`: Returns a list of all configured providers.
-*   `POST /v1/token-count`: Calculates the token count for a given message payload.
-#### Authentication
-All endpoints are protected by the `verify_api_key` dependency, which checks for a valid `Authorization: Bearer <PROXY_API_KEY>` header.
-#### Streaming Response Handling
-For streaming requests, the `chat_completions` endpoint returns a `StreamingResponse` whose content is generated by the `streaming_response_wrapper` function. This wrapper serves two purposes:
-1.  It passes the chunks from the `RotatingClient`'s stream directly to the user.
-2.  It aggregates the full response in the background so that it can be logged completely once the stream is finished.
-### 3.2. `detailed_logger.py` - Comprehensive Transaction Logging
-To facilitate robust debugging and performance analysis, the proxy includes a powerful detailed logging system, enabled by the `--enable-request-logging` command-line flag. This system is managed by the `DetailedLogger` class in `detailed_logger.py`.
-Unlike simple logging, this system creates a **unique directory for every single transaction**, ensuring that all related data is isolated and easy to analyze.
-#### Log Directory Structure
-When logging is enabled, each request will generate a new directory inside `logs/detailed_logs/` with a name like `YYYYMMDD_HHMMSS_unique-uuid`. Inside this directory, you will find a complete record of the transaction:
--   **`request.json`**: Contains the full incoming request, including HTTP headers and the JSON body.
--   **`streaming_chunks.jsonl`**: For streaming requests, this file contains a timestamped log of every individual data chunk received from the provider. This is invaluable for debugging malformed streams or partial responses.
--   **`final_response.json`**: Contains the complete final response from the provider, including the status code, headers, and full JSON body. For streaming requests, this body is the fully reassembled message.
--   **`metadata.json`**: A summary file for quick analysis, containing:
-    -   `request_id`: The unique identifier for the transaction.
-    -   `duration_ms`: The total time taken for the request to complete.
-    -   `status_code`: The final HTTP status code returned by the provider.
-    -   `model`: The model used for the request.
-    -   `usage`: Token usage statistics (`prompt`, `completion`, `total`).
-    -   `finish_reason`: The reason the model stopped generating tokens.
-    -   `reasoning_found`: A boolean indicating if a `reasoning` field was detected in the response.
-    -   `reasoning_content`: The extracted content of the `reasoning` field, if found.
-### 3.3. `build.py`
-This is a utility script for creating a standalone executable of the proxy application using PyInstaller. It includes logic to dynamically find all provider plugins and explicitly include them as hidden imports, ensuring they are bundled into the final executable.

 # Technical Documentation: Universal LLM API Proxy & Resilience Library
+This document provides a detailed technical explanation of the project's architecture, internal components, and data flows. It is intended for developers who want to understand how the system achieves high availability and resilience.
 ## 1. Architecture Overview
 The project is a monorepo containing two primary components:
+1.  **The Proxy Application (`proxy_app`)**: This is the user-facing component. It's a FastAPI application that acts as a universal gateway. It uses `litellm` to translate requests to various provider formats and includes:
+    *   **Batch Manager**: Optimizes high-volume embedding requests.
+    *   **Detailed Logger**: Provides per-request file logging for debugging.
+    *   **OpenAI-Compatible Endpoints**: `/v1/chat/completions`, `/v1/embeddings`, etc.
 2.  **The Resilience Library (`rotator_library`)**: This is the core engine that provides high availability. It is consumed by the proxy app to manage a pool of API keys, handle errors gracefully, and ensure requests are completed successfully even when individual keys or provider endpoints face issues.
 This architecture cleanly separates the API interface from the resilience logic, making the library a portable and powerful tool for any application needing robust API key management.
 ```python
 client = RotatingClient(
     api_keys=api_keys,
+    oauth_credentials=oauth_credentials,
     max_retries=2,
+    usage_file_path="key_usage.json",
+    configure_logging=True,
+    global_timeout=30,
+    abort_on_callback_error=True,
+    litellm_provider_params={},
+    ignore_models={},
+    whitelist_models={},
+    enable_request_logging=False,
+    max_concurrent_requests_per_key={}
 )
 ```
+-   `api_keys` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary mapping provider names to a list of API keys.
+-   `oauth_credentials` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary mapping provider names to a list of file paths to OAuth credential JSON files.
+-   `max_retries` (`int`, default: `2`): The number of times to retry a request with the *same key* if a transient server error occurs.
+-   `usage_file_path` (`str`, default: `"key_usage.json"`): The path to the JSON file where usage statistics are persisted.
+-   `configure_logging` (`bool`, default: `True`): If `True`, configures the library's logger to propagate logs to the root logger.
+-   `global_timeout` (`int`, default: `30`): A hard time limit (in seconds) for the entire request lifecycle.
+-   `abort_on_callback_error` (`bool`, default: `True`): If `True`, any exception raised by `pre_request_callback` will abort the request.
+-   `litellm_provider_params` (`Optional[Dict[str, Any]]`, default: `None`): Extra parameters to pass to `litellm` for specific providers.
+-   `ignore_models` (`Optional[Dict[str, List[str]]]`, default: `None`): Blacklist of models to exclude (supports wildcards).
+-   `whitelist_models` (`Optional[Dict[str, List[str]]]`, default: `None`): Whitelist of models to always include, overriding `ignore_models`.
+-   `enable_request_logging` (`bool`, default: `False`): If `True`, enables detailed per-request file logging.
+-   `max_concurrent_requests_per_key` (`Optional[Dict[str, int]]`, default: `None`): Max concurrent requests allowed for a single API key per provider.
 #### Core Responsibilities
+*   **Lifecycle Management**: Manages a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
+*   **Key Management**: Interfacing with the `UsageManager` to acquire and release API keys based on load and health.
+*   **Plugin System**: Dynamically loading and using provider-specific plugins from the `providers/` directory.
+*   **Execution Logic**: Executing API calls via `litellm` with a robust, **deadline-driven** retry and key selection strategy.
+*   **Streaming Safety**: Providing a safe, stateful wrapper (`_safe_streaming_wrapper`) for handling streaming responses, buffering incomplete JSON chunks, and detecting mid-stream errors.
+*   **Model Filtering**: Filtering available models using configurable whitelists and blacklists.
+*   **Request Sanitization**: Automatically cleaning invalid parameters (like `dimensions` for non-OpenAI models) via `request_sanitizer.py`.
 #### Model Filtering Logic
+The `RotatingClient` provides fine-grained control over which models are exposed via the `/v1/models` endpoint. This is handled by the `get_available_models` method.
+The logic applies in the following order:
+1.  **Whitelist Check**: If a provider has a whitelist defined (`WHITELIST_MODELS_<PROVIDER>`), any model on that list will **always be available**, even if it matches a blacklist pattern. This acts as a definitive override.
+2.  **Blacklist Check**: For any model *not* on the whitelist, the client checks the blacklist (`IGNORE_MODELS_<PROVIDER>`). If the model matches a blacklist pattern (supports wildcards like `*-preview`), it is excluded.
+3.  **Default**: If a model is on neither list, it is included.
 #### Request Lifecycle: A Deadline-Driven Approach
+The request lifecycle has been designed around a single, authoritative time budget to ensure predictable performance:
 1.  **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
+2.  **Deadline-Aware Key Selection**: The main loop checks this deadline before every key acquisition attempt. If the deadline is exceeded, the request fails immediately.
+3.  **Deadline-Aware Key Acquisition**: The `UsageManager` itself takes this `deadline`. It will only wait for a key (if all are busy) until the deadline is reached.
+4.  **Deadline-Aware Retries**: If a transient error occurs (like a 500 or 429), the client calculates the backoff time. If waiting would push the total time past the deadline, the wait is skipped, and the client immediately rotates to the next key.
+#### Streaming Resilience
+The `_safe_streaming_wrapper` is a critical component for stability. It:
+*   **Buffers Fragments**: Reads raw chunks from the stream and buffers them until a valid JSON object can be parsed. This handles providers that may split JSON tokens across network packets.
+*   **Error Interception**: Detects if a chunk contains an API error (like a quota limit) instead of content, and raises a specific `StreamedAPIError`.
+*   **Quota Handling**: If a specific "quota exceeded" error is detected mid-stream multiple times, it can terminate the stream gracefully to prevent infinite retry loops on oversized inputs.
 ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
+This class is the stateful core of the library, managing concurrency, usage tracking, and cooldowns.
 #### Key Concepts
+*   **Async-Native & Lazy-Loaded**: Fully asynchronous, using `aiofiles` for non-blocking file I/O. Usage data is loaded only when needed.
+*   **Fine-Grained Locking**: Each API key has its own `asyncio.Lock` and `asyncio.Condition`. This allows for highly granular control.
+#### Tiered Key Acquisition Strategy
+The `acquire_key` method uses a sophisticated strategy to balance load:
+1.  **Filtering**: Keys currently on cooldown (global or model-specific) are excluded.
+2.  **Tiering**: Valid keys are split into two tiers:
+    *   **Tier 1 (Ideal)**: Keys that are completely idle (0 concurrent requests).
+    *   **Tier 2 (Acceptable)**: Keys that are busy but still under their configured `MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>` limit for the requested model. This allows a single key to be used multiple times for the same model, maximizing throughput.
+3.  **Prioritization**: Within each tier, keys with the **lowest daily usage** are prioritized to spread costs evenly.
+4.  **Concurrency Limits**: Checks against `max_concurrent` limits to prevent overloading a single key.
+#### Failure Handling & Cooldowns
+*   **Escalating Backoff**: When a failure occurs, the key gets a temporary cooldown for that specific model. Consecutive failures increase this time (10s -> 30s -> 60s -> 120s).
+*   **Key-Level Lockouts**: If a key accumulates failures across multiple distinct models (3+), it is assumed to be dead/revoked and placed on a global 5-minute lockout.
+*   **Authentication Errors**: Immediate 5-minute global lockout.
+### 2.3. `batch_manager.py` - Efficient Request Aggregation
+The `EmbeddingBatcher` class optimizes high-throughput embedding workloads.
+*   **Mechanism**: It uses an `asyncio.Queue` to collect incoming requests.
+*   **Triggers**: A batch is dispatched when either:
+    1.  The queue size reaches `batch_size` (default: 64).
+    2.  A time window (`timeout`, default: 0.1s) elapses since the first request in the batch.
+*   **Efficiency**: This reduces dozens of HTTP calls to a single API request, significantly reducing overhead and rate limit usage.
+### 2.4. `background_refresher.py` - Automated Token Maintenance
+The `BackgroundRefresher` ensures that OAuth tokens (for providers like Gemini CLI, Qwen, iFlow) never expire while the proxy is running.
+*   **Periodic Checks**: It runs a background task that wakes up at a configurable interval (default: 3600 seconds/1 hour).
+*   **Proactive Refresh**: It iterates through all loaded OAuth credentials and calls their `proactively_refresh` method to ensure tokens are valid before they are needed.
+### 2.6. Credential Management Architecture
+The `CredentialManager` class (`credential_manager.py`) centralizes the lifecycle of all API credentials. It adheres to a "Local First" philosophy.
+#### 2.6.1. Automated Discovery & Preparation
+On startup (unless `SKIP_OAUTH_INIT_CHECK=true`), the manager performs a comprehensive sweep:
+1. **System-Wide Scan**: Searches for OAuth credential files in standard locations:
+   - `~/.gemini/` → All `*.json` files (typically `credentials.json`)
+   - `~/.qwen/` → All `*.json` files (typically `oauth_creds.json`)
+   - `~/.iflow/` → All `*. json` files
+2. **Local Import**: Valid credentials are **copied** (not moved) to the project's `oauth_creds/` directory with standardized names:
+   -  `gemini_cli_oauth_1.json`, `gemini_cli_oauth_2.json`, etc.
+   - `qwen_code_oauth_1.json`, `qwen_code_oauth_2.json`, etc.
+   - `iflow_oauth_1.json`, `iflow_oauth_2.json`, etc.
+3. **Intelligent Deduplication**:
+   - The manager inspects each credential file for a `_proxy_metadata` field containing the user's email or ID
+   - If this field doesn't exist, it's added during import using provider-specific APIs (e.g., fetching Google account email for Gemini)
+   - Duplicate accounts (same email/ID) are detected and skipped with a warning log
+   - Prevents the same account from being added multiple times, even if the files are in different locations
+4. **Isolation**: The project's credentials in `oauth_creds/` are completely isolated from system-wide credentials, preventing cross-contamination
+#### 2.6.2. Credential Loading & Stateless Operation
+The manager supports loading credentials from two sources, with a clear priority:
+**Priority 1: Local Files** (`oauth_creds/` directory)
+- Standard `.json` files are loaded first
+- Naming convention: `{provider}_oauth_{number}.json`
+- Example: `oauth_creds/gemini_cli_oauth_1.json`
+**Priority 2: Environment Variables** (Stateless Deployment)
+- If no local files are found, the manager checks for provider-specific environment variables
+- This is the key to "Stateless Deployment" for platforms like Railway, Render, Heroku
+**Gemini CLI Environment Variables:**
+```
+GEMINI_CLI_ACCESS_TOKEN
+GEMINI_CLI_REFRESH_TOKEN
+GEMINI_CLI_E XPIRY_DATE
+GEMINI_CLI_EMAIL
+GEMINI_CLI_PROJECT_ID (optional)
+GEMINI_CLI_CLIENT_ID (optional)
 ```
+**Qwen Code Environment Variables:**
+```
+QWEN_CODE_ACCESS_TOKEN
+QWEN_CODE_REFRESH_TOKEN
+QWEN_CODE_EXPIRY_DATE
+QWEN_CODE_EMAIL
+```
+**iFlow Environment Variables:**
+```
+IFLOW_ACCESS_TOKEN
+IFLOW_REFRESH_TOKEN
+IFLOW_EXPIRY_DATE
+IFLOW_EMAIL
+IFLOW_API_KEY
+```
+**How it works:**
+- If the manager finds (e.g.) `GEMINI_CLI_ACCESS_TOKEN`, it constructs an in-memory credential object that mimics the file structure
+- The credential behaves exactly like a file-based credential (automatic refresh, expiry detection, etc.)
+- No physical files are created or needed on the host system
+- Perfect for ephemeral containers or read-only filesystems
+#### 2.6.3. Credential Tool Integration
+The `credential_tool.py` provides a user-friendly CLI interface to the `CredentialManager`:
+**Key Functions:**
+1. **OAuth Setup**: Wraps provider-specific `AuthBase` classes (`GeminiAuthBase`, `QwenAuthBase`, `IFlowAuthBase`) to handle interactive login flows
+2. **Credential Export**: Reads local `.json` files and generates `.env` format output for stateless deployment
+3. **API Key Management**: Adds or updates `PROVIDER_API_KEY_N` entries in the `.env` file
 ---
+### 2.7. Request Sanitizer (`request_sanitizer.py`)
+The `sanitize_request_payload` function ensures requests are compatible with each provider's specific requirements:
+**Parameter Cleaning Logic:**
+1. **`dimensions` Parameter**:
+   - Only supported by OpenAI's `text-embedding-3-small` and `text-embedding-3-large` models
+   - Automatically removed for all other models to prevent `400 Bad Request` errors
+2. **`thinking` Parameter** (Gemini-specific):
+   - Format: `{"type": "enabled", "budget_tokens": -1}`
+   - Only valid for `gemini/gemini-2.5-pro` and `gemini/gemini-2.5-flash`
+   - Removed for all other models
+**Provider-Specific Tool Schema Cleaning:**
+Implemented in individual provider classes (`QwenCodeProvider`, `IFlowProvider`):
+- **Recursively removes** unsupported properties from tool function schemas:
+  - `strict`: OpenAI-specific, causes validation errors on Qwen/iFlow
+  - `additionalProperties`: Same issue
+- **Prevents `400 Bad Request` errors** when using complex tool definitions
+- Applied automatically before sending requests to the provider
+---
+### 2.8. Error Classification (`error_handler.py`)
+The `ClassifiedError` class wraps all exceptions from `litellm` and categorizes them for intelligent handling:
+**Error Types:**
+```python
+class ErrorType(Enum):
+    RATE_LIMIT = "rate_limit"           # 429 errors, temporary backoff needed
+    AUTHENTICATION = "authentication"    # 401/403, invalid/revoked key
+    SERVER_ERROR = "server_error"       # 500/502/503, provider infrastructure issues
+    QUOTA = "quota"                      # Daily/monthly quota exceeded
+    CONTEXT_LENGTH = "context_length"    # Input too long for model
+    CONTENT_FILTER = "content_filter"    # Request blocked by safety filters
+    NOT_FOUND = "not_found"              # Model/endpoint doesn't exist
+    TIMEOUT = "timeout"                  # Request took too long
+    UNKNOWN = "unknown"                  # Unclassified error
+```
+**Classification Logic:**
+1. **Status Code Analysis**: Primary classification method
+   - `401`/`403` → `AUTHENTICATION`
+   - `429` → `RATE_LIMIT`
+   - `400` with "context_length" or "tokens" → `CONTEXT_LENGTH`
+   - `400` with "quota" → `QUOTA`
+   - `500`/`502`/`503` → `SERVER_ERROR`
+2. **Message Analysis**: Fallback for ambiguous errors
+   - Searches for keywords like "quota exceeded", "rate limit", "invalid api key"
+3. **Provider-Specific Overrides**: Some providers use non-standard error formats
+**Usage in Client:**
+- `AUTHENTICATION` → Immediate 5-minute global lockout
+- `RATE_LIMIT`/`QUOTA` → Escalating per-model cooldown
+- `SERVER_ERROR` → Retry with same key (up to `max_retries`)
+- `CONTEXT_LENGTH`/`CONTENT_FILTER` → Immediate failure (user needs to fix request)
+---
+### 2.9. Cooldown Management (`cooldown_manager.py`)
+The `CooldownManager` handles IP or account-level rate limiting that affects all keys for a provider:
+**Purpose:**
+- Some providers (like NVIDIA NIM) have rate limits tied to account/IP rather than API key
+- When a 429 error occurs, ALL keys for that provider must be paused
+**Key Methods:**
+1. **`is_cooling_down(provider: str) -> bool`**:
+   - Checks if a provider is currently in a global cooldown period
+   - Returns `True` if the current time is still within the cooldown window
+2. **`start_cooldown(provider: str, duration: int)`**:
+   - Initiates or extends a cooldown for a provider
+   - Duration is typically 60-120 seconds for 429 errors
+3. **`get_cooldown_remaining(provider: str) -> float`**:
+   - Returns remaining cooldown time in seconds
+   - Used for logging and diagnostics
+**Integration with UsageManager:**
+- When a key fails with `RATE_LIMIT` error type, the client checks if it's likely an IP-level limit
+- If so, `CooldownManager.start_cooldown()` is called for the entire provider
+- All subsequent `acquire_key()` calls for that provider will wait until the cooldown expires
+---
+## 3. Provider Specific Implementations
+The library handles provider idiosyncrasies through specialized "Provider" classes in `src/rotator_library/providers/`.
+### 3.1. Gemini CLI (`gemini_cli_provider.py`)
+The `GeminiCliProvider` is the most complex implementation, mimicking the Google Cloud Code extension.
+#### Authentication (`gemini_auth_base.py`)
+ *   **Device Flow**: Uses a standard OAuth 2.0 flow. The `credential_tool` spins up a local web server (`localhost:8085`) to capture the callback from Google's auth page.
+*   **Token Lifecycle**:
+    *   **Proactive Refresh**: Tokens are refreshed 5 minutes before expiry.
+    *   **Atomic Writes**: Credential files are updated using a temp-file-and-move strategy to prevent corruption during writes.
+    *   **Revocation Handling**: If a `400` or `401` occurs during refresh, the token is marked as revoked, preventing infinite retry loops.
+#### Project ID Discovery (Zero-Config)
+The provider employs a sophisticated, cached discovery mechanism to find a valid Google Cloud Project ID:
+1.  **Configuration**: Checks `GEMINI_CLI_PROJECT_ID` first.
+2.  **Code Assist API**: Tries `CODE_ASSIST_ENDPOINT:loadCodeAssist`. This returns the project associated with the Cloud Code extension.
+3.  **Onboarding Flow**: If step 2 fails, it triggers the `onboardUser` endpoint. This initiates a Long-Running Operation (LRO) that automatically provisions a free-tier Google Cloud Project for the user. The proxy polls this operation for up to 5 minutes until completion.
+4.  **Resource Manager**: As a final fallback, it lists all active projects via the Cloud Resource Manager API and selects the first one.
+#### Rate Limit Handling
+*   **Internal Endpoints**: Uses `https://cloudcode-pa.googleapis.com/v1internal`, which typically has higher quotas than the public API.
+*   **Smart Fallback**: If `gemini-2.5-pro` hits a rate limit (`429`), the provider transparently retries the request using `gemini-2.5-pro-preview-06-05`. This fallback chain is configurable in code.
+### 3.2. Qwen Code (`qwen_code_provider.py`)
+*   **Dual Auth**: Supports both standard API keys (direct) and OAuth (via `QwenAuthBase`).
+*   **Device Flow**: Implements the OAuth Device Authorization Grant (RFC 8628). It displays a code to the user and polls the token endpoint until the user authorizes the device in their browser.
+*   **Dummy Tool Injection**: To work around a Qwen API bug where streams hang if `tools` is empty but `tool_choice` logic is present, the provider injects a benign `do_not_call_me` tool.
+*   **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from tool schemas, as Qwen's validation is stricter than OpenAI's.
+*   **Reasoning Parsing**: Detects `<think>` tags in the raw stream and redirects their content to a separate `reasoning_content` field in the delta, mimicking the OpenAI o1 format.
+### 3.3. iFlow (`iflow_provider.py`)
+*   **Hybrid Auth**: Uses a custom OAuth flow (Authorization Code) to obtain an `access_token`. However, the *actual* API calls use a separate `apiKey` that is retrieved from the user's profile (`/api/oauth/getUserInfo`) using the access token.
+*   **Callback Server**: The auth flow spins up a local server on port `11451` to capture the redirect.
+*   **Token Management**: Automatically refreshes the OAuth token and re-fetches the API key if needed.
+*   **Schema Cleaning**: Similar to Qwen, it aggressively sanitizes tool schemas to prevent 400 errors.
+*   **Dedicated Logging**: Implements `_IFlowFileLogger` to capture raw chunks for debugging proprietary API behaviors.
+### 3.4. Google Gemini (`gemini_provider.py`)
+*   **Thinking Parameter**: Automatically handles the `thinking` parameter transformation required for Gemini 2.5 models (`thinking` -> `gemini-2.5-pro` reasoning parameter).
+*   **Safety Settings**: Ensures default safety settings (blocking nothing) are applied if not provided, preventing over-sensitive refusals.
+---
+## 4. Logging & Debugging
+### `detailed_logger.py`
+To facilitate robust debugging, the proxy includes a comprehensive transaction logging system.
+*   **Unique IDs**: Every request generates a UUID.
+*   **Directory Structure**: Logs are stored in `logs/detailed_logs/YYYYMMDD_HHMMSS_{uuid}/`.
+*   **Artifacts**:
+    *   `request.json`: The exact payload sent to the proxy.
+    *   `final_response.json`: The complete reassembled response.
+    *   `streaming_chunks.jsonl`: A line-by-line log of every SSE chunk received from the provider.
+    *   `metadata.json`: Performance metrics (duration, token usage, model used).
+This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.

Deployment guide.md CHANGED Viewed

@@ -69,8 +69,19 @@ OPENROUTER_API_KEY_1="your-openrouter-key"
     - Supported providers: Check LiteLLM docs for a full list and specifics (e.g., GEMINI, OPENROUTER, NVIDIA_NIM).
     - Tip: Start with 1-2 providers to test. Don't share this file publicly!
 4. Save the file. (We'll upload it to Render in Step 5.)
 ## Step 4: Create a New Web Service on Render
 1. Log in to render.com and go to your Dashboard.

     - Supported providers: Check LiteLLM docs for a full list and specifics (e.g., GEMINI, OPENROUTER, NVIDIA_NIM).
     - Tip: Start with 1-2 providers to test. Don't share this file publicly!
+### Advanced: Stateless Deployment for OAuth Providers (Gemini CLI, Qwen, iFlow)
+If you are using providers that require complex OAuth files (like **Gemini CLI**, **Qwen Code**, or **iFlow**), you don't need to upload the JSON files manually. The proxy includes a tool to "export" these credentials into environment variables.
+1.  Run the credential tool locally: `python -m rotator_library.credential_tool`
+2.  Select the "Export ... to .env" option for your provider.
+3.  The tool will generate a file (e.g., `gemini_cli_user_at_gmail.env`) containing variables like `GEMINI_CLI_ACCESS_TOKEN`, `GEMINI_CLI_REFRESH_TOKEN`, etc.
+4.  Copy the contents of this file and paste them directly into your `.env` file or Render's "Environment Variables" section.
+5.  The proxy will automatically detect and use these variables—no file upload required!
 4. Save the file. (We'll upload it to Render in Step 5.)
 ## Step 4: Create a New Web Service on Render
 1. Log in to render.com and go to your Dashboard.

README.md CHANGED Viewed

@@ -1,18 +1,6 @@
 # Universal LLM API Proxy & Resilience Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Mirrowel/LLM-API-Key-Proxy) [![zread](https://img.shields.io/badge/Ask_Zread-_.svg?style=flat&color=00b0aa&labelColor=000000&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iMTYiIGhlaWdodD0iMTYiIHZpZXdCb3g9IjAgMCAxNiAxNiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTQuOTYxNTYgMS42MDAxSDIuMjQxNTZDMS44ODgxIDEuNjAwMSAxLjYwMTU2IDEuODg2NjQgMS42MDE1NiAyLjI0MDFWNC45NjAxQzEuNjAxNTYgNS4zMTM1NiAxLjg4ODEgNS42MDAxIDIuMjQxNTYgNS42MDAxSDQuOTYxNTZDNS4zMTUwMiA1LjYwMDEgNS42MDE1NiA1LjMxMzU2IDUuNjAxNTYgNC45NjAxVjIuMjQwMUM1LjYwMTU2IDEuODg2NjQgNS4zMTUwMiAxLjYwMDEgNC45NjE1NiAxLjYwMDFaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00Ljk2MTU2IDEwLjM5OTlIMi4yNDE1NkMxLjg4ODEgMTAuMzk5OSAxLjYwMTU2IDEwLjY4NjQgMS42MDE1NiAxMS4wMzk5VjEzLjc1OTlDMS42MDE1NiAxNC4xMTM0IDEuODg4MSAxNC4zOTk5IDIuMjQxNTYgMTQuMzk5OUg0Ljk2MTU2QzUuMzE1MDIgMTQuMzk5OSA1LjYwMTU2IDE0LjExMzQgNS42MDE1NiAxMy43NTk5VjExLjAzOTlDNS42MDE1NiAxMC42ODY0IDUuMzE1MDIgMTAuMzk5OSA0Ljk2MTU2IDEwLjM5OTlaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik0xMy43NTg0IDEuNjAwMUgxMS4wMzg0QzEwLjY4NSAxLjYwMDEgMTAuMzk4NCAxLjg4NjY0IDEwLjM5ODQgMi4yNDAxVjQuOTYwMUMxMC4zOTg0IDUuMzEzNTYgMTAuNjg1IDUuNjAwMSAxMS4wMzg0IDUuNjAwMUgxMy43NTg0QzE0LjExMTkgNS42MDAxIDE0LjM5ODQgNS4zMTM1NiAxNC4zOTg0IDQuOTYwMVYyLjI0MDFDMTQuMzk4NCAxLjg4NjY0IDE0LjExMTkgMS42MDAxIDEzLjc1ODQgMS42MDAxWiIgZmlsbD0iI2ZmZiIvPgo8cGF0aCBkPSJNNCAxMkwxMiA0TDQgMTJaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00IDEyTDEyIDQiIHN0cm9rZT0iI2ZmZiIgc3Ryb2tlLXdpZHRoPSIxLjUiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIvPgo8L3N2Zz4K&logoColor=ffffff)](https://zread.ai/Mirrowel/LLM-API-Key-Proxy)
-## Easy Setup for Beginners (Windows)
-This is the fastest way to get started.
-1.  **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
-2.  Unzip the downloaded file.
-3.  **Double-click `setup_env.bat`**. A window will open to help you add your API keys. Follow the on-screen instructions.
-4.  **Double-click `proxy_app.exe`**. This will start the proxy server.
-Your proxy is now running! You can now use it in your applications.
----
 ## Detailed Setup and Features
@@ -26,9 +14,12 @@ This project provides a powerful solution for developers building complex applic
 -   **Universal API Endpoint**: Simplifies development by providing a single, OpenAI-compatible interface for diverse LLM providers.
 -   **High Availability**: The underlying library ensures your application remains operational by gracefully handling transient provider errors and API key-specific issues.
 -   **Resilient Performance**: A global timeout on all requests prevents your application from hanging on unresponsive provider APIs.
--   **Efficient Concurrency**: Maximizes throughput by allowing a single API key to handle multiple concurrent requests to different models.
 -   **Intelligent Key Management**: Optimizes request distribution across your pool of keys by selecting the best available one for each call.
--   **Automated OAuth Discovery**: Automatically discovers, validates, and manages OAuth credentials from standard provider directories (e.g., `~/.gemini/`, `~/.qwen/`, `~/.iflow/`). No manual `.env` configuration is required for supported providers.
 -   **Duplicate Credential Detection**: Intelligently detects if multiple local credential files belong to the same user account and logs a warning, preventing redundancy in your key pool.
 -   **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
@@ -37,18 +28,65 @@ This project provides a powerful solution for developers building complex applic
 -   **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
 -   **Advanced Model Filtering**: Supports both blacklists and whitelists to give you fine-grained control over which models are available through the proxy.
 ---
-## 1. Quick Start (Windows Executable)
-This is the fastest way to get started for most users on Windows.
 1.  **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
 2.  Unzip the downloaded file.
-3.  **Run `setup_env.bat`**. A window will open to help you add your API keys. Follow the on-screen instructions.
-4.  **Run `proxy_app.exe`**. This will start the proxy server in a new terminal window.
-Your proxy is now running and ready to use at `http://127.0.0.1:8000`.
 ---
@@ -121,22 +159,67 @@ You only need to create a `.env` file to set your `PROXY_API_KEY` and to overrid
 #### Interactive Credential Management Tool
-For easier credential management, you can use the interactive credential tool:
 ```bash
 python -m rotator_library.credential_tool
 ```
-This tool provides:
-1. **Add OAuth Credential** - Interactive OAuth flow for Gemini CLI, Qwen Code, and iFlow
-2. **Add API Key** - Add API keys for any LiteLLM-supported provider
-3. **Export Gemini CLI to .env** - NEW! Export OAuth credentials to environment variables for stateless deployments
-**For Stateless Hosting (Railway, Render, Vercel, etc.):**
-- Use option 3 to export your Gemini CLI credentials to `.env` format
-- The generated file contains all necessary environment variables
-- Simply paste these into your hosting platform's environment settings
-- No file persistence required - credentials load automatically from environment variables
 **Example `.env` configuration:**
 ```env
@@ -269,17 +352,21 @@ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
 ## 4. Advanced Topics
 ### How It Works
-When a request is made to the proxy, the application uses its core resilience library to ensure the request is handled reliably:
-1.  **Selects an Optimal Key**: The `UsageManager` selects the best available key from your pool. It uses a tiered locking strategy to find a healthy, available key, prioritizing those with the least recent usage. This allows for concurrent requests to different models using the same key, maximizing efficiency.
-2.  **Makes the Request**: The proxy uses the acquired key to make the API call to the target provider via `litellm`.
-3.  **Manages Errors Gracefully**:
-    -   It uses a `classify_error` function to determine the failure type.
-    -   For **transient server errors**, it retries the request with the same key using exponential backoff.
-    -   For **key-specific issues (e.g., authentication or provider-side limits)**, it temporarily places that key on a cooldown for the specific model and seamlessly retries the request with the next available key from the pool.
-4.  **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key is then released back into the available pool, ready for the next request.
 ### Command-Line Arguments and Scripts
@@ -289,11 +376,84 @@ The proxy server can be configured at runtime using the following command-line a
 -   `--port`: The port to run the server on. Defaults to `8000`.
 -   `--enable-request-logging`: A flag to enable detailed, per-request logging. When active, the proxy creates a unique directory for each transaction in the `logs/detailed_logs/` folder, containing the full request, response, streaming chunks, and performance metadata. This is highly recommended for debugging.
 **Example:**
 ```bash
 python src/proxy_app/main.py --host 127.0.0.1 --port 9999 --enable-request-logging
 ```
 #### Windows Batch Scripts
 For convenience on Windows, you can use the provided `.bat` scripts in the root directory to run the proxy with common configurations:

 # Universal LLM API Proxy & Resilience Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Mirrowel/LLM-API-Key-Proxy) [![zread](https://img.shields.io/badge/Ask_Zread-_.svg?style=flat&color=00b0aa&labelColor=000000&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iMTYiIGhlaWdodD0iMTYiIHZpZXdCb3g9IjAgMCAxNiAxNiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTQuOTYxNTYgMS42MDAxSDIuMjQxNTZDMS44ODgxIDEuNjAwMSAxLjYwMTU2IDEuODg2NjQgMS42MDE1NiAyLjI0MDFWNC45NjAxQzEuNjAxNTYgNS4zMTM1NiAxLjg4ODEgNS42MDAxIDIuMjQxNTYgNS42MDAxSDQuOTYxNTZDNS4zMTUwMiA1LjYwMDEgNS42MDE1NiA1LjMxMzU2IDUuNjAxNTYgNC45NjAxVjIuMjQwMUM1LjYwMTU2IDEuODg2NjQgNS4zMTUwMiAxLjYwMDEgNC45NjE1NiAxLjYwMDFaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00Ljk2MTU2IDEwLjM5OTlIMi4yNDE1NkMxLjg4ODEgMTAuMzk5OSAxLjYwMTU2IDEwLjY4NjQgMS42MDE1NiAxMS4wMzk5VjEzLjc1OTlDMS42MDE1NiAxNC4xMTM0IDEuODg4MSAxNC4zOTk5IDIuMjQxNTYgMTQuMzk5OUg0Ljk2MTU2QzUuMzE1MDIgMTQuMzk5OSA1LjYwMTU2IDE0LjExMzQgNS42MDE1NiAxMy43NTk5VjExLjAzOTlDNS42MDE1NiAxMC42ODY0IDUuMzE1MDIgMTAuMzk5OSA0Ljk2MTU2IDEwLjM5OTlaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik0xMy43NTg0IDEuNjAwMUgxMS4wMzg0QzEwLjY4NSAxLjYwMDEgMTAuMzk4NCAxLjg4NjY0IDEwLjM5ODQgMi4yNDAxVjQuOTYwMUMxMC4zOTg0IDUuMzEzNTYgMTAuNjg1IDUuNjAwMSAxMS4wMzg0IDUuNjAwMUgxMy43NTg0QzE0LjExMTkgNS42MDAxIDE0LjM5ODQgNS4zMTM1NiAxNC4zOTg0IDQuOTYwMVYyLjI0MDFDMTQuMzk4NCAxLjg4NjY0IDE0LjExMTkgMS42MDAxIDEzLjc1ODQgMS42MDAxWiIgZmlsbD0iI2ZmZiIvPgo8cGF0aCBkPSJNNCAxMkwxMiA0TDQgMTJaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00IDEyTDEyIDQiIHN0cm9rZT0iI2ZmZiIgc3Ryb2tlLXdpZHRoPSIxLjUiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIvPgo8L3N2Zz4K&logoColor=ffffff)](https://zread.ai/Mirrowel/LLM-API-Key-Proxy)
 ## Detailed Setup and Features
 -   **Universal API Endpoint**: Simplifies development by providing a single, OpenAI-compatible interface for diverse LLM providers.
 -   **High Availability**: The underlying library ensures your application remains operational by gracefully handling transient provider errors and API key-specific issues.
 -   **Resilient Performance**: A global timeout on all requests prevents your application from hanging on unresponsive provider APIs.
+-   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests. By default, it supports concurrent requests to *different* models. With configuration (`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`), it can also support multiple concurrent requests to the *same* model using the same key.
 -   **Intelligent Key Management**: Optimizes request distribution across your pool of keys by selecting the best available one for each call.
+-   **Automated OAuth Discovery**: Automatically discovers, validates, and manages OAuth credentials from standard provider directories (e.g., `~/.gemini/`, `~/.qwen/`, `~/.iflow/`).
+-   **Stateless Deployment Support**: Deploy easily to platforms like Railway, Render, or Vercel. The new export tool converts complex OAuth credentials (Gemini CLI, Qwen, iFlow) into simple environment variables, removing the need for persistent storage or file uploads.
+-   **Batch Request Processing**: Efficiently aggregates multiple embedding requests into single batch API calls, improving throughput and reducing rate limit hits.
+-   **New Provider Support**: Full support for **iFlow** (API Key & OAuth), **Qwen Code** (API Key & OAuth), and **NVIDIA NIM** with DeepSeek thinking support, including special handling for their API quirks (tool schema cleaning, reasoning support, dedicated logging).
 -   **Duplicate Credential Detection**: Intelligently detects if multiple local credential files belong to the same user account and logs a warning, preventing redundancy in your key pool.
 -   **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
 -   **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 -   **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
 -   **Advanced Model Filtering**: Supports both blacklists and whitelists to give you fine-grained control over which models are available through the proxy.
 ---
+## 1. Quick Start
+### Windows (Simplest)
 1.  **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
 2.  Unzip the downloaded file.
+3.  **Run `launcher.bat`**. This all-in-one script allows you to:
+    -   Add/Manage credentials interactively.
+    -   Configure the server (Host, Port, Logging).
+    -   Run the proxy server.
+    -   Build the executable from source (if Python is installed).
+### macOS / Linux
+**Option A: Using the Executable (Recommended)**
+If you downloaded the pre-compiled binary for your platform, no Python installation is required.
+1.  **Download the latest release** from the GitHub Releases page.
+2.  Open a terminal and make the binary executable:
+    ```bash
+    chmod +x proxy_app
+    ```
+3.  **Run the Proxy**:
+    ```bash
+    ./proxy_app --host 0.0.0.0 --port 8000
+    ```
+4.  **Manage Credentials**:
+    ```bash
+    ./proxy_app --add-credential
+    ```
+**Option B: Manual Setup (Source Code)**
+If you are running from source, use these commands:
+**1. Install Dependencies**
+```bash
+# Ensure you have Python 3.10+ installed
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+**2. Add Credentials (Interactive Tool)**
+```bash
+# Equivalent to "Add Credentials"
+export PYTHONPATH=$PYTHONPATH:$(pwd)/src
+python src/proxy_app/main.py --add-credential
+```
+**3. Run the Proxy**
+```bash
+# Equivalent to "Run Proxy"
+export PYTHONPATH=$PYTHONPATH:$(pwd)/src
+python src/proxy_app/main.py --host 0.0.0.0 --port 8000
+```
+*To enable logging, add `--enable-request-logging` to the command.*
 ---
 #### Interactive Credential Management Tool
+The proxy includes a powerful interactive CLI tool for managing all your credentials. This is the recommended way to set up credentials:
 ```bash
 python -m rotator_library.credential_tool
 ```
+**Main Menu Features:**
+1. **Add OAuth Credential** - Interactive OAuth flow for Gemini CLI, Qwen Code, and iFlow
+   - Automatically opens your browser for authentication
+   - Handles the entire OAuth flow including callbacks
+   - Saves credentials to the local `oauth_creds/` directory
+   - For Gemini CLI: Automatically discovers or creates a Google Cloud project
+   - For Qwen Code: Uses Device Code flow (you'll enter a code in your browser)
+   - For iFlow: Starts a local callback server on port 11451
+2. **Add API Key** - Add standard API keys for any LiteLLM-supported provider
+   - Interactive prompts guide you through the process
+   - Automatically saves to your `.env` file
+   - Supports multiple keys per provider (numbered automatically)
+3. **Export Credentials to .env** - The "Stateless Deployment" feature
+   - Converts file-based OAuth credentials into environment variables
+   - Essential for platforms without persistent file storage
+   - Generates a ready-to-paste `.env` block for each credential
+**Stateless Deployment Workflow (Railway, Render, Vercel, etc.):**
+If you're deploying to a platform without persistent file storage:
+1. **Setup credentials locally first**:
+   ```bash
+   python -m rotator_library.credential_tool
+   # Select "Add OAuth Credential" and complete the flow
+   ```
+2. **Export to environment variables**:
+   ```bash
+   python -m rotator_library.credential_tool
+   # Select "Export Gemini CLI to .env" (or Qwen/iFlow)
+   # Choose your credential file
+   ```
+3. **Copy the generated output**:
+   - The tool creates a file like `gemini_cli_credential_1.env`
+   - Contains all necessary `GEMINI_CLI_*` variables
+4. **Paste into your hosting platform**:
+   - Add each variable to your platform's environment settings
+   - Set `SKIP_OAUTH_INIT_CHECK=true` to skip interactive validation
+   - No credential files needed; everything loads from environment variables
+**Local-First OAuth Management:**
+The proxy uses a "local-first" approach for OAuth credentials:
+- **Local Storage**: All OAuth credentials are stored in `oauth_creds/` directory
+- **Automatic Discovery**: On first run, the proxy scans system paths (`~/.gemini/`, `~/.qwen/`, `~/.iflow/`) and imports found credentials
+- **Deduplication**: Intelligently detects duplicate accounts (by email/user ID) and warns you
+- **Priority**: Local files take priority over system-wide credentials
+- **No System Pollution**: Your project's credentials are isolated from global system credentials
 **Example `.env` configuration:**
 ```env
 ## 4. Advanced Topics
+### Batch Request Processing
+The proxy includes a `Batch Manager` that optimizes high-volume embedding requests.
+- **Automatic Aggregation**: Multiple individual embedding requests are automatically collected into a single batch API call.
+- **Configurable**: Works out of the box, but can be tuned for specific needs.
+- **Benefits**: Significantly reduces the number of HTTP requests to providers, helping you stay within rate limits while improving throughput.
 ### How It Works
+The proxy is built on a robust architecture:
+1.  **Intelligent Routing**: The `UsageManager` selects the best available key from your pool. It prioritizes idle keys first, then keys that can handle concurrency, ensuring optimal load balancing.
+2.  **Resilience & Deadlines**: Every request has a strict deadline (`global_timeout`). If a provider is slow or fails, the proxy retries with a different key immediately, ensuring your application never hangs.
+3.  **Batching**: High-volume embedding requests are automatically aggregated into optimized batches, reducing API calls and staying within rate limits.
+4.  **Deep Observability**: (Optional) Detailed logs capture every byte of the transaction, including raw streaming chunks, for precise debugging of complex agentic interactions.
 ### Command-Line Arguments and Scripts
 -   `--port`: The port to run the server on. Defaults to `8000`.
 -   `--enable-request-logging`: A flag to enable detailed, per-request logging. When active, the proxy creates a unique directory for each transaction in the `logs/detailed_logs/` folder, containing the full request, response, streaming chunks, and performance metadata. This is highly recommended for debugging.
+### New Provider Highlights
+#### **Gemini CLI (Advanced)**
+A powerful provider that mimics the Google Cloud Code extension.
+-   **Zero-Config Project Discovery**: Automatically finds your Google Cloud Project ID or onboards you to a free-tier project if none exists.
+-   **Internal API Access**: Uses high-limit internal endpoints (`cloudcode-pa.googleapis.com`) rather than the public Vertex AI API.
+-   **Smart Rate Limiting**: Automatically falls back to preview models (e.g., `gemini-2.5-pro-preview`) if the main model hits a rate limit.
+#### **Qwen Code**
+-   **Dual Authentication**: Use either standard API keys or OAuth 2.0 Device Flow credentials.
+-   **Schema Cleaning**: Automatically removes `strict` and `additionalProperties` from tool schemas to prevent API errors.
+-   **Stream Stability**: Injects a dummy `do_not_call_me` tool to prevent stream corruption issues when no tools are provided.
+-   **Reasoning Support**: Parses `<think>` tags in responses and exposes them as `reasoning_content` (similar to OpenAI's o1 format).
+-   **Dedicated Logging**: Optional per-request file logging to `logs/qwen_code_logs/` for debugging.
+-   **Custom Models**: Define additional models via `QWEN_CODE_MODELS` environment variable (JSON array format).
+#### **iFlow**
+-   **Dual Authentication**: Use either standard API keys or OAuth 2.0 Authorization Code Flow.
+-   **Hybrid Auth**: OAuth flow provides an access token, but actual API calls use a separate `apiKey` retrieved from user profile.
+-   **Local Callback Server**: OAuth flow runs a temporary server on port 11451 to capture the redirect.
+-   **Schema Cleaning**: Same as Qwen Code - removes unsupported properties from tool schemas.
+-   **Stream Stability**: Injects placeholder tools to stabilize streaming for empty tool lists.
+-   **Dedicated Logging**: Optional per-request file logging to `logs/iflow_logs/` for debugging proprietary API behaviors.
+-   **Custom Models**: Define additional models via `IFLOW_MODELS` environment variable (JSON array format).
+### Advanced Configuration
+The following advanced settings can be added to your `.env` file:
+#### OAuth and Refresh Settings
+-   **`OAUTH_REFRESH_INTERVAL`**: Controls how often (in seconds) the background refresher checks for expired OAuth tokens. Default is `3600` (1 hour).
+    ```env
+    OAUTH_REFRESH_INTERVAL=1800  # Check every 30 minutes
+    ```
+-   **`SKIP_OAUTH_INIT_CHECK`**: Set to `true` to skip the interactive OAuth setup/validation check on startup. Essential for non-interactive environments like Docker containers or CI/CD pipelines.
+    ```env
+    SKIP_OAUTH_INIT_CHECK=true
+    ```
+#### Concurrency Control
+-   **`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`**: Set the maximum number of simultaneous requests allowed per API key for a specific provider. Default is `1` (no concurrency). Useful for high-throughput providers.
+    ```env
+    MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=3
+    MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=2
+    MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
+    ```
+#### Custom Model Lists
+For providers that support custom model definitions (Qwen Code, iFlow), you can override the default model list:
+-   **`QWEN_CODE_MODELS`**: JSON array of custom Qwen Code models. These models take priority over hardcoded defaults.
+    ```env
+    QWEN_CODE_MODELS='["qwen3-coder-plus", "qwen3-coder-flash", "custom-model-id"]'
+    ```
+-   **`IFLOW_MODELS`**: JSON array of custom iFlow models. These models take priority over hardcoded defaults.
+    ```env
+    IFLOW_MODELS='["glm-4.6", "qwen3-coder-plus", "deepseek-v3.2"]'
+    ```
+#### Provider-Specific Settings
+-   **`GEMINI_CLI_PROJECT_ID`**: Manually specify a Google Cloud Project ID for Gemini CLI OAuth. Only needed if automatic discovery fails.
+    ```env
+    GEMINI_CLI_PROJECT_ID="your-gcp-project-id"
+    ```
 **Example:**
 ```bash
 python src/proxy_app/main.py --host 127.0.0.1 --port 9999 --enable-request-logging
 ```
 #### Windows Batch Scripts
 For convenience on Windows, you can use the provided `.bat` scripts in the root directory to run the proxy with common configurations:

src/rotator_library/README.md CHANGED Viewed

@@ -5,16 +5,21 @@ A robust, asynchronous, and thread-safe Python library for managing a pool of AP
 ## Key Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
--   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
 -   **Smart Key Management**: Selects the optimal key for each request using a tiered, model-aware locking strategy to distribute load evenly and maximize availability.
--   **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key selections, exceeds a specified time limit, preventing indefinite hangs.
 -   **Intelligent Error Handling**:
-    -   **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
-    -   **Deadline-Aware Retries**: Retries requests on transient server errors with exponential backoff, but only if the wait time fits within the global request budget.
-    -   **Key-Level Lockouts**: If a key fails across multiple models, it's temporarily taken out of rotation entirely.
--   **Robust Streaming Support**: The client includes a wrapper for streaming responses that can reassemble fragmented JSON chunks and intelligently detect and handle errors that occur mid-stream.
--   **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost, persisted to a JSON file.
--   **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily to keep the system running smoothly.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
 -   **Extensible**: Easily add support for new providers through a simple plugin-based architecture.
@@ -35,7 +40,7 @@ This is the main class for interacting with the library. It is designed to be a
 ```python
 import os
 from dotenv import load_dotenv
-from rotating_api_key_client import RotatingClient
 # Load environment variables from .env file
 load_dotenv()
@@ -51,27 +56,43 @@ for key, value in os.environ.items():
             api_keys[provider] = []
         api_keys[provider].append(value)
-if not api_keys:
-    raise ValueError("No provider API keys found in environment variables.")
 client = RotatingClient(
     api_keys=api_keys,
     max_retries=2,
     usage_file_path="key_usage.json",
-    global_timeout=30  # Default is 30 seconds
 )
 ```
--   `api_keys`: A dictionary where keys are provider names (e.g., `"openai"`, `"gemini"`) and values are lists of API keys for that provider.
--   `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
--   `usage_file_path`: The path to the JSON file where key usage data will be stored.
--   `global_timeout`: A hard time limit (in seconds) for the entire request lifecycle. If the total time exceeds this, the request will fail.
--   `ignore_models`: A dictionary where keys are provider names and values are lists of model names/patterns to exclude (blacklist).
--   `whitelist_models`: A dictionary where keys are provider names and values are lists of model names/patterns to always include, overriding any blacklists.
 ### Concurrency and Resource Management
-The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. The recommended way is to use an `async with` block, which handles setup and teardown automatically.
 ```python
 import asyncio
@@ -131,14 +152,56 @@ Fetches a list of available models for a specific provider, applying any configu
 Fetches a dictionary of all available models, grouped by provider, or as a single flat list if `grouped=False`.
 ## Error Handling and Cooldowns
 The client uses a sophisticated error handling mechanism:
--   **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, etc.).
 -   **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
 -   **Key-Specific Errors (Authentication, Quota, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
--   **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 ### Global Timeout and Deadline-Driven Logic
@@ -146,7 +209,7 @@ To ensure predictable performance, the client now operates on a strict time budg
 -   **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
 -   **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
--   **Silent Internal Errors**: Intermittent failures like provider capacity limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key. A non-streaming request will only return `None` (or a streaming request will end) if the global timeout is exceeded or all keys have been exhausted. This creates a more stable experience for the end-user, as they are shielded from transient backend issues.
 ## Extending with Provider Plugins
@@ -162,13 +225,9 @@ from typing import List
 import httpx
 class MyProvider(ProviderInterface):
-    async def get_models(self, api_key: str, client: httpx.AsyncClient) -> List[str]:
         # Logic to fetch and return a list of model names
-        # The model names should be prefixed with the provider name.
-        # e.g., ["my-provider/model-1", "my-provider/model-2"]
-        # Example:
-        # response = await client.get("https://api.myprovider.com/models", headers={"Auth": api_key})
-        # return [f"my-provider/{model['id']}" for model in response.json()]
         pass
 ```
@@ -177,3 +236,4 @@ The system will automatically discover and register your new provider.
 ## Detailed Documentation
 For a more in-depth technical explanation of the library's architecture, including the `UsageManager`'s concurrency model and the error classification system, please refer to the [Technical Documentation](../../DOCUMENTATION.md).

 ## Key Features
 -   **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
+-   **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests. By default, it supports concurrent requests to *different* models. With configuration (`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`), it can also support multiple concurrent requests to the *same* model using the same key.
 -   **Smart Key Management**: Selects the optimal key for each request using a tiered, model-aware locking strategy to distribute load evenly and maximize availability.
+-   **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key selections, exceeds a specified time limit.
+-   **OAuth & API Key Support**: Built-in support for standard API keys and complex OAuth flows.
+    -   **Gemini CLI**: Full OAuth 2.0 web flow with automatic project discovery and free-tier onboarding.
+    -   **Qwen Code**: Device Code flow support.
+    -   **iFlow**: Authorization Code flow with local callback handling.
+-   **Stateless Deployment Ready**: Can load complex OAuth credentials from environment variables, eliminating the need for physical credential files in containerized environments.
 -   **Intelligent Error Handling**:
+    -   **Escalating Per-Model Cooldowns**: Failed keys are placed on a temporary, escalating cooldown for specific models.
+    -   **Key-Level Lockouts**: Keys failing across multiple models are temporarily removed from rotation.
+    -   **Stream Recovery**: The client detects mid-stream errors (like quota limits) and gracefully handles them.
+-   **Robust Streaming Support**: Includes a wrapper for streaming responses that reassembles fragmented JSON chunks.
+-   **Detailed Usage Tracking**: Tracks daily and global usage for each key, persisted to a JSON file.
+-   **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily.
 -   **Provider Agnostic**: Works with any provider supported by `litellm`.
 -   **Extensible**: Easily add support for new providers through a simple plugin-based architecture.
 ```python
 import os
 from dotenv import load_dotenv
+from rotator_library import RotatingClient
 # Load environment variables from .env file
 load_dotenv()
             api_keys[provider] = []
         api_keys[provider].append(value)
+# Initialize empty dictionary for OAuth credentials (or load from CredentialManager)
+oauth_credentials = {}
 client = RotatingClient(
     api_keys=api_keys,
+    oauth_credentials=oauth_credentials,
     max_retries=2,
     usage_file_path="key_usage.json",
+    configure_logging=True,
+    global_timeout=30,
+    abort_on_callback_error=True,
+    litellm_provider_params={},
+    ignore_models={},
+    whitelist_models={},
+    enable_request_logging=False,
+    max_concurrent_requests_per_key={}
 )
 ```
+#### Arguments
+-   `api_keys` (`Optional[Dict[str, List[str]]]`): A dictionary mapping provider names (e.g., "openai", "anthropic") to a list of API keys.
+-   `oauth_credentials` (`Optional[Dict[str, List[str]]]`): A dictionary mapping provider names (e.g., "gemini_cli", "qwen_code") to a list of file paths to OAuth credential JSON files.
+-   `max_retries` (`int`, default: `2`): The number of times to retry a request with the *same key* if a transient server error (e.g., 500, 503) occurs.
+-   `usage_file_path` (`str`, default: `"key_usage.json"`): The path to the JSON file where usage statistics (tokens, cost, success counts) are persisted.
+-   `configure_logging` (`bool`, default: `True`): If `True`, configures the library's logger to propagate logs to the root logger. Set to `False` if you want to handle logging configuration manually.
+-   `global_timeout` (`int`, default: `30`): A hard time limit (in seconds) for the entire request lifecycle. If the request (including all retries) takes longer than this, it is aborted.
+-   `abort_on_callback_error` (`bool`, default: `True`): If `True`, any exception raised by `pre_request_callback` will abort the request. If `False`, the error is logged and the request proceeds.
+-   `litellm_provider_params` (`Optional[Dict[str, Any]]`, default: `None`): A dictionary of extra parameters to pass to `litellm` for specific providers.
+-   `ignore_models` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary where keys are provider names and values are lists of model names/patterns to exclude (blacklist). Supports wildcards (e.g., `"*-preview"`).
+-   `whitelist_models` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary where keys are provider names and values are lists of model names/patterns to always include, overriding `ignore_models`.
+-   `enable_request_logging` (`bool`, default: `False`): If `True`, enables detailed per-request file logging (useful for debugging complex interactions).
+-   `max_concurrent_requests_per_key` (`Optional[Dict[str, int]]`, default: `None`): A dictionary defining the maximum number of concurrent requests allowed for a single API key for a specific provider. Defaults to 1 if not specified.
 ### Concurrency and Resource Management
+The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. The recommended way is to use an `async with` block.
 ```python
 import asyncio
 Fetches a dictionary of all available models, grouped by provider, or as a single flat list if `grouped=False`.
+## Credential Tool
+The library includes a utility to manage credentials easily:
+```bash
+python -m src.rotator_library.credential_tool
+```
+Use this tool to:
+1.  **Initialize OAuth**: Run the interactive login flows for Gemini, Qwen, and iFlow.
+2.  **Export Credentials**: Generate `.env` compatible configuration blocks from your saved OAuth JSON files. This is essential for setting up stateless deployments.
+## Provider Specifics
+### Qwen Code
+-   **Auth**: Uses OAuth 2.0 Device Flow. Requires manual entry of email/identifier if not returned by the provider.
+-   **Resilience**: Injects a dummy tool (`do_not_call_me`) into requests with no tools to prevent known stream corruption issues on the API.
+-   **Reasoning**: Parses `<think>` tags in the response and exposes them as `reasoning_content`.
+-   **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from all tool schemas. Qwen's API has stricter validation than OpenAI's, and these properties cause `400 Bad Request` errors.
+### iFlow
+-   **Auth**: Uses Authorization Code Flow with a local callback server (port 11451).
+-   **Key Separation**: Distinguishes between the OAuth `access_token` (used to fetch user info) and the `api_key` (used for actual chat requests).
+-   **Resilience**: Similar to Qwen, injects a placeholder tool to stabilize streaming for empty tool lists.
+-   **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from all tool schemas to prevent API validation errors.
+-   **Custom Models**: Supports model definitions via `IFLOW_MODELS` environment variable (JSON array of model IDs or objects).
+### NVIDIA NIM
+-   **Discovery**: Dynamically fetches available models from the NVIDIA API.
+-   **Thinking**: Automatically injects the `thinking` parameter into `extra_body` for DeepSeek models (`deepseek-v3.1`, etc.) when `reasoning_effort` is set to low/medium/high.
+### Google Gemini (CLI)
+-   **Auth**: Simulates the Google Cloud CLI authentication flow.
+-   **Project Discovery**: Automatically discovers the default Google Cloud Project ID.
+-   **Rate Limits**: Implements smart fallback strategies (e.g., switching from `gemini-1.5-pro` to `gemini-1.5-pro-002`) when rate limits are hit.
 ## Error Handling and Cooldowns
 The client uses a sophisticated error handling mechanism:
+-   **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, `quota`, `context_length`, etc.).
 -   **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
 -   **Key-Specific Errors (Authentication, Quota, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
+-   **Escalating Cooldown Strategy**: Consecutive failures for a key on the same model result in increasing cooldown períods:
+    - 1st failure: 10 seconds
+    - 2nd failure: 30 seconds
+    - 3rd failure: 60 seconds
+    - 4th+ failure: 120 seconds
+-   **Key-Level Lockouts**: If a key fails on multiple different models (3+ distinct models), the `UsageManager` applies a global 5-minute lockout for that key, removing it from rotation entirely.
+-   **Authentication Errors**: Immediate 5-minute global lockout (key is assumed revoked or invalid).
 ### Global Timeout and Deadline-Driven Logic
 -   **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
 -   **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
+-   **Silent Internal Errors**: Intermittent failures like provider capacity limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key.
 ## Extending with Provider Plugins
 import httpx
 class MyProvider(ProviderInterface):
+    async def get_models(self, credential: str, client: httpx.AsyncClient) -> List[str]:
         # Logic to fetch and return a list of model names
+        # The credential argument allows using the key to fetch models
         pass
 ```
 ## Detailed Documentation
 For a more in-depth technical explanation of the library's architecture, including the `UsageManager`'s concurrency model and the error classification system, please refer to the [Technical Documentation](../../DOCUMENTATION.md).