Mirrowel commited on
Commit
37e5eea
·
1 Parent(s): 939a72b

docs(readme): 📚 documentation expansion to cover new PR features(and in general)

Browse files

Refine and expand the project documentation to help users onboard faster and operate the proxy in a variety of environments.

Files changed (5) hide show
  1. .env.example +5 -1
  2. DOCUMENTATION.md +306 -127
  3. Deployment guide.md +11 -0
  4. README.md +197 -37
  5. src/rotator_library/README.md +88 -28
.env.example CHANGED
@@ -94,6 +94,10 @@ GEMINI_CLI_OAUTH_1=""
94
  # Path to your Qwen credential file (e.g., ~/.qwen/oauth_creds.json).
95
  QWEN_CODE_OAUTH_1=""
96
 
 
 
 
 
97
 
98
  # ------------------------------------------------------------------------------
99
  # | [ADVANCED] Provider-Specific Settings |
@@ -153,7 +157,7 @@ WHITELIST_MODELS_OPENAI=""
153
  MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=1
154
  MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
155
  MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=1
156
-
157
 
158
  # ------------------------------------------------------------------------------
159
  # | [ADVANCED] Proxy Configuration |
 
94
  # Path to your Qwen credential file (e.g., ~/.qwen/oauth_creds.json).
95
  QWEN_CODE_OAUTH_1=""
96
 
97
+ # --- iFlow ---
98
+ # Path to your iFlow credential file (e.g., ~/.iflow/oauth_creds.json).
99
+ IFLOW_OAUTH_1=""
100
+
101
 
102
  # ------------------------------------------------------------------------------
103
  # | [ADVANCED] Provider-Specific Settings |
 
157
  MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=1
158
  MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
159
  MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=1
160
+ MAX_CONCURRENT_REQUESTS_PER_KEY_IFLOW=1
161
 
162
  # ------------------------------------------------------------------------------
163
  # | [ADVANCED] Proxy Configuration |
DOCUMENTATION.md CHANGED
@@ -1,12 +1,15 @@
1
  # Technical Documentation: Universal LLM API Proxy & Resilience Library
2
 
3
- This document provides a detailed technical explanation of the project's two main components: the Universal LLM API Proxy and the Resilience Library that powers it.
4
 
5
  ## 1. Architecture Overview
6
 
7
  The project is a monorepo containing two primary components:
8
 
9
- 1. **The Proxy Application (`proxy_app`)**: This is the user-facing component. It's a FastAPI application that uses `litellm` to create a universal, OpenAI-compatible API. Its primary role is to abstract away the complexity of dealing with multiple LLM providers, offering a single point of entry for applications like agentic coders.
 
 
 
10
  2. **The Resilience Library (`rotator_library`)**: This is the core engine that provides high availability. It is consumed by the proxy app to manage a pool of API keys, handle errors gracefully, and ensure requests are completed successfully even when individual keys or provider endpoints face issues.
11
 
12
  This architecture cleanly separates the API interface from the resilience logic, making the library a portable and powerful tool for any application needing robust API key management.
@@ -28,180 +31,356 @@ The client is initialized with your provider API keys, retry settings, and a new
28
  ```python
29
  client = RotatingClient(
30
  api_keys=api_keys,
 
31
  max_retries=2,
32
- global_timeout=30 # in seconds
 
 
 
 
 
 
 
 
33
  )
34
  ```
35
 
36
- - `global_timeout`: A crucial new parameter that sets a hard time limit for the entire request lifecycle, from the moment `acompletion` is called until a response is returned or the timeout is exceeded.
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  #### Core Responsibilities
39
 
40
- * Managing a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
41
- * Interfacing with the `UsageManager` to acquire and release API keys.
42
- * Dynamically loading and using provider-specific plugins from the `providers/` directory.
43
- * Executing API calls via `litellm` with a robust, **deadline-driven** retry and key selection strategy.
44
- * Providing a safe, stateful wrapper for handling streaming responses.
45
- * Filtering available models using configurable whitelists and blacklists.
 
46
 
47
  #### Model Filtering Logic
48
 
49
- The `RotatingClient` provides fine-grained control over which models are exposed via the `/v1/models` endpoint. This is handled by the `get_available_models` method, which is called by `get_all_available_models`.
50
 
51
- The logic is as follows:
52
- 1. The client is initialized with `ignore_models` (blacklist) and `whitelist_models` dictionaries.
53
- 2. When `get_available_models` is called for a provider, it first fetches all models from the provider's API.
54
- 3. It then iterates through this list of actual models and applies the following rules:
55
- - **Whitelist Check**: It first checks if the model matches any pattern in the provider's whitelist. If it does, the model is **immediately included** in the final list, and the blacklist is ignored for this model.
56
- - **Blacklist Check**: If the model is *not* on the whitelist, it is then checked against the blacklist. If it matches a pattern, it is excluded.
57
- - **Default**: If a model is on neither list, it is included.
58
- 4. This ensures that the whitelist always acts as a definitive override to the blacklist.
59
 
60
  #### Request Lifecycle: A Deadline-Driven Approach
61
 
62
- The request lifecycle has been redesigned around a single, authoritative time budget to ensure predictable performance and prevent requests from hanging indefinitely.
63
 
64
  1. **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
 
 
 
65
 
66
- 2. **Deadline-Aware Key Selection Loop**: The main `while` loop now has a critical secondary condition: `while len(tried_keys) < len(keys_for_provider) and time.time() < deadline:`. The loop will exit immediately if the `deadline` is reached, regardless of how many keys are left to try.
67
 
68
- 3. **Deadline-Aware Key Acquisition**: The `self.usage_manager.acquire_key()` method now accepts the `deadline`. The `UsageManager` will not wait indefinitely for a key; if it cannot acquire one before the `deadline` is met, it will raise a `NoAvailableKeysError`, causing the request to fail fast with a "busy" error.
69
-
70
- 4. **Deadline-Aware Retries**: When a transient error occurs, the client calculates the necessary `wait_time` for an exponential backoff. It then checks if this wait time fits within the remaining budget (`deadline - time.time()`).
71
- - **If it fits**: It waits (`asyncio.sleep`) and retries with the same key.
72
- - **If it exceeds the budget**: It skips the wait entirely, logs a warning, and immediately rotates to the next key to avoid wasting time.
73
-
74
- 5. **Refined Error Propagation**:
75
- - **Fatal Errors**: Invalid requests or authentication errors are raised immediately to the client.
76
- - **Intermittent Errors**: Temporary issues like server errors and provider-side capacity limits are now handled internally. The error is logged, the key is rotated, but the exception is **not** propagated to the end client. This prevents the client from seeing disruptive, intermittent failures.
77
- - **Final Failure**: A non-streaming request will only return `None` (indicating failure) if either a) the global `deadline` is exceeded, or b) all keys for the provider have been tried and have failed. A streaming request will yield a final `[DONE]` with an error message in the same scenarios.
78
 
79
  ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
80
 
81
- This class is the stateful core of the library, managing concurrency, usage, and cooldowns.
82
 
83
  #### Key Concepts
84
 
85
- * **Async-Native & Lazy-Loaded**: The class is fully asynchronous, using `aiofiles` for non-blocking file I/O. The usage data from the JSON file is loaded only when the first request is made (`_lazy_init`).
86
- * **Fine-Grained Locking**: Each API key is associated with its own `asyncio.Lock` and `asyncio.Condition` object. This allows for a highly granular and efficient locking strategy.
87
-
88
- #### Tiered Key Acquisition (`acquire_key`)
89
-
90
- This method implements the intelligent logic for selecting the best key for a job, now with deadline awareness.
91
-
92
- 1. **Deadline Enforcement**: The entire acquisition process runs in a `while time.time() < deadline:` loop. If a key cannot be found before the deadline, the method raises `NoAvailableKeysError`.
93
- 2. **Filtering**: It first filters out any keys that are on a global or model-specific cooldown.
94
- 3. **Tiering**: It categorizes the remaining, valid keys into two tiers:
95
- - **Tier 1 (Ideal)**: Keys that are completely free (not being used by any model).
96
- - **Tier 2 (Acceptable)**: Keys that are currently in use, but for *different models* than the one being requested. This allows a single key to be used for concurrent calls to, for example, `gemini-1.5-pro` and `gemini-1.5-flash`.
97
- 4. **Selection**: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the key with the lowest usage count.
98
- 5. **Waiting**: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the *same model*. The method then `await`s on the `asyncio.Condition` of the best available key. Crucially, this wait is itself timed out by the remaining request budget, preventing indefinite waits.
99
-
100
- #### Failure Handling & Cooldowns (`record_failure`)
101
-
102
- * **Escalating Backoff**: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for that specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
103
- * **Authentication Errors**: These are treated more severely, applying an immediate 5-minute key-level lockout.
104
- * **Key-Level Lockouts**: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
105
-
106
- ### Data Structure
107
-
108
- The `key_usage.json` file has a more complex structure to store this detailed state:
109
- ```json
110
- {
111
- "api_key_hash": {
112
- "daily": {
113
- "date": "YYYY-MM-DD",
114
- "models": {
115
- "gemini/gemini-1.5-pro": {
116
- "success_count": 10,
117
- "prompt_tokens": 5000,
118
- "completion_tokens": 10000,
119
- "approx_cost": 0.075
120
- }
121
- }
122
- },
123
- "global": { /* ... similar to daily, but accumulates over time ... */ },
124
- "model_cooldowns": {
125
- "gemini/gemini-1.5-flash": 1719987600.0
126
- },
127
- "failures": {
128
- "gemini/gemini-1.5-flash": {
129
- "consecutive_failures": 2
130
- }
131
- },
132
- "key_cooldown_until": null,
133
- "last_daily_reset": "YYYY-MM-DD"
134
- }
135
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ```
137
 
138
- ## 3. `error_handler.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
- This module provides a centralized function, `classify_error`, which is a significant improvement over simple boolean checks.
 
 
 
 
141
 
142
- * It takes a raw exception from `litellm` and returns a `ClassifiedError` data object.
143
- * This object contains the `error_type` (e.g., `'rate_limit'`, `'authentication'`), the original exception, the status code, and any `retry_after` information extracted from the error message.
144
- * This structured classification allows the `RotatingClient` to make more intelligent decisions about whether to retry with the same key or rotate to a new one.
145
 
146
- ### 2.4. `providers/` - Provider Plugins
147
 
148
- The provider plugin system allows for easy extension. The `__init__.py` file in this directory dynamically scans for all modules ending in `_provider.py`, imports the provider class from each, and registers it in the `PROVIDER_PLUGINS` dictionary. This makes adding new providers as simple as dropping a new file into the directory.
 
 
 
149
 
150
  ---
151
 
152
- ## 3. `proxy_app` - The FastAPI Proxy
153
 
154
- The `proxy_app` directory contains the FastAPI application that serves the `rotator_library`.
155
 
156
- ### 3.1. `main.py` - The FastAPI App
157
 
158
- This file defines the web server and its endpoints.
 
 
159
 
160
- #### Lifespan Management
 
 
 
161
 
162
- The application uses FastAPI's `lifespan` context manager to manage the `RotatingClient` instance. The client is initialized when the application starts and gracefully closed (releasing its `httpx` resources) when the application shuts down. This ensures that a single, stateful client instance is shared across all requests.
163
 
164
- #### Endpoints
165
 
166
- * `POST /v1/chat/completions`: The main endpoint for chat requests.
167
- * `POST /v1/embeddings`: The endpoint for creating embeddings.
168
- * `GET /v1/models`: Returns a list of all available models from configured providers.
169
- * `GET /v1/providers`: Returns a list of all configured providers.
170
- * `POST /v1/token-count`: Calculates the token count for a given message payload.
171
 
172
- #### Authentication
173
 
174
- All endpoints are protected by the `verify_api_key` dependency, which checks for a valid `Authorization: Bearer <PROXY_API_KEY>` header.
175
 
176
- #### Streaming Response Handling
177
 
178
- For streaming requests, the `chat_completions` endpoint returns a `StreamingResponse` whose content is generated by the `streaming_response_wrapper` function. This wrapper serves two purposes:
179
- 1. It passes the chunks from the `RotatingClient`'s stream directly to the user.
180
- 2. It aggregates the full response in the background so that it can be logged completely once the stream is finished.
 
 
 
 
 
 
 
 
 
 
181
 
182
- ### 3.2. `detailed_logger.py` - Comprehensive Transaction Logging
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
- To facilitate robust debugging and performance analysis, the proxy includes a powerful detailed logging system, enabled by the `--enable-request-logging` command-line flag. This system is managed by the `DetailedLogger` class in `detailed_logger.py`.
185
 
186
- Unlike simple logging, this system creates a **unique directory for every single transaction**, ensuring that all related data is isolated and easy to analyze.
187
 
188
- #### Log Directory Structure
189
 
190
- When logging is enabled, each request will generate a new directory inside `logs/detailed_logs/` with a name like `YYYYMMDD_HHMMSS_unique-uuid`. Inside this directory, you will find a complete record of the transaction:
 
 
 
 
 
 
191
 
192
- - **`request.json`**: Contains the full incoming request, including HTTP headers and the JSON body.
193
- - **`streaming_chunks.jsonl`**: For streaming requests, this file contains a timestamped log of every individual data chunk received from the provider. This is invaluable for debugging malformed streams or partial responses.
194
- - **`final_response.json`**: Contains the complete final response from the provider, including the status code, headers, and full JSON body. For streaming requests, this body is the fully reassembled message.
195
- - **`metadata.json`**: A summary file for quick analysis, containing:
196
- - `request_id`: The unique identifier for the transaction.
197
- - `duration_ms`: The total time taken for the request to complete.
198
- - `status_code`: The final HTTP status code returned by the provider.
199
- - `model`: The model used for the request.
200
- - `usage`: Token usage statistics (`prompt`, `completion`, `total`).
201
- - `finish_reason`: The reason the model stopped generating tokens.
202
- - `reasoning_found`: A boolean indicating if a `reasoning` field was detected in the response.
203
- - `reasoning_content`: The extracted content of the `reasoning` field, if found.
204
 
205
- ### 3.3. `build.py`
206
 
207
- This is a utility script for creating a standalone executable of the proxy application using PyInstaller. It includes logic to dynamically find all provider plugins and explicitly include them as hidden imports, ensuring they are bundled into the final executable.
 
1
  # Technical Documentation: Universal LLM API Proxy & Resilience Library
2
 
3
+ This document provides a detailed technical explanation of the project's architecture, internal components, and data flows. It is intended for developers who want to understand how the system achieves high availability and resilience.
4
 
5
  ## 1. Architecture Overview
6
 
7
  The project is a monorepo containing two primary components:
8
 
9
+ 1. **The Proxy Application (`proxy_app`)**: This is the user-facing component. It's a FastAPI application that acts as a universal gateway. It uses `litellm` to translate requests to various provider formats and includes:
10
+ * **Batch Manager**: Optimizes high-volume embedding requests.
11
+ * **Detailed Logger**: Provides per-request file logging for debugging.
12
+ * **OpenAI-Compatible Endpoints**: `/v1/chat/completions`, `/v1/embeddings`, etc.
13
  2. **The Resilience Library (`rotator_library`)**: This is the core engine that provides high availability. It is consumed by the proxy app to manage a pool of API keys, handle errors gracefully, and ensure requests are completed successfully even when individual keys or provider endpoints face issues.
14
 
15
  This architecture cleanly separates the API interface from the resilience logic, making the library a portable and powerful tool for any application needing robust API key management.
 
31
  ```python
32
  client = RotatingClient(
33
  api_keys=api_keys,
34
+ oauth_credentials=oauth_credentials,
35
  max_retries=2,
36
+ usage_file_path="key_usage.json",
37
+ configure_logging=True,
38
+ global_timeout=30,
39
+ abort_on_callback_error=True,
40
+ litellm_provider_params={},
41
+ ignore_models={},
42
+ whitelist_models={},
43
+ enable_request_logging=False,
44
+ max_concurrent_requests_per_key={}
45
  )
46
  ```
47
 
48
+ - `api_keys` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary mapping provider names to a list of API keys.
49
+ - `oauth_credentials` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary mapping provider names to a list of file paths to OAuth credential JSON files.
50
+ - `max_retries` (`int`, default: `2`): The number of times to retry a request with the *same key* if a transient server error occurs.
51
+ - `usage_file_path` (`str`, default: `"key_usage.json"`): The path to the JSON file where usage statistics are persisted.
52
+ - `configure_logging` (`bool`, default: `True`): If `True`, configures the library's logger to propagate logs to the root logger.
53
+ - `global_timeout` (`int`, default: `30`): A hard time limit (in seconds) for the entire request lifecycle.
54
+ - `abort_on_callback_error` (`bool`, default: `True`): If `True`, any exception raised by `pre_request_callback` will abort the request.
55
+ - `litellm_provider_params` (`Optional[Dict[str, Any]]`, default: `None`): Extra parameters to pass to `litellm` for specific providers.
56
+ - `ignore_models` (`Optional[Dict[str, List[str]]]`, default: `None`): Blacklist of models to exclude (supports wildcards).
57
+ - `whitelist_models` (`Optional[Dict[str, List[str]]]`, default: `None`): Whitelist of models to always include, overriding `ignore_models`.
58
+ - `enable_request_logging` (`bool`, default: `False`): If `True`, enables detailed per-request file logging.
59
+ - `max_concurrent_requests_per_key` (`Optional[Dict[str, int]]`, default: `None`): Max concurrent requests allowed for a single API key per provider.
60
 
61
  #### Core Responsibilities
62
 
63
+ * **Lifecycle Management**: Manages a shared `httpx.AsyncClient` for all non-blocking HTTP requests.
64
+ * **Key Management**: Interfacing with the `UsageManager` to acquire and release API keys based on load and health.
65
+ * **Plugin System**: Dynamically loading and using provider-specific plugins from the `providers/` directory.
66
+ * **Execution Logic**: Executing API calls via `litellm` with a robust, **deadline-driven** retry and key selection strategy.
67
+ * **Streaming Safety**: Providing a safe, stateful wrapper (`_safe_streaming_wrapper`) for handling streaming responses, buffering incomplete JSON chunks, and detecting mid-stream errors.
68
+ * **Model Filtering**: Filtering available models using configurable whitelists and blacklists.
69
+ * **Request Sanitization**: Automatically cleaning invalid parameters (like `dimensions` for non-OpenAI models) via `request_sanitizer.py`.
70
 
71
  #### Model Filtering Logic
72
 
73
+ The `RotatingClient` provides fine-grained control over which models are exposed via the `/v1/models` endpoint. This is handled by the `get_available_models` method.
74
 
75
+ The logic applies in the following order:
76
+ 1. **Whitelist Check**: If a provider has a whitelist defined (`WHITELIST_MODELS_<PROVIDER>`), any model on that list will **always be available**, even if it matches a blacklist pattern. This acts as a definitive override.
77
+ 2. **Blacklist Check**: For any model *not* on the whitelist, the client checks the blacklist (`IGNORE_MODELS_<PROVIDER>`). If the model matches a blacklist pattern (supports wildcards like `*-preview`), it is excluded.
78
+ 3. **Default**: If a model is on neither list, it is included.
 
 
 
 
79
 
80
  #### Request Lifecycle: A Deadline-Driven Approach
81
 
82
+ The request lifecycle has been designed around a single, authoritative time budget to ensure predictable performance:
83
 
84
  1. **Deadline Establishment**: The moment `acompletion` or `aembedding` is called, a `deadline` is calculated: `time.time() + self.global_timeout`. This `deadline` is the absolute point in time by which the entire operation must complete.
85
+ 2. **Deadline-Aware Key Selection**: The main loop checks this deadline before every key acquisition attempt. If the deadline is exceeded, the request fails immediately.
86
+ 3. **Deadline-Aware Key Acquisition**: The `UsageManager` itself takes this `deadline`. It will only wait for a key (if all are busy) until the deadline is reached.
87
+ 4. **Deadline-Aware Retries**: If a transient error occurs (like a 500 or 429), the client calculates the backoff time. If waiting would push the total time past the deadline, the wait is skipped, and the client immediately rotates to the next key.
88
 
89
+ #### Streaming Resilience
90
 
91
+ The `_safe_streaming_wrapper` is a critical component for stability. It:
92
+ * **Buffers Fragments**: Reads raw chunks from the stream and buffers them until a valid JSON object can be parsed. This handles providers that may split JSON tokens across network packets.
93
+ * **Error Interception**: Detects if a chunk contains an API error (like a quota limit) instead of content, and raises a specific `StreamedAPIError`.
94
+ * **Quota Handling**: If a specific "quota exceeded" error is detected mid-stream multiple times, it can terminate the stream gracefully to prevent infinite retry loops on oversized inputs.
 
 
 
 
 
 
95
 
96
  ### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
97
 
98
+ This class is the stateful core of the library, managing concurrency, usage tracking, and cooldowns.
99
 
100
  #### Key Concepts
101
 
102
+ * **Async-Native & Lazy-Loaded**: Fully asynchronous, using `aiofiles` for non-blocking file I/O. Usage data is loaded only when needed.
103
+ * **Fine-Grained Locking**: Each API key has its own `asyncio.Lock` and `asyncio.Condition`. This allows for highly granular control.
104
+
105
+ #### Tiered Key Acquisition Strategy
106
+
107
+ The `acquire_key` method uses a sophisticated strategy to balance load:
108
+
109
+ 1. **Filtering**: Keys currently on cooldown (global or model-specific) are excluded.
110
+ 2. **Tiering**: Valid keys are split into two tiers:
111
+ * **Tier 1 (Ideal)**: Keys that are completely idle (0 concurrent requests).
112
+ * **Tier 2 (Acceptable)**: Keys that are busy but still under their configured `MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>` limit for the requested model. This allows a single key to be used multiple times for the same model, maximizing throughput.
113
+ 3. **Prioritization**: Within each tier, keys with the **lowest daily usage** are prioritized to spread costs evenly.
114
+ 4. **Concurrency Limits**: Checks against `max_concurrent` limits to prevent overloading a single key.
115
+
116
+ #### Failure Handling & Cooldowns
117
+
118
+ * **Escalating Backoff**: When a failure occurs, the key gets a temporary cooldown for that specific model. Consecutive failures increase this time (10s -> 30s -> 60s -> 120s).
119
+ * **Key-Level Lockouts**: If a key accumulates failures across multiple distinct models (3+), it is assumed to be dead/revoked and placed on a global 5-minute lockout.
120
+ * **Authentication Errors**: Immediate 5-minute global lockout.
121
+
122
+ ### 2.3. `batch_manager.py` - Efficient Request Aggregation
123
+
124
+ The `EmbeddingBatcher` class optimizes high-throughput embedding workloads.
125
+
126
+ * **Mechanism**: It uses an `asyncio.Queue` to collect incoming requests.
127
+ * **Triggers**: A batch is dispatched when either:
128
+ 1. The queue size reaches `batch_size` (default: 64).
129
+ 2. A time window (`timeout`, default: 0.1s) elapses since the first request in the batch.
130
+ * **Efficiency**: This reduces dozens of HTTP calls to a single API request, significantly reducing overhead and rate limit usage.
131
+
132
+ ### 2.4. `background_refresher.py` - Automated Token Maintenance
133
+
134
+ The `BackgroundRefresher` ensures that OAuth tokens (for providers like Gemini CLI, Qwen, iFlow) never expire while the proxy is running.
135
+
136
+ * **Periodic Checks**: It runs a background task that wakes up at a configurable interval (default: 3600 seconds/1 hour).
137
+ * **Proactive Refresh**: It iterates through all loaded OAuth credentials and calls their `proactively_refresh` method to ensure tokens are valid before they are needed.
138
+
139
+ ### 2.6. Credential Management Architecture
140
+
141
+ The `CredentialManager` class (`credential_manager.py`) centralizes the lifecycle of all API credentials. It adheres to a "Local First" philosophy.
142
+
143
+ #### 2.6.1. Automated Discovery & Preparation
144
+
145
+ On startup (unless `SKIP_OAUTH_INIT_CHECK=true`), the manager performs a comprehensive sweep:
146
+
147
+ 1. **System-Wide Scan**: Searches for OAuth credential files in standard locations:
148
+ - `~/.gemini/` → All `*.json` files (typically `credentials.json`)
149
+ - `~/.qwen/` → All `*.json` files (typically `oauth_creds.json`)
150
+ - `~/.iflow/` → All `*. json` files
151
+
152
+ 2. **Local Import**: Valid credentials are **copied** (not moved) to the project's `oauth_creds/` directory with standardized names:
153
+ - `gemini_cli_oauth_1.json`, `gemini_cli_oauth_2.json`, etc.
154
+ - `qwen_code_oauth_1.json`, `qwen_code_oauth_2.json`, etc.
155
+ - `iflow_oauth_1.json`, `iflow_oauth_2.json`, etc.
156
+
157
+ 3. **Intelligent Deduplication**:
158
+ - The manager inspects each credential file for a `_proxy_metadata` field containing the user's email or ID
159
+ - If this field doesn't exist, it's added during import using provider-specific APIs (e.g., fetching Google account email for Gemini)
160
+ - Duplicate accounts (same email/ID) are detected and skipped with a warning log
161
+ - Prevents the same account from being added multiple times, even if the files are in different locations
162
+
163
+ 4. **Isolation**: The project's credentials in `oauth_creds/` are completely isolated from system-wide credentials, preventing cross-contamination
164
+
165
+ #### 2.6.2. Credential Loading & Stateless Operation
166
+
167
+ The manager supports loading credentials from two sources, with a clear priority:
168
+
169
+ **Priority 1: Local Files** (`oauth_creds/` directory)
170
+ - Standard `.json` files are loaded first
171
+ - Naming convention: `{provider}_oauth_{number}.json`
172
+ - Example: `oauth_creds/gemini_cli_oauth_1.json`
173
+
174
+ **Priority 2: Environment Variables** (Stateless Deployment)
175
+ - If no local files are found, the manager checks for provider-specific environment variables
176
+ - This is the key to "Stateless Deployment" for platforms like Railway, Render, Heroku
177
+
178
+ **Gemini CLI Environment Variables:**
179
+ ```
180
+ GEMINI_CLI_ACCESS_TOKEN
181
+ GEMINI_CLI_REFRESH_TOKEN
182
+ GEMINI_CLI_E XPIRY_DATE
183
+ GEMINI_CLI_EMAIL
184
+ GEMINI_CLI_PROJECT_ID (optional)
185
+ GEMINI_CLI_CLIENT_ID (optional)
186
  ```
187
 
188
+ **Qwen Code Environment Variables:**
189
+ ```
190
+ QWEN_CODE_ACCESS_TOKEN
191
+ QWEN_CODE_REFRESH_TOKEN
192
+ QWEN_CODE_EXPIRY_DATE
193
+ QWEN_CODE_EMAIL
194
+ ```
195
+
196
+ **iFlow Environment Variables:**
197
+ ```
198
+ IFLOW_ACCESS_TOKEN
199
+ IFLOW_REFRESH_TOKEN
200
+ IFLOW_EXPIRY_DATE
201
+ IFLOW_EMAIL
202
+ IFLOW_API_KEY
203
+ ```
204
 
205
+ **How it works:**
206
+ - If the manager finds (e.g.) `GEMINI_CLI_ACCESS_TOKEN`, it constructs an in-memory credential object that mimics the file structure
207
+ - The credential behaves exactly like a file-based credential (automatic refresh, expiry detection, etc.)
208
+ - No physical files are created or needed on the host system
209
+ - Perfect for ephemeral containers or read-only filesystems
210
 
211
+ #### 2.6.3. Credential Tool Integration
 
 
212
 
213
+ The `credential_tool.py` provides a user-friendly CLI interface to the `CredentialManager`:
214
 
215
+ **Key Functions:**
216
+ 1. **OAuth Setup**: Wraps provider-specific `AuthBase` classes (`GeminiAuthBase`, `QwenAuthBase`, `IFlowAuthBase`) to handle interactive login flows
217
+ 2. **Credential Export**: Reads local `.json` files and generates `.env` format output for stateless deployment
218
+ 3. **API Key Management**: Adds or updates `PROVIDER_API_KEY_N` entries in the `.env` file
219
 
220
  ---
221
 
222
+ ### 2.7. Request Sanitizer (`request_sanitizer.py`)
223
 
224
+ The `sanitize_request_payload` function ensures requests are compatible with each provider's specific requirements:
225
 
226
+ **Parameter Cleaning Logic:**
227
 
228
+ 1. **`dimensions` Parameter**:
229
+ - Only supported by OpenAI's `text-embedding-3-small` and `text-embedding-3-large` models
230
+ - Automatically removed for all other models to prevent `400 Bad Request` errors
231
 
232
+ 2. **`thinking` Parameter** (Gemini-specific):
233
+ - Format: `{"type": "enabled", "budget_tokens": -1}`
234
+ - Only valid for `gemini/gemini-2.5-pro` and `gemini/gemini-2.5-flash`
235
+ - Removed for all other models
236
 
237
+ **Provider-Specific Tool Schema Cleaning:**
238
 
239
+ Implemented in individual provider classes (`QwenCodeProvider`, `IFlowProvider`):
240
 
241
+ - **Recursively removes** unsupported properties from tool function schemas:
242
+ - `strict`: OpenAI-specific, causes validation errors on Qwen/iFlow
243
+ - `additionalProperties`: Same issue
244
+ - **Prevents `400 Bad Request` errors** when using complex tool definitions
245
+ - Applied automatically before sending requests to the provider
246
 
247
+ ---
248
 
249
+ ### 2.8. Error Classification (`error_handler.py`)
250
 
251
+ The `ClassifiedError` class wraps all exceptions from `litellm` and categorizes them for intelligent handling:
252
 
253
+ **Error Types:**
254
+ ```python
255
+ class ErrorType(Enum):
256
+ RATE_LIMIT = "rate_limit" # 429 errors, temporary backoff needed
257
+ AUTHENTICATION = "authentication" # 401/403, invalid/revoked key
258
+ SERVER_ERROR = "server_error" # 500/502/503, provider infrastructure issues
259
+ QUOTA = "quota" # Daily/monthly quota exceeded
260
+ CONTEXT_LENGTH = "context_length" # Input too long for model
261
+ CONTENT_FILTER = "content_filter" # Request blocked by safety filters
262
+ NOT_FOUND = "not_found" # Model/endpoint doesn't exist
263
+ TIMEOUT = "timeout" # Request took too long
264
+ UNKNOWN = "unknown" # Unclassified error
265
+ ```
266
 
267
+ **Classification Logic:**
268
+
269
+ 1. **Status Code Analysis**: Primary classification method
270
+ - `401`/`403` → `AUTHENTICATION`
271
+ - `429` → `RATE_LIMIT`
272
+ - `400` with "context_length" or "tokens" → `CONTEXT_LENGTH`
273
+ - `400` with "quota" → `QUOTA`
274
+ - `500`/`502`/`503` → `SERVER_ERROR`
275
+
276
+ 2. **Message Analysis**: Fallback for ambiguous errors
277
+ - Searches for keywords like "quota exceeded", "rate limit", "invalid api key"
278
+
279
+ 3. **Provider-Specific Overrides**: Some providers use non-standard error formats
280
+
281
+ **Usage in Client:**
282
+ - `AUTHENTICATION` → Immediate 5-minute global lockout
283
+ - `RATE_LIMIT`/`QUOTA` → Escalating per-model cooldown
284
+ - `SERVER_ERROR` → Retry with same key (up to `max_retries`)
285
+ - `CONTEXT_LENGTH`/`CONTENT_FILTER` → Immediate failure (user needs to fix request)
286
+
287
+ ---
288
+
289
+ ### 2.9. Cooldown Management (`cooldown_manager.py`)
290
+
291
+ The `CooldownManager` handles IP or account-level rate limiting that affects all keys for a provider:
292
+
293
+ **Purpose:**
294
+ - Some providers (like NVIDIA NIM) have rate limits tied to account/IP rather than API key
295
+ - When a 429 error occurs, ALL keys for that provider must be paused
296
+
297
+ **Key Methods:**
298
+
299
+ 1. **`is_cooling_down(provider: str) -> bool`**:
300
+ - Checks if a provider is currently in a global cooldown period
301
+ - Returns `True` if the current time is still within the cooldown window
302
+
303
+ 2. **`start_cooldown(provider: str, duration: int)`**:
304
+ - Initiates or extends a cooldown for a provider
305
+ - Duration is typically 60-120 seconds for 429 errors
306
+
307
+ 3. **`get_cooldown_remaining(provider: str) -> float`**:
308
+ - Returns remaining cooldown time in seconds
309
+ - Used for logging and diagnostics
310
+
311
+ **Integration with UsageManager:**
312
+ - When a key fails with `RATE_LIMIT` error type, the client checks if it's likely an IP-level limit
313
+ - If so, `CooldownManager.start_cooldown()` is called for the entire provider
314
+ - All subsequent `acquire_key()` calls for that provider will wait until the cooldown expires
315
+
316
+ ---
317
+
318
+ ## 3. Provider Specific Implementations
319
+
320
+ The library handles provider idiosyncrasies through specialized "Provider" classes in `src/rotator_library/providers/`.
321
+
322
+ ### 3.1. Gemini CLI (`gemini_cli_provider.py`)
323
+
324
+ The `GeminiCliProvider` is the most complex implementation, mimicking the Google Cloud Code extension.
325
+
326
+ #### Authentication (`gemini_auth_base.py`)
327
+
328
+ * **Device Flow**: Uses a standard OAuth 2.0 flow. The `credential_tool` spins up a local web server (`localhost:8085`) to capture the callback from Google's auth page.
329
+ * **Token Lifecycle**:
330
+ * **Proactive Refresh**: Tokens are refreshed 5 minutes before expiry.
331
+ * **Atomic Writes**: Credential files are updated using a temp-file-and-move strategy to prevent corruption during writes.
332
+ * **Revocation Handling**: If a `400` or `401` occurs during refresh, the token is marked as revoked, preventing infinite retry loops.
333
+
334
+ #### Project ID Discovery (Zero-Config)
335
+
336
+ The provider employs a sophisticated, cached discovery mechanism to find a valid Google Cloud Project ID:
337
+ 1. **Configuration**: Checks `GEMINI_CLI_PROJECT_ID` first.
338
+ 2. **Code Assist API**: Tries `CODE_ASSIST_ENDPOINT:loadCodeAssist`. This returns the project associated with the Cloud Code extension.
339
+ 3. **Onboarding Flow**: If step 2 fails, it triggers the `onboardUser` endpoint. This initiates a Long-Running Operation (LRO) that automatically provisions a free-tier Google Cloud Project for the user. The proxy polls this operation for up to 5 minutes until completion.
340
+ 4. **Resource Manager**: As a final fallback, it lists all active projects via the Cloud Resource Manager API and selects the first one.
341
+
342
+ #### Rate Limit Handling
343
+
344
+ * **Internal Endpoints**: Uses `https://cloudcode-pa.googleapis.com/v1internal`, which typically has higher quotas than the public API.
345
+ * **Smart Fallback**: If `gemini-2.5-pro` hits a rate limit (`429`), the provider transparently retries the request using `gemini-2.5-pro-preview-06-05`. This fallback chain is configurable in code.
346
+
347
+ ### 3.2. Qwen Code (`qwen_code_provider.py`)
348
+
349
+ * **Dual Auth**: Supports both standard API keys (direct) and OAuth (via `QwenAuthBase`).
350
+ * **Device Flow**: Implements the OAuth Device Authorization Grant (RFC 8628). It displays a code to the user and polls the token endpoint until the user authorizes the device in their browser.
351
+ * **Dummy Tool Injection**: To work around a Qwen API bug where streams hang if `tools` is empty but `tool_choice` logic is present, the provider injects a benign `do_not_call_me` tool.
352
+ * **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from tool schemas, as Qwen's validation is stricter than OpenAI's.
353
+ * **Reasoning Parsing**: Detects `<think>` tags in the raw stream and redirects their content to a separate `reasoning_content` field in the delta, mimicking the OpenAI o1 format.
354
+
355
+ ### 3.3. iFlow (`iflow_provider.py`)
356
+
357
+ * **Hybrid Auth**: Uses a custom OAuth flow (Authorization Code) to obtain an `access_token`. However, the *actual* API calls use a separate `apiKey` that is retrieved from the user's profile (`/api/oauth/getUserInfo`) using the access token.
358
+ * **Callback Server**: The auth flow spins up a local server on port `11451` to capture the redirect.
359
+ * **Token Management**: Automatically refreshes the OAuth token and re-fetches the API key if needed.
360
+ * **Schema Cleaning**: Similar to Qwen, it aggressively sanitizes tool schemas to prevent 400 errors.
361
+ * **Dedicated Logging**: Implements `_IFlowFileLogger` to capture raw chunks for debugging proprietary API behaviors.
362
+
363
+ ### 3.4. Google Gemini (`gemini_provider.py`)
364
+
365
+ * **Thinking Parameter**: Automatically handles the `thinking` parameter transformation required for Gemini 2.5 models (`thinking` -> `gemini-2.5-pro` reasoning parameter).
366
+ * **Safety Settings**: Ensures default safety settings (blocking nothing) are applied if not provided, preventing over-sensitive refusals.
367
+
368
+ ---
369
 
370
+ ## 4. Logging & Debugging
371
 
372
+ ### `detailed_logger.py`
373
 
374
+ To facilitate robust debugging, the proxy includes a comprehensive transaction logging system.
375
 
376
+ * **Unique IDs**: Every request generates a UUID.
377
+ * **Directory Structure**: Logs are stored in `logs/detailed_logs/YYYYMMDD_HHMMSS_{uuid}/`.
378
+ * **Artifacts**:
379
+ * `request.json`: The exact payload sent to the proxy.
380
+ * `final_response.json`: The complete reassembled response.
381
+ * `streaming_chunks.jsonl`: A line-by-line log of every SSE chunk received from the provider.
382
+ * `metadata.json`: Performance metrics (duration, token usage, model used).
383
 
384
+ This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.
 
 
 
 
 
 
 
 
 
 
 
385
 
 
386
 
 
Deployment guide.md CHANGED
@@ -69,8 +69,19 @@ OPENROUTER_API_KEY_1="your-openrouter-key"
69
 
70
  - Supported providers: Check LiteLLM docs for a full list and specifics (e.g., GEMINI, OPENROUTER, NVIDIA_NIM).
71
  - Tip: Start with 1-2 providers to test. Don't share this file publicly!
 
 
 
 
 
 
 
 
 
 
72
  4. Save the file. (We'll upload it to Render in Step 5.)
73
 
 
74
  ## Step 4: Create a New Web Service on Render
75
 
76
  1. Log in to render.com and go to your Dashboard.
 
69
 
70
  - Supported providers: Check LiteLLM docs for a full list and specifics (e.g., GEMINI, OPENROUTER, NVIDIA_NIM).
71
  - Tip: Start with 1-2 providers to test. Don't share this file publicly!
72
+
73
+ ### Advanced: Stateless Deployment for OAuth Providers (Gemini CLI, Qwen, iFlow)
74
+ If you are using providers that require complex OAuth files (like **Gemini CLI**, **Qwen Code**, or **iFlow**), you don't need to upload the JSON files manually. The proxy includes a tool to "export" these credentials into environment variables.
75
+
76
+ 1. Run the credential tool locally: `python -m rotator_library.credential_tool`
77
+ 2. Select the "Export ... to .env" option for your provider.
78
+ 3. The tool will generate a file (e.g., `gemini_cli_user_at_gmail.env`) containing variables like `GEMINI_CLI_ACCESS_TOKEN`, `GEMINI_CLI_REFRESH_TOKEN`, etc.
79
+ 4. Copy the contents of this file and paste them directly into your `.env` file or Render's "Environment Variables" section.
80
+ 5. The proxy will automatically detect and use these variables—no file upload required!
81
+
82
  4. Save the file. (We'll upload it to Render in Step 5.)
83
 
84
+
85
  ## Step 4: Create a New Web Service on Render
86
 
87
  1. Log in to render.com and go to your Dashboard.
README.md CHANGED
@@ -1,18 +1,6 @@
1
  # Universal LLM API Proxy & Resilience Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
2
  [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Mirrowel/LLM-API-Key-Proxy) [![zread](https://img.shields.io/badge/Ask_Zread-_.svg?style=flat&color=00b0aa&labelColor=000000&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iMTYiIGhlaWdodD0iMTYiIHZpZXdCb3g9IjAgMCAxNiAxNiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTQuOTYxNTYgMS42MDAxSDIuMjQxNTZDMS44ODgxIDEuNjAwMSAxLjYwMTU2IDEuODg2NjQgMS42MDE1NiAyLjI0MDFWNC45NjAxQzEuNjAxNTYgNS4zMTM1NiAxLjg4ODEgNS42MDAxIDIuMjQxNTYgNS42MDAxSDQuOTYxNTZDNS4zMTUwMiA1LjYwMDEgNS42MDE1NiA1LjMxMzU2IDUuNjAxNTYgNC45NjAxVjIuMjQwMUM1LjYwMTU2IDEuODg2NjQgNS4zMTUwMiAxLjYwMDEgNC45NjE1NiAxLjYwMDFaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00Ljk2MTU2IDEwLjM5OTlIMi4yNDE1NkMxLjg4ODEgMTAuMzk5OSAxLjYwMTU2IDEwLjY4NjQgMS42MDE1NiAxMS4wMzk5VjEzLjc1OTlDMS42MDE1NiAxNC4xMTM0IDEuODg4MSAxNC4zOTk5IDIuMjQxNTYgMTQuMzk5OUg0Ljk2MTU2QzUuMzE1MDIgMTQuMzk5OSA1LjYwMTU2IDE0LjExMzQgNS42MDE1NiAxMy43NTk5VjExLjAzOTlDNS42MDE1NiAxMC42ODY0IDUuMzE1MDIgMTAuMzk5OSA0Ljk2MTU2IDEwLjM5OTlaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik0xMy43NTg0IDEuNjAwMUgxMS4wMzg0QzEwLjY4NSAxLjYwMDEgMTAuMzk4NCAxLjg4NjY0IDEwLjM5ODQgMi4yNDAxVjQuOTYwMUMxMC4zOTg0IDUuMzEzNTYgMTAuNjg1IDUuNjAwMSAxMS4wMzg0IDUuNjAwMUgxMy43NTg0QzE0LjExMTkgNS42MDAxIDE0LjM5ODQgNS4zMTM1NiAxNC4zOTg0IDQuOTYwMVYyLjI0MDFDMTQuMzk4NCAxLjg4NjY0IDE0LjExMTkgMS42MDAxIDEzLjc1ODQgMS42MDAxWiIgZmlsbD0iI2ZmZiIvPgo8cGF0aCBkPSJNNCAxMkwxMiA0TDQgMTJaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00IDEyTDEyIDQiIHN0cm9rZT0iI2ZmZiIgc3Ryb2tlLXdpZHRoPSIxLjUiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIvPgo8L3N2Zz4K&logoColor=ffffff)](https://zread.ai/Mirrowel/LLM-API-Key-Proxy)
3
 
4
- ## Easy Setup for Beginners (Windows)
5
-
6
- This is the fastest way to get started.
7
-
8
- 1. **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
9
- 2. Unzip the downloaded file.
10
- 3. **Double-click `setup_env.bat`**. A window will open to help you add your API keys. Follow the on-screen instructions.
11
- 4. **Double-click `proxy_app.exe`**. This will start the proxy server.
12
-
13
- Your proxy is now running! You can now use it in your applications.
14
-
15
- ---
16
 
17
  ## Detailed Setup and Features
18
 
@@ -26,9 +14,12 @@ This project provides a powerful solution for developers building complex applic
26
  - **Universal API Endpoint**: Simplifies development by providing a single, OpenAI-compatible interface for diverse LLM providers.
27
  - **High Availability**: The underlying library ensures your application remains operational by gracefully handling transient provider errors and API key-specific issues.
28
  - **Resilient Performance**: A global timeout on all requests prevents your application from hanging on unresponsive provider APIs.
29
- - **Efficient Concurrency**: Maximizes throughput by allowing a single API key to handle multiple concurrent requests to different models.
30
  - **Intelligent Key Management**: Optimizes request distribution across your pool of keys by selecting the best available one for each call.
31
- - **Automated OAuth Discovery**: Automatically discovers, validates, and manages OAuth credentials from standard provider directories (e.g., `~/.gemini/`, `~/.qwen/`, `~/.iflow/`). No manual `.env` configuration is required for supported providers.
 
 
 
32
  - **Duplicate Credential Detection**: Intelligently detects if multiple local credential files belong to the same user account and logs a warning, preventing redundancy in your key pool.
33
  - **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
34
  - **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
@@ -37,18 +28,65 @@ This project provides a powerful solution for developers building complex applic
37
  - **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
38
  - **Advanced Model Filtering**: Supports both blacklists and whitelists to give you fine-grained control over which models are available through the proxy.
39
 
 
40
  ---
41
 
42
- ## 1. Quick Start (Windows Executable)
43
 
44
- This is the fastest way to get started for most users on Windows.
45
 
46
  1. **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
47
  2. Unzip the downloaded file.
48
- 3. **Run `setup_env.bat`**. A window will open to help you add your API keys. Follow the on-screen instructions.
49
- 4. **Run `proxy_app.exe`**. This will start the proxy server in a new terminal window.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- Your proxy is now running and ready to use at `http://127.0.0.1:8000`.
 
 
 
 
 
 
52
 
53
  ---
54
 
@@ -121,22 +159,67 @@ You only need to create a `.env` file to set your `PROXY_API_KEY` and to overrid
121
 
122
  #### Interactive Credential Management Tool
123
 
124
- For easier credential management, you can use the interactive credential tool:
125
 
126
  ```bash
127
  python -m rotator_library.credential_tool
128
  ```
129
 
130
- This tool provides:
131
- 1. **Add OAuth Credential** - Interactive OAuth flow for Gemini CLI, Qwen Code, and iFlow
132
- 2. **Add API Key** - Add API keys for any LiteLLM-supported provider
133
- 3. **Export Gemini CLI to .env** - NEW! Export OAuth credentials to environment variables for stateless deployments
134
 
135
- **For Stateless Hosting (Railway, Render, Vercel, etc.):**
136
- - Use option 3 to export your Gemini CLI credentials to `.env` format
137
- - The generated file contains all necessary environment variables
138
- - Simply paste these into your hosting platform's environment settings
139
- - No file persistence required - credentials load automatically from environment variables
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  **Example `.env` configuration:**
142
  ```env
@@ -269,17 +352,21 @@ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
269
 
270
  ## 4. Advanced Topics
271
 
 
 
 
 
 
 
 
272
  ### How It Works
273
 
274
- When a request is made to the proxy, the application uses its core resilience library to ensure the request is handled reliably:
275
 
276
- 1. **Selects an Optimal Key**: The `UsageManager` selects the best available key from your pool. It uses a tiered locking strategy to find a healthy, available key, prioritizing those with the least recent usage. This allows for concurrent requests to different models using the same key, maximizing efficiency.
277
- 2. **Makes the Request**: The proxy uses the acquired key to make the API call to the target provider via `litellm`.
278
- 3. **Manages Errors Gracefully**:
279
- - It uses a `classify_error` function to determine the failure type.
280
- - For **transient server errors**, it retries the request with the same key using exponential backoff.
281
- - For **key-specific issues (e.g., authentication or provider-side limits)**, it temporarily places that key on a cooldown for the specific model and seamlessly retries the request with the next available key from the pool.
282
- 4. **Tracks Usage & Releases Key**: On a successful request, it records usage stats. The key is then released back into the available pool, ready for the next request.
283
 
284
  ### Command-Line Arguments and Scripts
285
 
@@ -289,11 +376,84 @@ The proxy server can be configured at runtime using the following command-line a
289
  - `--port`: The port to run the server on. Defaults to `8000`.
290
  - `--enable-request-logging`: A flag to enable detailed, per-request logging. When active, the proxy creates a unique directory for each transaction in the `logs/detailed_logs/` folder, containing the full request, response, streaming chunks, and performance metadata. This is highly recommended for debugging.
291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
  **Example:**
293
  ```bash
294
  python src/proxy_app/main.py --host 127.0.0.1 --port 9999 --enable-request-logging
295
  ```
296
 
 
297
  #### Windows Batch Scripts
298
 
299
  For convenience on Windows, you can use the provided `.bat` scripts in the root directory to run the proxy with common configurations:
 
1
  # Universal LLM API Proxy & Resilience Library [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/C0C0UZS4P)
2
  [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Mirrowel/LLM-API-Key-Proxy) [![zread](https://img.shields.io/badge/Ask_Zread-_.svg?style=flat&color=00b0aa&labelColor=000000&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iMTYiIGhlaWdodD0iMTYiIHZpZXdCb3g9IjAgMCAxNiAxNiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTQuOTYxNTYgMS42MDAxSDIuMjQxNTZDMS44ODgxIDEuNjAwMSAxLjYwMTU2IDEuODg2NjQgMS42MDE1NiAyLjI0MDFWNC45NjAxQzEuNjAxNTYgNS4zMTM1NiAxLjg4ODEgNS42MDAxIDIuMjQxNTYgNS42MDAxSDQuOTYxNTZDNS4zMTUwMiA1LjYwMDEgNS42MDE1NiA1LjMxMzU2IDUuNjAxNTYgNC45NjAxVjIuMjQwMUM1LjYwMTU2IDEuODg2NjQgNS4zMTUwMiAxLjYwMDEgNC45NjE1NiAxLjYwMDFaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00Ljk2MTU2IDEwLjM5OTlIMi4yNDE1NkMxLjg4ODEgMTAuMzk5OSAxLjYwMTU2IDEwLjY4NjQgMS42MDE1NiAxMS4wMzk5VjEzLjc1OTlDMS42MDE1NiAxNC4xMTM0IDEuODg4MSAxNC4zOTk5IDIuMjQxNTYgMTQuMzk5OUg0Ljk2MTU2QzUuMzE1MDIgMTQuMzk5OSA1LjYwMTU2IDE0LjExMzQgNS42MDE1NiAxMy43NTk5VjExLjAzOTlDNS42MDE1NiAxMC42ODY0IDUuMzE1MDIgMTAuMzk5OSA0Ljk2MTU2IDEwLjM5OTlaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik0xMy43NTg0IDEuNjAwMUgxMS4wMzg0QzEwLjY4NSAxLjYwMDEgMTAuMzk4NCAxLjg4NjY0IDEwLjM5ODQgMi4yNDAxVjQuOTYwMUMxMC4zOTg0IDUuMzEzNTYgMTAuNjg1IDUuNjAwMSAxMS4wMzg0IDUuNjAwMUgxMy43NTg0QzE0LjExMTkgNS42MDAxIDE0LjM5ODQgNS4zMTM1NiAxNC4zOTg0IDQuOTYwMVYyLjI0MDFDMTQuMzk4NCAxLjg4NjY0IDE0LjExMTkgMS42MDAxIDEzLjc1ODQgMS42MDAxWiIgZmlsbD0iI2ZmZiIvPgo8cGF0aCBkPSJNNCAxMkwxMiA0TDQgMTJaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00IDEyTDEyIDQiIHN0cm9rZT0iI2ZmZiIgc3Ryb2tlLXdpZHRoPSIxLjUiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIvPgo8L3N2Zz4K&logoColor=ffffff)](https://zread.ai/Mirrowel/LLM-API-Key-Proxy)
3
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Detailed Setup and Features
6
 
 
14
  - **Universal API Endpoint**: Simplifies development by providing a single, OpenAI-compatible interface for diverse LLM providers.
15
  - **High Availability**: The underlying library ensures your application remains operational by gracefully handling transient provider errors and API key-specific issues.
16
  - **Resilient Performance**: A global timeout on all requests prevents your application from hanging on unresponsive provider APIs.
17
+ - **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests. By default, it supports concurrent requests to *different* models. With configuration (`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`), it can also support multiple concurrent requests to the *same* model using the same key.
18
  - **Intelligent Key Management**: Optimizes request distribution across your pool of keys by selecting the best available one for each call.
19
+ - **Automated OAuth Discovery**: Automatically discovers, validates, and manages OAuth credentials from standard provider directories (e.g., `~/.gemini/`, `~/.qwen/`, `~/.iflow/`).
20
+ - **Stateless Deployment Support**: Deploy easily to platforms like Railway, Render, or Vercel. The new export tool converts complex OAuth credentials (Gemini CLI, Qwen, iFlow) into simple environment variables, removing the need for persistent storage or file uploads.
21
+ - **Batch Request Processing**: Efficiently aggregates multiple embedding requests into single batch API calls, improving throughput and reducing rate limit hits.
22
+ - **New Provider Support**: Full support for **iFlow** (API Key & OAuth), **Qwen Code** (API Key & OAuth), and **NVIDIA NIM** with DeepSeek thinking support, including special handling for their API quirks (tool schema cleaning, reasoning support, dedicated logging).
23
  - **Duplicate Credential Detection**: Intelligently detects if multiple local credential files belong to the same user account and logs a warning, preventing redundancy in your key pool.
24
  - **Escalating Per-Model Cooldowns**: If a key fails for a specific model, it's placed on a temporary, escalating cooldown for that model, allowing it to be used with others.
25
  - **Automatic Daily Resets**: Cooldowns and usage statistics are automatically reset daily, making the system self-maintaining.
 
28
  - **OpenAI-Compatible Proxy**: Offers a familiar API interface with additional endpoints for model and provider discovery.
29
  - **Advanced Model Filtering**: Supports both blacklists and whitelists to give you fine-grained control over which models are available through the proxy.
30
 
31
+
32
  ---
33
 
34
+ ## 1. Quick Start
35
 
36
+ ### Windows (Simplest)
37
 
38
  1. **Download the latest release** from the [GitHub Releases page](https://github.com/Mirrowel/LLM-API-Key-Proxy/releases/latest).
39
  2. Unzip the downloaded file.
40
+ 3. **Run `launcher.bat`**. This all-in-one script allows you to:
41
+ - Add/Manage credentials interactively.
42
+ - Configure the server (Host, Port, Logging).
43
+ - Run the proxy server.
44
+ - Build the executable from source (if Python is installed).
45
+
46
+ ### macOS / Linux
47
+
48
+ **Option A: Using the Executable (Recommended)**
49
+ If you downloaded the pre-compiled binary for your platform, no Python installation is required.
50
+
51
+ 1. **Download the latest release** from the GitHub Releases page.
52
+ 2. Open a terminal and make the binary executable:
53
+ ```bash
54
+ chmod +x proxy_app
55
+ ```
56
+ 3. **Run the Proxy**:
57
+ ```bash
58
+ ./proxy_app --host 0.0.0.0 --port 8000
59
+ ```
60
+ 4. **Manage Credentials**:
61
+ ```bash
62
+ ./proxy_app --add-credential
63
+ ```
64
+
65
+ **Option B: Manual Setup (Source Code)**
66
+ If you are running from source, use these commands:
67
+
68
+ **1. Install Dependencies**
69
+ ```bash
70
+ # Ensure you have Python 3.10+ installed
71
+ python3 -m venv venv
72
+ source venv/bin/activate
73
+ pip install -r requirements.txt
74
+ ```
75
+
76
+ **2. Add Credentials (Interactive Tool)**
77
+ ```bash
78
+ # Equivalent to "Add Credentials"
79
+ export PYTHONPATH=$PYTHONPATH:$(pwd)/src
80
+ python src/proxy_app/main.py --add-credential
81
+ ```
82
 
83
+ **3. Run the Proxy**
84
+ ```bash
85
+ # Equivalent to "Run Proxy"
86
+ export PYTHONPATH=$PYTHONPATH:$(pwd)/src
87
+ python src/proxy_app/main.py --host 0.0.0.0 --port 8000
88
+ ```
89
+ *To enable logging, add `--enable-request-logging` to the command.*
90
 
91
  ---
92
 
 
159
 
160
  #### Interactive Credential Management Tool
161
 
162
+ The proxy includes a powerful interactive CLI tool for managing all your credentials. This is the recommended way to set up credentials:
163
 
164
  ```bash
165
  python -m rotator_library.credential_tool
166
  ```
167
 
168
+ **Main Menu Features:**
 
 
 
169
 
170
+ 1. **Add OAuth Credential** - Interactive OAuth flow for Gemini CLI, Qwen Code, and iFlow
171
+ - Automatically opens your browser for authentication
172
+ - Handles the entire OAuth flow including callbacks
173
+ - Saves credentials to the local `oauth_creds/` directory
174
+ - For Gemini CLI: Automatically discovers or creates a Google Cloud project
175
+ - For Qwen Code: Uses Device Code flow (you'll enter a code in your browser)
176
+ - For iFlow: Starts a local callback server on port 11451
177
+
178
+ 2. **Add API Key** - Add standard API keys for any LiteLLM-supported provider
179
+ - Interactive prompts guide you through the process
180
+ - Automatically saves to your `.env` file
181
+ - Supports multiple keys per provider (numbered automatically)
182
+
183
+ 3. **Export Credentials to .env** - The "Stateless Deployment" feature
184
+ - Converts file-based OAuth credentials into environment variables
185
+ - Essential for platforms without persistent file storage
186
+ - Generates a ready-to-paste `.env` block for each credential
187
+
188
+ **Stateless Deployment Workflow (Railway, Render, Vercel, etc.):**
189
+
190
+ If you're deploying to a platform without persistent file storage:
191
+
192
+ 1. **Setup credentials locally first**:
193
+ ```bash
194
+ python -m rotator_library.credential_tool
195
+ # Select "Add OAuth Credential" and complete the flow
196
+ ```
197
+
198
+ 2. **Export to environment variables**:
199
+ ```bash
200
+ python -m rotator_library.credential_tool
201
+ # Select "Export Gemini CLI to .env" (or Qwen/iFlow)
202
+ # Choose your credential file
203
+ ```
204
+
205
+ 3. **Copy the generated output**:
206
+ - The tool creates a file like `gemini_cli_credential_1.env`
207
+ - Contains all necessary `GEMINI_CLI_*` variables
208
+
209
+ 4. **Paste into your hosting platform**:
210
+ - Add each variable to your platform's environment settings
211
+ - Set `SKIP_OAUTH_INIT_CHECK=true` to skip interactive validation
212
+ - No credential files needed; everything loads from environment variables
213
+
214
+ **Local-First OAuth Management:**
215
+
216
+ The proxy uses a "local-first" approach for OAuth credentials:
217
+
218
+ - **Local Storage**: All OAuth credentials are stored in `oauth_creds/` directory
219
+ - **Automatic Discovery**: On first run, the proxy scans system paths (`~/.gemini/`, `~/.qwen/`, `~/.iflow/`) and imports found credentials
220
+ - **Deduplication**: Intelligently detects duplicate accounts (by email/user ID) and warns you
221
+ - **Priority**: Local files take priority over system-wide credentials
222
+ - **No System Pollution**: Your project's credentials are isolated from global system credentials
223
 
224
  **Example `.env` configuration:**
225
  ```env
 
352
 
353
  ## 4. Advanced Topics
354
 
355
+ ### Batch Request Processing
356
+
357
+ The proxy includes a `Batch Manager` that optimizes high-volume embedding requests.
358
+ - **Automatic Aggregation**: Multiple individual embedding requests are automatically collected into a single batch API call.
359
+ - **Configurable**: Works out of the box, but can be tuned for specific needs.
360
+ - **Benefits**: Significantly reduces the number of HTTP requests to providers, helping you stay within rate limits while improving throughput.
361
+
362
  ### How It Works
363
 
364
+ The proxy is built on a robust architecture:
365
 
366
+ 1. **Intelligent Routing**: The `UsageManager` selects the best available key from your pool. It prioritizes idle keys first, then keys that can handle concurrency, ensuring optimal load balancing.
367
+ 2. **Resilience & Deadlines**: Every request has a strict deadline (`global_timeout`). If a provider is slow or fails, the proxy retries with a different key immediately, ensuring your application never hangs.
368
+ 3. **Batching**: High-volume embedding requests are automatically aggregated into optimized batches, reducing API calls and staying within rate limits.
369
+ 4. **Deep Observability**: (Optional) Detailed logs capture every byte of the transaction, including raw streaming chunks, for precise debugging of complex agentic interactions.
 
 
 
370
 
371
  ### Command-Line Arguments and Scripts
372
 
 
376
  - `--port`: The port to run the server on. Defaults to `8000`.
377
  - `--enable-request-logging`: A flag to enable detailed, per-request logging. When active, the proxy creates a unique directory for each transaction in the `logs/detailed_logs/` folder, containing the full request, response, streaming chunks, and performance metadata. This is highly recommended for debugging.
378
 
379
+ ### New Provider Highlights
380
+
381
+ #### **Gemini CLI (Advanced)**
382
+ A powerful provider that mimics the Google Cloud Code extension.
383
+ - **Zero-Config Project Discovery**: Automatically finds your Google Cloud Project ID or onboards you to a free-tier project if none exists.
384
+ - **Internal API Access**: Uses high-limit internal endpoints (`cloudcode-pa.googleapis.com`) rather than the public Vertex AI API.
385
+ - **Smart Rate Limiting**: Automatically falls back to preview models (e.g., `gemini-2.5-pro-preview`) if the main model hits a rate limit.
386
+
387
+ #### **Qwen Code**
388
+ - **Dual Authentication**: Use either standard API keys or OAuth 2.0 Device Flow credentials.
389
+ - **Schema Cleaning**: Automatically removes `strict` and `additionalProperties` from tool schemas to prevent API errors.
390
+ - **Stream Stability**: Injects a dummy `do_not_call_me` tool to prevent stream corruption issues when no tools are provided.
391
+ - **Reasoning Support**: Parses `<think>` tags in responses and exposes them as `reasoning_content` (similar to OpenAI's o1 format).
392
+ - **Dedicated Logging**: Optional per-request file logging to `logs/qwen_code_logs/` for debugging.
393
+ - **Custom Models**: Define additional models via `QWEN_CODE_MODELS` environment variable (JSON array format).
394
+
395
+ #### **iFlow**
396
+ - **Dual Authentication**: Use either standard API keys or OAuth 2.0 Authorization Code Flow.
397
+ - **Hybrid Auth**: OAuth flow provides an access token, but actual API calls use a separate `apiKey` retrieved from user profile.
398
+ - **Local Callback Server**: OAuth flow runs a temporary server on port 11451 to capture the redirect.
399
+ - **Schema Cleaning**: Same as Qwen Code - removes unsupported properties from tool schemas.
400
+ - **Stream Stability**: Injects placeholder tools to stabilize streaming for empty tool lists.
401
+ - **Dedicated Logging**: Optional per-request file logging to `logs/iflow_logs/` for debugging proprietary API behaviors.
402
+ - **Custom Models**: Define additional models via `IFLOW_MODELS` environment variable (JSON array format).
403
+
404
+
405
+ ### Advanced Configuration
406
+
407
+ The following advanced settings can be added to your `.env` file:
408
+
409
+ #### OAuth and Refresh Settings
410
+
411
+ - **`OAUTH_REFRESH_INTERVAL`**: Controls how often (in seconds) the background refresher checks for expired OAuth tokens. Default is `3600` (1 hour).
412
+ ```env
413
+ OAUTH_REFRESH_INTERVAL=1800 # Check every 30 minutes
414
+ ```
415
+
416
+ - **`SKIP_OAUTH_INIT_CHECK`**: Set to `true` to skip the interactive OAuth setup/validation check on startup. Essential for non-interactive environments like Docker containers or CI/CD pipelines.
417
+ ```env
418
+ SKIP_OAUTH_INIT_CHECK=true
419
+ ```
420
+
421
+ #### Concurrency Control
422
+
423
+ - **`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`**: Set the maximum number of simultaneous requests allowed per API key for a specific provider. Default is `1` (no concurrency). Useful for high-throughput providers.
424
+ ```env
425
+ MAX_CONCURRENT_REQUESTS_PER_KEY_OPENAI=3
426
+ MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=2
427
+ MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
428
+ ```
429
+
430
+ #### Custom Model Lists
431
+
432
+ For providers that support custom model definitions (Qwen Code, iFlow), you can override the default model list:
433
+
434
+ - **`QWEN_CODE_MODELS`**: JSON array of custom Qwen Code models. These models take priority over hardcoded defaults.
435
+ ```env
436
+ QWEN_CODE_MODELS='["qwen3-coder-plus", "qwen3-coder-flash", "custom-model-id"]'
437
+ ```
438
+
439
+ - **`IFLOW_MODELS`**: JSON array of custom iFlow models. These models take priority over hardcoded defaults.
440
+ ```env
441
+ IFLOW_MODELS='["glm-4.6", "qwen3-coder-plus", "deepseek-v3.2"]'
442
+ ```
443
+
444
+ #### Provider-Specific Settings
445
+
446
+ - **`GEMINI_CLI_PROJECT_ID`**: Manually specify a Google Cloud Project ID for Gemini CLI OAuth. Only needed if automatic discovery fails.
447
+ ```env
448
+ GEMINI_CLI_PROJECT_ID="your-gcp-project-id"
449
+ ```
450
+
451
  **Example:**
452
  ```bash
453
  python src/proxy_app/main.py --host 127.0.0.1 --port 9999 --enable-request-logging
454
  ```
455
 
456
+
457
  #### Windows Batch Scripts
458
 
459
  For convenience on Windows, you can use the provided `.bat` scripts in the root directory to run the proxy with common configurations:
src/rotator_library/README.md CHANGED
@@ -5,16 +5,21 @@ A robust, asynchronous, and thread-safe Python library for managing a pool of AP
5
  ## Key Features
6
 
7
  - **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
8
- - **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests to *different* models, maximizing throughput while ensuring thread safety. Requests for the *same model* using the same key are queued, preventing conflicts.
9
  - **Smart Key Management**: Selects the optimal key for each request using a tiered, model-aware locking strategy to distribute load evenly and maximize availability.
10
- - **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key selections, exceeds a specified time limit, preventing indefinite hangs.
 
 
 
 
 
11
  - **Intelligent Error Handling**:
12
- - **Escalating Per-Model Cooldowns**: If a key fails, it's placed on a temporary, escalating cooldown for that specific model, allowing it to continue being used for others.
13
- - **Deadline-Aware Retries**: Retries requests on transient server errors with exponential backoff, but only if the wait time fits within the global request budget.
14
- - **Key-Level Lockouts**: If a key fails across multiple models, it's temporarily taken out of rotation entirely.
15
- - **Robust Streaming Support**: The client includes a wrapper for streaming responses that can reassemble fragmented JSON chunks and intelligently detect and handle errors that occur mid-stream.
16
- - **Detailed Usage Tracking**: Tracks daily and global usage for each key, including token counts and approximate cost, persisted to a JSON file.
17
- - **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily to keep the system running smoothly.
18
  - **Provider Agnostic**: Works with any provider supported by `litellm`.
19
  - **Extensible**: Easily add support for new providers through a simple plugin-based architecture.
20
 
@@ -35,7 +40,7 @@ This is the main class for interacting with the library. It is designed to be a
35
  ```python
36
  import os
37
  from dotenv import load_dotenv
38
- from rotating_api_key_client import RotatingClient
39
 
40
  # Load environment variables from .env file
41
  load_dotenv()
@@ -51,27 +56,43 @@ for key, value in os.environ.items():
51
  api_keys[provider] = []
52
  api_keys[provider].append(value)
53
 
54
- if not api_keys:
55
- raise ValueError("No provider API keys found in environment variables.")
56
 
57
  client = RotatingClient(
58
  api_keys=api_keys,
 
59
  max_retries=2,
60
  usage_file_path="key_usage.json",
61
- global_timeout=30 # Default is 30 seconds
 
 
 
 
 
 
 
62
  )
63
  ```
64
 
65
- - `api_keys`: A dictionary where keys are provider names (e.g., `"openai"`, `"gemini"`) and values are lists of API keys for that provider.
66
- - `max_retries`: The number of times to retry a request with the *same key* if a transient server error occurs.
67
- - `usage_file_path`: The path to the JSON file where key usage data will be stored.
68
- - `global_timeout`: A hard time limit (in seconds) for the entire request lifecycle. If the total time exceeds this, the request will fail.
69
- - `ignore_models`: A dictionary where keys are provider names and values are lists of model names/patterns to exclude (blacklist).
70
- - `whitelist_models`: A dictionary where keys are provider names and values are lists of model names/patterns to always include, overriding any blacklists.
 
 
 
 
 
 
 
 
71
 
72
  ### Concurrency and Resource Management
73
 
74
- The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. The recommended way is to use an `async with` block, which handles setup and teardown automatically.
75
 
76
  ```python
77
  import asyncio
@@ -131,14 +152,56 @@ Fetches a list of available models for a specific provider, applying any configu
131
 
132
  Fetches a dictionary of all available models, grouped by provider, or as a single flat list if `grouped=False`.
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  ## Error Handling and Cooldowns
135
 
136
  The client uses a sophisticated error handling mechanism:
137
 
138
- - **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, etc.).
139
  - **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
140
  - **Key-Specific Errors (Authentication, Quota, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
141
- - **Key-Level Lockouts**: If a key fails on multiple different models, the `UsageManager` can apply a key-level lockout, taking it out of rotation entirely for a short period.
 
 
 
 
 
 
142
 
143
  ### Global Timeout and Deadline-Driven Logic
144
 
@@ -146,7 +209,7 @@ To ensure predictable performance, the client now operates on a strict time budg
146
 
147
  - **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
148
  - **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
149
- - **Silent Internal Errors**: Intermittent failures like provider capacity limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key. A non-streaming request will only return `None` (or a streaming request will end) if the global timeout is exceeded or all keys have been exhausted. This creates a more stable experience for the end-user, as they are shielded from transient backend issues.
150
 
151
  ## Extending with Provider Plugins
152
 
@@ -162,13 +225,9 @@ from typing import List
162
  import httpx
163
 
164
  class MyProvider(ProviderInterface):
165
- async def get_models(self, api_key: str, client: httpx.AsyncClient) -> List[str]:
166
  # Logic to fetch and return a list of model names
167
- # The model names should be prefixed with the provider name.
168
- # e.g., ["my-provider/model-1", "my-provider/model-2"]
169
- # Example:
170
- # response = await client.get("https://api.myprovider.com/models", headers={"Auth": api_key})
171
- # return [f"my-provider/{model['id']}" for model in response.json()]
172
  pass
173
  ```
174
 
@@ -177,3 +236,4 @@ The system will automatically discover and register your new provider.
177
  ## Detailed Documentation
178
 
179
  For a more in-depth technical explanation of the library's architecture, including the `UsageManager`'s concurrency model and the error classification system, please refer to the [Technical Documentation](../../DOCUMENTATION.md).
 
 
5
  ## Key Features
6
 
7
  - **Asynchronous by Design**: Built with `asyncio` and `httpx` for high-performance, non-blocking I/O.
8
+ - **Advanced Concurrency Control**: A single API key can be used for multiple concurrent requests. By default, it supports concurrent requests to *different* models. With configuration (`MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>`), it can also support multiple concurrent requests to the *same* model using the same key.
9
  - **Smart Key Management**: Selects the optimal key for each request using a tiered, model-aware locking strategy to distribute load evenly and maximize availability.
10
+ - **Deadline-Driven Requests**: A global timeout ensures that no request, including all retries and key selections, exceeds a specified time limit.
11
+ - **OAuth & API Key Support**: Built-in support for standard API keys and complex OAuth flows.
12
+ - **Gemini CLI**: Full OAuth 2.0 web flow with automatic project discovery and free-tier onboarding.
13
+ - **Qwen Code**: Device Code flow support.
14
+ - **iFlow**: Authorization Code flow with local callback handling.
15
+ - **Stateless Deployment Ready**: Can load complex OAuth credentials from environment variables, eliminating the need for physical credential files in containerized environments.
16
  - **Intelligent Error Handling**:
17
+ - **Escalating Per-Model Cooldowns**: Failed keys are placed on a temporary, escalating cooldown for specific models.
18
+ - **Key-Level Lockouts**: Keys failing across multiple models are temporarily removed from rotation.
19
+ - **Stream Recovery**: The client detects mid-stream errors (like quota limits) and gracefully handles them.
20
+ - **Robust Streaming Support**: Includes a wrapper for streaming responses that reassembles fragmented JSON chunks.
21
+ - **Detailed Usage Tracking**: Tracks daily and global usage for each key, persisted to a JSON file.
22
+ - **Automatic Daily Resets**: Automatically resets cooldowns and archives stats daily.
23
  - **Provider Agnostic**: Works with any provider supported by `litellm`.
24
  - **Extensible**: Easily add support for new providers through a simple plugin-based architecture.
25
 
 
40
  ```python
41
  import os
42
  from dotenv import load_dotenv
43
+ from rotator_library import RotatingClient
44
 
45
  # Load environment variables from .env file
46
  load_dotenv()
 
56
  api_keys[provider] = []
57
  api_keys[provider].append(value)
58
 
59
+ # Initialize empty dictionary for OAuth credentials (or load from CredentialManager)
60
+ oauth_credentials = {}
61
 
62
  client = RotatingClient(
63
  api_keys=api_keys,
64
+ oauth_credentials=oauth_credentials,
65
  max_retries=2,
66
  usage_file_path="key_usage.json",
67
+ configure_logging=True,
68
+ global_timeout=30,
69
+ abort_on_callback_error=True,
70
+ litellm_provider_params={},
71
+ ignore_models={},
72
+ whitelist_models={},
73
+ enable_request_logging=False,
74
+ max_concurrent_requests_per_key={}
75
  )
76
  ```
77
 
78
+ #### Arguments
79
+
80
+ - `api_keys` (`Optional[Dict[str, List[str]]]`): A dictionary mapping provider names (e.g., "openai", "anthropic") to a list of API keys.
81
+ - `oauth_credentials` (`Optional[Dict[str, List[str]]]`): A dictionary mapping provider names (e.g., "gemini_cli", "qwen_code") to a list of file paths to OAuth credential JSON files.
82
+ - `max_retries` (`int`, default: `2`): The number of times to retry a request with the *same key* if a transient server error (e.g., 500, 503) occurs.
83
+ - `usage_file_path` (`str`, default: `"key_usage.json"`): The path to the JSON file where usage statistics (tokens, cost, success counts) are persisted.
84
+ - `configure_logging` (`bool`, default: `True`): If `True`, configures the library's logger to propagate logs to the root logger. Set to `False` if you want to handle logging configuration manually.
85
+ - `global_timeout` (`int`, default: `30`): A hard time limit (in seconds) for the entire request lifecycle. If the request (including all retries) takes longer than this, it is aborted.
86
+ - `abort_on_callback_error` (`bool`, default: `True`): If `True`, any exception raised by `pre_request_callback` will abort the request. If `False`, the error is logged and the request proceeds.
87
+ - `litellm_provider_params` (`Optional[Dict[str, Any]]`, default: `None`): A dictionary of extra parameters to pass to `litellm` for specific providers.
88
+ - `ignore_models` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary where keys are provider names and values are lists of model names/patterns to exclude (blacklist). Supports wildcards (e.g., `"*-preview"`).
89
+ - `whitelist_models` (`Optional[Dict[str, List[str]]]`, default: `None`): A dictionary where keys are provider names and values are lists of model names/patterns to always include, overriding `ignore_models`.
90
+ - `enable_request_logging` (`bool`, default: `False`): If `True`, enables detailed per-request file logging (useful for debugging complex interactions).
91
+ - `max_concurrent_requests_per_key` (`Optional[Dict[str, int]]`, default: `None`): A dictionary defining the maximum number of concurrent requests allowed for a single API key for a specific provider. Defaults to 1 if not specified.
92
 
93
  ### Concurrency and Resource Management
94
 
95
+ The `RotatingClient` is asynchronous and manages an `httpx.AsyncClient` internally. It's crucial to close the client properly to release resources. The recommended way is to use an `async with` block.
96
 
97
  ```python
98
  import asyncio
 
152
 
153
  Fetches a dictionary of all available models, grouped by provider, or as a single flat list if `grouped=False`.
154
 
155
+ ## Credential Tool
156
+
157
+ The library includes a utility to manage credentials easily:
158
+
159
+ ```bash
160
+ python -m src.rotator_library.credential_tool
161
+ ```
162
+
163
+ Use this tool to:
164
+ 1. **Initialize OAuth**: Run the interactive login flows for Gemini, Qwen, and iFlow.
165
+ 2. **Export Credentials**: Generate `.env` compatible configuration blocks from your saved OAuth JSON files. This is essential for setting up stateless deployments.
166
+
167
+ ## Provider Specifics
168
+
169
+ ### Qwen Code
170
+ - **Auth**: Uses OAuth 2.0 Device Flow. Requires manual entry of email/identifier if not returned by the provider.
171
+ - **Resilience**: Injects a dummy tool (`do_not_call_me`) into requests with no tools to prevent known stream corruption issues on the API.
172
+ - **Reasoning**: Parses `<think>` tags in the response and exposes them as `reasoning_content`.
173
+ - **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from all tool schemas. Qwen's API has stricter validation than OpenAI's, and these properties cause `400 Bad Request` errors.
174
+
175
+ ### iFlow
176
+ - **Auth**: Uses Authorization Code Flow with a local callback server (port 11451).
177
+ - **Key Separation**: Distinguishes between the OAuth `access_token` (used to fetch user info) and the `api_key` (used for actual chat requests).
178
+ - **Resilience**: Similar to Qwen, injects a placeholder tool to stabilize streaming for empty tool lists.
179
+ - **Schema Cleaning**: Recursively removes `strict` and `additionalProperties` from all tool schemas to prevent API validation errors.
180
+ - **Custom Models**: Supports model definitions via `IFLOW_MODELS` environment variable (JSON array of model IDs or objects).
181
+
182
+ ### NVIDIA NIM
183
+ - **Discovery**: Dynamically fetches available models from the NVIDIA API.
184
+ - **Thinking**: Automatically injects the `thinking` parameter into `extra_body` for DeepSeek models (`deepseek-v3.1`, etc.) when `reasoning_effort` is set to low/medium/high.
185
+
186
+ ### Google Gemini (CLI)
187
+ - **Auth**: Simulates the Google Cloud CLI authentication flow.
188
+ - **Project Discovery**: Automatically discovers the default Google Cloud Project ID.
189
+ - **Rate Limits**: Implements smart fallback strategies (e.g., switching from `gemini-1.5-pro` to `gemini-1.5-pro-002`) when rate limits are hit.
190
+
191
  ## Error Handling and Cooldowns
192
 
193
  The client uses a sophisticated error handling mechanism:
194
 
195
+ - **Error Classification**: All exceptions from `litellm` are passed through a `classify_error` function to determine their type (`rate_limit`, `authentication`, `server_error`, `quota`, `context_length`, etc.).
196
  - **Server Errors**: The client will retry the request with the *same key* up to `max_retries` times, using an exponential backoff strategy.
197
  - **Key-Specific Errors (Authentication, Quota, etc.)**: The client records the failure in the `UsageManager`, which applies an escalating cooldown to the key for that specific model. The client then immediately acquires a new key and continues its attempt to complete the request.
198
+ - **Escalating Cooldown Strategy**: Consecutive failures for a key on the same model result in increasing cooldown períods:
199
+ - 1st failure: 10 seconds
200
+ - 2nd failure: 30 seconds
201
+ - 3rd failure: 60 seconds
202
+ - 4th+ failure: 120 seconds
203
+ - **Key-Level Lockouts**: If a key fails on multiple different models (3+ distinct models), the `UsageManager` applies a global 5-minute lockout for that key, removing it from rotation entirely.
204
+ - **Authentication Errors**: Immediate 5-minute global lockout (key is assumed revoked or invalid).
205
 
206
  ### Global Timeout and Deadline-Driven Logic
207
 
 
209
 
210
  - **Deadline Enforcement**: When a request starts, a `deadline` is set. The entire process, including all key rotations and retries, must complete before this deadline.
211
  - **Deadline-Aware Retries**: If a retry requires a wait time that would exceed the remaining budget, the wait is skipped, and the client immediately rotates to the next key.
212
+ - **Silent Internal Errors**: Intermittent failures like provider capacity limits or temporary server errors are logged internally but are **not raised** to the caller. The client will simply rotate to the next key.
213
 
214
  ## Extending with Provider Plugins
215
 
 
225
  import httpx
226
 
227
  class MyProvider(ProviderInterface):
228
+ async def get_models(self, credential: str, client: httpx.AsyncClient) -> List[str]:
229
  # Logic to fetch and return a list of model names
230
+ # The credential argument allows using the key to fetch models
 
 
 
 
231
  pass
232
  ```
233
 
 
236
  ## Detailed Documentation
237
 
238
  For a more in-depth technical explanation of the library's architecture, including the `UsageManager`'s concurrency model and the error classification system, please refer to the [Technical Documentation](../../DOCUMENTATION.md).
239
+