Spaces:
Paused
Technical Documentation: rotating-api-key-client
This document provides a detailed technical explanation of the rotating-api-key-client library, its components, and its internal workings. The library has evolved into a sophisticated, asynchronous client for managing LLM API keys with a strong focus on concurrency, resilience, and state management.
1. client.py - The RotatingClient
The RotatingClient is the central component, orchestrating API calls, key management, and error handling. It is designed as a long-lived, async-native object.
Core Responsibilities
- Managing an
httpx.AsyncClientfor non-blocking HTTP requests. - Interfacing with the
UsageManagerto acquire and release API keys. - Handling provider-specific request modifications.
- Executing API calls via
litellmwith a robust retry and rotation strategy. - Providing a safe wrapper for streaming responses.
Request Lifecycle (acompletion)
When acompletion is called, it follows these steps:
Provider and Key Validation: It extracts the provider from the
modelname and ensures keys are configured for it.Key Acquisition Loop: The client enters a loop to find a valid key and complete the request. It iterates through all keys for the provider until one succeeds or all have been tried. a. Acquire Best Key: It calls
self.usage_manager.acquire_key(). This is a blocking call that waits until a suitable key is available, based on the manager's tiered locking strategy (seeUsageManagersection). b. Prepare Request: It prepares thelitellmkeyword arguments. This includes: - Request Sanitization: Callingsanitize_request_payload()to remove parameters that might be unsupported by the target model, preventing errors. - Provider-Specific Logic: Applying special handling for providers like Gemini (safety settings), Gemma (system prompts), and Chutes.ai (api_baseand model name remapping).Retry Loop: Once a key is acquired, it enters an inner retry loop (
for attempt in range(self.max_retries)): a. API Call: It callslitellm.acompletionwith the acquired key. b. Success (Non-Streaming): - It callsself.usage_manager.record_success()to update usage stats and clear any cooldowns for the key-model pair. - It callsself.usage_manager.release_key()to release the lock on the key for this model. - It returns the response, and the process ends. c. Success (Streaming): - It returns a_safe_streaming_wrapperasync generator. This wrapper is critical: - It yields SSE-formatted chunks to the consumer. - After the stream is fully consumed, itsfinallyblock ensures thatrecord_success()andrelease_key()are called. This guarantees that the key lock is held for the entire duration of the stream and released correctly, even if the consumer abandons the stream. d. Failure: If an exception occurs: - The failure is logged in detail bylog_failure(). - The exception is passed toclassify_error()to get a structuredClassifiedErrorobject. - Server Error: If the error type isserver_error, it waits with exponential backoff and retries the request with the same key. - Rotation Error (Rate Limit, Auth, etc.): For any other error, it's considered a rotation trigger.self.usage_manager.record_failure()is called to apply an escalating cooldown, andself.usage_manager.release_key()releases the lock. The innerattemptloop is broken, and the outerwhileloop continues, acquiring a new key.
2. usage_manager.py - Stateful Concurrency & Usage Management
This class is the heart of the library's state management and concurrency control. It is a stateful, async-native service that ensures keys are used efficiently and safely across multiple concurrent requests.
Key Concepts
- Asynchronous Design & Lazy Loading: The entire class is asynchronous, using
aiofilesfor non-blocking file I/O and a_lazy_initpattern. The usage data from the JSON file is loaded only when the first request is made. - Concurrency Primitives:
filelock: A file-level lock (.json.lock) prevents race conditions if multiple processes are running and sharing the same usage file.asyncio.Lock&asyncio.Condition: Each key has its ownasyncio.Lockandasyncio.Conditionobject. This enables the fine-grained, model-aware locking strategy.
Tiered Key Acquisition (acquire_key)
This method implements the core logic for selecting a key. It is a "smart" blocking call.
- Filtering: It first filters out any keys that are on a global or model-specific cooldown.
- Tiering: It categorizes the remaining, valid keys into two tiers:
- Tier 1 (Ideal): Keys that are completely free (not being used by any model).
- Tier 2 (Acceptable): Keys that are currently in use, but for different models than the one being requested.
- Selection: It attempts to acquire a lock on a key, prioritizing Tier 1 over Tier 2. Within each tier, it prioritizes the least-used key.
- Waiting: If no keys in Tier 1 or Tier 2 can be locked, it means all eligible keys are currently handling requests for the same model. The method then
awaits on theasyncio.Conditionof the best available key, waiting until it is notified that the key has been released.
Failure Handling & Cooldowns (record_failure)
- Escalating Backoff: When a failure is recorded, it applies a cooldown that increases with the number of consecutive failures for a specific key-model pair (e.g., 10s, 30s, 60s, up to 2 hours).
- Authentication Errors: These are treated more severely, applying an immediate 5-minute key-level lockout.
- Key-Level Lockouts: If a single key accumulates 3 or more long-term (2-hour) cooldowns across different models, the manager assumes the key is compromised or disabled and applies a 5-minute global lockout on the key.
Data Structure
The key_usage.json file has a more complex structure to store this detailed state:
{
"api_key_hash": {
"daily": {
"date": "YYYY-MM-DD",
"models": {
"gemini/gemini-1.5-pro": {
"success_count": 10,
"prompt_tokens": 5000,
"completion_tokens": 10000,
"approx_cost": 0.075
}
}
},
"global": { /* ... similar to daily, but accumulates over time ... */ },
"model_cooldowns": {
"gemini/gemini-1.5-flash": 1719987600.0
},
"failures": {
"gemini/gemini-1.5-flash": {
"consecutive_failures": 2
}
},
"key_cooldown_until": null,
"last_daily_reset": "YYYY-MM-DD"
}
}
3. error_handler.py
This module provides a centralized function, classify_error, which is a significant improvement over the previous boolean checks.
- It takes a raw exception from
litellmand returns aClassifiedErrordata object. - This object contains the
error_type(e.g.,'rate_limit','authentication','server_error'), the original exception, the status code, and anyretry_afterinformation extracted from the error message. - This structured classification allows the
RotatingClientto make more intelligent decisions about whether to retry with the same key or rotate to a new one.
4. request_sanitizer.py (New Module)
- This module's purpose is to prevent
InvalidRequestErrorexceptions fromlitellmthat occur when a payload contains parameters not supported by the target model (e.g., sending athinkingparameter to a model that doesn't support it). - The
sanitize_request_payloadfunction is called just beforelitellm.acompletionto strip out any such unsupported parameters, making the system more robust.
5. providers/ - Provider Plugins
The provider plugin system remains for fetching model lists. The interface now correctly specifies that the get_models method receives an httpx.AsyncClient instance, which it should use to make its API calls. This ensures all HTTP traffic goes through the client's managed session.
6. proxy_app/ - The Proxy Application
The proxy_app directory contains the FastAPI application that serves the rotating client.
main.py - The FastAPI App
This file contains the FastAPI application that exposes the RotatingClient through an OpenAI-compatible API.
Command-Line Arguments
--enable-request-logging: This flag enables logging of all incoming requests and outgoing responses to thelogs/directory. This is useful for debugging and monitoring the proxy's activity. By default, this is disabled.